<li>Last month I enabled the <code>log_lock_waits</code> in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:</li>
<li>I think you could analyze the locks for the <code>dspaceWeb</code> user (XMLUI) and find out what queries were locking… but it’s so much information and I don’t know where to start
<ul>
<li>For now I just restarted PostgreSQL…</li>
<li>Francesca was able to do her submission immediately…</li>
</ul>
</li>
<li>On a related note, I want to enable the <code>pg_stat_statement</code> feature to see which queries get run the most, so I created the extension on the CGSpace database</li>
<li>I was doing some research on PostgreSQL locks and found some interesting things to consider
<ul>
<li>The default <code>lock_timeout</code> is 0, aka disabled</li>
<li>The default <code>statement_timeout</code> is 0, aka disabled</li>
<li>It seems to be recommended to start by setting <code>statement_timeout</code> first, rule of thumb <ahref="https://github.com/jberkus/annotated.conf/blob/master/postgresql.10.simple.conf#L211">ten times longer than your longest query</a></li>
</ul>
</li>
<li>Mark Wood mentioned the <code>checker</code> cron job that apparently runs in one transaction and might be an issue
<ul>
<li>I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks</li>
</ul>
</li>
<li>Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
<ul>
<li>After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!</li>
</ul>
</li>
</ul>
<h2id="2021-12-03">2021-12-03</h2>
<ul>
<li>I see GARDIAN is now using a “GARDIAN” user agent finally
<li>Proof fifty records Abenet sent me from Africa Rice Center (“AfricaRice 1st batch Import”)
<ul>
<li>Fixed forty-six incorrect collections</li>
<li>Cleaned up and normalize affiliations</li>
<li>Cleaned up dates (extra <code>*</code> character in all?)</li>
<li>Cleaned up citation format</li>
<li>Fixed some encoding issues in abstracts</li>
<li>Removed empty columns</li>
<li>Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna</li>
<li>Added volume and issue metadata by extracting it from the citations</li>
<li>All PDFs hosted on davidpublishing.com are dead…</li>
<li>All DOIs linking to African Journal of Agricultural Research are dead…</li>
<li>Fixed a handful of items marked as “Open Access” that are actually closed</li>
<li>Added many missing ISSNs</li>
<li>Added many missing countries/regions</li>
<li>Fixed invalid AGROVOC terms and added some more based on article subjects</li>
</ul>
</li>
<li>I also made some minor changes to the <ahref="https://github.com/ilri/csv-metadata-quality">CSV Metadata Quality Checker</a>
<ul>
<li>Added the ability to check if the item’s title exists in the citation</li>
<li>Updated to only run the mojibake check if we’re not running in unsafe mode (so we don’t print the same warning during both the check and fix steps)</li>
<li>Some minor work on the <code>check-duplicates.py</code> script I wrote last month
<ul>
<li>I found some corner cases where there were items that matched in the database, but they were <code>in_archive=f</code> and or <code>withdrawn=t</code> so I check that before trying to resolve the handles of potential duplicates</li>
</ul>
</li>
<li>More work on the Africa Rice Center 1st batch import
<ul>
<li>I merged the metadata for three duplicates in Africa Rice’s items and mapped them on CGSpace</li>
<li>I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace</li>
</ul>
</li>
<li>I started looking at the seventy CAS records that Abenet has been working on for the past few months</li>
<li>I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday
<ul>
<li>Also, I ran the <code>check-duplicates.py</code> script on them and found that they might ALL be duplicates!!!</li>
<li>I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it’s still at least twenty or so…</li>
<li>The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default)</li>
<li>For example, many items like “Annual Report 2020” can have similar title and type to previous annual reports, but are not duplicates</li>
</ul>
</li>
<li>I noticed a strange user agent in the XMLUI logs on CGSpace:</li>
20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-""Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
</code></pre></div><ul>
<li>I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
<ul>
<li>It could be someone on Azure?</li>
<li>I opened <ahref="https://github.com/atmire/COUNTER-Robots/pull/49">a pull request to COUNTER-Robots</a> and I’ll add this user agent to our local override until they decide to include it or not</li>
</ul>
</li>
<li>I purged 34,000 hits from this user agent in our Solr statistics:</li>
<li>Finalize country/region changes in csv-metadata-quality checker and release v0.5.0: <ahref="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.5.0">https://github.com/ilri/csv-metadata-quality/releases/tag/v0.5.0</a>
<ul>
<li>This also includes the mojibake fixes and title/citation checks and some bug fixes</li>