<li>Last month I enabled the <code>log_lock_waits</code> in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:</li>
<li>I think you could analyze the locks for the <code>dspaceWeb</code> user (XMLUI) and find out what queries were locking… but it’s so much information and I don’t know where to start
<ul>
<li>For now I just restarted PostgreSQL…</li>
<li>Francesca was able to do her submission immediately…</li>
</ul>
</li>
<li>On a related note, I want to enable the <code>pg_stat_statement</code> feature to see which queries get run the most, so I created the extension on the CGSpace database</li>
<li>I was doing some research on PostgreSQL locks and found some interesting things to consider
<ul>
<li>The default <code>lock_timeout</code> is 0, aka disabled</li>
<li>The default <code>statement_timeout</code> is 0, aka disabled</li>
<li>It seems to be recommended to start by setting <code>statement_timeout</code> first, rule of thumb <ahref="https://github.com/jberkus/annotated.conf/blob/master/postgresql.10.simple.conf#L211">ten times longer than your longest query</a></li>
</ul>
</li>
<li>Mark Wood mentioned the <code>checker</code> cron job that apparently runs in one transaction and might be an issue
<ul>
<li>I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks</li>
</ul>
</li>
<li>Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
<ul>
<li>After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!</li>
</ul>
</li>
</ul>
<h2id="2021-12-03">2021-12-03</h2>
<ul>
<li>I see GARDIAN is now using a “GARDIAN” user agent finally
<li>Proof fifty records Abenet sent me from Africa Rice Center (“AfricaRice 1st batch Import”)
<ul>
<li>Fixed forty-six incorrect collections</li>
<li>Cleaned up and normalize affiliations</li>
<li>Cleaned up dates (extra <code>*</code> character in all?)</li>
<li>Cleaned up citation format</li>
<li>Fixed some encoding issues in abstracts</li>
<li>Removed empty columns</li>
<li>Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna</li>
<li>Added volume and issue metadata by extracting it from the citations</li>
<li>All PDFs hosted on davidpublishing.com are dead…</li>
<li>All DOIs linking to African Journal of Agricultural Research are dead…</li>
<li>Fixed a handful of items marked as “Open Access” that are actually closed</li>
<li>Added many missing ISSNs</li>
<li>Added many missing countries/regions</li>
<li>Fixed invalid AGROVOC terms and added some more based on article subjects</li>
</ul>
</li>
<li>I also made some minor changes to the <ahref="https://github.com/ilri/csv-metadata-quality">CSV Metadata Quality Checker</a>
<ul>
<li>Added the ability to check if the item’s title exists in the citation</li>
<li>Updated to only run the mojibake check if we’re not running in unsafe mode (so we don’t print the same warning during both the check and fix steps)</li>
<li>Some minor work on the <code>check-duplicates.py</code> script I wrote last month
<ul>
<li>I found some corner cases where there were items that matched in the database, but they were <code>in_archive=f</code> and or <code>withdrawn=t</code> so I check that before trying to resolve the handles of potential duplicates</li>
</ul>
</li>
<li>More work on the Africa Rice Center 1st batch import
<ul>
<li>I merged the metadata for three duplicates in Africa Rice’s items and mapped them on CGSpace</li>
<li>I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace</li>
</ul>
</li>
<li>I started looking at the seventy CAS records that Abenet has been working on for the past few months</li>
<li>I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday
<ul>
<li>Also, I ran the <code>check-duplicates.py</code> script on them and found that they might ALL be duplicates!!!</li>
<li>I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it’s still at least twenty or so…</li>
<li>The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default)</li>
<li>For example, many items like “Annual Report 2020” can have similar title and type to previous annual reports, but are not duplicates</li>
</ul>
</li>
<li>I noticed a strange user agent in the XMLUI logs on CGSpace:</li>
<li>I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
<ul>
<li>It could be someone on Azure?</li>
<li>I opened <ahref="https://github.com/atmire/COUNTER-Robots/pull/49">a pull request to COUNTER-Robots</a> and I’ll add this user agent to our local override until they decide to include it or not</li>
</ul>
</li>
<li>I purged 34,000 hits from this user agent in our Solr statistics:</li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 34458
<li>Finalize country/region changes in csv-metadata-quality checker and release v0.5.0: <ahref="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.5.0">https://github.com/ilri/csv-metadata-quality/releases/tag/v0.5.0</a>
<ul>
<li>This also includes the mojibake fixes and title/citation checks and some bug fixes</li>
<li>Help Francesca upload the dataset for one CIAT publication (it has like 100 authors so we did it via CSV)</li>
</ul>
<h2id="2021-12-12">2021-12-12</h2>
<ul>
<li>Patch OpenRXV’s Elasticsearch for the CVE-2021-44228 log4j vulnerability and re-deploy AReS
<ul>
<li>I added <code>-Dlog4j2.formatMsgNoLookups=true</code> to the Elasticsearch Java environment</li>
</ul>
</li>
<li>Run AReS harvesting</li>
</ul>
<h2id="2021-12-13">2021-12-13</h2>
<ul>
<li>I ran the <code>check-duplicates.py</code> script on the 1,000 items from the CGIAR System Office TAC/ICW/Green Cover archives and found hundreds or thousands of potential duplicates
<ul>
<li>I sent feedback to Gaia</li>
</ul>
</li>
<li>Help Jacquie from WorldFish try to find all outputs for the Fish CRP because there are a few different formats for that name</li>
<li>Create a temporary account for Rafael Rodriguez on DSpace Test so he can investigate the submission workflow
<ul>
<li>I added him to the admin group on the Alliance community…</li>
</ul>
</li>
</ul>
<h2id="2021-12-14">2021-12-14</h2>
<ul>
<li>I finally caught some stuck locks on CGSpace after checking several times per day for the last week:</li>
<li>The oldest locks are 9 hours and 26 minutes old and the time on the server is <code>Tue Dec 14 18:41:58 CET 2021</code>, so it seems something happened around 9:15 this morning
<ul>
<li>I looked at the maintenance tasks and there is nothing running around then (only the sitemap update that runs at 8AM, and should be quick)</li>
<li>I looked at the DSpace log, but didn’t see anything interesting there: only editors making edits…</li>
<li>I looked at the nginx REST API logs and saw lots of GET action there from Drupal sites harvesting us…</li>
<li>So I’m not sure what it causing this… perhaps something in the XMLUI submission / task workflow</li>
<li>For now I just ran all system updates and rebooted the server</li>
<li>I also enabled Atmire’s <code>log-db-activity.sh</code> script to run every four hours (in the DSpace user’s crontab) so perhaps that will be better than me checking manually</li>
</ul>
</li>
<li>Regarding Gaia’s 1,000 items to upload to CGSpace, I checked the eighteen Green Cover records and there are no duplicates, so that’s at least a starting point!
<ul>
<li>I sent her a spreadsheet with the eighteen items with a new collection column to indicate where they should go</li>
</ul>
</li>
</ul>
<h2id="2021-12-16">2021-12-16</h2>
<ul>
<li>Working on the CGIAR CAS Green Cover records for Gaia
<ul>
<li>Add months to dcterms.issued from PDFs</li>
<li>Add languages</li>
<li>Format and fix several authors</li>
</ul>
</li>
<li>I created a SAF archive with SAFBuilder and then imported it to DSpace Test:</li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
</span></span><spanstyle="display:flex;"><span>node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
<li>It seems to be related to <code>@elastic/elasticsearch-js</code>](<ahref="https://github.com/elastic/elasticsearch-js)">https://github.com/elastic/elasticsearch-js)</a>, which our <code>package.json</code> pins with version <code>^7.13.0</code></li>
<li>I see that AReS is currently using 7.15.0 in its <code>package-lock.json</code>, and 7.16.0 was released four days ago so perhaps it’s that…</li>
<li>Pinning <code>~7.15.0</code> allows nest to build fine…</li>
<li>I made a pull request</li>
</ul>
</li>
<li>But since software sucks, now I get an error in the frontend while starting nginx:</li>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>nginx: [emerg] host not found in upstream "backend:3000" in /etc/nginx/conf.d/default.conf:2
<li>In other news, looking at updating our Redis from version 5 to 6 (which is slightly less old, but still old!) and I’m happy to see that the <ahref="https://raw.githubusercontent.com/redis/redis/6.0/00-RELEASENOTES">release notes for version 6</a> say that it is compatible with 5 except for one minor thing that we don’t seem to be using (SPOP?)</li>
<li>For reference I see that our Redis 5 container is based on Debian 11, which I didn’t expect… but I still want to try to upgrade to Redis 6 eventually:</li>
</span></span><spanstyle="display:flex;"><span>1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
</span></span><spanstyle="display:flex;"><span>1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
</span></span><spanstyle="display:flex;"><span>1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
</span></span><spanstyle="display:flex;"><span>1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
</span></span><spanstyle="display:flex;"><span>1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
</span></span><spanstyle="display:flex;"><span>1:M 19 Dec 2021 19:27:15.584 # Server initialized
</span></span><spanstyle="display:flex;"><span>1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
</span></span><spanstyle="display:flex;"><span>1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
</span></span><spanstyle="display:flex;"><span>1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
<li>I also fixed the weird “unsafe” issue in the links on AReS that Abenet told me about last week
<ul>
<li>When testing my local instance I realized that the <code>thumbnail</code> field was missing on the production AReS, and that somehow breaks the links</li>
</ul>
</li>
</ul>
<h2id="2021-12-22">2021-12-22</h2>
<ul>
<li>Fix apt error on DSpace servers due to updated <code>/etc/java-8-openjdk/security/java.security</code> file</li>
</ul>
<h2id="2021-12-23">2021-12-23</h2>
<ul>
<li>Add support for dropping invalid AGROVOC subjects to csv-metadata-quality</li>
<li>Move invalid AGROVOC subjects in Gaia’s eighteen green cover items on DSpace Test to <code>cg.subject.system</code></li>
<li>I created an “approve” user for Rafael from CIAT to do tests on DSpace Test:</li>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p <spanstyle="color:#e6db74">'fuuuuuu'</span>
<li>Most of the time it has a real-looking user agent, but sometimes it uses <code>Apache-HttpClient/4.3.4 (java 1.5)</code></li>
<li>Another 82.65.26.228 is doing SQL injection attempts from France</li>
<li>216.213.28.138 is some scrape-as-a-service bot from Sprious</li>
<li>I used my <code>resolve-addresses-geoip2.py</code> script to get the ASNs for all the IPs in Solr stats this month, then extracted the ASNs that were responsible for more than one IP:</li>
<li>AS 64267 is Sprious, and it has used these IPs this month:
<ul>
<li>216.213.28.136</li>
<li>207.182.27.191</li>
<li>216.41.235.187</li>
<li>216.41.232.169</li>
<li>216.41.235.186</li>
<li>52.124.19.190</li>
<li>216.213.28.138</li>
<li>216.41.234.163</li>
</ul>
</li>
<li>To be honest I want to ban all their networks but I’m afraid it’s too many IPs… hmmm</li>
<li>AS 24940 is Hetzner, but I don’t feel like going through all the IPs to see… they always pretend to be normal users and make semi-sane requests so it might be a proxy or something</li>
<li>AS 24757 is Ethiopian Telecom</li>
<li>I’m going to purge all these for sure, as they are a scraping-as-a-service company and don’t use proper user agents or request robots.txt</li>
<li>AS 49505 is the Russian Selectel, and it has used these IPs this month:
<ul>
<li>45.146.166.173</li>
<li>45.134.26.171</li>
<li>45.146.164.123</li>
<li>45.155.205.231</li>
<li>195.54.167.122</li>
</ul>
</li>
<li>I will purge them all too because they are up to no good, as I already saw earlier today (SQL injections)</li>
<li>AS 16509 is Amazon, and it has used these IPs this month:
<ul>
<li>18.135.23.223 (made requests using the <code>Mozilla/5.0 (compatible; U; Koha checkurl)</code> user agent, so I will purge it and add it to our DSpace user agent override and <ahref="https://github.com/atmire/COUNTER-Robots/pull/51">submit to COUNTER-Robots</a>)</li>
<li>54.76.137.83 (made hundreds of requests to “/” with a normal user agent)</li>
<li>34.253.119.85 (made hundreds of requests to “/” with a normal user agent)</li>
<li>34.216.201.131 (made hundreds of requests to “/” with a normal user agent)</li>
<li>54.203.193.46 (made hundreds of requests to “/” with a normal user agent)</li>
</ul>
</li>
<li>I ran the script to purge spider agents with the latest updates:</li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 16785
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 37959
</span></span></code></pre></div><!-- raw HTML omitted -->