<li>The current time on the server is 08:52 and I see the dspaceCli locks were started at 04:00 and 05:00… so I need to check which cron jobs those belong to as I think I noticed this last month too
<ul>
<li>I’m going to wait and see if they finish, but by tomorrow I will kill them</li>
<li>The load on the server is now very low and there are no more locks from dspaceCli
<ul>
<li>So there <em>was</em> some long-running process that was running and had to finish!</li>
<li>That finally sheds some light on the “high load on Sunday” problem where I couldn’t find any other distinct pattern in the nginx or Tomcat requests</li>
</ul>
</li>
</ul>
<h2id="2023-01-03">2023-01-03</h2>
<ul>
<li>The load from the server on Sundays, which I have noticed for a long time, seems to be coming from the DSpace checker cron job
<ul>
<li>This checks the checksums of all bitstreams to see if they match the ones in the database</li>
</ul>
</li>
<li>I exported the entire CGSpace metadata to do country/region checks with <code>csv-metadata-quality</code>
<ul>
<li>I extracted only the items with countries, which was about 48,000, then split the file into parts of 10,000 items, but the upload found 2,000 changes in the first one and took several hours to complete…</li>
</ul>
</li>
<li>IWMI sent me ORCID identifiers for new scientsts, bringing our total to 2,010</li>
</ul>
<h2id="2023-01-04">2023-01-04</h2>
<ul>
<li>I finally finished applying the region imports (in five batches of 10,000)
<ul>
<li>It was about 7,500 missing regions in total…</li>
</ul>
</li>
<li>Now I will move on to doing the Initiative mappings
<ul>
<li>I modified my <code>fix-initiative-mappings.py</code> script to only write out the items that have updated mappings</li>
<li>This makes it way easier to apply fixes to the entire CGSpace because we don’t try to import 100,000 items with no changes in mappings</li>
</ul>
</li>
<li>More dspaceCli locks from 04:00 this morning (current time on server is 07:33) and today is a Wednesday
<ul>
<li>The checker cron job runs on <code>0,3</code>, which is Sunday and Wednesday, so this is from that…</li>
<li>Finally at 16:30 I decided to kill the PIDs associated with those locks…</li>
<li>I am going to disable that cron job for now and watch the server load for a few weeks</li>
<li>It’s Sunday and I see some PostgreSQL locks belonging to dspaceCli that started at 05:00
<ul>
<li>That’s strange because I disabled the <code>dspace checker</code> one last week, so I’m not sure which this is…</li>
<li>It’s currently 2:30PM on the server so these locks have been there for almost twelve hours</li>
</ul>
</li>
<li>I exported the entire CGSpace to update the Initiative mappings
<ul>
<li>Items were mapped to ~58 new Initiative collections</li>
</ul>
</li>
<li>Then I ran the ORCID import to catch any new ones that might not have been tagged</li>
<li>Then I started a harvest on AReS</li>
</ul>
<h2id="2023-01-09">2023-01-09</h2>
<ul>
<li>Fix some invalid Initiative names on CGSpace and then check for missing mappings</li>
<li>Check for missing regions in the Initiatives collection</li>
<li>Export a list of author affiliations from the Initiatives community for Peter to check
<ul>
<li>Was slightly ghetto because I did it from a CSV export of the Initiatives community, then imported to OpenRefine to split multi-value fields, then did some sed nonsense to handle the quoting:</li>
<li>While following onathe <ahref="https://github.com/DSpace/RestContract/blob/main/authentication.md">DSpace 7 REST API authentication docs</a> I still cannot log in via curl on the command line because I get a <code>Access is denied. Invalid CSRF token.</code> message</li>
<li>Logging in via the HAL Browser works…</li>
<li>Someone on the DSpace Slack mentioned that the <ahref="https://github.com/DSpace/RestContract/issues/209">authentication documentation is out of date</a> and we need to specify the cookie too</li>
<li>Batch import four IFPRI items for CGIAR Initiative on Low-Emission Food Systems</li>
<li>Batch import another twenty-eight items for IFPRI across several Initiatives
<ul>
<li>On this one I did quite a bit of extra work to check for CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc</li>
<li>I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts</li>
<li>Then I checked for duplicates and ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values</li>
</ul>
</li>
</ul>
<h2id="2023-01-17">2023-01-17</h2>
<ul>
<li>Batch import another twenty-three items for IFPRI across several Initiatives
<ul>
<li>I checked the IFPRI eBrary for extra CRPs and data/code URLs in the acknowledgements, licenses, volume/issue/extent, etc</li>
<li>I fixed some authors, an ISBN, and added extra AGROVOC keywords from the abstracts</li>
<li>Then I found and removed one duplicate in these items, as well as another on CGSpace already (!): 10568/126669</li>
<li>Then I ran it through csv-metadata-quality to make sure the countries/regions matched and there were no duplicate metadata values</li>
</ul>
</li>
<li>I exported the Initiatives collection to check the mappings, regions, and other metadata with csv-metadata-quality</li>
<li>I also added a bunch of ORCID identifiers to my list and tagged 837 new metadata values on CGSpace</li>
<li>There is a high load on CGSpace pretty regularly
<ul>
<li>Looking at Munin it shows there is a marked increase in DSpace sessions the last few weeks:</li>
<li>I ran the IPs through my <code>resolve-addresses-geoip2.py</code> script to resolve their ASNs/networks, then extracted some lists of data center ISPs by eyeballing them (Amazon, Google, Microsoft, Apple, DigitalOcean, HostRoyale, and a dozen others):</li>
<li>But still, it’s suspicious that there are so many PostgreSQL locks</li>
<li>Looking at the Solr stats to check the hits the last month (actually I skipped December because I was so busy)
<ul>
<li>I see 31.148.223.10 is on ALFA TELECOM s.r.o. in Russia and it made 43,000 requests this month (and 400,000 more last month!)</li>
<li>I see 18.203.245.60 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 3.249.192.212 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.244.160.145 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 52.213.59.101 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 91.209.8.29 is in Bulgaria on DGM EOOD and is low risk according to Scamlytics, but their user agent is all lower case and it’s a data center ISP so nope</li>
<li>I see 54.78.176.127 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 54.246.128.111 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 54.74.197.53 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 52.16.103.133 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 63.32.99.252 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 176.34.141.181 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.243.17.80 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 34.240.206.16 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 18.203.81.120 is on Amazon and it uses weird user agents, different with each request</li>
<li>I see 176.97.210.106 is on Tube Hosting and is rate VERY BAD, malicious, scammy on everything I checked</li>
<li>I see 79.110.73.54 is on ALFA TELCOM / Serverel and is using a different, weird user agent with each request</li>
<li>There are too many to count… so I will purge these and then move on to user agents</li>
</span></span></span><spanstyle="display:flex;"><span><spanstyle="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 466826
</span></span></code></pre></div><ul>
<li>Looking at user agents in Solr statistics from 2022-12 and 2023-01 I see some weird ones:
<ul>
<li><code>azure-logic-apps/1.0 (workflow e1f855704d6543f48be6205c40f4083f; version 08585300079823949478) microsoft-flow/1.0</code></li>
<li><code>Gov employment data scraper ([[your email]])</code></li>
<li>I’m looking at all the ORCID identifiers in the database, which seem to be way more than I realized:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
<li>In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7…
<ul>
<li>I hope this doesn’t cause some issue with in-progress workflows…</li>
</ul>
</li>
<li>I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python
<ul>
<li>I will add python to the list of bad bot user agents in nginx</li>
</ul>
</li>
<li>While looking into the locks I see some potential Java heap issues
<ul>
<li>Indeed, I see two out of memory errors in Tomcat’s journal:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
</span></span><spanstyle="display:flex;"><span>tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
</span></span></code></pre></div><ul>
<li>Which explains why the locks went down to normal numbers as I was watching… (because Java crashed)</li>
</ul>
<h2id="2023-01-19">2023-01-19</h2>
<ul>
<li>Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace</li>
<li>So it seems an IFPRI user got caught up in the blocking I did yesterday
<ul>
<li>Their ISP is Comcast…</li>
<li>I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast:</li>
<li>A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives)</li>
<li>I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara</li>
</ul>
<h2id="2023-01-21">2023-01-21</h2>
<ul>
<li>Export the Initiatives community again to perform collection mappings and country/region fixes</li>
</ul>
<h2id="2023-01-22">2023-01-22</h2>
<ul>
<li>There has been a high load on the server for a few days, currently 8.0… and I’ve been seeing some PostgreSQL locks stuck all day:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ psql -c <spanstyle="color:#e6db74">'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;'</span> | grep -o -E <spanstyle="color:#e6db74">'(dspaceWeb|dspaceApi|dspaceCli)'</span> | sort | uniq -c
<li>Looking at the locks I see they are from this morning at 5:00 AM, which is the <code>dspace checker-email</code> script
<ul>
<li>Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too…</li>
<li>Then I killed the PIDs of the locks</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ psql -c <spanstyle="color:#e6db74">"SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';"</span> | less -S
<li>Also, I checked the age of the locks and killed anything over 1 day:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ psql < locks-age.sql | grep days | less -S
</span></span></code></pre></div><ul>
<li>Then I ran all updates on the server and restarted it…</li>
<li>Salem responded to my question about the SDG mismatch between MEL and CGSpace
<ul>
<li>We agreed to use a version based on the text of <ahref="http://metadata.un.org/sdg/?lang=en">this site</a></li>
</ul>
</li>
<li>Salem is having issues with some REST API submission / updates
<ul>
<li>I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test</li>
</ul>
</li>
<li>Clean and normalize fifty-eight IFPRI records for batch import to CGSpace
<ul>
<li>I did a duplicate check and found six, so that’s good!</li>
</ul>
</li>
<li>I exported the entire CGSpace to check for missing Initiative mappings
<ul>
<li>Then I exported the Initiatives community to check for missing regions</li>
<li>Then I ran the script to check for missing ORCID identifiers</li>
<li>Then <em>finally</em>, I started a harvest on AReS</li>