<li>I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
<ul>
<li>18.207.136.176</li>
<li>185.189.36.248</li>
<li>50.118.223.78</li>
<li>52.70.76.123</li>
<li>3.236.10.11</li>
</ul>
</li>
<li>Looking at the Solr statistics for 2022-04
<ul>
<li>52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests</li>
<li>64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc</li>
<li>185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt</li>
<li>157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr</li>
<li>52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
<li>157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests</li>
<li>207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr</li>
<li>If I query Solr for <code>time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.</code> I see a handful of IPs that made 41,000 requests</li>
</ul>
</li>
<li>I purged 93,974 hits from these IPs using my <code>check-spider-ip-hits.sh</code> script</li>
</ul>
<ul>
<li>Now looking at the Solr statistics by user agent I see:
<li>Update PostgreSQL JDBC driver to 42.3.5 in the Ansible infrastructure playbooks and deploy on DSpace Test</li>
<li>Peter asked me how many items we add to CGSpace every year
<ul>
<li>I wrote a SQL query to check the number of items grouped by their accession dates since 2009:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ SELECT EXTRACT(year from text_value::date) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
<li>Note that I had an issue with casting <code>text_value</code> to date because one item had an accession date of <code>2016</code> instead of <code>2016-09-29T20:14:47Z</code>
<ul>
<li>Once I fixed that PostgreSQL was able to <ahref="https://www.postgresql.org/docs/12/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT">extract() the year</a></li>
<li>There were some other methods I tried that worked also, for example <code>TO_DATE()</code>:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ SELECT EXTRACT(year from TO_DATE(text_value, 'YYYY-MM-DD"T"HH24:MI:SS"Z"')) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
</span></span></code></pre></div><ul>
<li>But it seems PostgreSQL is smart enough to recognize date formatting in strings automatically when we cast so we don’t need to convert to date first</li>
<li>Another thing I noticed is that a few hundred items have accession dates from decades ago, perhaps this is due to importing items from the CGIAR Library?</li>
<li>I also submitted a <ahref="https://github.com/DSpace/DSpace/pull/8288">pull request to migrate Mirage 2’s build from bower and compass to yarn and node-sass</a></li>
<li>Submit an issue to Atmire’s bug tracker inquiring about DSpace 6.4 support</li>
</ul>
<h2id="2022-05-10">2022-05-10</h2>
<ul>
<li>Submit an updated <ahref="https://github.com/DSpace/DSpace/pull/8292">pull request to migrate Mirage 2’s build from bower and compass to npm and node-sass</a>
<ul>
<li>This one is better than the previous one because it uses npm directly, which comes with the Node.js distribution, rather than requiring the user to install yarn</li>
<li>I also updated a bunch of grunt build deps</li>
</ul>
</li>
</ul>
<h2id="2022-05-12">2022-05-12</h2>
<ul>
<li>CGSpace meeting with Abenet and Peter
<ul>
<li>We discussed the future of CGSpace and DSpace in general in the new One CGIAR</li>
<li>We discussed how to prepare for bringing in content from the Initiatives, whether we need new metadata fields to support people from IFPRI etc</li>
<li>We discussed the need for good quality Drupal and WordPress modules so sites can harvest content from the repository</li>
<li>Peter asked me to send him a list of investors/funders/donors so he can clean it up, but also to try to align it with ROR and evntually do something like we do with country codes, adding the ROR IDs and potentially showing the badge on item views</li>
<li>We also discussed removing some Mirage 2 themes for old programs and CRPs that don’t have custom branding, ie only Google Analytics</li>
</ul>
</li>
<li>Export a list of donors for Peter to clean up:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-05-12-donors.csv WITH CSV HEADER;
<li>Then I created a CSV from our <code>cg-creator-identifier.xml</code> controlled vocabulary and ran it against our database with <code>add-orcid-identifiers-csv.py</code> to see if any author names by chance matched that are missing ORCIDs in CGSpace</li>