<li>He was checking the OAI base URL on OpenArchives and I had to verify the results in order to proceed to Step 2</li>
<li>Now it seems to be verified (all green): <ahref="https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85">https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85</a></li>
<li>We are listed in the OpenArchives list of databases conforming to OAI 2.0</li>
<li>Advise IWMI colleagues on best practices for thumbnails</li>
<li>Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
<ul>
<li>I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:</li>
<li>WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific</li>
<li>WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea</li>
</ul>
</li>
<li>Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
<ul>
<li>35.174.144.154 made 11,000 requests last month with the following user agent:</li>
</ul>
</li>
</ul>
<pre><codeclass="language-console"data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
</code></pre><ul>
<li>That IP is on Amazon, and from looking at the DSpace logs I don’t see them logging in at all, only scraping… so I will purge hits from that IP</li>
<li>I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
<ul>
<li>Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so</li>
<li>Same deal with 93.158.90.91, also in Sweden</li>
</ul>
</li>
<li>3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart</li>
<li>61.143.40.50 is in China and uses this hilarious user agent:</li>
</ul>
<pre><codeclass="language-console"data-lang="console">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
</code></pre><ul>
<li>47.252.80.214 is owned by Alibaba in the US and has the same user agent</li>
<li>159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours</li>
<li>95.87.154.12 seems to be a new bot with the following user agent:</li>
<li>129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.</li>
<li>217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day</li>
<li>103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human</li>
<li>There are probably more but that’s most of them over 1,000 hits last month, so I will purge them:</li>
<li>Then in OpenRefine I merged all null, blank, and en fields into the <code>en_US</code> one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
<ul>
<li>In total it was a few thousand metadata entries or so so I had to split the CSV with <code>xsv split</code> in order to process it</li>
<li>I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes…</li>
<li>In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:</li>
<li>Then I exported the list of journals that differ and sent it to Peter for comments and corrections
<ul>
<li>I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it</li>
</ul>
</li>
<li>Convert my <code>generate-thumbnails.py</code> script to use libvips instead of Graphicsmagick
<ul>
<li>It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs</li>
<li>One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can’t work in CMYK</li>
<li>I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I’m not sure…</li>