Update notes

This commit is contained in:
Alan Orth 2022-07-21 10:03:16 +03:00
parent daf209efb9
commit a0456cd0f7
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
29 changed files with 148 additions and 35 deletions

View File

@ -354,4 +354,53 @@ $ wc -l /tmp/bot-ips.txt
1946968 /tmp/bot-ips.txt
```
- I started running `check-spider-ip-hits.sh` with the 1946968 IPs and left it running in dry run mode
## 2022-07-19
- Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace
- It's one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API
- Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items:
```console
dc.contributor.author,cg.creator.identifier
"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247"
"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247"
"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247"
"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411"
"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411"
"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887"
"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887"
"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746"
"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337"
"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337"
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p 'fuuu'
```
- Review the AfricaRice records from earlier this month again
- I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two
- I took all the ~560 IPs that had hits so far in `check-spider-ip-hits.sh` above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
- This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from
- Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script
## 2022-07-20
- Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
- Once it worked well I did an import to CGSpace:
```console
$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
```
- Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up
- Extract another ~1,600 IPs that had hits since I started the second round of `check-spider-ip-hits.sh` yesterday and purge another 303,594 hits
- This is about 999846 into the original list of 1946968 from yesterday
- A metric fuck ton of the IPs in this batch were from Hetzner
## 2022-07-21
- Extract another ~2,100 IPs that had hits since I started the third round of `check-spider-ip-hits.sh` last night and purge another 763,843 hits
- This is about 1441221 into the original list of 1946968 from two days ago
- Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)
<!-- vim: set sw=2 ts=2: -->

View File

@ -19,7 +19,7 @@ Also, the trgm functions I&rsquo;ve used before are case insensitive, but Levens
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-07/" />
<meta property="article:published_time" content="2022-07-02T14:07:36+03:00" />
<meta property="article:modified_time" content="2022-07-18T12:32:23+03:00" />
<meta property="article:modified_time" content="2022-07-18T16:45:55+03:00" />
@ -44,9 +44,9 @@ Also, the trgm functions I&rsquo;ve used before are case insensitive, but Levens
"@type": "BlogPosting",
"headline": "July, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-07/",
"wordCount": "2266",
"wordCount": "2679",
"datePublished": "2022-07-02T14:07:36+03:00",
"dateModified": "2022-07-18T12:32:23+03:00",
"dateModified": "2022-07-18T16:45:55+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -521,7 +521,71 @@ Also, the trgm functions I&rsquo;ve used before are case insensitive, but Levens
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r line; <span style="color:#66d9ef">do</span> prips <span style="color:#e6db74">&#34;</span>$line<span style="color:#e6db74">&#34;</span> | sed -e <span style="color:#e6db74">&#39;1d; $d&#39;</span>; <span style="color:#66d9ef">done</span> &lt; /tmp/bot-networks.conf &gt; /tmp/bot-ips.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/bot-ips.txt
</span></span><span style="display:flex;"><span>1946968 /tmp/bot-ips.txt
</span></span></code></pre></div><!-- raw HTML omitted -->
</span></span></code></pre></div><ul>
<li>I started running <code>check-spider-ip-hits.sh</code> with the 1946968 IPs and left it running in dry run mode</li>
</ul>
<h2 id="2022-07-19">2022-07-19</h2>
<ul>
<li>Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace
<ul>
<li>It&rsquo;s one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API</li>
</ul>
</li>
<li>Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
</span></span><span style="display:flex;"><span>&#34;Dhulipala, Ram K&#34;,&#34;Ram Dhulipala: 0000-0002-9720-3247&#34;
</span></span><span style="display:flex;"><span>&#34;Dhulipala, Ram&#34;,&#34;Ram Dhulipala: 0000-0002-9720-3247&#34;
</span></span><span style="display:flex;"><span>&#34;Dhulipala, R.&#34;,&#34;Ram Dhulipala: 0000-0002-9720-3247&#34;
</span></span><span style="display:flex;"><span>&#34;Wambua, Lillian&#34;,&#34;Lillian Wambua: 0000-0003-3632-7411&#34;
</span></span><span style="display:flex;"><span>&#34;Wambua, Lilian&#34;,&#34;Lillian Wambua: 0000-0003-3632-7411&#34;
</span></span><span style="display:flex;"><span>&#34;Masiga, D.K.&#34;,&#34;Daniel Masiga: 0000-0001-7513-0887&#34;
</span></span><span style="display:flex;"><span>&#34;Masiga, Daniel K.&#34;,&#34;Daniel Masiga: 0000-0001-7513-0887&#34;
</span></span><span style="display:flex;"><span>&#34;Jores, Joerg&#34;,&#34;Joerg Jores: 0000-0003-3790-5746&#34;
</span></span><span style="display:flex;"><span>&#34;Schieck, Elise&#34;,&#34;Elise Schieck: 0000-0003-1756-6337&#34;
</span></span><span style="display:flex;"><span>&#34;Schieck, Elise G.&#34;,&#34;Elise Schieck: 0000-0003-1756-6337&#34;
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span>
</span></span></code></pre></div><ul>
<li>Review the AfricaRice records from earlier this month again
<ul>
<li>I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two</li>
</ul>
</li>
<li>I took all the ~560 IPs that had hits so far in <code>check-spider-ip-hits.sh</code> above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
<ul>
<li>This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from</li>
<li>Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script</li>
</ul>
</li>
</ul>
<h2 id="2022-07-20">2022-07-20</h2>
<ul>
<li>Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
<ul>
<li>Once it worked well I did an import to CGSpace:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
</span></span></code></pre></div><ul>
<li>Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up</li>
<li>Extract another ~1,600 IPs that had hits since I started the second round of <code>check-spider-ip-hits.sh</code> yesterday and purge another 303,594 hits
<ul>
<li>This is about 999846 into the original list of 1946968 from yesterday</li>
<li>A metric fuck ton of the IPs in this batch were from Hetzner</li>
</ul>
</li>
</ul>
<h2 id="2022-07-21">2022-07-21</h2>
<ul>
<li>Extract another ~2,100 IPs that had hits since I started the third round of <code>check-spider-ip-hits.sh</code> last night and purge another 763,843 hits
<ul>
<li>This is about 1441221 into the original list of 1946968 from two days ago</li>
<li>Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-07-18T12:32:23+03:00" />
<meta property="og:updated_time" content="2022-07-18T16:45:55+03:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2022-07-18T12:32:23+03:00</lastmod>
<lastmod>2022-07-18T16:45:55+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2022-07-18T12:32:23+03:00</lastmod>
<lastmod>2022-07-18T16:45:55+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-07/</loc>
<lastmod>2022-07-18T12:32:23+03:00</lastmod>
<lastmod>2022-07-18T16:45:55+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2022-07-18T12:32:23+03:00</lastmod>
<lastmod>2022-07-18T16:45:55+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2022-07-18T12:32:23+03:00</lastmod>
<lastmod>2022-07-18T16:45:55+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-06/</loc>
<lastmod>2022-07-04T09:25:14+03:00</lastmod>