Compare commits

..

No commits in common. "61a012edeedc8f698e921990e6e9ef5d05c01bf8" and "55c22a0d10f89334638474138a8a1fbad80ba1df" have entirely different histories.

26 changed files with 31 additions and 135 deletions

View File

@ -203,52 +203,4 @@ Total number of bot hits purged: 10893
- According to my notes we actually completed this in 2021-08, but for some reason we are no longer on the list and I can't validate again
- There seems to be a problem with their website because every link I try to validate says it received an HTTP 500 response from CGSpace
## 2021-11-23
- Help RTB colleagues with thumbnail issues on their [2020 Annual Report](https://hdl.handle.net/10568/114576)
- The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white
- I generated a new one manually with libvips and it is better:
```console
$ vipsthumbnail AR\ RTB\ 2020.pdf -s 600 -o '%s.jpg[Q=85,optimize_coding,strip]'
```
- I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
- Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently...
- I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than "Microsoft Internet Explorer" for their scraper
- He said he will change the user agent
## 2021-11-24
- I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
- Other than a few that I ruled out that *may* be humans, these are all making requests within one month or with no user agent, which is highly suspicious:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
Found 3 hits from 188.134.31.88 in statistics
Total number of hits from bots: 295492
```
## 2021-11-27
- Peter sent me corrections for the authors that I had sent him back in 2021-09
- I did a quick sanity check on them with OpenRefine, filtering out all the metadata with no replacements, then ran through my csv-metadata-quality script
- Then I imported them into my local instance as a test:
```console
$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
```
- Then I imported to CGSpace and started a full Discovery re-index
<!-- vim: set sw=2 ts=2: -->

View File

@ -18,7 +18,7 @@ $ zstd statistics-2019.json
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-11/" />
<meta property="article:published_time" content="2021-11-02T22:27:07+02:00" />
<meta property="article:modified_time" content="2021-11-27T12:18:52+02:00" />
<meta property="article:modified_time" content="2021-11-21T13:45:30+02:00" />
@ -42,9 +42,9 @@ $ zstd statistics-2019.json
"@type": "BlogPosting",
"headline": "November, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-11/",
"wordCount": "1682",
"wordCount": "1339",
"datePublished": "2021-11-02T22:27:07+02:00",
"dateModified": "2021-11-27T12:18:52+02:00",
"dateModified": "2021-11-21T13:45:30+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -335,62 +335,6 @@ Purging 10893 hits from 87.203.87.141 in statistics
</ul>
</li>
</ul>
<h2 id="2021-11-23">2021-11-23</h2>
<ul>
<li>Help RTB colleagues with thumbnail issues on their <a href="https://hdl.handle.net/10568/114576">2020 Annual Report</a>
<ul>
<li>The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white</li>
<li>I generated a new one manually with libvips and it is better:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ vipsthumbnail AR<span style="color:#ae81ff">\ </span>RTB<span style="color:#ae81ff">\ </span>2020.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</code></pre></div><ul>
<li>I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
<ul>
<li>Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently&hellip;</li>
</ul>
</li>
<li>I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than &ldquo;Microsoft Internet Explorer&rdquo; for their scraper
<ul>
<li>He said he will change the user agent</li>
</ul>
</li>
</ul>
<h2 id="2021-11-24">2021-11-24</h2>
<ul>
<li>I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
<ul>
<li>Other than a few that I ruled out that <em>may</em> be humans, these are all making requests within one month or with no user agent, which is highly suspicious:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
Found 3 hits from 188.134.31.88 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 295492
</code></pre></div><h2 id="2021-11-27">2021-11-27</h2>
<ul>
<li>Peter sent me corrections for the authors that I had sent him back in 2021-09
<ul>
<li>I did a quick sanity check on them with OpenRefine, filtering out all the metadata with no replacements, then ran through my csv-metadata-quality script</li>
<li>Then I imported them into my local instance as a test:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f dc.contributor.author -t <span style="color:#e6db74">&#39;correct&#39;</span> -m <span style="color:#ae81ff">3</span>
</code></pre></div><ul>
<li>Then I imported to CGSpace and started a full Discovery re-index</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-11/</loc>
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-10/</loc>
<lastmod>2021-11-01T10:48:13+02:00</lastmod>