Add notes for 2021-11-26

This commit is contained in:
Alan Orth 2021-11-27 12:18:52 +02:00
parent 55c22a0d10
commit 103b6548fa
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
26 changed files with 111 additions and 32 deletions

View File

@ -203,4 +203,40 @@ Total number of bot hits purged: 10893
- According to my notes we actually completed this in 2021-08, but for some reason we are no longer on the list and I can't validate again
- There seems to be a problem with their website because every link I try to validate says it received an HTTP 500 response from CGSpace
## 2021-11-23
- Help RTB colleagues with thumbnail issues on their [2020 Annual Report](https://hdl.handle.net/10568/114576)
- The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white
- I generated a new one manually with libvips and it is better:
```console
$ vipsthumbnail AR\ RTB\ 2020.pdf -s 600 -o '%s.jpg[Q=85,optimize_coding,strip]'
```
- I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
- Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently...
- I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than "Microsoft Internet Explorer" for their scraper
- He said he will change the user agent
## 2021-11-24
- I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
- Other than a few that I ruled out that *may* be humans, these are all making requests within one month or with no user agent, which is highly suspicious:
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
Found 3 hits from 188.134.31.88 in statistics
Total number of hits from bots: 295492
```
<!-- vim: set sw=2 ts=2: -->

View File

@ -18,7 +18,7 @@ $ zstd statistics-2019.json
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-11/" />
<meta property="article:published_time" content="2021-11-02T22:27:07+02:00" />
<meta property="article:modified_time" content="2021-11-21T13:45:30+02:00" />
<meta property="article:modified_time" content="2021-11-22T16:47:50+02:00" />
@ -42,9 +42,9 @@ $ zstd statistics-2019.json
"@type": "BlogPosting",
"headline": "November, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-11/",
"wordCount": "1339",
"wordCount": "1604",
"datePublished": "2021-11-02T22:27:07+02:00",
"dateModified": "2021-11-21T13:45:30+02:00",
"dateModified": "2021-11-22T16:47:50+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -335,7 +335,50 @@ Purging 10893 hits from 87.203.87.141 in statistics
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
<h2 id="2021-11-23">2021-11-23</h2>
<ul>
<li>Help RTB colleagues with thumbnail issues on their <a href="https://hdl.handle.net/10568/114576">2020 Annual Report</a>
<ul>
<li>The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white</li>
<li>I generated a new one manually with libvips and it is better:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ vipsthumbnail AR<span style="color:#ae81ff">\ </span>RTB<span style="color:#ae81ff">\ </span>2020.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">&#39;%s.jpg[Q=85,optimize_coding,strip]&#39;</span>
</code></pre></div><ul>
<li>I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
<ul>
<li>Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently&hellip;</li>
</ul>
</li>
<li>I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than &ldquo;Microsoft Internet Explorer&rdquo; for their scraper
<ul>
<li>He said he will change the user agent</li>
</ul>
</li>
</ul>
<h2 id="2021-11-24">2021-11-24</h2>
<ul>
<li>I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
<ul>
<li>Other than a few that I ruled out that <em>may</em> be humans, these are all making requests within one month or with no user agent, which is highly suspicious:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
Found 8352 hits from 138.201.49.199 in statistics
Found 9374 hits from 78.46.89.18 in statistics
Found 2112 hits from 93.179.69.74 in statistics
Found 1 hits from 31.6.77.23 in statistics
Found 5 hits from 34.209.213.122 in statistics
Found 86772 hits from 163.172.68.99 in statistics
Found 77 hits from 163.172.70.248 in statistics
Found 15842 hits from 163.172.71.24 in statistics
Found 172954 hits from 104.154.216.0 in statistics
Found 3 hits from 188.134.31.88 in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 295492
</code></pre></div><!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
<meta property="og:updated_time" content="2021-11-22T16:47:50+02:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
<lastmod>2021-11-22T16:47:50+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
<lastmod>2021-11-22T16:47:50+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
<lastmod>2021-11-22T16:47:50+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-11/</loc>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
<lastmod>2021-11-22T16:47:50+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
<lastmod>2021-11-22T16:47:50+02:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-10/</loc>
<lastmod>2021-11-01T10:48:13+02:00</lastmod>