mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-29 01:48:19 +01:00
Compare commits
No commits in common. "61a012edeedc8f698e921990e6e9ef5d05c01bf8" and "55c22a0d10f89334638474138a8a1fbad80ba1df" have entirely different histories.
61a012edee
...
55c22a0d10
@ -203,52 +203,4 @@ Total number of bot hits purged: 10893
|
||||
- According to my notes we actually completed this in 2021-08, but for some reason we are no longer on the list and I can't validate again
|
||||
- There seems to be a problem with their website because every link I try to validate says it received an HTTP 500 response from CGSpace
|
||||
|
||||
## 2021-11-23
|
||||
|
||||
- Help RTB colleagues with thumbnail issues on their [2020 Annual Report](https://hdl.handle.net/10568/114576)
|
||||
- The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white
|
||||
- I generated a new one manually with libvips and it is better:
|
||||
|
||||
```console
|
||||
$ vipsthumbnail AR\ RTB\ 2020.pdf -s 600 -o '%s.jpg[Q=85,optimize_coding,strip]'
|
||||
```
|
||||
|
||||
- I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
|
||||
- Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently...
|
||||
- I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than "Microsoft Internet Explorer" for their scraper
|
||||
- He said he will change the user agent
|
||||
|
||||
## 2021-11-24
|
||||
|
||||
- I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
|
||||
- Other than a few that I ruled out that *may* be humans, these are all making requests within one month or with no user agent, which is highly suspicious:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
|
||||
Found 8352 hits from 138.201.49.199 in statistics
|
||||
Found 9374 hits from 78.46.89.18 in statistics
|
||||
Found 2112 hits from 93.179.69.74 in statistics
|
||||
Found 1 hits from 31.6.77.23 in statistics
|
||||
Found 5 hits from 34.209.213.122 in statistics
|
||||
Found 86772 hits from 163.172.68.99 in statistics
|
||||
Found 77 hits from 163.172.70.248 in statistics
|
||||
Found 15842 hits from 163.172.71.24 in statistics
|
||||
Found 172954 hits from 104.154.216.0 in statistics
|
||||
Found 3 hits from 188.134.31.88 in statistics
|
||||
|
||||
Total number of hits from bots: 295492
|
||||
```
|
||||
|
||||
## 2021-11-27
|
||||
|
||||
- Peter sent me corrections for the authors that I had sent him back in 2021-09
|
||||
- I did a quick sanity check on them with OpenRefine, filtering out all the metadata with no replacements, then ran through my csv-metadata-quality script
|
||||
- Then I imported them into my local instance as a test:
|
||||
|
||||
```console
|
||||
$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
|
||||
```
|
||||
|
||||
- Then I imported to CGSpace and started a full Discovery re-index
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -18,7 +18,7 @@ $ zstd statistics-2019.json
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-11/" />
|
||||
<meta property="article:published_time" content="2021-11-02T22:27:07+02:00" />
|
||||
<meta property="article:modified_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="article:modified_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
@ -42,9 +42,9 @@ $ zstd statistics-2019.json
|
||||
"@type": "BlogPosting",
|
||||
"headline": "November, 2021",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2021-11/",
|
||||
"wordCount": "1682",
|
||||
"wordCount": "1339",
|
||||
"datePublished": "2021-11-02T22:27:07+02:00",
|
||||
"dateModified": "2021-11-27T12:18:52+02:00",
|
||||
"dateModified": "2021-11-21T13:45:30+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -335,62 +335,6 @@ Purging 10893 hits from 87.203.87.141 in statistics
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2021-11-23">2021-11-23</h2>
|
||||
<ul>
|
||||
<li>Help RTB colleagues with thumbnail issues on their <a href="https://hdl.handle.net/10568/114576">2020 Annual Report</a>
|
||||
<ul>
|
||||
<li>The PDF seems to be in landscape mode or something and the first page is half width, so the thumbnail renders with the left half being white</li>
|
||||
<li>I generated a new one manually with libvips and it is better:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ vipsthumbnail AR<span style="color:#ae81ff">\ </span>RTB<span style="color:#ae81ff">\ </span>2020.pdf -s <span style="color:#ae81ff">600</span> -o <span style="color:#e6db74">'%s.jpg[Q=85,optimize_coding,strip]'</span>
|
||||
</code></pre></div><ul>
|
||||
<li>I sent an email to the OpenArchives.org contact to ask for help with the OAI validator
|
||||
<ul>
|
||||
<li>Someone responded to say that there have been a number of complaints about this on the oai-pmh mailing list recently…</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I sent an email to Pythagoras from GARDIAN to ask if they can use a more specific user agent than “Microsoft Internet Explorer” for their scraper
|
||||
<ul>
|
||||
<li>He said he will change the user agent</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2021-11-24">2021-11-24</h2>
|
||||
<ul>
|
||||
<li>I had an idea to check our Solr statistics for hits from all the IPs that I have listed in nginx as being bots
|
||||
<ul>
|
||||
<li>Other than a few that I ruled out that <em>may</em> be humans, these are all making requests within one month or with no user agent, which is highly suspicious:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt
|
||||
Found 8352 hits from 138.201.49.199 in statistics
|
||||
Found 9374 hits from 78.46.89.18 in statistics
|
||||
Found 2112 hits from 93.179.69.74 in statistics
|
||||
Found 1 hits from 31.6.77.23 in statistics
|
||||
Found 5 hits from 34.209.213.122 in statistics
|
||||
Found 86772 hits from 163.172.68.99 in statistics
|
||||
Found 77 hits from 163.172.70.248 in statistics
|
||||
Found 15842 hits from 163.172.71.24 in statistics
|
||||
Found 172954 hits from 104.154.216.0 in statistics
|
||||
Found 3 hits from 188.134.31.88 in statistics
|
||||
<span style="color:#960050;background-color:#1e0010">
|
||||
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 295492
|
||||
</code></pre></div><h2 id="2021-11-27">2021-11-27</h2>
|
||||
<ul>
|
||||
<li>Peter sent me corrections for the authors that I had sent him back in 2021-09
|
||||
<ul>
|
||||
<li>I did a quick sanity check on them with OpenRefine, filtering out all the metadata with no replacements, then ran through my csv-metadata-quality script</li>
|
||||
<li>Then I imported them into my local instance as a test:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/fix-metadata-values.py -i /tmp/authors.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -f dc.contributor.author -t <span style="color:#e6db74">'correct'</span> -m <span style="color:#ae81ff">3</span>
|
||||
</code></pre></div><ul>
|
||||
<li>Then I imported to CGSpace and started a full Discovery re-index</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2021-11-27T12:18:52+02:00" />
|
||||
<meta property="og:updated_time" content="2021-11-21T13:45:30+02:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -3,19 +3,19 @@
|
||||
xmlns:xhtml="http://www.w3.org/1999/xhtml">
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
|
||||
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
|
||||
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
|
||||
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2021-11/</loc>
|
||||
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
|
||||
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2021-11-27T12:18:52+02:00</lastmod>
|
||||
<lastmod>2021-11-21T13:45:30+02:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2021-10/</loc>
|
||||
<lastmod>2021-11-01T10:48:13+02:00</lastmod>
|
||||
|
Loading…
Reference in New Issue
Block a user