mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-25 16:08:19 +01:00
Add notes for 2020-07-20
This commit is contained in:
parent
49d08e2db9
commit
501c282ecb
@ -540,4 +540,34 @@ $ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dsp
|
||||
- I said I would try to do a migration on DSpace Test with more of CGSpace's Solr data to try and approximate how much of our data be affected
|
||||
- I also asked them about the Tomcat 8.5 issue with CUA as well as the CUA group name issue that I had asked originally in April
|
||||
|
||||
## 2020-07-20
|
||||
|
||||
- Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:
|
||||
|
||||
```
|
||||
217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
|
||||
```
|
||||
|
||||
- I still see 12,000 records in Solr from this user agent, though.
|
||||
- I wonder why the DSpace bot list didn't get those... because it has "bot" which should cause Solr to not log the hit
|
||||
- I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn't in the list yet) and OgScrper (which is as of 2020-03)
|
||||
- Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
|
||||
- I closed the [old pull request](https://github.com/atmire/COUNTER-Robots/pull/34) and created a [new one](https://github.com/atmire/COUNTER-Robots/pull/36)
|
||||
- Then I updated the lists in the `5_x-prod` and 6.x branches
|
||||
- I re-ran the `check-spider-hits.sh` script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total
|
||||
- I looked at the [CLARISA](https://clarisa.cgiar.org/) institutions list again, since I hadn't looked at it in over six months:
|
||||
|
||||
```
|
||||
$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
|
||||
```
|
||||
|
||||
- The API still needs a key unless you query from Swagger web interface
|
||||
- They currently have 3,469 institutions...
|
||||
- Also, they still combine multiple text names into one string along with acronyms and countries:
|
||||
- Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)
|
||||
- Ministerio del Ambiente / Ministry of Environment (Peru)
|
||||
- Carthage University / Université de Carthage
|
||||
- Sweet Potato Research Institute (SPRI) of Chinese Academy of Agricultural Sciences (CAAS)
|
||||
- I think the ROR is much better in every possible way
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -20,7 +20,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-07/" />
|
||||
<meta property="article:published_time" content="2020-07-01T10:53:54+03:00" />
|
||||
<meta property="article:modified_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="article:modified_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="July, 2020"/>
|
||||
@ -45,9 +45,9 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
|
||||
"@type": "BlogPosting",
|
||||
"headline": "July, 2020",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2020-07/",
|
||||
"wordCount": "3347",
|
||||
"wordCount": "3664",
|
||||
"datePublished": "2020-07-01T10:53:54+03:00",
|
||||
"dateModified": "2020-07-14T10:57:49+03:00",
|
||||
"dateModified": "2020-07-15T15:42:23+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -659,6 +659,44 @@ COPY 186
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2020-07-20">2020-07-20</h2>
|
||||
<ul>
|
||||
<li>Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:</li>
|
||||
</ul>
|
||||
<pre><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
|
||||
</code></pre><ul>
|
||||
<li>I still see 12,000 records in Solr from this user agent, though.
|
||||
<ul>
|
||||
<li>I wonder why the DSpace bot list didn’t get those… because it has “bot” which should cause Solr to not log the hit</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I purged ~30,000 hits from Solr statistics based on the IPs above, but also for some agents like Drupal (which isn’t in the list yet) and OgScrper (which is as of 2020-03)</li>
|
||||
<li>Some of my user agent patterns had been incorporated into COUNTER-Robots in 2020-07, but not all
|
||||
<ul>
|
||||
<li>I closed the <a href="https://github.com/atmire/COUNTER-Robots/pull/34">old pull request</a> and created a <a href="https://github.com/atmire/COUNTER-Robots/pull/36">new one</a></li>
|
||||
<li>Then I updated the lists in the <code>5_x-prod</code> and 6.x branches</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I re-ran the <code>check-spider-hits.sh</code> script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total</li>
|
||||
<li>I looked at the <a href="https://clarisa.cgiar.org/">CLARISA</a> institutions list again, since I hadn’t looked at it in over six months:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
|
||||
</code></pre><ul>
|
||||
<li>The API still needs a key unless you query from Swagger web interface
|
||||
<ul>
|
||||
<li>They currently have 3,469 institutions…</li>
|
||||
<li>Also, they still combine multiple text names into one string along with acronyms and countries:
|
||||
<ul>
|
||||
<li>Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung / Federal Ministry of Economic Cooperation and Development (Germany)</li>
|
||||
<li>Ministerio del Ambiente / Ministry of Environment (Peru)</li>
|
||||
<li>Carthage University / Université de Carthage</li>
|
||||
<li>Sweet Potato Research Institute (SPRI) of Chinese Academy of Agricultural Sciences (CAAS)</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I think the ROR is much better in every possible way</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Categories"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="CGSpace Notes"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -9,7 +9,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2020-07-14T10:57:49+03:00" />
|
||||
<meta property="og:updated_time" content="2020-07-15T15:42:23+03:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="Posts"/>
|
||||
|
@ -4,27 +4,27 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2020-07-14T10:57:49+03:00</lastmod>
|
||||
<lastmod>2020-07-15T15:42:23+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2020-07-14T10:57:49+03:00</lastmod>
|
||||
<lastmod>2020-07-15T15:42:23+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2020-07/</loc>
|
||||
<lastmod>2020-07-14T10:57:49+03:00</lastmod>
|
||||
<lastmod>2020-07-15T15:42:23+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2020-07-14T10:57:49+03:00</lastmod>
|
||||
<lastmod>2020-07-15T15:42:23+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2020-07-14T10:57:49+03:00</lastmod>
|
||||
<lastmod>2020-07-15T15:42:23+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
|
Loading…
Reference in New Issue
Block a user