Add notes for 2020-08-11

This commit is contained in:
Alan Orth 2020-08-11 11:35:05 +03:00
parent cb03863647
commit ccecd63eb0
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
20 changed files with 62 additions and 25 deletions

View File

@ -367,5 +367,22 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H
- In Twitter's case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter
- I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my [CountryCodeTagger](https://github.com/ilri/cgspace-java-helpers) curation task
- I still need to set up a cron job for it...
- This tagged 50,000 countries!
```
dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
count
-------
50812
(1 row)
```
## 2020-08-11
- I noticed some more hits from Macaroni's WordPress harvestor that I hadn't caught last week
- 104.198.13.34 made many requests without a user agent, with a "WordPress" user agent, and with their new "RTB website BOT" user agent, about 100,000 in total in 2020, and maybe another 70,000 in the other years
- I will purge them an add them to the Tomcat Crawler Session Manager and the DSpace bots list so they don't get logged in Solr
- I noticed a bunch of user agents with "Crawl" in the Solr stats, which is strange because the DSpace spider agents file has had "crawl" for a long time (and it is case insensitive)
- In any case I will purge them and add them to the Tomcat Crawler Session Manager Valve so that at least their sessions get re-used
<!-- vim: set sw=2 ts=2: -->

View File

@ -19,7 +19,7 @@ It is class based so I can easily add support for other vocabularies, and the te
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-08/" />
<meta property="article:published_time" content="2020-08-02T15:35:54+03:00" />
<meta property="article:modified_time" content="2020-08-10T09:27:50+03:00" />
<meta property="article:modified_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2020"/>
@ -43,9 +43,9 @@ It is class based so I can easily add support for other vocabularies, and the te
"@type": "BlogPosting",
"headline": "August, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-08/",
"wordCount": "2285",
"wordCount": "2443",
"datePublished": "2020-08-02T15:35:54+03:00",
"dateModified": "2020-08-10T09:27:50+03:00",
"dateModified": "2020-08-10T15:59:22+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -527,6 +527,26 @@ $ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=tru
<li>I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my <a href="https://github.com/ilri/cgspace-java-helpers">CountryCodeTagger</a> curation task
<ul>
<li>I still need to set up a cron job for it&hellip;</li>
<li>This tagged 50,000 countries!</li>
</ul>
</li>
</ul>
<pre><code>dspace=# SELECT count(text_value) FROM metadatavalue WHERE metadata_field_id = 243 AND resource_type_id = 2;
count
-------
50812
(1 row)
</code></pre><h2 id="2020-08-11">2020-08-11</h2>
<ul>
<li>I noticed some more hits from Macaroni&rsquo;s WordPress harvestor that I hadn&rsquo;t caught last week
<ul>
<li>104.198.13.34 made many requests without a user agent, with a &ldquo;WordPress&rdquo; user agent, and with their new &ldquo;RTB website BOT&rdquo; user agent, about 100,000 in total in 2020, and maybe another 70,000 in the other years</li>
<li>I will purge them an add them to the Tomcat Crawler Session Manager and the DSpace bots list so they don&rsquo;t get logged in Solr</li>
</ul>
</li>
<li>I noticed a bunch of user agents with &ldquo;Crawl&rdquo; in the Solr stats, which is strange because the DSpace spider agents file has had &ldquo;crawl&rdquo; for a long time (and it is case insensitive)
<ul>
<li>In any case I will purge them and add them to the Tomcat Crawler Session Manager Valve so that at least their sessions get re-used</li>
</ul>
</li>
</ul>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta property="og:updated_time" content="2020-08-10T15:59:22+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-08/</loc>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
<lastmod>2020-08-10T15:59:22+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
<lastmod>2020-08-10T15:59:22+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
<lastmod>2020-08-10T15:59:22+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
<lastmod>2020-08-10T15:59:22+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
<lastmod>2020-08-10T15:59:22+03:00</lastmod>
</url>
<url>