Update notes for 2020-08-10

This commit is contained in:
Alan Orth 2020-08-10 15:59:22 +03:00
parent 14f239ccaa
commit cb03863647
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
20 changed files with 70 additions and 26 deletions

View File

@ -350,4 +350,22 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
```
- I looked into the strange Solr record above that had "{set=830}" in the communities and collections
- There are exactly 11724 records like this in the current CGSpace (DSpace 5.8) statistics-2018 Solr core
- None of them have an `id` or `type` field!
- I see 242,000 of them in the statistics-2017 core, 185,063 in the statistics-2016 core... all the way to 2010, but not in 2019 or the current statistics core
- I decided to purge all of these records from CGSpace right now so they don't even have a chance at being an issue on the real migration:
```
$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
...
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
```
- I added `Googlebot` and `Twitterbot` to the list of explicitly allowed bots
- In Google's case, they were getting lumped in with all the other bad bots and then important links like the sitemaps were returning HTTP 503, but they generally respect `robots.txt` so we should just allow them (perhaps we can control the crawl rate in the webmaster console)
- In Twitter's case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter
- I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my [CountryCodeTagger](https://github.com/ilri/cgspace-java-helpers) curation task
- I still need to set up a cron job for it...
<!-- vim: set sw=2 ts=2: -->

View File

@ -19,7 +19,7 @@ It is class based so I can easily add support for other vocabularies, and the te
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-08/" />
<meta property="article:published_time" content="2020-08-02T15:35:54+03:00" />
<meta property="article:modified_time" content="2020-08-07T19:55:21+03:00" />
<meta property="article:modified_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2020"/>
@ -43,9 +43,9 @@ It is class based so I can easily add support for other vocabularies, and the te
"@type": "BlogPosting",
"headline": "August, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-08/",
"wordCount": "2049",
"wordCount": "2285",
"datePublished": "2020-08-02T15:35:54+03:00",
"dateModified": "2020-08-07T19:55:21+03:00",
"dateModified": "2020-08-10T09:27:50+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -504,7 +504,33 @@ dspace=# \q
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq &gt; /tmp/2020-08-09-orcid-identifiers-uniq.csv
$ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
</code></pre><!-- raw HTML omitted -->
</code></pre><ul>
<li>I looked into the strange Solr record above that had &ldquo;{set=830}&rdquo; in the communities and collections
<ul>
<li>There are exactly 11724 records like this in the current CGSpace (DSpace 5.8) statistics-2018 Solr core</li>
<li>None of them have an <code>id</code> or <code>type</code> field!</li>
<li>I see 242,000 of them in the statistics-2017 core, 185,063 in the statistics-2016 core&hellip; all the way to 2010, but not in 2019 or the current statistics core</li>
<li>I decided to purge all of these records from CGSpace right now so they don&rsquo;t even have a chance at being an issue on the real migration:</li>
</ul>
</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2018/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;owningColl:/.*set.*/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>I added <code>Googlebot</code> and <code>Twitterbot</code> to the list of explicitly allowed bots
<ul>
<li>In Google&rsquo;s case, they were getting lumped in with all the other bad bots and then important links like the sitemaps were returning HTTP 503, but they generally respect <code>robots.txt</code> so we should just allow them (perhaps we can control the crawl rate in the webmaster console)</li>
<li>In Twitter&rsquo;s case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter</li>
</ul>
</li>
<li>I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my <a href="https://github.com/ilri/cgspace-java-helpers">CountryCodeTagger</a> curation task
<ul>
<li>I still need to set up a cron job for it&hellip;</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-07T19:55:21+03:00" />
<meta property="og:updated_time" content="2020-08-10T09:27:50+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-08/</loc>
<lastmod>2020-08-07T19:55:21+03:00</lastmod>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-08-07T19:55:21+03:00</lastmod>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-08-07T19:55:21+03:00</lastmod>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-08-07T19:55:21+03:00</lastmod>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-08-07T19:55:21+03:00</lastmod>
<lastmod>2020-08-10T09:27:50+03:00</lastmod>
</url>
<url>