Add notes for 2020-08-19

This commit is contained in:
Alan Orth 2020-08-19 22:08:33 +03:00
parent 3252567208
commit d2c037d0de
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
20 changed files with 166 additions and 35 deletions

View File

@ -432,12 +432,6 @@ $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
```
- Then I see there are 849,000 docs with `id: -1` and `type: 5` so I should purge those too probably:
```
$ curl -s "http://localhost:8081/solr/statistics-2017/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:\-1</query></delete>'
```
- Altmetric asked for a dump of CGSpace's OAI "sets" so they can update their affiliation mappings
- I did it in a kinda ghetto way:
@ -450,4 +444,55 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets.xml;
- This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that's how theirs was in the first place...
- Help Bizu with a restricted item for CIAT
## 2020-08-16
- The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs...
- I looked at a few of the UIDs that it was having problems with and they were unmigrated ones... so I purged them in 2015 and all the rest of the statistics cores
```
$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
...
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
```
## 2020-08-19
- I tested the DSpace 5 and DSpace 6 versions of the [country code tagger curation task](https://github.com/ilri/cgspace-java-helpers) and noticed a few things
- The DSpace 5.8 version finishes in 2 hours and 1 minute
- The DSpace 6.3 version ran for over 12 hours and didn't even finish (I killed it)
- Furthermore, it seems that each item is curated once for each collection it appears in, causing about 115,000 items to be processed, even though we only have about 87,000
- I had been running the tasks on the entire repository with `-i 10568/0`, but I think I might need to try again with the special `all` option before writing to the dspace-tech mailing list for help
- Actually I just tested the `all` option on DSpace 5.8 and it still does many of the items multiple times, once for each of their mappings
- I finished the Atmire stats processing on all cores on DSpace Test:
- statistics:
- 2,040,385 docs: 2h 28m 49s
- statistics-2019:
- 8,960,000 docs: 12h 7s
- 1,780,575 docs: 2h 7m 29s
- statistics-2018:
- 2,200,000 docs: 12h 1m 11s
- 2,100,000 docs: 12h 4m 19s
- ?
- statistics-2017:
- 1,970,000 docs: 12h 5m 45s
- 2,000,000 docs: 12h 5m 38s
- 1,312,674 docs: 4h 14m 23s
- statistics-2016:
- 1,669,020 docs: 12h 4m 3s
- 1,650,000 docs: 12h 7m 40s
- 850,611 docs: 44m 52s
- statistics-2014:
- 4,832,334 docs: 3h 53m 41s
- statistics-2013:
- 4,509,891 docs: 3h 18m 44s
- statistics-2012:
- 3,716,857 docs: 2h 36m 21s
- statistics-2011:
- 1,645,426 docs: 1h 11m 41s
- As far as I can tell, the processing became much faster once I purged all the unmigrated records
- It took about six days for the processing according to the times above, though 2015 is missing... hmm
- Now I am testing the Atmire Listings and Reports
- On both my local test and DSpace Test I get no results when searching for "Orth, A." and "Orth, Alan" or even Delia Grace, but the Discovery index is up to date and I have eighteen items...
- I sent a message to Atmire...
<!-- vim: set sw=2 ts=2: -->

View File

@ -19,7 +19,7 @@ It is class based so I can easily add support for other vocabularies, and the te
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-08/" />
<meta property="article:published_time" content="2020-08-02T15:35:54+03:00" />
<meta property="article:modified_time" content="2020-08-13T17:56:39+03:00" />
<meta property="article:modified_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="August, 2020"/>
@ -43,9 +43,9 @@ It is class based so I can easily add support for other vocabularies, and the te
"@type": "BlogPosting",
"headline": "August, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-08/",
"wordCount": "2800",
"wordCount": "3168",
"datePublished": "2020-08-02T15:35:54+03:00",
"dateModified": "2020-08-13T17:56:39+03:00",
"dateModified": "2020-08-14T11:22:16+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -599,10 +599,6 @@ Caused by: java.lang.NullPointerException
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
</code></pre><ul>
<li>Then I see there are 849,000 docs with <code>id: -1</code> and <code>type: 5</code> so I should purge those too probably:</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2017/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:\-1&lt;/query&gt;&lt;/delete&gt;'
</code></pre><ul>
<li>Altmetric asked for a dump of CGSpace&rsquo;s OAI &ldquo;sets&rdquo; so they can update their affiliation mappings
<ul>
<li>I did it in a kinda ghetto way:</li>
@ -616,6 +612,96 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml &gt;&gt; /tmp/cgspace-oai-sets
<li>This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that&rsquo;s how theirs was in the first place&hellip;</li>
<li>Help Bizu with a restricted item for CIAT</li>
</ul>
<h2 id="2020-08-16">2020-08-16</h2>
<ul>
<li>The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs&hellip;</li>
<li>I looked at a few of the UIDs that it was having problems with and they were unmigrated ones&hellip; so I purged them in 2015 and all the rest of the statistics cores</li>
</ul>
<pre><code>$ curl -s &quot;http://localhost:8081/solr/statistics-2015/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
...
$ curl -s &quot;http://localhost:8081/solr/statistics-2010/update?softCommit=true&quot; -H &quot;Content-Type: text/xml&quot; --data-binary '&lt;delete&gt;&lt;query&gt;id:/.*unmigrated.*/&lt;/query&gt;&lt;/delete&gt;'
</code></pre><h2 id="2020-08-19">2020-08-19</h2>
<ul>
<li>I tested the DSpace 5 and DSpace 6 versions of the <a href="https://github.com/ilri/cgspace-java-helpers">country code tagger curation task</a> and noticed a few things
<ul>
<li>The DSpace 5.8 version finishes in 2 hours and 1 minute</li>
<li>The DSpace 6.3 version ran for over 12 hours and didn&rsquo;t even finish (I killed it)</li>
<li>Furthermore, it seems that each item is curated once for each collection it appears in, causing about 115,000 items to be processed, even though we only have about 87,000</li>
</ul>
</li>
<li>I had been running the tasks on the entire repository with <code>-i 10568/0</code>, but I think I might need to try again with the special <code>all</code> option before writing to the dspace-tech mailing list for help
<ul>
<li>Actually I just tested the <code>all</code> option on DSpace 5.8 and it still does many of the items multiple times, once for each of their mappings</li>
</ul>
</li>
<li>I finished the Atmire stats processing on all cores on DSpace Test:
<ul>
<li>statistics:
<ul>
<li>2,040,385 docs: 2h 28m 49s</li>
</ul>
</li>
<li>statistics-2019:
<ul>
<li>8,960,000 docs: 12h 7s</li>
<li>1,780,575 docs: 2h 7m 29s</li>
</ul>
</li>
<li>statistics-2018:
<ul>
<li>2,200,000 docs: 12h 1m 11s</li>
<li>2,100,000 docs: 12h 4m 19s</li>
<li>?</li>
</ul>
</li>
<li>statistics-2017:
<ul>
<li>1,970,000 docs: 12h 5m 45s</li>
<li>2,000,000 docs: 12h 5m 38s</li>
<li>1,312,674 docs: 4h 14m 23s</li>
</ul>
</li>
<li>statistics-2016:
<ul>
<li>1,669,020 docs: 12h 4m 3s</li>
<li>1,650,000 docs: 12h 7m 40s</li>
<li>850,611 docs: 44m 52s</li>
</ul>
</li>
<li>statistics-2014:
<ul>
<li>4,832,334 docs: 3h 53m 41s</li>
</ul>
</li>
<li>statistics-2013:
<ul>
<li>4,509,891 docs: 3h 18m 44s</li>
</ul>
</li>
<li>statistics-2012:
<ul>
<li>3,716,857 docs: 2h 36m 21s</li>
</ul>
</li>
<li>statistics-2011:
<ul>
<li>1,645,426 docs: 1h 11m 41s</li>
</ul>
</li>
</ul>
</li>
<li>As far as I can tell, the processing became much faster once I purged all the unmigrated records
<ul>
<li>It took about six days for the processing according to the times above, though 2015 is missing&hellip; hmm</li>
</ul>
</li>
<li>Now I am testing the Atmire Listings and Reports
<ul>
<li>On both my local test and DSpace Test I get no results when searching for &ldquo;Orth, A.&rdquo; and &ldquo;Orth, Alan&rdquo; or even Delia Grace, but the Discovery index is up to date and I have eighteen items&hellip;</li>
<li>I sent a message to Atmire&hellip;</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-13T17:56:39+03:00" />
<meta property="og:updated_time" content="2020-08-14T11:22:16+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-08/</loc>
<lastmod>2020-08-13T17:56:39+03:00</lastmod>
<lastmod>2020-08-14T11:22:16+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-08-13T17:56:39+03:00</lastmod>
<lastmod>2020-08-14T11:22:16+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-08-13T17:56:39+03:00</lastmod>
<lastmod>2020-08-14T11:22:16+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-08-13T17:56:39+03:00</lastmod>
<lastmod>2020-08-14T11:22:16+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-08-13T17:56:39+03:00</lastmod>
<lastmod>2020-08-14T11:22:16+03:00</lastmod>
</url>
<url>