mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-25 16:08:19 +01:00
Add notes for 2020-02-26
This commit is contained in:
parent
b990e4da33
commit
56ce326b80
@ -258,7 +258,7 @@ ReactorNetty/0.9.2.RELEASE
|
||||
```
|
||||
|
||||
- I made [an issue on the COUNTER-Robots repository](https://github.com/atmire/COUNTER-Robots/issues/31)
|
||||
- I found a [nice tool for exporting and importing Solr records](https://github.com/freedev/solr-import-export-json) and it seems to workfor exporting our 2019 stats from the large statistics core!
|
||||
- I found a [nice tool for exporting and importing Solr records](https://github.com/freedev/solr-import-export-json) and it seems to work for exporting our 2019 stats from the large statistics core!
|
||||
|
||||
```
|
||||
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
|
||||
@ -1027,7 +1027,69 @@ Total number of bot hits purged: 159
|
||||
- Make pull requests for issues with user agents in the COUNTER-Robots repository:
|
||||
- [Fix okhttp](https://github.com/atmire/COUNTER-Robots/pull/33)
|
||||
- [Add new bots](https://github.com/atmire/COUNTER-Robots/pull/34)
|
||||
- One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can't remember how big it was before that
|
||||
- One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can't remember how big it was before that
|
||||
- According to my notes it was 43GiB in January when it failed the first time
|
||||
- I wonder if the sharding process would work now...
|
||||
|
||||
## 2020-02-26
|
||||
|
||||
- Bosede finally got back to me about the IITA records from earlier last month ([IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567))
|
||||
- She said she has added more information to fifty-three of the journal articles, as I had requested
|
||||
- I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||||
$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
|
||||
```
|
||||
|
||||
- Interestingly I saw this in the Solr log:
|
||||
|
||||
```
|
||||
2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
|
||||
2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
|
||||
```
|
||||
|
||||
- The process has been going for several hours now and I suspect it will fail eventually
|
||||
- I want to explore manually creating and migrating the core
|
||||
- Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:
|
||||
|
||||
```
|
||||
$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
|
||||
```
|
||||
- After that the `statistics-2019` core was immediately available in the Solr UI, but after restarting Tomcat it was gone
|
||||
- I wonder if I import some old statistics into the current `statistics` core and then let DSpace create the `statistics-2019` core itself using `dspace stats-util -s` will work...
|
||||
- First export a small slice of 2019 stats from the main CGSpace `statistics` core, skipping Atmire schema additions:
|
||||
|
||||
```
|
||||
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
```
|
||||
|
||||
- Then import into my local `statistics` core:
|
||||
|
||||
```
|
||||
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
|
||||
$ ~/dspace63/bin/dspace stats-util -s
|
||||
Moving: 21993 into core statistics-2019
|
||||
```
|
||||
|
||||
- To my surprise, the `statistics-2019` core is created and the documents are immediately visible in the Solr UI!
|
||||
- Also, I am able to see the stats in DSpace's default "View Usage Statistics" screen
|
||||
- Items appear with the words "(legacy)" at the end, ie "Improving farming practices in flood-prone areas in the Solomon Islands(legacy)"
|
||||
- Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as "Improving farming practices in flood-prone areas in the Solomon Islands" without the the legacy identifier
|
||||
- I need to remember to test out the [SolrUpgradePre6xStatistics tool](https://wiki.lyrasis.org/display/DSDOC6x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-UpgradeLegacyDSpaceObjectIdentifiers(pre-6xstatistics)toDSpace6xUUIDIdentifiers)
|
||||
- After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the `statistics-2019` core loaded up...
|
||||
- I wonder what the difference is between the core I created vs the one created by `stats-util`?
|
||||
- I'm honestly considering just moving everything back into one core...
|
||||
- Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util
|
||||
- Testing some [proposed patches for 6.4](https://wiki.lyrasis.org/display/DSPACE/DSpace+6.4+Release+Status) in my local `6_x-dev64` branch
|
||||
- [DS-4135 (citation author UTF-8)](https://jira.lyrasis.org/browse/DS-4135)
|
||||
- Testing [item 10568/106959](https://hdl.handle.net/10568/106959) before and after:
|
||||
|
||||
```
|
||||
<meta content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
|
||||
<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
|
||||
```
|
||||
|
||||
- [DS-4397 controlled vocabulary loading speedup](https://jira.lyrasis.org/browse/DS-4397)
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-02/" />
|
||||
<meta property="article:published_time" content="2020-02-02T11:56:30+02:00" />
|
||||
<meta property="article:modified_time" content="2020-02-25T09:14:30+02:00" />
|
||||
<meta property="article:modified_time" content="2020-02-25T21:05:22+02:00" />
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="February, 2020"/>
|
||||
@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install
|
||||
"@type": "BlogPosting",
|
||||
"headline": "February, 2020",
|
||||
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/",
|
||||
"wordCount": "6499",
|
||||
"wordCount": "6988",
|
||||
"datePublished": "2020-02-02T11:56:30+02:00",
|
||||
"dateModified": "2020-02-25T09:14:30+02:00",
|
||||
"dateModified": "2020-02-25T21:05:22+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -384,7 +384,7 @@ $ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jerse
|
||||
<pre><code>ReactorNetty/0.9.2.RELEASE
|
||||
</code></pre><ul>
|
||||
<li>I made <a href="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
|
||||
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to workfor exporting our 2019 stats from the large statistics core!</li>
|
||||
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to work for exporting our 2019 stats from the large statistics core!</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
|
||||
$ ls -lh /tmp/statistics-2019-01.json
|
||||
@ -1152,12 +1152,81 @@ Total number of bot hits purged: 159
|
||||
<li><a href="https://github.com/atmire/COUNTER-Robots/pull/34">Add new bots</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can’t remember how big it was before that
|
||||
<li>One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can’t remember how big it was before that
|
||||
<ul>
|
||||
<li>According to my notes it was 43GiB in January when it failed the first time</li>
|
||||
<li>I wonder if the sharding process would work now…</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2020-02-26">2020-02-26</h2>
|
||||
<ul>
|
||||
<li>Bosede finally got back to me about the IITA records from earlier last month (<a href="https://dspacetest.cgiar.org/handle/10568/106567">IITA_201907_Jan13</a>)
|
||||
<ul>
|
||||
<li>She said she has added more information to fifty-three of the journal articles, as I had requested</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||||
$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
|
||||
</code></pre><ul>
|
||||
<li>Interestingly I saw this in the Solr log:</li>
|
||||
</ul>
|
||||
<pre><code>2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
|
||||
2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
|
||||
</code></pre><ul>
|
||||
<li>The process has been going for several hours now and I suspect it will fail eventually
|
||||
<ul>
|
||||
<li>I want to explore manually creating and migrating the core</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
|
||||
</code></pre><ul>
|
||||
<li>After that the <code>statistics-2019</code> core was immediately available in the Solr UI, but after restarting Tomcat it was gone
|
||||
<ul>
|
||||
<li>I wonder if I import some old statistics into the current <code>statistics</code> core and then let DSpace create the <code>statistics-2019</code> core itself using <code>dspace stats-util -s</code> will work…</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>First export a small slice of 2019 stats from the main CGSpace <code>statistics</code> core, skipping Atmire schema additions:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
</code></pre><ul>
|
||||
<li>Then import into my local <code>statistics</code> core:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
|
||||
$ ~/dspace63/bin/dspace stats-util -s
|
||||
Moving: 21993 into core statistics-2019
|
||||
</code></pre><ul>
|
||||
<li>To my surprise, the <code>statistics-2019</code> core is created and the documents are immediately visible in the Solr UI!
|
||||
<ul>
|
||||
<li>Also, I am able to see the stats in DSpace’s default “View Usage Statistics” screen</li>
|
||||
<li>Items appear with the words “(legacy)” at the end, ie “Improving farming practices in flood-prone areas in the Solomon Islands(legacy)”</li>
|
||||
<li>Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as “Improving farming practices in flood-prone areas in the Solomon Islands” without the the legacy identifier</li>
|
||||
<li>I need to remember to test out the <a href="https://wiki.lyrasis.org/display/DSDOC6x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-UpgradeLegacyDSpaceObjectIdentifiers(pre-6xstatistics)toDSpace6xUUIDIdentifiers">SolrUpgradePre6xStatistics tool</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the <code>statistics-2019</code> core loaded up…
|
||||
<ul>
|
||||
<li>I wonder what the difference is between the core I created vs the one created by <code>stats-util</code>?</li>
|
||||
<li>I’m honestly considering just moving everything back into one core…</li>
|
||||
<li>Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>Testing some <a href="https://wiki.lyrasis.org/display/DSPACE/DSpace+6.4+Release+Status">proposed patches for 6.4</a> in my local <code>6_x-dev64</code> branch</li>
|
||||
<li><a href="https://jira.lyrasis.org/browse/DS-4135">DS-4135 (citation author UTF-8)</a>
|
||||
<ul>
|
||||
<li>Testing <a href="https://hdl.handle.net/10568/106959">item 10568/106959</a> before and after:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code><meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
|
||||
<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
|
||||
</code></pre><ul>
|
||||
<li><a href="https://jira.lyrasis.org/browse/DS-4397">DS-4397 controlled vocabulary loading speedup</a></li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
@ -4,27 +4,27 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
<lastmod>2020-02-25T21:05:22+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
<lastmod>2020-02-25T21:05:22+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2020-02/</loc>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
<lastmod>2020-02-25T21:05:22+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
<lastmod>2020-02-25T21:05:22+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2020-02-25T09:14:30+02:00</lastmod>
|
||||
<lastmod>2020-02-25T21:05:22+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
|
Loading…
Reference in New Issue
Block a user