diff --git a/content/posts/2020-02.md b/content/posts/2020-02.md index 3e7d76862..d2364cce9 100644 --- a/content/posts/2020-02.md +++ b/content/posts/2020-02.md @@ -258,7 +258,7 @@ ReactorNetty/0.9.2.RELEASE ``` - I made [an issue on the COUNTER-Robots repository](https://github.com/atmire/COUNTER-Robots/issues/31) -- I found a [nice tool for exporting and importing Solr records](https://github.com/freedev/solr-import-export-json) and it seems to workfor exporting our 2019 stats from the large statistics core! +- I found a [nice tool for exporting and importing Solr records](https://github.com/freedev/solr-import-export-json) and it seems to work for exporting our 2019 stats from the large statistics core! ``` $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid @@ -1027,7 +1027,69 @@ Total number of bot hits purged: 159 - Make pull requests for issues with user agents in the COUNTER-Robots repository: - [Fix okhttp](https://github.com/atmire/COUNTER-Robots/pull/33) - [Add new bots](https://github.com/atmire/COUNTER-Robots/pull/34) -- One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can't remember how big it was before that +- One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can't remember how big it was before that + - According to my notes it was 43GiB in January when it failed the first time - I wonder if the sharding process would work now... +## 2020-02-26 + +- Bosede finally got back to me about the IITA records from earlier last month ([IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567)) + - She said she has added more information to fifty-three of the journal articles, as I had requested +- I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month: + +``` +$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" +$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601) +``` + +- Interestingly I saw this in the Solr log: + +``` +2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/ +2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590 +``` + +- The process has been going for several hours now and I suspect it will fail eventually + - I want to explore manually creating and migrating the core +- Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment: + +``` +$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data' +``` +- After that the `statistics-2019` core was immediately available in the Solr UI, but after restarting Tomcat it was gone + - I wonder if I import some old statistics into the current `statistics` core and then let DSpace create the `statistics-2019` core itself using `dspace stats-util -s` will work... +- First export a small slice of 2019 stats from the main CGSpace `statistics` core, skipping Atmire schema additions: + +``` +$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId +``` + +- Then import into my local `statistics` core: + +``` +$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid +$ ~/dspace63/bin/dspace stats-util -s +Moving: 21993 into core statistics-2019 +``` + +- To my surprise, the `statistics-2019` core is created and the documents are immediately visible in the Solr UI! + - Also, I am able to see the stats in DSpace's default "View Usage Statistics" screen + - Items appear with the words "(legacy)" at the end, ie "Improving farming practices in flood-prone areas in the Solomon Islands(legacy)" + - Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as "Improving farming practices in flood-prone areas in the Solomon Islands" without the the legacy identifier + - I need to remember to test out the [SolrUpgradePre6xStatistics tool](https://wiki.lyrasis.org/display/DSDOC6x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-UpgradeLegacyDSpaceObjectIdentifiers(pre-6xstatistics)toDSpace6xUUIDIdentifiers) +- After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the `statistics-2019` core loaded up... + - I wonder what the difference is between the core I created vs the one created by `stats-util`? + - I'm honestly considering just moving everything back into one core... + - Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util +- Testing some [proposed patches for 6.4](https://wiki.lyrasis.org/display/DSPACE/DSpace+6.4+Release+Status) in my local `6_x-dev64` branch +- [DS-4135 (citation author UTF-8)](https://jira.lyrasis.org/browse/DS-4135) + - Testing [item 10568/106959](https://hdl.handle.net/10568/106959) before and after: + +``` + + +``` + +- [DS-4397 controlled vocabulary loading speedup](https://jira.lyrasis.org/browse/DS-4397) + diff --git a/docs/2020-02/index.html b/docs/2020-02/index.html index 4bcaff87d..e944c565b 100644 --- a/docs/2020-02/index.html +++ b/docs/2020-02/index.html @@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install - + @@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install "@type": "BlogPosting", "headline": "February, 2020", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/", - "wordCount": "6499", + "wordCount": "6988", "datePublished": "2020-02-02T11:56:30+02:00", - "dateModified": "2020-02-25T09:14:30+02:00", + "dateModified": "2020-02-25T21:05:22+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -384,7 +384,7 @@ $ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jerse
ReactorNetty/0.9.2.RELEASE
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
$ ls -lh /tmp/statistics-2019-01.json
@@ -1152,12 +1152,81 @@ Total number of bot hits purged: 159
Add new bots
-One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can’t remember how big it was before that
+ One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can’t remember how big it was before that
+- According to my notes it was 43GiB in January when it failed the first time
- I wonder if the sharding process would work now…
+2020-02-26
+
+- Bosede finally got back to me about the IITA records from earlier last month (IITA_201907_Jan13)
+
+- She said she has added more information to fifty-three of the journal articles, as I had requested
+
+
+- I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:
+
+$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
+$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
+
+- Interestingly I saw this in the Solr log:
+
+2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
+2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
+
+- The process has been going for several hours now and I suspect it will fail eventually
+
+- I want to explore manually creating and migrating the core
+
+
+- Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:
+
+$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
+
+- After that the
statistics-2019
core was immediately available in the Solr UI, but after restarting Tomcat it was gone
+
+- I wonder if I import some old statistics into the current
statistics
core and then let DSpace create the statistics-2019
core itself using dspace stats-util -s
will work…
+
+
+- First export a small slice of 2019 stats from the main CGSpace
statistics
core, skipping Atmire schema additions:
+
+$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
+
+- Then import into my local
statistics
core:
+
+$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
+$ ~/dspace63/bin/dspace stats-util -s
+Moving: 21993 into core statistics-2019
+
+- To my surprise, the
statistics-2019
core is created and the documents are immediately visible in the Solr UI!
+
+- Also, I am able to see the stats in DSpace’s default “View Usage Statistics” screen
+- Items appear with the words “(legacy)” at the end, ie “Improving farming practices in flood-prone areas in the Solomon Islands(legacy)”
+- Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as “Improving farming practices in flood-prone areas in the Solomon Islands” without the the legacy identifier
+- I need to remember to test out the SolrUpgradePre6xStatistics tool
+
+
+- After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the
statistics-2019
core loaded up…
+
+- I wonder what the difference is between the core I created vs the one created by
stats-util
?
+- I’m honestly considering just moving everything back into one core…
+- Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util
+
+
+- Testing some proposed patches for 6.4 in my local
6_x-dev64
branch
+- DS-4135 (citation author UTF-8)
+
+- Testing item 10568/106959 before and after:
+
+
+
+<meta content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
+<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 0d8065249..0f3cb92a2 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,27 +4,27 @@
https://alanorth.github.io/cgspace-notes/categories/
- 2020-02-25T09:14:30+02:00
+ 2020-02-25T21:05:22+02:00
https://alanorth.github.io/cgspace-notes/
- 2020-02-25T09:14:30+02:00
+ 2020-02-25T21:05:22+02:00
https://alanorth.github.io/cgspace-notes/2020-02/
- 2020-02-25T09:14:30+02:00
+ 2020-02-25T21:05:22+02:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2020-02-25T09:14:30+02:00
+ 2020-02-25T21:05:22+02:00
https://alanorth.github.io/cgspace-notes/posts/
- 2020-02-25T09:14:30+02:00
+ 2020-02-25T21:05:22+02:00