From 56ce326b80017a216dc9c0bbbd9015cbc3f7b5e2 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 26 Feb 2020 16:39:55 +0200 Subject: [PATCH] Add notes for 2020-02-26 --- content/posts/2020-02.md | 66 ++++++++++++++++++++++++++++++++- docs/2020-02/index.html | 79 +++++++++++++++++++++++++++++++++++++--- docs/sitemap.xml | 10 ++--- 3 files changed, 143 insertions(+), 12 deletions(-) diff --git a/content/posts/2020-02.md b/content/posts/2020-02.md index 3e7d76862..d2364cce9 100644 --- a/content/posts/2020-02.md +++ b/content/posts/2020-02.md @@ -258,7 +258,7 @@ ReactorNetty/0.9.2.RELEASE ``` - I made [an issue on the COUNTER-Robots repository](https://github.com/atmire/COUNTER-Robots/issues/31) -- I found a [nice tool for exporting and importing Solr records](https://github.com/freedev/solr-import-export-json) and it seems to workfor exporting our 2019 stats from the large statistics core! +- I found a [nice tool for exporting and importing Solr records](https://github.com/freedev/solr-import-export-json) and it seems to work for exporting our 2019 stats from the large statistics core! ``` $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid @@ -1027,7 +1027,69 @@ Total number of bot hits purged: 159 - Make pull requests for issues with user agents in the COUNTER-Robots repository: - [Fix okhttp](https://github.com/atmire/COUNTER-Robots/pull/33) - [Add new bots](https://github.com/atmire/COUNTER-Robots/pull/34) -- One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can't remember how big it was before that +- One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can't remember how big it was before that + - According to my notes it was 43GiB in January when it failed the first time - I wonder if the sharding process would work now... +## 2020-02-26 + +- Bosede finally got back to me about the IITA records from earlier last month ([IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567)) + - She said she has added more information to fifty-three of the journal articles, as I had requested +- I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month: + +``` +$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" +$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601) +``` + +- Interestingly I saw this in the Solr log: + +``` +2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/ +2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590 +``` + +- The process has been going for several hours now and I suspect it will fail eventually + - I want to explore manually creating and migrating the core +- Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment: + +``` +$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data' +``` +- After that the `statistics-2019` core was immediately available in the Solr UI, but after restarting Tomcat it was gone + - I wonder if I import some old statistics into the current `statistics` core and then let DSpace create the `statistics-2019` core itself using `dspace stats-util -s` will work... +- First export a small slice of 2019 stats from the main CGSpace `statistics` core, skipping Atmire schema additions: + +``` +$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId +``` + +- Then import into my local `statistics` core: + +``` +$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid +$ ~/dspace63/bin/dspace stats-util -s +Moving: 21993 into core statistics-2019 +``` + +- To my surprise, the `statistics-2019` core is created and the documents are immediately visible in the Solr UI! + - Also, I am able to see the stats in DSpace's default "View Usage Statistics" screen + - Items appear with the words "(legacy)" at the end, ie "Improving farming practices in flood-prone areas in the Solomon Islands(legacy)" + - Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as "Improving farming practices in flood-prone areas in the Solomon Islands" without the the legacy identifier + - I need to remember to test out the [SolrUpgradePre6xStatistics tool](https://wiki.lyrasis.org/display/DSDOC6x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-UpgradeLegacyDSpaceObjectIdentifiers(pre-6xstatistics)toDSpace6xUUIDIdentifiers) +- After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the `statistics-2019` core loaded up... + - I wonder what the difference is between the core I created vs the one created by `stats-util`? + - I'm honestly considering just moving everything back into one core... + - Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util +- Testing some [proposed patches for 6.4](https://wiki.lyrasis.org/display/DSPACE/DSpace+6.4+Release+Status) in my local `6_x-dev64` branch +- [DS-4135 (citation author UTF-8)](https://jira.lyrasis.org/browse/DS-4135) + - Testing [item 10568/106959](https://hdl.handle.net/10568/106959) before and after: + +``` + + +``` + +- [DS-4397 controlled vocabulary loading speedup](https://jira.lyrasis.org/browse/DS-4397) + diff --git a/docs/2020-02/index.html b/docs/2020-02/index.html index 4bcaff87d..e944c565b 100644 --- a/docs/2020-02/index.html +++ b/docs/2020-02/index.html @@ -20,7 +20,7 @@ The code finally builds and runs with a fresh install - + @@ -45,9 +45,9 @@ The code finally builds and runs with a fresh install "@type": "BlogPosting", "headline": "February, 2020", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-02\/", - "wordCount": "6499", + "wordCount": "6988", "datePublished": "2020-02-02T11:56:30+02:00", - "dateModified": "2020-02-25T09:14:30+02:00", + "dateModified": "2020-02-25T21:05:22+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -384,7 +384,7 @@ $ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jerse
ReactorNetty/0.9.2.RELEASE
 
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
 $ ls -lh /tmp/statistics-2019-01.json
@@ -1152,12 +1152,81 @@ Total number of bot hits purged: 159
 
  • Add new bots
  • -
  • One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can’t remember how big it was before that +
  • One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can’t remember how big it was before that
      +
    • According to my notes it was 43GiB in January when it failed the first time
    • I wonder if the sharding process would work now…
  • +

    2020-02-26

    +
      +
    • Bosede finally got back to me about the IITA records from earlier last month (IITA_201907_Jan13) +
        +
      • She said she has added more information to fifty-three of the journal articles, as I had requested
      • +
      +
    • +
    • I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:
    • +
    +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
    +
      +
    • Interestingly I saw this in the Solr log:
    • +
    +
    2020-02-26 08:55:47,433 INFO  org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
    +2020-02-26 08:55:47,511 INFO  org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
    +
      +
    • The process has been going for several hours now and I suspect it will fail eventually +
        +
      • I want to explore manually creating and migrating the core
      • +
      +
    • +
    • Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:
    • +
    +
    $ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
    +
      +
    • After that the statistics-2019 core was immediately available in the Solr UI, but after restarting Tomcat it was gone +
        +
      • I wonder if I import some old statistics into the current statistics core and then let DSpace create the statistics-2019 core itself using dspace stats-util -s will work…
      • +
      +
    • +
    • First export a small slice of 2019 stats from the main CGSpace statistics core, skipping Atmire schema additions:
    • +
    +
    $ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
    +
      +
    • Then import into my local statistics core:
    • +
    +
    $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
    +$ ~/dspace63/bin/dspace stats-util -s
    +Moving: 21993 into core statistics-2019
    +
      +
    • To my surprise, the statistics-2019 core is created and the documents are immediately visible in the Solr UI! +
        +
      • Also, I am able to see the stats in DSpace’s default “View Usage Statistics” screen
      • +
      • Items appear with the words “(legacy)” at the end, ie “Improving farming practices in flood-prone areas in the Solomon Islands(legacy)”
      • +
      • Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as “Improving farming practices in flood-prone areas in the Solomon Islands” without the the legacy identifier
      • +
      • I need to remember to test out the SolrUpgradePre6xStatistics tool
      • +
      +
    • +
    • After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the statistics-2019 core loaded up… +
        +
      • I wonder what the difference is between the core I created vs the one created by stats-util?
      • +
      • I’m honestly considering just moving everything back into one core…
      • +
      • Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util
      • +
      +
    • +
    • Testing some proposed patches for 6.4 in my local 6_x-dev64 branch
    • +
    • DS-4135 (citation author UTF-8) + +
    • +
    +
    <meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
    +<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
    +
    diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 0d8065249..0f3cb92a2 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-02-25T09:14:30+02:00 + 2020-02-25T21:05:22+02:00 https://alanorth.github.io/cgspace-notes/ - 2020-02-25T09:14:30+02:00 + 2020-02-25T21:05:22+02:00 https://alanorth.github.io/cgspace-notes/2020-02/ - 2020-02-25T09:14:30+02:00 + 2020-02-25T21:05:22+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-02-25T09:14:30+02:00 + 2020-02-25T21:05:22+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-02-25T09:14:30+02:00 + 2020-02-25T21:05:22+02:00