Add notes for 2020-10-26

This commit is contained in:
2020-10-26 16:34:45 +03:00
parent da88f0e7a9
commit 5f76797488
22 changed files with 249 additions and 28 deletions

View File

@ -656,4 +656,111 @@ $ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affil
- replace: International Livestock Research Institute
- I re-uploaded the mappings to Elasticsearch like I did yesterday and restarted the harvesting
## 2020-10-24
- Atmire sent a small version bump to CUA (6.x-4.1.10-ilri-RC5) to fix the logging of bot requests when `usage-statistics.logBots` is false
- I tested it by making several requests to DSpace Test with the `RTB website BOT` and `Delphi 2009` user agents and can verify that they are no longer logged
- I spent a few hours working on mappings on AReS
- I decided to do a full re-harvest on AReS with *no mappings* so I could extract the CRPs and affiliations to see how much work they needed
- I worked on my Python script to process some cleanups of the values to create find/replace mappings for common scenarios:
- Removing acronyms from the end of strings
- Removing "CRP on " from strings
- The problem is that the mappings are applied to all fields, and we want to keep "CGIAR Research Program on ..." in the authors, but not in the CRPs field
- Really the best solution is to have each repository use the same controlled vocabularies
## 2020-10-25
- I re-installed DSpace Test with a fresh snapshot of CGSpace's to test the DSpace 6 upgrade (the last time was in 2020-05, and we've fixed a lot of issues since then):
```
$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
$ git checkout origin/6_x-dev-atmire-modules
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
$ sudo su - postgres
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
$ exit
$ sudo systemctl stop tomcat7
$ cd dspace/target/dspace-installer
$ rm -rf /blah/dspacetest/config/spring
$ ant update
$ dspace database migrate
(10 minutes)
$ sudo systemctl start tomcat7
(discovery indexing starts)
```
- Then I started processing the Solr stats one core and 1 million records at a time:
```
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
```
- After the fifth or so run I got this error:
```
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- So basically, as I saw at this same step in 2020-05, there are some documents that have IDs that have *not* been converted to UUID, and have *not* been labeled as "unmigrated" either...
- I see there are about 217,000 of them, 99% of which are of `type: 5` which is "search"
- I purged them:
```
$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
```
- Then I restarted the `solr-upgrade-statistics-6x` process, which apparently had no records left to process
- I started processing the statistics-2019 core...
- I managed to process 7.5 million records in 7 hours without any errors!
## 2020-10-26
- The statistics processing on the statistics-2018 core errored after 1.8 million records:
```
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
```
- I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08
- I will try to purge some unmigrated records (around 460,000), most of which are of `type: 5` (search) so not relevant to our views and downloads anyways:
```console
$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
```
- I restarted the process and it crashed again a few minutes later
- I increased the memory to 4096m and tried again
- It eventually completed, after which time I purge all remaining 350,000 unmigrated records (99% of which were `type: 5`):
```
$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
```
- Then I started processing the statistics-2017 core...
- I filed an issue with Atmire about the duplicate values in the `owningComm` and `containerCommunity` fields in Solr: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
- Add new ORCID identifier for [Perle LATRE DE LATE](https://orcid.org/0000-0003-3871-6277) to controlled vocabulary
<!-- vim: set sw=2 ts=2: -->