mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-10-26
This commit is contained in:
@ -656,4 +656,111 @@ $ http 'http://localhost:9200/openrxv-items-final/_search?_source_includes=affil
|
||||
- replace: International Livestock Research Institute
|
||||
- I re-uploaded the mappings to Elasticsearch like I did yesterday and restarted the harvesting
|
||||
|
||||
## 2020-10-24
|
||||
|
||||
- Atmire sent a small version bump to CUA (6.x-4.1.10-ilri-RC5) to fix the logging of bot requests when `usage-statistics.logBots` is false
|
||||
- I tested it by making several requests to DSpace Test with the `RTB website BOT` and `Delphi 2009` user agents and can verify that they are no longer logged
|
||||
- I spent a few hours working on mappings on AReS
|
||||
- I decided to do a full re-harvest on AReS with *no mappings* so I could extract the CRPs and affiliations to see how much work they needed
|
||||
- I worked on my Python script to process some cleanups of the values to create find/replace mappings for common scenarios:
|
||||
- Removing acronyms from the end of strings
|
||||
- Removing "CRP on " from strings
|
||||
- The problem is that the mappings are applied to all fields, and we want to keep "CGIAR Research Program on ..." in the authors, but not in the CRPs field
|
||||
- Really the best solution is to have each repository use the same controlled vocabularies
|
||||
|
||||
## 2020-10-25
|
||||
|
||||
- I re-installed DSpace Test with a fresh snapshot of CGSpace's to test the DSpace 6 upgrade (the last time was in 2020-05, and we've fixed a lot of issues since then):
|
||||
|
||||
```
|
||||
$ cp dspace/etc/postgres/update-sequences.sql /tmp/dspace5-update-sequences.sql
|
||||
$ git checkout origin/6_x-dev-atmire-modules
|
||||
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||||
$ sudo su - postgres
|
||||
$ psql dspacetest -c 'CREATE EXTENSION pgcrypto;'
|
||||
$ psql dspacetest -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
|
||||
$ exit
|
||||
$ sudo systemctl stop tomcat7
|
||||
$ cd dspace/target/dspace-installer
|
||||
$ rm -rf /blah/dspacetest/config/spring
|
||||
$ ant update
|
||||
$ dspace database migrate
|
||||
(10 minutes)
|
||||
$ sudo systemctl start tomcat7
|
||||
(discovery indexing starts)
|
||||
```
|
||||
|
||||
- Then I started processing the Solr stats one core and 1 million records at a time:
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics
|
||||
```
|
||||
|
||||
- After the fifth or so run I got this error:
|
||||
|
||||
```
|
||||
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||||
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
|
||||
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
|
||||
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
|
||||
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
|
||||
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
|
||||
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
|
||||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||||
```
|
||||
|
||||
- So basically, as I saw at this same step in 2020-05, there are some documents that have IDs that have *not* been converted to UUID, and have *not* been labeled as "unmigrated" either...
|
||||
- I see there are about 217,000 of them, 99% of which are of `type: 5` which is "search"
|
||||
- I purged them:
|
||||
|
||||
```
|
||||
$ curl -s "http://localhost:8083/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
```
|
||||
|
||||
- Then I restarted the `solr-upgrade-statistics-6x` process, which apparently had no records left to process
|
||||
- I started processing the statistics-2019 core...
|
||||
- I managed to process 7.5 million records in 7 hours without any errors!
|
||||
|
||||
## 2020-10-26
|
||||
|
||||
- The statistics processing on the statistics-2018 core errored after 1.8 million records:
|
||||
|
||||
```
|
||||
Exception: Java heap space
|
||||
java.lang.OutOfMemoryError: Java heap space
|
||||
```
|
||||
|
||||
- I had the same problem when I processed the statistics-2018 core in 2020-07 and 2020-08
|
||||
- I will try to purge some unmigrated records (around 460,000), most of which are of `type: 5` (search) so not relevant to our views and downloads anyways:
|
||||
|
||||
```console
|
||||
$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>id:/.+-unmigrated/</query></delete>"
|
||||
```
|
||||
|
||||
- I restarted the process and it crashed again a few minutes later
|
||||
- I increased the memory to 4096m and tried again
|
||||
- It eventually completed, after which time I purge all remaining 350,000 unmigrated records (99% of which were `type: 5`):
|
||||
|
||||
```
|
||||
$ curl -s "http://localhost:8083/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>(*:* NOT id:/.{36}/) AND (*:* NOT id:/.+-unmigrated/)</query></delete>"
|
||||
```
|
||||
|
||||
- Then I started processing the statistics-2017 core...
|
||||
- I filed an issue with Atmire about the duplicate values in the `owningComm` and `containerCommunity` fields in Solr: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
|
||||
- Add new ORCID identifier for [Perle LATRE DE LATE](https://orcid.org/0000-0003-3871-6277) to controlled vocabulary
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user