CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2021

2021-03-01

  • Discuss some OpenRXV issues with Abdullah from CodeObia
    • He’s trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API
    • Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies

2021-03-02

2021-03-03

2021-03-04

  • Peter is having issues with the workflow since yesterday
    • I looked at the Munin stats and see a high number of database locks since yesterday

PostgreSQL locks week PostgreSQL connections week

  • I looked at the number of connections in PostgreSQL and it’s definitely high again:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020
  • I reported it to Atmire to take a look, on the same issue we had been tracking this before
  • Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
  • I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my add-orcid-identifier.py script:
$ cat 2021-03-04-add-zoe-campbell-orcid.csv 
dc.contributor.author,cg.creator.identifier
"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
  • I still need to do cleanup on the journal articles metadata
    • Peter sent me some cleanups but I can’t use them in the search/replace format he gave
    • I think it’s better to export the metadata values with IDs and import cleaned up ones as CSV
localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
  • I used OpenRefine to remove all journal values that didn’t have one of these values: ; ( )
    • Then I cloned the cg.journal field to cg.volume and cg.issue
    • I used some GREL expressions like these to extract the journal name, volume, and issue:
value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
  • Then I uploaded the changes to CGSpace using dspace metadata-import
  • Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
  • Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana’s submissions
    • I edited Udana’s submission to CGSpace:
      • corrected the title
      • added language English
      • changed the link to the external item page instead of PDF
      • added SDGs from the external item page
      • added AGROVOC subjects from the external item page
      • added pagination (extent)
      • changed the license to “other” because CC-BY-NC-ND is not printed anywhere in the PDF or external item page

2021-03-05

  • I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:
$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
  • The trick is that when you create a volume like “myvolume” from a docker-compose.yml file, Docker will create it with the name “docker_myvolume”
    • If you create it manually on the command line with docker volume create myvolume then the name is literally “myvolume”
  • I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
  • Delete the openrxv-items-temp index to test a fresh harvesting:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'

2021-03-05

  • Check the results of the AReS harvesting from last night:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 101761,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
  • Set the current items index to read only and make a backup:
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
  • Delete the current items index and clone the temp one to it:
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
  • Then delete the temp and backup:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{"acknowledged":true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'

2021-03-07

  • I realized there is something wrong with the Elasticsearch indexes on AReS
    • On a new test environment I see openrxv-items is correctly an alias of openrxv-items-final:
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },
  • But on AReS production openrxv-items has somehow become an index:
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {}
    },
  • I fixed the issue on production by cloning the openrxv-items index to openrxv-items-final, deleting openrxv-items, and then re-creating it as an alias:
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
  • Delete backups and remove read-only mode on openrxv-items:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
  • Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
    • Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
  • I see the usual IPs for CCAFS and ILRI importer bots, but also 143.233.242.132 which appears to be for GARDIAN:
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418

2021-03-08

  • I approved the WLE item that I edited last week, and all the metadata is there: https://hdl.handle.net/10568/111810
    • So I’m not sure what Niroshini’s issue with metadata is…
  • Peter sent a message yesterday saying that his item finally got committed
    • I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
  • On 2021-03-03 the PostgreSQL transactions started rising:

PostgreSQL query length week

  • After that the connections and locks started going up, peaking on 2021-03-06:

PostgreSQL locks week PostgreSQL connections week

  • I sent another message to Atmire to ask if they have time to look into this
  • CIFOR is pressuring me to upload the batch items from last week
    • Vika sent me a final file with some duplicates that Peter identified removed
    • I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through csv-metadata-quality checker and uploaded them to CGSpace
    • In total there are 1,088 items
  • Udana from IWMI emailed to ask about CGSpace thumbnails
  • Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
    • The item was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item
  • Abenet got a quote from Atmire to buy 125 credits for 3750€
  • Maria at Bioversity sent some feedback about duplicate items on AReS
  • I’m wondering if the issue of the openrxv-items-final index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
    • I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
  • As I saw on my local test instance, even when you cancel a harvesting, it replaces the openrxv-items-final index with whatever is in openrxv-items-temp automatically, so I assume it will do the same now

2021-03-09

  • The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
    • This means that the issue we were facing for a few months was due to the openrxv-items index being deleted and re-created as a standalone index instead of an alias of openrxv-items-final
  • Talk to Moayad about OpenRXV development
    • We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift
    • I proposed a solution in the GitHub issue, where we consult the site’s XML sitemap after harvesting to see if we missed any items, and then we harvest them individually
  • Peter sent me a list of 356 DOIs from Altmetric that don’t have our Handles, so we need to Tweet them
    • I used my doi-to-handle.py script to generate a list of handles and titles for him:
$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'

2021-03-10

  • Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses cg.isijournal and MELSpace uses mel.impact-factor
    • I filed an issue on the cg-core project to ask colleagues for ideas
  • Peter said he doesn’t see “Source Code” or “Software” in the output type facet on the ILRI community, but I see it on the home page, so I will try to do a full Discovery re-index:
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    318m20.485s
user    215m15.196s
sys     2m51.529s
  • Now I see ten items for “Source Code” in the facets…
  • Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code
  • Added the ability to check dcterms.license values against the SPDX licenses in the csv-metadata-quality tool
    • Also, I made some other minor fixes and released version 0.4.6 on GitHub
  • Proof and upload twenty-seven items to CGSpace for Peter Ballantyne
    • Mostly Ugandan outputs for CRP Livestock and Livestock and Fish

2021-03-14

  • Switch to linux-kvm kernel on linode20 and linode18:
# apt update && apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
# apt autoremove && apt autoclean
# reboot
  • Deploy latest changes from 6_x-prod branch on CGSpace
  • Deploy latest changes from OpenRXV master branch on AReS
  • Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
  • Back up the current openrxv-items-final index on AReS to start a new harvest:
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
  • After the harvesting finished it seems the indexes got messed up again, as openrxv-items is an alias of openrxv-items-temp instead of openrxv-items-final:
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
  • Anyways, the number of items in openrxv-items seems OK and the AReS Explorer UI is working fine
    • I will have to manually fix the indexes before the next harvesting
  • Publish the web version of the DSpace CSV Metadata Quality checker tool that I wrote this weekend on GitHub: https://github.com/ilri/csv-metadata-quality-web

2021-03-16

  • Review ten items for Livestock and Fish and Dryland Systems from Peter
    • I told him to try the new web-based CSV Metadata Qualiter checker and he thought it was cool
    • I found one exact duplicate item and it gave me an idea to try to detect this in the tool

2021-03-17

2021-03-18

  • I added the ability to check for, and fix, “mojibake” characters in csv-metadata-quality

2021-03-21

  • Last week Atmire asked me which browser I was using to test the duplicate checker, which I had reported as not loading
    • I tried to load it in Chrome and it works… hmmm
  • Back up the current openrxv-items-final index to start a fresh AReS Harvest:
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
  • Then start harvesting in the AReS Explorer admin UI

2021-03-22

  • The harvesting on AReS yesterday completed, but somehow I have twice the number of items:
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 206204,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
  • Hmmm and even my backup index has a strange number of items:
$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
{
  "count" : 844,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
  • I deleted all indexes and re-created the openrxv-items alias:
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-temp": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    }
  • Then I started a new harvesting
  • I switched the Node.js in the Ansible infrastructure scripts to v12 since v10 will cease to be supported soon
    • I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server
  • The AReS harvest finally finished, with 1047 pages of items, but the openrxv-items-final index is empty and the openrxv-items-temp index has a 103,000 items:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 103162,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
  • I tried to clone the temp index to the final, but got an error:
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}% 
  • I looked in the Docker logs for Elasticsearch and saw a few memory errors:
java.lang.OutOfMemoryError: Java heap space
  • According to /usr/share/elasticsearch/config/jvm.options in the Elasticsearch container the default JVM heap is 1g
    • I see the running Java process has -Xms 1g -Xmx 1g in its process invocation so I guess that it must be indeed using 1g
    • We can change the heap size with the ES_JAVA_OPTS environment variable
    • Or perhaps better, we should use a jvm.options.d file because if you use the environment variable it overrides all other JVM options from the default jvm.options
    • I tried to set memory to 1536m by binding an options file and restarting the container, but it didn’t seem to work
    • Nevertheless, after restarting I see 103,000 items in the Explorer…
    • But the indexes are still kinda messed up… the openrxv-items index is an alias of the wrong index!
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },

2021-03-23

  • For reference you can also get the Elasticsearch JVM stats from the API:
$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'

2021-03-24

  • Atmire responded to the ticket about the Duplicate Checker
    • He says it works for him in Firefox, so I checked and it seems to have been an issue with my LocalCDN addon
  • I re-deployed DSpace Test (linode26) from the latest CGSpace (linode18) data
    • I want to try to finish up processing the duplicates in Solr that Atmire advised on last month
    • The current statistics core is 57861236 kilobytes:
# du -s /home/dspacetest.cgiar.org/solr/statistics
57861236        /home/dspacetest.cgiar.org/solr/statistics
  • I applied their changes to config/spring/api/atmire-cua-update.xml and started the duplicate processor:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
  • The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
  • Hah, I still got a memory error after only a few minutes:
...
Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s                                      
Exception: GC overhead limit exceeded                                                                          
java.lang.OutOfMemoryError: GC overhead limit exceeded 
  • I guess we really do have to use -r 100
  • Now the thing runs for a few minutes and “finishes”:
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
Loading @mire database changes for module MQM
Changes have been processed


*************************
* Update Script Started *
*************************

Run 1
Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:17 CET 2021
Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:41 CET 2021
Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
Done. | Wed Mar 24 14:45:46 CET 2021
Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
Done. | Wed Mar 24 14:45:54 CET 2021
Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
Done. | Wed Mar 24 14:45:58 CET 2021
Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
Done. | Wed Mar 24 14:45:59 CET 2021
Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
Run 1 took 5m 53s


**************************
* Update Script Finished *
**************************

2021-03-25

  • Niroshini from IWMI is still having problems adding metadata during the edit step of the workflow on CGSpace
    • I told her to try to register using a private email account and we’ll add her to the WLE group so she can try that way

2021-03-28

  • Make a backup of the openrxv-items-final index on AReS Explorer and start a new harvest

2021-03-29

  • The AReS harvesting that I started yesterday finished successfully and all indexes look OK:
    • openrxv-items is an alias of openrxv-items-final and has a correct number of items
  • Last week Bosede from IITA said she was trying to move an item from one collection to another and the system was “rolling” and never finished
    • I looked in Munin and I don’t see anything particularly wrong that day, so I told her to try again
  • Marianne Gadeberg asked about mapping an item last week
    • Searched for the item’s handle, the title, the title in quotes, the UUID, with pluses instead of spaces, etc in the item mapper… but I can never find it in the results
    • I see someone has reported this issue on Jira in DSpace 5.x’s XMLUI item mapper: https://jira.lyrasis.org/browse/DS-2761
    • The Solr log shows that my query (with and without quotes, etc) has 143 results:
2021-03-29 08:55:40,073 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0
  • But the item mapper only displays ten items, with no pagination
    • There is no way to search by handle or ID
    • I mapped the item manually using a CSV

2021-03-30

  • I realized I never finished deleting all the old fields after our CG Core migration a few months ago
    • I found a few occurrences of old metadata so I had to move them where possible and delete them where not
  • I updated the CG Core v2 migration page
  • Marianne Gadeberg wrote to ask why the item she wanted to map a few days ago still doesn’t appear in the mapped collection
    • I looked on the item page itself and it lists the collection, but doesn’t appear in the collection list
    • I tried to forceably reindex the collection and the item, but it didn’t seem to work
    • Now I will try a complete Discovery re-index

2021-03-31

  • The Discovery re-index finished, but the CIP item still does not appear in the GENDER Platform grants collection
    • The item page itself DOES list the grants collection! WTF
    • I sent a message to the dspace-tech mailing list to see if someone can comment
    • I even tried unmapping and re-mapping, but it doesn’t change anything: the item still doesn’t appear in the collection, but I can see that it is mapped
  • I signed up for a SHERPA API key so I can try to write something to get journal names from ISSN
    • This code seems to get a journal title, though I only tried it with a few ISSNs:
import requests

query_params = {'item-type': 'publication', 'format': 'Json', 'limit': 10, 'offset': 0, 'api-key': 'blahhhahahah', 'filter': '[["issn","equals","0011-183X"]]'}
r = requests.get('https://v2.sherpa.ac.uk/cgi/retrieve')
if r.status_code and len(r.json()['items']) > 0:
    r.json()['items'][0]['title'][0]['title']
  • I exported a list of all our ISSNs from CGSpace:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
COPY 3081
  • I wrote a script to check the ISSNs against Crossref’s API: crossref-issn-lookup.py
    • I suspect Crossref might have better data actually…