CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2021

2021-02-01

  • Abenet said that CIP found more duplicate records in their export from AReS
  • I had a call with CodeObia to discuss the work on OpenRXV
  • Check the results of the AReS harvesting from last night:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 100875,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
  • Set the current items index to read only and make a backup:
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'
{"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-01
  • Delete the current items index and clone the temp one to it:
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
  • Then delete the temp and backup:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'       
{"acknowledged":true}%                                                                                                               
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-01'
  • Meeting with Peter and Abenet about CGSpace goals and progress
  • Test submission to DSpace via REST API to see if Abenet can fix / reject it (submit workflow?)
  • Get Peter a list of users who have submitted or approved on DSpace everrrrrrr, so he can remove some
  • Ask MEL for a dump of their types to reconcile with ours and CG Core
  • Need to tag ILRI collection with license!! For pre-2010 use “Other” unless a license is already there; 2010-2020 do the ilri content in batches (2010-2015: CC-BY-NC-SA; 2016-onwards: CC-BY);
    • ONLY if ILRI / International Livestock Research Institute is the publisher, no journal articles, no book chapters…
  • I tried to export the ILRI community from CGSpace but I got an error:
$ dspace metadata-export -i 10568/1 -f /tmp/2021-02-01-ILRI.csv
Loading @mire database changes for module MQM
Changes have been processed
Exporting community 'International Livestock Research Institute (ILRI)' (10568/1)
           Exception: null
java.lang.NullPointerException
        at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:212)
        at com.google.common.collect.Iterators.concat(Iterators.java:464)
        at org.dspace.app.bulkedit.MetadataExport.addItemsToResult(MetadataExport.java:136)
        at org.dspace.app.bulkedit.MetadataExport.buildFromCommunity(MetadataExport.java:125)
        at org.dspace.app.bulkedit.MetadataExport.<init>(MetadataExport.java:77)
        at org.dspace.app.bulkedit.MetadataExport.main(MetadataExport.java:282)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
  • I imported the production database to my local development environment and I get the same error… WTF is this?
    • I was able to export another smaller community
    • I filed an issue with Atmire to see if it is likely something of theirs, or if I need to ask on the dspace-tech mailing list
  • CodeObia sent a pull request with fixes for several issues we highlighted in OpenRXV
    • I deployed the fixes on production, as they only affect minor parts of the frontend, and two of the four are working
    • I sent feedback to CodeObia

2021-02-02

  • Communicate more with CodeObia about some fixes for OpenRXV
  • Maria Garruccio sent me some new ORCID iDs for Bioversity authors, as well as a correction for Stefan Burkart’s iD
  • I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
  • Then for the rest, I saved them to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-02-02-combined-orcids.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-02-02-combined-orcids.txt -o /tmp/2021-02-02-combined-orcid-names.txt
  • I sorted the names and added the XML formatting in vim, then ran it through tidy:
$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
  • Then I added all the changed names plus Stefan’s incorrect ones to a CSV and processed them with fix-metadata-values.py:
$ cat 2021-02-02-fix-orcid-ids.csv 
cg.creator.id,correct
Burkart Stefan: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Burkart Stefan: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
Stefan  Burkart: 0000-0001-5297-2184,Stefan Burkart: 0000-0001-5297-2184
Stefan Burkart: 0000-0002-7558-9177,Stefan Burkart: 0000-0001-5297-2184
Adina Chain Guadarrama: 0000-0002-6944-2064,Adina Chain-Guadarrama: 0000-0002-6944-2064
Bedru: 0000-0002-7344-5743,Bedru B. Balana: 0000-0002-7344-5743
Leigh Winowiecki: 0000-0001-5572-1284,Leigh Ann Winowiecki: 0000-0001-5572-1284
Sander J. Zwart: 0000-0002-5091-1801,Sander Zwart: 0000-0002-5091-1801
saul lozano-fuentes: 0000-0003-1517-6853,Saul Lozano: 0000-0003-1517-6853
$ ./ilri/fix-metadata-values.py -i 2021-02-02-fix-orcid-ids.csv -db dspace63 -u dspace -p 'fuuu' -f cg.creator.id -t 'correct' -m 240
  • I also looked up which of these new authors might have existing items that are missing ORCID iDs
  • I had to port my add-orcid-identifiers-csv.py to DSpace 6 UUIDs and I think it’s working but I want to do a few more tests because it uses a sequence for the metadata_value_id

2021-02-03

  • Tag forty-three items from Bioversity’s new authors with ORCID iDs using add-orcid-identifiers-csv.py:
$ cat /tmp/2021-02-02-add-orcid-ids.csv
dc.contributor.author,cg.creator.id
"Nchanji, E.",Eileen Bogweh Nchanji: 0000-0002-6859-0962
"Nchanji, Eileen",Eileen Bogweh Nchanji: 0000-0002-6859-0962
"Nchanji, Eileen Bogweh",Eileen Bogweh Nchanji: 0000-0002-6859-0962
"Machida, Lewis",Lewis Machida: 0000-0002-0012-3997
"Mockshell, Jonathan",Jonathan Mockshell: 0000-0003-1990-6657"
"Aubert, C.",Celine Aubert: 0000-0001-6284-4821
"Aubert, Céline",Celine Aubert: 0000-0001-6284-4821
"Devare, M.",Medha Devare: 0000-0003-0041-4812
"Devare, Medha",Medha Devare: 0000-0003-0041-4812
"Benites-Alfaro, O.E.",Omar E. Benites-Alfaro: 0000-0002-6852-9598
"Benites-Alfaro, Omar Eduardo",Omar E. Benites-Alfaro: 0000-0002-6852-9598
"Johnson, Vincent",VINCENT JOHNSON: 0000-0001-7874-178X
"Lesueur, Didier",didier lesueur: 0000-0002-6694-0869
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-02-02-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -d

2021-02-04

  • Re-sync CGSpace database and Solr to DSpace Test to start a public test of CG Core v2
    • Afterwards I updated Discovery and OAI:
$ time chrt -b 0 dspace index-discovery -b
$ dspace oai import -c
  • Attend Accenture meeting for repository managers
    • Not clear what the SMO wants to get out of us
  • Enrico asked for some notes about our work on AReS in 2020 for CRP Livestock reporting
    • Abenet and I came up with the following:

In 2020 we funded the third phase of development on the OpenRXV platform that powers AReS. This phase focused mainly on improving the search filtering, graphical visualizations, and reporting capabilities. It is now possible to create custom reports in Excel, Word, and PDF formats using a templating system. We also concentrated on making the vanilla OpenRXV platform easier to deploy and administer in hopes that other organizations would begin using it. Lastly, we identified and fixed a handful of bugs in the system. All development takes place publicly on GitHub: https://github.com/ilri/OpenRXV.

In the last quarter of 2020, ILRI conducted a briefing for nearly 100 scientists and communications staff on how to use ARes as a visualization tool for repository outputs and as a reporting tool (https://hdl.handle.net/10568/110527). Staff will begin using AReS to generate lists of their outputs to upload in the performance evaluation system to assist in their performance evaluation. The list of publications they will upload from AReS to Performax will indicate the open access status of each publication to help start discussion why some outputs are not open access given the open access policies of the CGIAR.

  • Call Moayad to discuss OpenRXV development
    • We talked about the “reporting period” (date-based statistics) and some of the issues Abdullah is working on on GitHub
    • I suggested that we offer the date-range statistics in a modal dialog with other sorting and grouping options during report generation
  • Peter sent me the cleaned up series that I had originally sent him in 2020-10
    • I quickly applied all the deletions on CGSpace:
$ ./ilri/delete-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -m 43
  • The corrected versions have a lot of encoding issues so I asked Peter to give me the correct ones so I can search/replace them:
    • CIAT Publicaçao
    • CIAT Publicación
    • CIAT Série
    • CIAT Séries
    • Colección investigación y desarrollo
    • CTA Guias práticos
    • CTA Guias técnicas
    • Curso de adiestramiento en producción y utilización de pastos tropicales
    • Folheto Técnico
    • ILRI Nota Informativa de Investigação
    • Influencia de los actores sociales en América Central
    • Institutionalization of quality assurance mechanism and dissemination of top quality commercial products to increase crop yields and improve food security of smallholder farmers in sub-Saharan Africa – COMPRO-II
    • Manuel pour les Banques de Gènes;1
    • Sistematización de experiencias Proyecto ACORDAR
    • Strüngmann Forum
    • Unité de Recherche
  • I ended up using python-ftfy to fix those very easily, then replaced them in the CSV
  • Then I trimmed whitespace at the beginning, end, and around the “;”, and applied the 1,600 fixes using fix-metadata-values.py:
$ ./ilri/fix-metadata-values.py -i /tmp/2020-10-28-Series-PB.csv -db dspace -u dspace -p 'fuuu' -f dc.relation.ispartofseries -t 'correct' -m 43
  • Help Peter debug an issue with one of Alan Duncan’s new FEAST Data reports on CGSpace
    • For some reason the default policy for the item was “COLLECTION_492_DEFAULT_READ” group, which had zero members
    • I changed them all to Anonymous and the item was accessible

2021-02-07

  • Run system updates on CGSpace (linode18), deploy latest 6_x-prod branch, and reboot the server
  • After the server came back up I started a full Discovery re-indexing:
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    247m30.850s
user    160m36.657s
sys     2m26.050s
  • Regarding the CG Core v2 migration, Fabio wrote to tell me that he is not using CGSpace directly, instead harvesting via GARDIAN
    • He gave me the contact of Sotiris Konstantinidis, who is the CTO at SCIO Systems and works on the GARDIAN platform
  • Delete the old Elasticsearch temp index to prepare for starting an AReS re-harvest:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS

2021-02-08

  • Finish rotating the AReS indexes after the harvesting last night:
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 100983,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-08
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-08'

2021-02-10

  • Talk to Abdullah from CodeObia about a few of the issues we filed on OpenRXV
  • Atmire responded to a few issues today:
    • First, the one about a crash while exporting a community CSV, which appears to be a vanilla DSpace issue with a patch in DSpace 6.4
    • Second, the MQM batch consumer issue, which appears to be harmless log spam in most cases and they have sent a patch that adjusts the logging as such
    • Third, a version bump for CUA to fix the java.lang.UnsupportedOperationException: Multiple update components target the same field:solr_update_time_stamp error
  • I cherry-picked the patches for DS-4111 and was able to export the ILRI community finally, but the results are almost twice as many items as in the community!
    • Investigating with csvcut I see there are some ids that appear up to five, six, or seven times!
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | wc -l
30354
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort -u | wc -l         
18555
$ csvcut -c id /tmp/2021-02-10-ILRI.csv | sed '1d' | sort | uniq -c | sort -h | tail     
      5 c21a79e5-e24e-4861-aa07-e06703d1deb7
      5 c2460aa1-ae28-4003-9a99-2d7c5cd7fd38
      5 d73fb3ae-9fac-4f7e-990f-e394f344246c
      5 dc0e24fa-b7f5-437e-ac09-e15c0704be00
      5 dc50bcca-0abf-473f-8770-69d5ab95cc33
      5 e714bdf9-cc0f-4d9a-a808-d572e25c9238
      6 7dfd1c61-9e8c-4677-8d41-e1c4b11d867d
      6 fb76888c-03ae-4d53-b27d-87d7ca91371a
      6 ff42d1e6-c489-492c-a40a-803cabd901ed
      7 094e9e1d-09ff-40ca-a6b9-eca580936147
  • I added a comment to that bug to ask if this is a side effect of the patch
  • I started working on tagging pre-2010 ILRI items with license information, like we talked about with Peter and Abenet last week
    • Due to the export bug I had to sort and remove duplicates first, then use csvgrep to filter out books and journal articles:
$ csvcut -c 'id,dc.date.issued,dc.date.issued[],dc.date.issued[en_US],dc.rights,dc.rights[],dc.rights[en],dc.rights[en_US],dc.publisher,dc.publisher[],dc.publisher[en_US],dc.type[en_US]' /tmp/2021-02-10-ILRI.csv | csvgrep -c 'dc.type[en_US]' -r '^.+[^(Journal Item|Journal Article|Book|Book Chapter)]'
  • I imported the CSV into OpenRefine and converted the date text values to date types so I could facet by dates before 2010:
if(diff(value,"01/01/2010".toDate(),"days")<0, true, false)
  • Then I filtered by publisher to make sure they were only ours:
or(
  value.contains("International Livestock Research Institute"),
  value.contains("ILRI"),
  value.contains("International Livestock Centre for Africa"),
  value.contains("ILCA"),
  value.contains("ILRAD"),
  value.contains("International Laboratory for Research on Animal Diseases")
)
  • I tagged these pre-2010 items with “Other” if they didn’t already have a license
  • I checked 2010 to 2015, and 2016 to date, but they were all tagged already!
  • In the end I added the “Other” license to 1,523 items from before 2010

2021-02-11

  • CodeObia keeps working on a few more small issues on OpenRXV
    • Abdullah sent fixes for two issues but I couldn’t verify them myself so I asked him to check again
    • Call with Abdullah and Yousef to discuss some issues
    • We got the Angular expressions parser working…

2021-02-13

  • Run system updates, deploy latest 6_x-prod branch, and reboot CGSpace (linode18)
  • Normalize text_lang of DSpace item metadata on CGSpace:
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
 text_lang |  count  
-----------+---------
 en_US     | 2567413
           |    8050
 en        |    7601
           |       0
(4 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
  • Start a full Discovery re-indexing on CGSpace

2021-02-14

  • Clear the OpenRXV temp items index:
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
  • Then start a full harvesting of CGSpace in the AReS Explorer admin dashboard
  • Peter asked me about a few other recently submitted FEAST items that are restricted
    • I checked the collection and there was an empty group there for the “default read” authorization
    • I deleted the group and fixed the authorization policies for two new items manually
  • Upload fifteen items to CGSpace for Peter Ballantyne
  • Move 313 journals from series, which Peter had indicated when we were cleaning up the series last week
    • I re-purposed one of my Python metadata scripts to create move-metadata-values.py
    • The script reads a text file with one metadata value per line and moves them from one metadata field id to another
$ ./ilri/move-metadata-values.py -i /tmp/move.txt -db dspace -u dspace -p 'fuuu' -f 43 -t 55