Add notes for 2023-01-29

2025-01-27 05:49:12 +01:00 · 2023-01-29 18:19:31 +03:00
parent 2c7f6b3e39
commit 81f04f48ad
29 changed files with 338 additions and 34 deletions
--- a/content/posts/2023-01.md
+++ b/content/posts/2023-01.md
@@ -407,4 +407,148 @@ $ psql < locks-age.sql | grep days | less -S
  - Then I ran the script to check for missing ORCID identifiers
  - Then *finally*, I started a harvest on AReS

+## 2023-01-23
+
+- Salem found that you can actually harvest everything in DSpace 7 using the [`discover/browses` endpoint](https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100)
+- Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
+  - I noticed that we still have "North America" as a region, but according to UN M.49 that is the continent, which comprises "Northern America" the region, so I will update our controlled vocabularies and all existing entries
+  - I imported changes to 1,800 items
+  - When it finished five hours later I started a harvest on AReS
+
+## 2023-01-24
+
+- Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI
+- Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
+  - I also added "CGIAR Trust Fund" to all items with an Initiative in `cg.contributor.initiative`
+
+## 2023-01-25
+
+- Oh shit, the import last night ran for twelve hours and then died:
+
+```console
+Error committing changes to database: could not execute statement
+Aborting most recent changes.
+```
+
+- I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes
+- Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted
+- Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
+  - We looked on AReS and all the items are still there
+  - I looked in the DSpace log and see around 2,000 messages like this:
+
+```console
+2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
+org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
+        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
+        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
+        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
+        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
+        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
+        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
+        at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
+        at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
+        at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
+        at org.dspace.core.Context.dispatchEvents(Context.java:455)
+        at org.dspace.core.Context.commit(Context.java:424)
+        at org.dspace.core.Context.complete(Context.java:380)
+        at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
+        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
+        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+        at java.lang.reflect.Method.invoke(Method.java:498)
+        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
+        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
+```
+
+- I filed a ticket with Atmire to ask them
+- For now I just did a light Discovery reindex (not the full one) and all the items appeared again
+- Submit an issue to MEL GitHub regarding the capitalization of CRPs: https://github.com/CodeObia/MEL/issues/11133
+  - I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with [our current controlled vocabulary for CRPs](https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt) and he will update it in MEL.
+  - On that note, Peter and Abenet and I realized that we still have an old field `cg.subject.crp` with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)
+  - I exported this list of values to lowercase them and move them to `cg.contributor.crp`
+  - Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon
+
+```console
+$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t correct
+$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t cg.contributor.crp
+```
+
+- After fixing and moving them all, I deleted the `cg.subject.crp` field from the metadata registry
+- I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:
+
+```sql
+UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn) AND text_lang IS NULL OR text_lang IN ('en', '');
+```
+
+- I tried that in a transaction and it hung, so I canceled it and rolled back
+- I see some PostgreSQL locks attributed to `dspaceApi` that were started at `2023-01-25 13:40:04.529087+01` and haven't changed since then (that's eight hours ago)
+  - I killed the pid...
+  - There were also saw some locks owned by `dspaceWeb` that were nine and four hours old, so I killed those too...
+  - Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can't run the update on the text langs...
+
+- Export entire CGSpace to do Initiative mappings again
+- Started a harvest on AReS
+
+## 2023-01-26
+
+- Export entire CGSpace to do some metadata cleanup on various fields
+  - I also added "CGIAR Trust Fund" to all items in the Initiatives community
+
+## 2023-01-27
+
+- Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting *everything* from PostgreSQL:
+
+```console
+$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
+$ csvcut -c 'cg.contributor.affiliation[en_US]' 2023-01-27-initiatives.csv \
+  | sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d'            \
+  | sort | uniq -c | sort -h                                               \
+  | awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
+  | sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/'           \
+  > /tmp/2023-01-27-initiatives-affiliations.csv
+```
+
+- The first sed command strips the quotes, deletes empty lines, and splits multiple values on "||"
+- The awk command sets the field separator to something so we can get the second "field" of the sort command, ie:
+
+```console
+...
+    309 International Center for Agricultural Research in the Dry Areas
+    412 International Livestock Research Institute
+```
+
+- The second sed command adds the CSV header and quotes back
+- I did the same for authors and donors and send them to Peter to make corrections
+
+## 2023-01-28
+
+- Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API
+
+## 2023-01-29
+
+- Export the entire CGSpace to do Initiatives collection mappings
+- I was thinking about a way to use Crossref's API to enrich our data, for example checking registered DOIs for license information, publishers, etc
+  - Turns out I had already written `crossref-doi-lookup.py` last year, and it works
+  - I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren't registered on Crossref, which is about 11,800 DOIs
+
+```console
+$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
+  | csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
+  | sed 1d > /tmp/2023-01-29-dois.txt
+$ wc -l /tmp/2023-01-29-dois.txt
+11819 /tmp/2023-01-29-dois.txt
+$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
+$ csvcut -c 'id,cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
+  | sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
+  > /tmp/cgspace-temp.csv
+$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
+  | csvgrep -c license -r 'creative' \
+  | sed '1s/license/dcterms.license[en_US]/' \
+  | csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
+```
+
+- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
+  - Then I imported 635 new licenses to CGSpace woooo
+  - After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
+
 <!-- vim: set sw=2 ts=2: -->