Add notes for 2023-01-29

This commit is contained in:
2023-01-29 18:19:31 +03:00
parent 2c7f6b3e39
commit 81f04f48ad
29 changed files with 338 additions and 34 deletions

View File

@ -407,4 +407,148 @@ $ psql < locks-age.sql | grep days | less -S
- Then I ran the script to check for missing ORCID identifiers
- Then *finally*, I started a harvest on AReS
## 2023-01-23
- Salem found that you can actually harvest everything in DSpace 7 using the [`discover/browses` endpoint](https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100)
- Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
- I noticed that we still have "North America" as a region, but according to UN M.49 that is the continent, which comprises "Northern America" the region, so I will update our controlled vocabularies and all existing entries
- I imported changes to 1,800 items
- When it finished five hours later I started a harvest on AReS
## 2023-01-24
- Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI
- Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
- I also added "CGIAR Trust Fund" to all items with an Initiative in `cg.contributor.initiative`
## 2023-01-25
- Oh shit, the import last night ran for twelve hours and then died:
```console
Error committing changes to database: could not execute statement
Aborting most recent changes.
```
- I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes
- Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted
- Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
- We looked on AReS and all the items are still there
- I looked in the DSpace log and see around 2,000 messages like this:
```console
2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I filed a ticket with Atmire to ask them
- For now I just did a light Discovery reindex (not the full one) and all the items appeared again
- Submit an issue to MEL GitHub regarding the capitalization of CRPs: https://github.com/CodeObia/MEL/issues/11133
- I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with [our current controlled vocabulary for CRPs](https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt) and he will update it in MEL.
- On that note, Peter and Abenet and I realized that we still have an old field `cg.subject.crp` with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)
- I exported this list of values to lowercase them and move them to `cg.contributor.crp`
- Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon
```console
$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t correct
$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t cg.contributor.crp
```
- After fixing and moving them all, I deleted the `cg.subject.crp` field from the metadata registry
- I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:
```sql
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn) AND text_lang IS NULL OR text_lang IN ('en', '');
```
- I tried that in a transaction and it hung, so I canceled it and rolled back
- I see some PostgreSQL locks attributed to `dspaceApi` that were started at `2023-01-25 13:40:04.529087+01` and haven't changed since then (that's eight hours ago)
- I killed the pid...
- There were also saw some locks owned by `dspaceWeb` that were nine and four hours old, so I killed those too...
- Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can't run the update on the text langs...
- Export entire CGSpace to do Initiative mappings again
- Started a harvest on AReS
## 2023-01-26
- Export entire CGSpace to do some metadata cleanup on various fields
- I also added "CGIAR Trust Fund" to all items in the Initiatives community
## 2023-01-27
- Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting *everything* from PostgreSQL:
```console
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' 2023-01-27-initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -h \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-01-27-initiatives-affiliations.csv
```
- The first sed command strips the quotes, deletes empty lines, and splits multiple values on "||"
- The awk command sets the field separator to something so we can get the second "field" of the sort command, ie:
```console
...
309 International Center for Agricultural Research in the Dry Areas
412 International Livestock Research Institute
```
- The second sed command adds the CSV header and quotes back
- I did the same for authors and donors and send them to Peter to make corrections
## 2023-01-28
- Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API
## 2023-01-29
- Export the entire CGSpace to do Initiatives collection mappings
- I was thinking about a way to use Crossref's API to enrich our data, for example checking registered DOIs for license information, publishers, etc
- Turns out I had already written `crossref-doi-lookup.py` last year, and it works
- I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren't registered on Crossref, which is about 11,800 DOIs
```console
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
| sed 1d > /tmp/2023-01-29-dois.txt
$ wc -l /tmp/2023-01-29-dois.txt
11819 /tmp/2023-01-29-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
$ csvcut -c 'id,cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
> /tmp/cgspace-temp.csv
$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
| csvgrep -c license -r 'creative' \
| sed '1s/license/dcterms.license[en_US]/' \
| csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
```
- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
- Then I imported 635 new licenses to CGSpace woooo
- After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
<!-- vim: set sw=2 ts=2: -->