CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

June, 2020

2020-06-01

  • I tried to run the AtomicStatisticsUpdateCLI CUA migration script on DSpace Test (linode26) again and it is still going very slowly and has tons of errors like I noticed yesterday
    • I sent Atmire the dspace.log from today and told them to log into the server to debug the process
  • In other news, I checked the statistics API on DSpace 6 and it’s working
  • I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error:
$ dspace oai import -c
OAI 2.0 manager action started
Loading @mire database changes for module MQM
Changes have been processed
Clearing index
Index cleared
Using full import.
Full import
java.lang.NullPointerException
        at org.dspace.xoai.app.XOAI.willChangeStatus(XOAI.java:438)
        at org.dspace.xoai.app.XOAI.index(XOAI.java:368)
        at org.dspace.xoai.app.XOAI.index(XOAI.java:280)
        at org.dspace.xoai.app.XOAI.indexAll(XOAI.java:227)
        at org.dspace.xoai.app.XOAI.index(XOAI.java:134)
        at org.dspace.xoai.app.XOAI.main(XOAI.java:560)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)

2020-06-02

  • I noticed that I was able to do a partial OAI import (ie, without -c)
    • Then I tried to clear the OAI Solr core and import, but I get the same error:
$ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<delete><query>*:*</query></delete>'
$ curl http://localhost:8080/solr/oai/update -H "Content-type: text/xml" --data-binary '<commit />'
$ ~/dspace63/bin/dspace oai import
OAI 2.0 manager action started
...
There are no indexed documents, using full import.
Full import
java.lang.NullPointerException
        at org.dspace.xoai.app.XOAI.willChangeStatus(XOAI.java:438)
        at org.dspace.xoai.app.XOAI.index(XOAI.java:368)
        at org.dspace.xoai.app.XOAI.index(XOAI.java:280)
        at org.dspace.xoai.app.XOAI.indexAll(XOAI.java:227)
        at org.dspace.xoai.app.XOAI.index(XOAI.java:143)
        at org.dspace.xoai.app.XOAI.main(XOAI.java:560)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
  • I found a bug report on DSpace Jira describing this issue affecting someone else running DSpace 6.3
    • They suspect it has to do with the item having some missing group names in its authorization policies
    • I added some debugging to dspace-oai/src/main/java/org/dspace/xoai/app/XOAI.java to print the Handle of the item that causes the crash and then I looked at its authorization policies
    • Indeed there are some blank group names:

Missing group names in DSpace 6.3 item authorization policy

  • The same item on CGSpace (DSpace 5.8) also has groups with no name:

Missing group names in DSpace 5.8 item authorization policy

  • I added some debugging and found exactly where this happens
    • As it turns out we can just check if the group policy is null there and it allows the OAI import to proceed
    • Aaaaand as it turns out, this was fixed in dspace-6_x in 2018 after DSpace 6.3 was released (see DS-4019), so that was a waste of three hours.
    • I cherry picked 150e83558103ed7f50e8f323b6407b9cbdf33717 into our current 6_x-dev-atmire-modules branch

2020-06-04

  • Maria was asking about some items they are trying to map from the CGIAR Big Data collection into their Alliance of Bioversity and CIAT journal articles collection, but for some reason the items don’t show up in the item mapper
    • The items don’t even show up in the XMLUI Discover advanced search, and actually I don’t even see any recent items on the recently submitted part of the collection (but the item pages exist of course)
    • Perhaps I need to try a full Discovery re-index:
$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    125m37.423s
user    11m20.312s
sys     3m19.965s
  • Still I don’t see the item in XMLUI search or in the item mapper (and I made sure to clear the Cocoon cache)
    • I’m starting to think it’s something related to the database transaction issue…
    • I removed our custom JDBC driver from /usr/local/apache-tomcat... so that DSpace will use its own much older one, version 9.1-901-1.jdbc4
    • I ran all system updates on the server (linode18) and rebooted it
    • After it came back up I had to restart Tomcat five times before all Solr statistics cores came up properly
    • Unfortunately this means that the Tomcat JDBC pooling via JNDI doesn’t work, so we’re using only the 30 connections reserved for the DSpace CLI from DSpace’s own internal pool
    • Perhaps our previous issues with the database pool from a few years ago will be less now that we have much more aggressive blocking and rate limiting of bots in nginx
  • I will also import a fresh database snapshot from CGSpace and check if I can map the item in my local environment
    • After importing and forcing a full reindex locally I can see the item in search and in the item mapper
  • Abenet sent another message about two users who are having issues with submission, and I see the number of locks in PostgreSQL has sky rocketed again as of a few days ago:

PostgreSQL locks week

  • As far as I can tell this started happening for the first time in April, connections and locks:

PostgreSQL connections year PostgreSQL locks year

  • I think I need to just leave this as is with the DSpace default JDBC driver for now, but perhaps I could also downgrade the Tomcat version (I deployed Tomcat 7.0.103 in March, so perhaps that’s relevant)
  • Also, I’ll start another full reindexing to see if the issue with mapping is somehow also resolved now that the database connections are working better
    • Perhaps related, but this one finished much faster:
$ time chrt -i 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    101m41.195s
user    10m9.569s
sys     3m13.929s
  • Unfortunately the item is still not showing up in the item mapper…
  • Something happened to AReS Explorer (linode20) so I ran all system updates and rebooted it

2020-06-07

  • Peter said he was annoyed with a CSV export from CGSpace because of the different text_lang attributes and asked if we can fix it
  • The last time I normalized these was in 2019-06, and currently it looks like this:
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE resource_type_id=2 GROUP BY text_lang ORDER BY count DESC;
  text_lang  |  count
-------------+---------
 en_US       | 2158377
 en          |  149540
             |   49206
 es_ES       |      18
 fr          |       4
 Peer Review |       1
             |       0
(7 rows)
  • In theory we can have different languages for metadata fields but in practice we don’t do that, so we might as well normalize everything to “en_US” (and perhaps I should make a curation task to do this)
  • For now I will do it manually on CGSpace and DSpace Test:
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE resource_type_id=2;
UPDATE 2414738
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item);
  • Peter asked if it was possible to find all ILRI items that have “zoonoses” or “zoonotic” in their titles and check if they have the ILRI subject “ZOONOTIC DISEASES” (and add it if not)
    • Unfortunately the only way we have currently would be to export the entire ILRI community as a CSV and filter/edit it in OpenRefine

2020-06-08

  • I manually mapped the two Big Data items that Maria had asked about last week by exporting their metadata to CSV and re-importing it
    • I still need to look into the underlying issue there, seems to be something in Solr
    • Something strang is that, when I search for part of the title in Discovery I get 2,000 results on CGSpace, while on my local DSpace 5.8 environment I get 2!

CGSpace Discovery search results

CGSpace Discovery search results

  • On DSpace Test, which is currently running DSpace 6, I get 2,000 results but the top one is the correct match and the item does show up in the item mapper
    • Interestingly, if I search directly in the Solr search core on CGSpace with a query like handle:10568/108315 I don’t see the item, but on my local Solr I see them!
  • Peter asked if it was easy for me to add ILRI subject “ZOONOTIC DISEASES” to any items in the ILRI community that had “zoonotic” or “zoonoses” in their title, but were missing the ILRI subject
    • I exported the ILRI community metadata, cut the three fields I needed, and then filtered and edited the CSV in OpenRefine:
$ dspace metadata-export -i 10568/1 -f /tmp/2020-06-08-ILRI.csv
$ csvcut -c 'id,cg.subject.ilri[en_US],dc.title[en_US]' ~/Downloads/2020-06-08-ILRI.csv > /tmp/ilri.csv
  • Moayad asked why he’s getting HTTP 500 errors on CGSpace
    • I looked in the Nginx logs and I see some HTTP 500 responses, but nothing in nginx’s error.log
    • Looking in Tomcat’s log I see there are many:
# journalctl --since=today -u tomcat7  | grep -c 'Internal Server Error'
482
  • They are all related to the REST API, like:
Jun 07 02:00:27 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
Jun 07 02:00:27 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
Jun 07 02:00:27 linode18 tomcat7[6286]:         at org.dspace.rest.Resource.processException(Resource.java:151)
Jun 07 02:00:27 linode18 tomcat7[6286]:         at org.dspace.rest.ItemsResource.getItems(ItemsResource.java:195)
Jun 07 02:00:27 linode18 tomcat7[6286]:         at sun.reflect.GeneratedMethodAccessor548.invoke(Unknown Source)
Jun 07 02:00:27 linode18 tomcat7[6286]:         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Jun 07 02:00:27 linode18 tomcat7[6286]:         at java.lang.reflect.Method.invoke(Method.java:498)
Jun 07 02:00:27 linode18 tomcat7[6286]:         at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
...
  • And:
Jun 08 09:28:29 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
Jun 08 09:28:29 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
Jun 08 09:28:29 linode18 tomcat7[6286]:         at org.dspace.rest.Resource.processFinally(Resource.java:169)
Jun 08 09:28:29 linode18 tomcat7[6286]:         at org.dspace.rest.HandleResource.getObject(HandleResource.java:81)
Jun 08 09:28:29 linode18 tomcat7[6286]:         at sun.reflect.GeneratedMethodAccessor360.invoke(Unknown Source)
Jun 08 09:28:29 linode18 tomcat7[6286]:         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Jun 08 09:28:29 linode18 tomcat7[6286]:         at java.lang.reflect.Method.invoke(Method.java:498)
  • And:
Jun 06 08:19:54 linode18 tomcat7[6286]: SEVERE: Mapped exception to response: 500 (Internal Server Error)
Jun 06 08:19:54 linode18 tomcat7[6286]: javax.ws.rs.WebApplicationException
Jun 06 08:19:54 linode18 tomcat7[6286]:         at org.dspace.rest.Resource.processException(Resource.java:151)
Jun 06 08:19:54 linode18 tomcat7[6286]:         at org.dspace.rest.CollectionsResource.getCollectionItems(CollectionsResource.java:289)
Jun 06 08:19:54 linode18 tomcat7[6286]:         at sun.reflect.GeneratedMethodAccessor598.invoke(Unknown Source)
Jun 06 08:19:54 linode18 tomcat7[6286]:         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Jun 06 08:19:54 linode18 tomcat7[6286]:         at java.lang.reflect.Method.invoke(Method.java:498)
  • Looking back, I see ~800 of these errors since I changed the database configuration last week:
# journalctl --since=2020-06-04 --until=today -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
795
  • And only ~280 in the entire month before that…
# journalctl --since=2020-05-01 --until=2020-06-04 -u tomcat7 | grep -c 'javax.ws.rs.WebApplicationException'
286
  • So it seems to be related to the database, perhaps that there are less connections in the pool?
    • … and on that note, working without the JDBC driver and DSpace’s built-in connection pool since 2020-06-04 hasn’t actually solved anything, the issue with locks and idle in transaction connections is creeping up again!

PostgreSQL connections day PostgreSQL connections week

  • It seems to have started today around 10:00 AM… I need to pore over the logs to see if there is a correlation
    • I think there is some kind of attack going on because I see a bunch of requests for sequential Handles from a similar IP range in a datacenter in Sweden where the user does not re-use their DSpace session_id
    • Looking in the nginx logs I see most (all?) of these requests are using the following user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
  • Looking at the nginx access logs I see that, other than something that seems like Google Feedburner, all hosts using this user agent are all in Sweden!
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.*.gz | grep 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' | grep -v '/feed' | awk '{print $1}' | sort | uniq -c | sort -n
   1624 192.36.136.246
   1627 192.36.241.95
   1629 192.165.45.204
   1636 192.36.119.28
   1641 192.36.217.7
   1648 192.121.146.160
   1648 192.36.23.35
   1651 192.36.109.94
   1659 192.36.24.93
   1665 192.36.154.13
   1679 192.36.137.125
   1682 192.176.249.42
   1682 192.36.166.120
   1683 192.36.172.86
   1683 192.36.198.145
   1689 192.36.226.212
   1702 192.121.136.49
   1704 192.36.207.54
   1740 192.36.121.98
   1774 192.36.173.93
  • The earliest I see any of these hosts is 2020-06-05 (three days ago)
  • I will purge them from the Solr statistics and add them to abusive IPs ipset in the Ansible deployment scripts
$ ./check-spider-ip-hits.sh -f /tmp/ips -s statistics -p
Purging 1423 hits from 192.36.136.246 in statistics
Purging 1387 hits from 192.36.241.95 in statistics
Purging 1398 hits from 192.165.45.204 in statistics
Purging 1413 hits from 192.36.119.28 in statistics
Purging 1418 hits from 192.36.217.7 in statistics
Purging 1418 hits from 192.121.146.160 in statistics
Purging 1416 hits from 192.36.23.35 in statistics
Purging 1449 hits from 192.36.109.94 in statistics
Purging 1440 hits from 192.36.24.93 in statistics
Purging 1465 hits from 192.36.154.13 in statistics
Purging 1447 hits from 192.36.137.125 in statistics
Purging 1453 hits from 192.176.249.42 in statistics
Purging 1462 hits from 192.36.166.120 in statistics
Purging 1499 hits from 192.36.172.86 in statistics
Purging 1457 hits from 192.36.198.145 in statistics
Purging 1467 hits from 192.36.226.212 in statistics
Purging 1489 hits from 192.121.136.49 in statistics
Purging 1478 hits from 192.36.207.54 in statistics
Purging 1502 hits from 192.36.121.98 in statistics
Purging 1544 hits from 192.36.173.93 in statistics

Total number of bot hits purged: 29025
  • Skype with Enrico, Moayad, Jane, Peter, and Abenet to see the latest OpenRXV/AReS developments
    • One thing Enrico mentioned to me during the call was that they had issues with Altmetric’s user agents, and he said they are apparently using Altmetribot and Postgenomic V2
    • I looked in our logs and indeed we have those, so I will add them to the nginx rate limit bypass
    • I checked the Solr stats and it seems there are only a few thousand in 2016 and a few hundred in other years so I won’t bother adding it to the DSpace robot user agents list
  • Atmire sent an updated pull request for the Font Awesome 5 update for CUA (#445) so I filed feedback on their tracker

2020-06-09

  • I’m still thinking about the issue with PostgreSQL “idle in transaction” and “waiting for lock” connections
    • As far as I can see from the Munin graphs this issue started in late April or early May
    • I don’t see any PostgreSQL updates around then, though I did update Tomcat to version 7.0.103 in March
    • I will try to downgrade Tomcat to 7.0.99, which was the version I was using until early February, before we had seen any issues
    • Also, I will use the PostgreSQL JDBC driver version 42.2.9, which is what we were using back then as well
    • After deploying Tomcat 7.0.99 I had to restart Tomcat three times before all the Solr statistics cores came up OK
  • Well look at that, the “idle in transaction” and locking issues started in April on DSpace Test too…

PostgreSQL connections year DSpace Test

2020-06-13

  • Atmire sent some questions about DSpace Test related to our ongoing CUA indexing issue
    • I had to clarify a few build steps and directories on the test server
  • I notice that the PostgreSQL connection issues have not come back since 2020-06-09 when I downgraded Tomcat to 7.0.99… fingers crossed that it was something related to that!
    • On that note I notice that the AReS explorer is still not harvesting CGSpace properly…
    • I looked at the REST API logs on CGSpace (linode18) and saw that the AReS harvester is being denied due to not having a user agent, oops:
172.104.229.92 - - [13/Jun/2020:02:00:00 +0200] "GET /rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=0 HTTP/1.1" 403 260 "-" "-"
  • I created an nginx map based on the host’s IP address that sets a temporary user agent (ua) and then changed the conditional in the REST API location block so that it checks this mapped ua instead of the default one
    • That should allow AReS to harvest for now until they update their user agent
    • I restarted the AReS server’s docker containers with docker-compose down and docker-compose up -d and the next day I saw CGSpace was in AReS again finally

2020-06-14

  • Abenet asked for a list of authors from CIP’s community so that Gabriela can make some corrections
    • I generated a list of collections in CIPs two communities using the REST API:
$ curl -s 'https://cgspace.cgiar.org/rest/handle/10568/51671?expand=collections' 'https://cgspace.cgiar.org/rest/handle/10568/89346?expand=collections' | grep -oE '10568/[0-9]+' | sort | uniq > /tmp/cip-collections.txt
  • Then I formatted it into a SQL query and exported a CSV:
dspace=# \COPY (SELECT DISTINCT text_value AS author, COUNT(*) FROM metadatavalue WHERE metadata_field_id = (SELECT metadata_field_id FROM metadatafieldregistry WHERE element = 'contributor' AND qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (SELECT item_id FROM collection2item WHERE collection_id IN (SELECT resource_id FROM hANDle WHERE hANDle IN ('10568/100533', '10568/100653', '10568/101955', '10568/106580', '10568/108469', '10568/51671', '10568/53085', '10568/53086', '10568/53087', '10568/53088', '10568/53089', '10568/53090', '10568/53091', '10568/53092', '10568/53093', '10568/53094', '10568/64874', '10568/69069', '10568/70150', '10568/88229', '10568/89346', '10568/89347', '10568/99301', '10568/99302', '10568/99303', '10568/99304', '10568/99428'))) GROUP BY text_value ORDER BY count DESC) TO /tmp/cip-authors.csv WITH CSV;
COPY 3917

2020-06-15

  • Macaroni Bros emailed me to say that they are having issues with thumbnail links on the REST API
    • For some reason all the bitstream retrieveLink links are wrong because they use /bitstreams/ instead of /rest/bitstreams/, which leads to an HTTP 404
    • I looked on DSpace Test, which is running DSpace 6 dev branch right now, and the links are OK there!
    • Looks like someone reported this issue on DSpace 6
    • Other large DSpace 5 sites have this same issue: https://openknowledge.worldbank.org/handle/10986/30568
    • I can’t believe nobody ever noticed this before…
    • I tried to port the patch from DS-3193 to DSpace 5.x and it builds, but causes an HTTP 500 Internal Server error when generating bitstream links
    • Well the correct URL should have /rest/ anyways, and that’s how the URLs are in DSpace 6 anyways, so I will tell Macaroni to just make sure that those links use /rest/

2020-06-16

  • Looks like the PostgreSQL connection/lock issue might be fixed because it’s been six days with no reoccurrence:

PostgreSQL connections week

  • And CGSpace is being harvested successfully by AReS every day still
  • Fix some CIP subjects that had two different styles of dashes, causing them to show up differently in Discovery
    • SWEETPOTATO AGRI‐FOOD SYSTEMSSWEETPOTATO AGRI-FOOD SYSTEMS
    • POTATO AGRI‐FOOD SYSTEMSPOTATO AGRI-FOOD SYSTEMS
  • They also asked me to update INCLUSIVE VALUE CHAINS to INCLUSIVE GROWTH, both in the existing items on CGSpace and the submission form

2020-06-18

  • I guess Atmire fixed the CUA download issue after updating the version for Font Awesome 5, but now I get an error during ant update
    • I tried to remove the config/spring directory, but it still fails
    • The same issue happens on my local environment and on the DSpace Test server
    • I raised the issue with Atmire

2020-06-21

  • The PostgreSQL connections and locks still look OK on both CGSpace (linode18) and DSpace Test (linode26) after ten days or so
    • I decided to upgrade the JDBC driver on DSpace Test from 42.2.9 to 42.2.14 to leave it for a few weeks and see if the issue comes back, as it was present on the test server with DSpace 6 as well
    • As far as I can tell the issue is related to something in Tomcat > 7.0.99
  • Run system updates and reboot DSpace Test
  • Re-deploy 5_x-prod branch on CGSpace, run system updates, and reboot the server
    • I had to restart Tomcat 7 once after the server came back up because some of the Solr statistics cores didn’t come up the first time
    • Unfortunately I realized that I had forgotten to update the 5_x-prod branch to include the REST API fix from last week so I had to rebuild and restart Tomcat again

2020-06-22

  • Skype with Peter, Abenet, Enrico, Jane, and Moayad about the latest OpenRXV developments
    • I will go visit CodeObia later this week to run through the list of issues and close the ones that are finished

2020-06-23

  • Peter said he’s having problems submitting an item on CGSpace and shit, it seems to be the same fucking PostgreSQL “idle in transaction” and “waiting for lock” issue we’ve been having sporadically the last few months

PostgreSQL connections year CGSpace

  • The issue hadn’t occurred in almost two weeks after I downgraded Tomcat to 7.0.99 with the PostgreSQL JDBC driver version 42.2.9 so I thought it was fixed
    • Apparently it’s not related to the Tomcat or JDBC driver version, as even when I reverted back to DSpace’s really old built-in JDBC driver it still did the same thing!
    • Could it be a memory leak or something? Why now?
    • For now I will revert to the latest Tomcat 7.0.104 and PostgreSQL JDBC 42.2.14

2020-06-24

  • Spend some time with Moayad looking over issues for OpenRXV
    • We updated all the labels, tooltips, and filters
    • The next step is to go through the GitHub issues and close them if they are done
  • I also discussed the dspace-statistics-api with Mohammed Salem
    • He made some new features to add the harvesting of twelve months of item statistics
    • We talked about extending it to be “x” amount of months and years with some sensible defaults
    • The item and items endpoints could then have ?months=12 and ?years=2 to show stats for the past “x” months or years
    • We thought other arbitrary date ranges could be added with queries like ?date_from=2020-05 etc that would query Solr on the fly and obviously be slower…

2020-06-28

  • Email GRID.ac to ask them about where old names for institutes are stores, as I see them in the “Disambiguate” search function online, but not in the standalone data
    • For example, both “International Laboratory for Research on Animal Diseases” (ILRAD) and “International Livestock Centre for Africa” (ILCA) correctly return a hit for “International Livestock Research Institute”, but it’s nowhere in the data
  • I discovered two interesting OpenRefine reconciliation services:

2020-06-29

Because coverage is so broadly defined, it is preferable to use the more specific subproperties Temporal Coverage and Spatial Coverage.

  • So I guess we should be using this for countries… but then all regions, countries, etc get merged together into this when you use DCTERMS
    • Perhaps better to use cg.coverage.country and crosswalk to dcterms.spatial
    • Another thing is that these values are not literals—you are supposed to embed classes…
  • I also notice that there is a CrossRef funders registry with 23,000+ funders that you can download as RDF or access via an API
$ http 'https://api.crossref.org/funders?query=Bill+and+Melinda+Gates&mailto=a.orth@cgiar.org'
  • Searching for “Bill and Melinda Gates” we can see the name literal and a list of alt-names literals
  • See the CrossRef API docs (specifically the parameters and filters)
  • I made a pull request on CG Core v2 to recommend using persistent identifiers for DOIs and ORCID iDs (#26)
  • I exported sponsors/funders from CGSpace and wrote a script to query the CrossRef API for matches:
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29) TO /tmp/2020-06-29-sponsors.csv;
COPY 682
  • The script is crossref-funders-lookup.py and it is based on agrovoc-lookup.py
    • On that note, I realized I need to URL encode the funder before making the search request with requests because, while the requests library does do URL encoding, it seems that it interprets characters like & as indicative query parameters and this causes searches for funders like Bill & Melinda Gates Foundation to get misinterpreted
    • So then I noticed that I had worked around this in agrovoc-lookup.py a few years ago by just ignoring subjects with special characters like apostrophes and accents!
  • I tested the script on our funders:
$ ./crossref-funders-lookup.py -i /tmp/2020-06-29-sponsors.csv -om /tmp/sponsors-matched.txt -or /tmp/sponsors-rejected.txt -d -e blah@blah.com
$ wc -l /tmp/2020-06-29-sponsors.csv 
682 /tmp/2020-06-29-sponsors.csv
$ wc -l /tmp/sponsors-*
  180 /tmp/sponsors-matched.txt
  502 /tmp/sponsors-rejected.txt
  682 total
  • It seems that 35% of our funders already match… I bet a few more will match if I check for simple errors
    • Interesting, I found a few funders that we have correct, but can’t figure out how to match them in the API:
      • Claussen-Simon-Stiftung
      • H2020 Marie Skłodowska-Curie Actions

2020-06-30

  • GRID responded to my question about historical names
    • They said the information is not part of the public GRID or ROR lists, but you can access it with a license to the Dimensions API
  • Gabriela from CIP sent me a list of erroneously added CIP subjects to remove from CGSpace:
$ cat /tmp/2020-06-30-remove-cip-subjects.csv 
cg.subject.cip
INTEGRATED PEST MANAGEMENT
ORANGE FLESH SWEET POTATOES
AEROPONICS
FOOD SUPPLY
SASHA
SPHI
INSECT LIFE CYCLE MODELLING
SUSTAIN
AGRICULTURAL INNOVATIONS
NATIVE VARIETIES
PHYTOPHTHORA INFESTANS
$ ./delete-metadata-values.py -i /tmp/2020-06-30-remove-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -m 127 -d
  • She also wants to change their SWEET POTATOES term to SWEETPOTATOES, both in the CIP subject list and existing items so I updated those too:
$ cat /tmp/2020-06-30-fix-cip-subjects.csv 
cg.subject.cip,correct
SWEET POTATOES,SWEETPOTATOES
$ ./fix-metadata-values.py -i /tmp/2020-06-30-fix-cip-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.cip -t correct -m 127 -d
  • She also finished doing all the corrections to authors that I had sent her last week, but many of the changes are removing Spanish accents from authors names so I asked if she’s really should she wants to do that
  • I ran the fixes and deletes on CGSpace, but not on DSpace Test yet because those scripts need updating for DSpace 6 UUIDs
  • I spent about two hours manually checking our sponsors that were rejected from CrossRef and found about fifty-five corrections that I ran on CGSpace:
$ cat 2020-06-29-fix-sponsors.csv
dc.description.sponsorship,correct
"Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil","Conselho Nacional de Desenvolvimento Científico e Tecnológico"
"Claussen Simon Stiftung","Claussen-Simon-Stiftung"
"Fonds pour la formation á la Recherche dans l'Industrie et dans l'Agriculture, Belgium","Fonds pour la Formation à la Recherche dans l’Industrie et dans l’Agriculture"
"Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil","Fundação de Amparo à Pesquisa do Estado de São Paulo"
"Schlumberger Foundation Faculty for the Future","Schlumberger Foundation"
"Wildlife Conservation Society, United States","Wildlife Conservation Society"
"Portuguese Foundation for Science and Technology","Portuguese Science and Technology Foundation"
"Wageningen University and Research","Wageningen University and Research Centre"
"Leverhulme Centre for Integrative Research in Agriculture and Health","Leverhulme Centre for Integrative Research on Agriculture and Health"
"Natural Science and Engineering Research Council of Canada","Natural Sciences and Engineering Research Council of Canada"
"Biotechnology and Biological Sciences Research Council, United Kingdom","Biotechnology and Biological Sciences Research Council"
"Home Grown Ceraels Authority United Kingdom","Home-Grown Cereals Authority"
"Fiat Panis Foundation","Foundation fiat panis"
"Defence Science and Technology Laboratory, United Kingdom","Defence Science and Technology Laboratory"
"African Development Bank","African Development Bank Group"
"Ministry of Health, Labour, and Welfare, Japan","Ministry of Health, Labour and Welfare"
"World Academy of Sciences","The World Academy of Sciences"
"Agricultural Research Council, South Africa","Agricultural Research Council"
"Department of Homeland Security, USA","U.S. Department of Homeland Security"
"Quadram Institute","Quadram Institute Bioscience"
"Google.org","Google"
"Department for Environment, Food and Rural Affairs, United Kingdom","Department for Environment, Food and Rural Affairs, UK Government"
"National Commission for Science, Technology and Innovation, Kenya","National Commission for Science, Technology and Innovation"
"Hainan Province Natural Science Foundation of China","Natural Science Foundation of Hainan Province"
"German Society for International Cooperation (GIZ)","GIZ"
"German Federal Ministry of Food and Agriculture","Federal Ministry of Food and Agriculture"
"State Key Laboratory of Environmental Geochemistry, China","State Key Laboratory of Environmental Geochemistry"
"QUT student scholarship","Queensland University of Technology"
"Australia Centre for International Agricultural Research","Australian Centre for International Agricultural Research"
"Belgian Science Policy","Belgian Federal Science Policy Office"
"U.S. Department of Agriculture USDA","U.S. Department of Agriculture"
"U.S.. Department of Agriculture (USDA)","U.S. Department of Agriculture"
"Fundação de Amparo à Pesquisa do Estado de São Paulo ( FAPESP)","Fundação de Amparo à Pesquisa do Estado de São Paulo"
"Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul, Brazil","Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul"
"Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro, Brazil","Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro"
"Swedish University of Agricultural Sciences (SLU)","Swedish University of Agricultural Sciences"
"U.S. Department of Agriculture (USDA)","U.S. Department of Agriculture"
"Swedish International Development Cooperation Agency (Sida)","Sida"
"Swedish International Development Agency","Sida"
"Federal Ministry for Economic Cooperation and Development, Germany","Federal Ministry for Economic Cooperation and Development"
"Natural Environment Research Council, United Kingdom","Natural Environment Research Council"
"Economic and Social Research Council, United Kingdom","Economic and Social Research Council"
"Medical Research Council, United Kingdom","Medical Research Council"
"Federal Ministry for Education and Research, Germany","Federal Ministry for Education, Science, Research and Technology"
"UK Government’s Department for International Development","Department for International Development, UK Government"
"Department for International Development, United Kingdom","Department for International Development, UK Government"
"United Nations Children's Fund","United Nations Children's Emergency Fund"
"Swedish Research Council for Environment, Agricultural Science and Spatial Planning","Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning"
"Agence Nationale de la Recherche, France","French National Research Agency"
"Fondation pour la recherche sur la biodiversité","Foundation for Research on Biodiversity"
"Programa Nacional de Innovacion Agraria, Peru","Programa Nacional de Innovación Agraria, Peru"
"United States Agency for International Development (USAID)","United States Agency for International Development"
"West Africa Agricultural Productivity Programme","West Africa Agricultural Productivity Program"
"West African Agricultural Productivity Project","West Africa Agricultural Productivity Program"
"Rural Development Administration, Republic of Korea","Rural Development Administration"
"UK’s Biotechnology and Biological Sciences Research Council (BBSRC)","Biotechnology and Biological Sciences Research Council"
$ ./fix-metadata-values.py -i /tmp/2020-06-29-fix-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t correct -m 29
  • Then I started a full re-index at batch CPU priority:
$ time chrt --batch 0 dspace index-discovery -b

real    99m16.230s
user    11m23.245s
sys     2m56.635s
  • Peter wants me to add “CORONAVIRUS DISEASE” to all ILRI items that have ILRI subject “COVID19”
    • I exported the ILRI community and cut the columns I needed, then opened the file in OpenRefine:
$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
$ dspace metadata-export -i 10568/1 -f /tmp/ilri.cs
$ csvcut -c 'id,cg.subject.ilri[],cg.subject.ilri[en_US],dc.subject[en_US]' /tmp/ilri.csv > /tmp/ilri-covid19.csv
  • I see that all items with “COVID19” already have “CORONAVIRUS DISEASE” so I don’t need to do anything