CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

September, 2020

2020-09-02

  • Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS
  • The AReS Explorer hasn’t updated its index since 2020-08-22 when I last forced it
    • I restarted it again now and told Moayad that the automatic indexing isn’t working
  • Add Alliance of Bioversity International and CIAT to affiliations on CGSpace
  • Abenet told me that the general search text on AReS doesn’t get reset when you use the “Reset Filters” button
  • I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
  • I ran the country code tagger on CGSpace:
$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
...
real    2m10.516s
user    1m43.953s
sys     0m15.192s
$ grep -c added /tmp/2020-09-02-countrycodetagger.log
39
  • I still need to create a cron job for this…
  • Sisay and Abenet said they can’t log in with LDAP on DSpace Test (DSpace 6)
    • I tried and I can’t either… but it is working on CGSpace
    • The error on DSpace 6 is:
2020-09-02 12:03:10,666 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
  • I tried to query LDAP directly using the application credentials with ldapsearch and it works:
$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)"
  • According to the DSpace 6 docs we need to escape commas in our LDAP parameters due to the new configuration system
    • I added the commas and restarted DSpace (though technically we shouldn’t need to restart due to the new config system hot reloading configs)
    • Run all system updates on DSpace Test (linode26) and reboot it
    • After the restart LDAP login works…

2020-09-03

  • Fix some erroneous “review status” fields that Abenet noticed on AReS
    • I used my fix-metadata-values.py and delete-metadata-values.py scripts with the following input files:
$ cat 2020-09-03-fix-review-status.csv
dc.description.version,correct
Externally Peer Reviewed,Peer Review
Peer Reviewed,Peer Review
Peer review,Peer Review
Peer reviewed,Peer Review
Peer-Reviewed,Peer Review
Peer-reviewed,Peer Review
peer Review,Peer Review
$ cat 2020-09-03-delete-review-status.csv
dc.description.version
Report
Formally Published
Poster
Unrefereed reprint
$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68
$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68
  • Start reviewing 95 items for IITA (20201stbatch)
    • I used my csv-metadata-quality tool to check and fix some low-hanging fruit first
    • This fixed a few unnecessary Unicode, excessive whitespace, invalid multi-value separator, and duplicate metadata values
    • Then I looked at the data in OpenRefine and noticed some things:
      • All issue dates use year only, but some have months in the citation so they could be more specific
      • I normalized all the DOIs to use “https://doi.org” format
      • I fixed a few AGROVOC subjects with a simple GREL: value.replace("GRAINS","GRAIN").replace("SOILS","SOIL").replace("CORN","MAIZE")
      • But there are a few more that are invalid that she will have to look at
      • I uploaded the items to DSpace Test and it was apparently successful but I get these errors to the console:
Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
Error while updating
java.lang.NullPointerException
        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
        at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
        at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
        at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
        at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
        at org.dspace.core.Context.dispatchEvents(Context.java:455)
        at org.dspace.core.Context.commit(Context.java:424)
        at org.dspace.core.Context.complete(Context.java:380)
        at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
  • There are more in the DSpace log so I will raise it with Atmire immediately

2020-09-04

dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
  • I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
    • $ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id
    • I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)
  • I mapped one duplicated from the CIFOR Archives and re-uploaded the 94 IITA items to a new collection on DSpace Test

2020-09-08

  • I noticed that the “share” link in AReS wasn’t working properly because it excludes the “explorer” part of the URI

AReS share link broken

  • I filed an issue on GitHub: https://github.com/ilri/OpenRXV/issues/41
  • I uploaded the 94 IITA items that I had been working on last week to CGSpace
  • RTB emailed to ask why they are getting HTTP 503 errors during harvesting to the RTB WordPress website
    • From the screenshot I can see they are requesting URLs like this:
https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
  • So they end up getting rate limited due to the XMLUI rate limits
    • I told them to use the REST API bitstream retrieve links, because we don’t have any rate limits there

2020-09-09

  • Wire up the systemd service/timer for the CGSpace Country Code Tagger curation task in the Ansible infrastructure scripts
    • For now it won’t work on DSpace 6 because the curation task invocation needs to be slightly different (minus the -l parameter) and for some reason the task isn’t working on DSpace Test (version 6) right now
    • I added DSpace 6 support to the playbook templates…
  • Run system updates on DSpace Test (linode26), re-deploy the DSpace 6 test branch, and reboot the server
    • After rebooting I deleted old copies of the cgspace-java-helpers JAR in the DSpace lib directory and then the curation worked
    • To my great surprise the curation worked (and completed, albeit a few times slower) on my local DSpace 6 environment as well:
$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object

2020-09-10

  • I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night… w00t
  • I started looking at Peter’s changes to the CGSpace regions that were proposed in 2020-07
    • The changes will be:
$ cat 2020-09-10-fix-cgspace-regions.csv
cg.coverage.region,correct
EAST AFRICA,EASTERN AFRICA
WEST AFRICA,WESTERN AFRICA
SOUTHEAST ASIA,SOUTHEASTERN ASIA
SOUTH ASIA,SOUTHERN ASIA
AFRICA SOUTH OF SAHARA,SUB-SAHARAN AFRICA
NORTH AFRICA,NORTHERN AFRICA
WEST ASIA,WESTERN ASIA
SOUTHWEST ASIA,SOUTHWESTERN ASIA
$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n
Connected to database.
Would fix 12227 occurences of: EAST AFRICA
Would fix 7996 occurences of: WEST AFRICA
Would fix 3515 occurences of: SOUTHEAST ASIA
Would fix 3443 occurences of: SOUTH ASIA
Would fix 1134 occurences of: AFRICA SOUTH OF SAHARA
Would fix 357 occurences of: NORTH AFRICA
Would fix 81 occurences of: WEST ASIA
Would fix 3 occurences of: SOUTHWEST ASIA
  • I think we need to wait for the web team, though, as they need to update their mappings
    • Not to mention that we’ll need to give WLE and CCAFS time to update their harvesters as well… hmmm
  • Looking at the top user agents active on CGSpace in 2020-08 and I see:
    • Delphi 2009: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)
    • Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98): 57004 (IP is 18.196.100.94, and the requests seem to be for CTA’s content)
    • RTB website BOT: 12282
    • ILRI Livestock Website Publications importer BOT: 9393
  • Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn’t commit the change
  • HTTrack is in the agents list so I’m not sure why DSpace registers a hit from that request
  • Also, I am surprised to see the RTB and ILRI bots here because they have “BOT” in the name and that should also be dropped
  • I also see hits from curl and Java/1.8.0_66 and Apache-HttpClient so WTF… those are supposed to be dropped by the default agents list
  • Some IP 2607:f298:5:101d:f816:3eff:fed9:a484 made 9,000 requests with the RI/1.0 user agent this year…
    • That’s on DreamHost…?
  • I purged 448658 hits from these agents and added Delphi to our local agents overload for Solr as well as Tomcat’s Crawler Session Manager Valve so that it forces them to re-use a single session
  • I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
    • This bot made 8,000 requests to CGSpace this year
    • I purged about 20,000 total requests from this bot from our Solr stats for the last few years

2020-09-11

  • Peter noticed that an export from AReS shows some items with zero views and others with zero views/downloads, but on CGSpace and in the statistics API there are views/downloads
    • I need to ask Moayad…

2020-09-12

  • Carlos Tejo from the LandPortal emailed to ask for advice about integrating their LandVoc vocabulary, which is a subset of AGROVOC, into DSpace
  • Redeploy the latest 5_x-prod branch on CGSpace, re-run the latest Ansible DSpace playbook, run all system updates, and reboot the server (linode18)
    • This will bring the latest bot lists for Solr and Tomcat
    • I had to restart Tomcat 7 three times before all Solr statistics cores came up OK
  • Leroy and Carol from CIAT/Bioversity were asking for information about posting to the CGSpace REST API from Sharepoint
    • I told them that we don’t allow this yet, but that we need to check in the future whether content can be posted to a workflow

2020-09-15

  • Charlotte from Altmetric said they had issues parsing the XML file I sent them last month
    • I told them that it was mimicking the same format that they had sent me (fourteen pages of XML responses concatenated together)!
  • A few days ago IWMI asked us if we can add a new field on CGSpace for their library identifier
    • The IDs look like this: H049940
    • I suggested that we use cg.identifier.iwmilibrary
    • I added it to the input forms and push it to the 5_x-prod and 6.x branches and will re-deploy it in the next few days
  • Abenet asked me to import sixty-nine (69) CIP Annual Reports to CGSpace
    • I looked at the data in OpenRefine and it is very good quality
    • I only added descriptions to the filename field so that SAFBuilder will add them to the bitstreams on import:
value + "__description:" + cells["dc.type"].value
  • Then I created a SAF bundle with SAFBuilder:
$ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
  • And imported them into my local test instance of CGSpace:
$ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
  • Then I uploaded them to CGSpace

2020-09-16

  • Looking further into Carlos Tejos’s question about integrating LandVoc (the AGROVOC subset) into DSpace
    • I see that you can actually get LandVoc concepts directly from AGROVOC’s SPARQL, for example with this query

AGROVOC LandVoc SPARQL

  • So maybe we can query AGROVOC directly using a similar method to DSpace-CRIS’s GettyAuthority
  • I wired up DSpace-CRIS’s VIAFAuthority to see how authorities for auto suggested names get stored
    • After submission you can see the item’s VIAF identifier:

VIAF authority

  • And this identifier is the ID on VIAF, pretty cool!

VIAF entry for Charles Darwin

  • I did a similar test with the Getty Thesaurus of Geographic Names (TGN) and it stores the concept URI in the authority:

TGNAuthority

  • But the authority values are not exposed anywhere as metadata…
    • I need to play with it a bit more I guess…
  • The nice thing is that the Getty example from DSpace-CRIS uses SPARQL as well, and the TGN authority extends it
    • We could use a similar model for AGROVOC/LandVoc very easily

2020-09-17

  • Maria from Bioveristy asked about the ORCID identifier for one of her colleagues that seems to have been removed from our list
    • I re-added it to our controlled vocabulary and added the identifier to fifty-one of his existing items on CGSpace using my script:
$ cat 2020-09-17-add-bioversity-orcids.csv
dc.contributor.author,cg.creator.id
"Etten, Jacob van","Jacob van Etten: 0000-0001-7554-2558"
"van Etten, Jacob","Jacob van Etten: 0000-0001-7554-2558"
$ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dspace -u dspace -p 'dom@in34sniper'
  • I sent a follow-up message to Atmire to look into the two remaining issues with the DSpace 6 upgrade
    • First is the fact that we have zero results in our Listings and Reports, for any search
    • Second is the error we get during CSV imports
  • Help Natalia and Cathy from Bioversity-CIAT with their OpenSearch query on “trade offs” again
    • They wanted to build a search query with multiple filters (type, crpsubject, status) and the general query “trade offs”
    • I found a great reference for DSpace’s OpenSearch syntax (albeit in Finnish, but the example URLs show the syntax clearly)
    • We can use quotes and AND and OR and even group search parameters with parenthesis!
    • So now I built a query for Natalia which uses these (showing without URL encoding so you can see the syntax):
https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
  • I noticed that my move-collections.sh script didn’t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection resource_id parameters in the PostgreSQL query

2020-09-18

  • Help Natalia with her WLE “tradeoffs” search query again…

2020-09-20

  • Deploy latest 5_x-prod branch on CGSpace, run all system updates, and reboot the server
    • To my great surprise, all the Solr statistics cores came up correctly after reboot
  • Deploy latest 6_x-dev branch on DSpace Test, run all system updates and reboot the server

2020-09-22

  • Abenet sent some feedback about AReS
    • The item views and downloads are still incorrect
    • I looked in the server’s API logs and there are no errors, and the database has many more views/downloads:
dspacestatistics=# SELECT SUM(views) FROM items;
   sum
----------
 15714024
(1 row)

dspacestatistics=# SELECT SUM(downloads) FROM items;
   sum
----------
 13979911
(1 row)
  • I deleted “Report” from twelve items that had it in their peer review field:
dspace=# BEGIN;
BEGIN
dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
DELETE 12
dspace=# COMMIT;
  • I added all CG center- and CRP-specific subject fields and mapped them to dc.subject in AReS
  • After forcing a re-harvesting now the review status is much cleaner and the missing subjects are available
  • Last week Natalia from CIAT had asked me to download all the PDFs for a certain query:
    • items with status “Open Access”
    • items with type “Journal Article”
    • items containing any of the following words: water land and ecosystems & trade offs
    • The resulting OpenSearch query is: https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:“Open Access” AND Water Land Ecosystems trade offs&rpp=1
    • There were 241 results with a total of 208 PDFs, which I downloaded with my get-wle-pdfs.py script and shared to her via bashupload.com

2020-09-23

  • Peter said he was having problems submitting items to CGSpace
    • On a hunch I looked at the PostgreSQL locks in Munin and indeed the normal issue with locks is back (though I haven’t seen it in a few months?)

PostgreSQL connections day

  • Instead of restarting Tomcat I restarted the PostgreSQL service and then Peter said he was able to submit the item…
  • Experiment with doing direct queries for items in the dspace-statistics-api
    • I tested querying a handful of item UUIDs with a date range and returning their hits faceted by id
    • Assuming a list of item UUIDs was posted to the REST API we could prepare them for a Solr query by joining them into a string with “OR” and escaping the hyphens:
...
item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
...
solr_query_params = {
    "q": f"id:({item_ids_str})",
    "fq": "type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]",
    "facet": "true",
    "facet.field": "id",
    "facet.mincount": 1,
    "facet.limit": 1,
    "facet.offset": 0,
    "stats": "true",
    "stats.field": "id",
    "stats.calcdistinct": "true",
    "shards": shards,
    "rows": 0,
    "wt": "json",
}
  • The date range format for Solr is important, but it seems we only need to add T00:00:00Z to the normal ISO 8601 YYYY-MM-DD strings