mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-18 04:37:04 +01:00
14 KiB
14 KiB
title | date | author | categories | |
---|---|---|---|---|
September, 2020 | 2020-09-02T15:35:54+03:00 | Alan Orth |
|
2020-09-02
- Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS
- The AReS Explorer hasn't updated its index since 2020-08-22 when I last forced it
- I restarted it again now and told Moayad that the automatic indexing isn't working
- Add
Alliance of Bioversity International and CIAT
to affiliations on CGSpace - Abenet told me that the general search text on AReS doesn't get reset when you use the "Reset Filters" button
- I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
- I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
- I ran the country code tagger on CGSpace:
$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
...
real 2m10.516s
user 1m43.953s
sys 0m15.192s
$ grep -c added /tmp/2020-09-02-countrycodetagger.log
39
- I still need to create a cron job for this...
- Sisay and Abenet said they can't log in with LDAP on DSpace Test (DSpace 6)
- I tried and I can't either... but it is working on CGSpace
- The error on DSpace 6 is:
2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
- I tried to query LDAP directly using the application credentials with ldapsearch and it works:
$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "applicationaccount@cgiarad.org" -W "(sAMAccountName=me)"
- According to the DSpace 6 docs we need to escape commas in our LDAP parameters due to the new configuration system
- I added the commas and restarted DSpace (though technically we shouldn't need to restart due to the new config system hot reloading configs)
- Run all system updates on DSpace Test (linode26) and reboot it
- After the restart LDAP login works...
2020-09-03
- Fix some erroneous "review status" fields that Abenet noticed on AReS
- I used my
fix-metadata-values.py
anddelete-metadata-values.py
scripts with the following input files:
- I used my
$ cat 2020-09-03-fix-review-status.csv
dc.description.version,correct
Externally Peer Reviewed,Peer Review
Peer Reviewed,Peer Review
Peer review,Peer Review
Peer reviewed,Peer Review
Peer-Reviewed,Peer Review
Peer-reviewed,Peer Review
peer Review,Peer Review
$ cat 2020-09-03-delete-review-status.csv
dc.description.version
Report
Formally Published
Poster
Unrefereed reprint
$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68
$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68
- Start reviewing 95 items for IITA (20201stbatch)
- I used my csv-metadata-quality tool to check and fix some low-hanging fruit first
- This fixed a few unnecessary Unicode, excessive whitespace, invalid multi-value separator, and duplicate metadata values
- Then I looked at the data in OpenRefine and noticed some things:
- All issue dates use year only, but some have months in the citation so they could be more specific
- I normalized all the DOIs to use "https://doi.org" format
- I fixed a few AGROVOC subjects with a simple GREL:
value.replace("GRAINS","GRAIN").replace("SOILS","SOIL").replace("CORN","MAIZE")
- But there are a few more that are invalid that she will have to look at
- I uploaded the items to DSpace Test and it was apparently successful but I get these errors to the console:
Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
Error while updating
java.lang.NullPointerException
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
- There are more in the DSpace log so I will raise it with Atmire immediately
2020-09-04
- I was checking the recent IITA data for duplicates when I noticed that one in CIFOR's Archive and saw that CIFOR has updated a bunch of their website URLs, for example:
- http://www.cifor.org/nc/online-library/browse/view-publication/publication/151.html → https://www.cifor.org/knowledge/publication/151
- https://www.cifor.org/library/4033 → https://www.cifor.org/knowledge/publication/4033
- https://www.cifor.org/pid/5087 → https://www.cifor.org/knowledge/publication/5087
- I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
- I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id
- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL:
if(cell.recon.matched, cell.recon.match.name, value)
- I mapped one duplicated from the CIFOR Archives and re-uploaded the 94 IITA items to a new collection on DSpace Test
2020-09-08
- I noticed that the "share" link in AReS wasn't working properly because it excludes the "explorer" part of the URI
- I filed an issue on GitHub: https://github.com/ilri/OpenRXV/issues/41
- I uploaded the 94 IITA items that I had been working on last week to CGSpace
- RTB emailed to ask why they are getting HTTP 503 errors during harvesting to the RTB WordPress website
- From the screenshot I can see they are requesting URLs like this:
https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
- So they end up getting rate limited due to the XMLUI rate limits
- I told them to use the REST API bitstream retrieve links, because we don't have any rate limits there
2020-09-09
- Wire up the systemd service/timer for the CGSpace Country Code Tagger curation task in the Ansible infrastructure scripts
For now it won't work on DSpace 6 because the curation task invocation needs to be slightly different (minus the-l
parameter) and for some reason the task isn't working on DSpace Test (version 6) right now- I added DSpace 6 support to the playbook templates...
- Run system updates on DSpace Test (linode26), re-deploy the DSpace 6 test branch, and reboot the server
- After rebooting I deleted old copies of the cgspace-java-helpers JAR in the DSpace lib directory and then the curation worked
- To my great surprise the curation worked (and completed, albeit a few times slower) on my local DSpace 6 environment as well:
$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
2020-09-10
- I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night... w00t
- I started looking at Peter's changes to the CGSpace regions that were proposed in 2020-07
- The changes will be:
$ cat 2020-09-10-fix-cgspace-regions.csv
cg.coverage.region,correct
EAST AFRICA,EASTERN AFRICA
WEST AFRICA,WESTERN AFRICA
SOUTHEAST ASIA,SOUTHEASTERN ASIA
SOUTH ASIA,SOUTHERN ASIA
AFRICA SOUTH OF SAHARA,SUB-SAHARAN AFRICA
NORTH AFRICA,NORTHERN AFRICA
WEST ASIA,WESTERN ASIA
SOUTHWEST ASIA,SOUTHWESTERN ASIA
$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n
Connected to database.
Would fix 12227 occurences of: EAST AFRICA
Would fix 7996 occurences of: WEST AFRICA
Would fix 3515 occurences of: SOUTHEAST ASIA
Would fix 3443 occurences of: SOUTH ASIA
Would fix 1134 occurences of: AFRICA SOUTH OF SAHARA
Would fix 357 occurences of: NORTH AFRICA
Would fix 81 occurences of: WEST ASIA
Would fix 3 occurences of: SOUTHWEST ASIA
- I think we need to wait for the web team, though, as they need to update their mappings
- Not to mention that we'll need to give WLE and CCAFS time to update their harvesters as well... hmmm
- Looking at the top user agents active on CGSpace in 2020-08 and I see:
Delphi 2009
: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA's content)RTB website BOT
: 12282ILRI Livestock Website Publications importer BOT
: 9393
- Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn't commit the change
- HTTrack is in the agents list so I'm not sure why DSpace registers a hit from that request
- Also, I am surprised to see the RTB and ILRI bots here because they have "BOT" in the name and that should also be dropped
- I also see hits from
curl
andJava/1.8.0_66
andApache-HttpClient
so WTF... those are supposed to be dropped by the default agents list - Some IP
2607:f298:5:101d:f816:3eff:fed9:a484
made 9,000 requests with theRI/1.0
user agent this year...- That's on DreamHost...?
- I purged 448658 hits from these agents and added
Delphi
to our local agents overload for Solr as well as Tomcat's Crawler Session Manager Valve so that it forces them to re-use a single session - I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
- This bot made 8,000 requests to CGSpace this year
- I purged about 20,000 total requests from this bot from our Solr stats for the last few years
2020-09-11
- Peter noticed that an export from AReS shows some items with zero views and others with zero views/downloads, but on CGSpace and in the statistics API there are views/downloads
- I need to ask Moayad...
2020-09-12
- Carlos Tejo from the LandPortal emailed to ask for advice about integrating their LandVoc vocabulary, which is a subset of AGROVOC, into DSpace
- I told him that they could use the DSpace authority control framework and sent an example of the VIAFAuthority from the DSpace-CRIS project: https://github.com/4Science/DSpace/blob/dspace-6_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java
- Redeploy the latest
5_x-prod
branch on CGSpace, re-run the latest Ansible DSpace playbook, run all system updates, and reboot the server (linode18)- This will bring the latest bot lists for Solr and Tomcat
- I had to restart Tomcat 7 three times before all Solr statistics cores came up OK
- Leroy and Carol from CIAT/Bioversity were asking for information about posting to the CGSpace REST API from Sharepoint
- I told them that we don't allow this yet, but that we need to check in the future whether content can be posted to a workflow
2020-09-15
- Charlotte from Altmetric said they had issues parsing the XML file I sent them last month
- I told them that it was mimicking the same format that they had sent me (fourteen pages of XML responses concatenated together)!
- A few days ago IWMI asked us if we can add a new field on CGSpace for their library identifier
- The IDs look like this: H049940
- I suggested that we use
cg.identifier.iwmilibrary
- I added it to the input forms and push it to the
5_x-prod
and 6.x branches and will re-deploy it in the next few days
- Abenet asked me to import sixty-nine (69) CIP Annual Reports to CGSpace
- I looked at the data in OpenRefine and it is very good quality
- I only added descriptions to the filename field so that SAFBuilder will add them to the bitstreams on import:
value + "__description:" + cells["dc.type"].value
- Then I created a SAF bundle with SAFBuilder:
$ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
- And imported them into my local test instance of CGSpace:
$ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
- Then I uploaded them to CGSpace