CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

October, 2018

2018-10-01

  • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
  • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now

2018-10-03

  • I see Moayad was busy collecting item views and downloads from CGSpace yesterday:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}
' | sort | uniq -c | sort -n | tail -n 10
    933 40.77.167.90
    971 95.108.181.88
   1043 41.204.190.40
   1454 157.55.39.54
   1538 207.46.13.69
   1719 66.249.64.61
   2048 50.116.102.77
   4639 66.249.64.59
   4736 35.237.175.180
 150362 34.218.226.147
  • Of those, about 20% were HTTP 500 responses (!):
$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
 118927 200
  31435 500
  • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
  • I found a new corner case error that I need to check, given and family names deactivated:
Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
  • It appears to be Jim Lorenzen… I need to check that later!
  • I merged the changes to the 5_x-prod branch (#390)
  • Linode sent another alert about CPU usage on CGSpace (linode18) this evening
  • It seems that Moayad is making quite a lot of requests today:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1594 157.55.39.160
   1627 157.55.39.173
   1774 136.243.6.84
   4228 35.237.175.180
   4497 70.32.83.92
   4856 66.249.64.59
   7120 50.116.102.77
  12518 138.201.49.199
  87646 34.218.226.147
 111729 213.139.53.62
  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API
  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
   8324 GET /bitstream
   4193 GET /handle
  • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):
# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
      7 GET /handle/10568
   4186 GET /handle/10947
  • The user agent is suspicious too:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
  • It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
  • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm
  • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
  • Where 2018-10-03-add-orcids.csv contained:
dc.contributor.author,cg.creator.id
"Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
"Henson, S.",Sonal Henson: 0000-0002-2002-5462
"Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
"Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phil",Philip Thornton: 0000-0002-1854-0182
"Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182

2018-10-04

  • Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)
  • Every item has at least a LICENSE bundle, and some have a THUMBNAIL bundle, but the indexing code is specifically checking for downloads from the ORIGINAL bundle
  • I see there are other bundles we might need to pay attention to: TEXT, @_LOGO-COLLECTION_@, @_LOGO-COMMUNITY_@, etc…
  • On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
  • So it’s fixed, but I’m not sure why!
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
# zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
251226
  • I found a logic error in the dspace-statistics-api indexer.py script that was causing item views to be inserted into downloads
  • I tagged version 0.4.2 of the tool and redeployed it on CGSpace

2018-10-05

  • Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay’s work plan
  • We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms
  • Add a link to AReS explorer to the CGSpace homepage introduction text

2018-10-06

  • Follow up with AgriKnowledge about including Handle links (dc.identifier.uri) on their item pages
  • In July, 2018 they had said their programmers would include the field in the next update of their website software
  • CIMMYT’s DSpace repository is now running DSpace 5.x!
  • It’s running OAI, but not REST, so I need to talk to Richard about that!

2018-10-08

  • AgriKnowledge says they’re going to add the dc.identifier.uri to their item view in November when they update their website software

2018-10-10

  • Peter noticed that some recently added PDFs don’t have thumbnails
  • When I tried to force them to be generated I got an error that I’ve never seen before:
$ dspace filter-media -v -f -i 10568/97613
org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412.
  • I see there was an update to Ubuntu’s ImageMagick on 2018-10-05, so maybe something changed or broke?
  • I get the same error when forcing filter-media to run on DSpace Test too, so it’s gotta be an ImageMagic bug
  • The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an Ubuntu Security Notice from 2018-10-04
  • Wow, someone on Twitter posted about this breaking his web application (and it was retweeted by the ImageMagick acount!)
  • I commented out the line that disables PDF thumbnails in /etc/ImageMagick-6/policy.xml:
  <!--<policy domain="coder" rights="none" pattern="PDF" />-->
  • This works, but I’m not sure what ImageMagick’s long-term plan is if they are going to disable ALL image formats…
  • I suppose I need to enable a workaround for this in Ansible?

2018-10-11

  • I emailed DuraSpace to update our entry in their DSpace registry (the data was still on DSpace 3, JSPUI, etc)
  • Generate a list of the top 1500 values for dc.subject so Sisay can start making a controlled vocabulary for it:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER;
COPY 1500
  • Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles!
  • Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format “handle:1056880775” because I noticed that the Land Portal does this
  • Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using <meta> tags in their page header, and using “dct:identifier” property instead of “dc:identifier”
  • I re-created my local DSpace databse container using podman instead of Docker:
$ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data
$ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ sudo podman start dspacedb
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
  • I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository
  • I can pull the docker.bintray.io/jfrog/artifactory-oss:latest image, but not start it
  • I decided to use a Sonatype Nexus repository instead:
$ mkdir -p ~/.local/lib/containers/volumes/nexus_data
$ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
  • With a few changes to my local Maven settings.xml it is working well
  • Generate a list of the top 10,000 authors for Peter Ballantyne to look through:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER;
COPY 10000
  • CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections
  • I decided to constrain the max height of these to 200px using CSS (#392)

2018-10-13

  • Run all system updates on DSpace Test (linode19) and reboot it
  • Look through Peter’s list of 746 author corrections in OpenRefine
  • I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL:
or(
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b4.*/))
)
  • Then I exported and applied them on my local test server:
$ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3
  • I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay’s author controlled vocabulary

2018-10-14

  • Merge the authors controlled vocabulary (#393), usage rights (#394), and the upstream DSpace 5.x cherry-picks (#394) into our 5_x-prod branch
  • Switch to new CGIAR LDAP server on CGSpace, as it’s been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?)
  • Apply Peter’s 746 author corrections on CGSpace and DSpace Test using my fix-metadata-values.py script:
$ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu'
  • Run all system updates on CGSpace (linode19) and reboot the server
  • After rebooting the server I noticed that Handles are not resolving, and the dspace-handle-server systemd service is not running (or rather, it exited with success)
  • Restarting the service with systemd works for a few seconds, then the java process quits
  • I suspect that the systemd service type needs to be forking rather than simple, because the service calls the default DSpace start-handle-server shell script, which uses nohup and & to background the java process
  • It would be nice if there was a cleaner way to start the service and then just log to the systemd journal rather than all this hiding and log redirecting
  • Email the Landportal.org people to ask if they would consider Dublin Core metadata tags in their page’s header, rather than the HTML properties they are using in their body
  • Peter pointed out that some thumbnails were still not getting generated
    • When I tried to generate them manually I noticed that the path to the CMYK profile had changed because Ubuntu upgraded Ghostscript from 9.18 to 9.25 last week… WTF?
    • Looks like I can use /usr/share/ghostscript/current instead of /usr/share/ghostscript/9.25
  • I limited the tall thumbnails even further to 170px because Peter said CTA’s were still too tall at 200px (#396)

2018-10-15

  • Tomcat on DSpace Test (linode19) has somehow stopped running all the DSpace applications
  • I don’t see anything in the Catalina logs or dmesg, and the Tomcat manager shows XMLUI, REST, OAI, etc all “Running: false”
  • Actually, now I remember that yesterday when I deployed the latest changes from git on DSpace Test I noticed a syntax error in one XML file when I was doing the discovery reindexing
  • I fixed it so that I could reindex, but I guess the rest of DSpace actually didn’t start up…
  • Create an account on DSpace Test for Felix from Earlham so he can test COPO submission
    • I created a new collection and added him as the administrator so he can test submission
    • He said he actually wants to test creation of communities, collections, etc, so I had to make him a super admin for now
    • I told him we need to think about the workflow more seriously in the future
  • I ended up having some issues with podman and went back to Docker, so I had to re-create my containers:
$ sudo docker run --name nexus --network dspace-build -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3
$ sudo docker run --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;'

2018-10-16

  • Generate a list of the schema on CGSpace so CodeObia can compare with MELSpace:
dspace=# \copy (SELECT (CASE when metadata_schema_id=1 THEN 'dc' WHEN metadata_schema_id=2 THEN 'cg' END) AS schema, element, qualifier, scope_note FROM metadatafieldregistry where metadata_schema_id IN (1,2)) TO /tmp/cgspace-schema.csv WITH CSV HEADER;
  • Talking to the CodeObia guys about the REST API I started to wonder why it’s so slow and how I can quantify it in order to ask the dspace-tech mailing list for help profiling it
  • Interestingly, the speed doesn’t get better after you request the same thing multiple times–it’s consistently bad on both CGSpace and DSpace Test!
$ time http --print h 'https://cgspace.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.35s user 0.06s system 1% cpu 25.133 total
0.31s user 0.04s system 1% cpu 25.223 total
0.27s user 0.06s system 1% cpu 27.858 total
0.20s user 0.05s system 1% cpu 23.838 total
0.30s user 0.05s system 1% cpu 24.301 total

$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.22s user 0.03s system 1% cpu 17.248 total
0.23s user 0.02s system 1% cpu 16.856 total
0.23s user 0.04s system 1% cpu 16.460 total
0.24s user 0.04s system 1% cpu 21.043 total
0.22s user 0.04s system 1% cpu 17.132 total
  • I should note that at this time CGSpace is using Oracle Java and DSpace Test is using OpenJDK (both version 8)
  • I wonder if the Java garbage collector is important here, or if there are missing indexes in PostgreSQL?
  • I switched DSpace Test to the G1GC garbage collector and tried again and now the results are worse!
$ time http --print h 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
...
0.20s user 0.03s system 0% cpu 25.017 total
0.23s user 0.02s system 1% cpu 23.299 total
0.24s user 0.02s system 1% cpu 22.496 total
0.22s user 0.03s system 1% cpu 22.720 total
0.23s user 0.03s system 1% cpu 22.632 total
  • If I make a request without the expands it is ten time faster:
$ time http --print h 'https://dspacetest.cgiar.org/rest/items?limit=100&offset=0'
...
0.20s user 0.03s system 7% cpu 3.098 total
0.22s user 0.03s system 8% cpu 2.896 total
0.21s user 0.05s system 9% cpu 2.787 total
0.23s user 0.02s system 8% cpu 2.896 total
  • I sent a mail to dspace-tech to ask how to profile this…