--- title: "October, 2018" date: 2018-10-01T22:31:54+03:00 author: "Alan Orth" tags: ["Notes"] --- ## 2018-10-01 - Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items - I created a GitHub issue to track this [#389](https://github.com/ilri/DSpace/issues/389), because I'm super busy in Nairobi right now ## 2018-10-03 - I see Moayad was busy collecting item views and downloads from CGSpace yesterday: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1} ' | sort | uniq -c | sort -n | tail -n 10 933 40.77.167.90 971 95.108.181.88 1043 41.204.190.40 1454 157.55.39.54 1538 207.46.13.69 1719 66.249.64.61 2048 50.116.102.77 4639 66.249.64.59 4736 35.237.175.180 150362 34.218.226.147 ``` - Of those, about 20% were HTTP 500 responses (!): ``` $ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c 118927 200 31435 500 ``` - I added Phil Thornton and Sonal Henson's ORCID identifiers to the controlled vocabulary for `cg.creator.orcid` and then re-generated the names using my [resolve-orcids.py](https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b) script: ``` $ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt $ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d ``` - I found a new corner case error that I need to check, given *and* family names deactivated: ``` Looking up the names associated with ORCID iD: 0000-0001-7930-5752 Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752 ``` - It appears to be Jim Lorenzen... I need to check that later! - I merged the changes to the `5_x-prod` branch ([#390](https://github.com/ilri/DSpace/pull/390)) - Linode sent another alert about CPU usage on CGSpace (linode18) this evening - It seems that Moayad is making quite a lot of requests today: ``` # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10 1594 157.55.39.160 1627 157.55.39.173 1774 136.243.6.84 4228 35.237.175.180 4497 70.32.83.92 4856 66.249.64.59 7120 50.116.102.77 12518 138.201.49.199 87646 34.218.226.147 111729 213.139.53.62 ``` - But in super positive news, he says they are using my new [dspace-statistics-api](https://github.com/alanorth/dspace-statistics-api) and it's MUCH faster than using Atmire CUA's internal "restlet" API - I don't recognize the `138.201.49.199` IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams: ``` # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c 8324 GET /bitstream 4193 GET /handle ``` - Suspiciously, it's only grabbing the CGIAR System Office community (handle prefix 10947): ``` # grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c 7 GET /handle/10568 4186 GET /handle/10947 ``` - The user agent is suspicious too: ``` Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36 ``` - It's clearly a bot and it's not re-using its Tomcat session, so I will add its IP to the nginx bad bot list - I looked in Solr's statistics core and these hits were actually all counted as `isBot:false` (of course)... hmmm - I tagged all of Sonal and Phil's items with their ORCID identifiers on CGSpace using my [add-orcid-identifiers.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script: ``` $ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu' ``` - Where `2018-10-03-add-orcids.csv` contained: ``` dc.contributor.author,cg.creator.id "Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462 "Henson, S.",Sonal Henson: 0000-0002-2002-5462 "Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182 "Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182 "Thornton, Phil",Philip Thornton: 0000-0002-1854-0182 "Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182 "Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182 "Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182 ``` ## 2018-10-04 - Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items) - Every item has at least a `LICENSE` bundle, and some have a `THUMBNAIL` bundle, but the indexing code is specifically checking for downloads from the `ORIGINAL` bundle - [10568/97460](https://cgspace.cgiar.org/handle/10568/97460) (100550): has a thumbnail bitstream - [10568/96112](https://cgspace.cgiar.org/handle/10568/96112) (96736): has only a LICENSE bitstream - I see there are other bundles we might need to pay attention to: `TEXT`, `@_LOGO-COLLECTION_@`, `@_LOGO-COMMUNITY_@`, etc... - On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads - So it's fixed, but I'm not sure why! - Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests): ``` # zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets' 251226 ``` - I found a logic error in the dspace-statistics-api `indexer.py` script that was causing item views to be inserted into downloads - I tagged version 0.4.2 of the tool and redeployed it on CGSpace ## 2018-10-05 - Meet with Peter, Abenet, and Sisay to discuss CGSpace meeting in Nairobi and Sisay's work plan - We agreed that he would do monthly updates of the controlled vocabularies and generate a new one for the top 1,000 AGROVOC terms - Add a link to [AReS explorer](https://cgspace.cgiar.org/explorer/) to the CGSpace homepage introduction text ## 2018-10-06 - Follow up with AgriKnowledge about including Handle links (`dc.identifier.uri`) on their item pages - In July, 2018 they had said their programmers would include the field in the next update of their website software - [CIMMYT's DSpace repository](https://repository.cimmyt.org/) is now running DSpace 5.x! - It's running OAI, but not REST, so I need to talk to Richard about that! ## 2018-10-08 - AgriKnowledge says they're going to add the `dc.identifier.uri` to their item view in November when they update their website software ## 2018-10-10 - Peter noticed that some recently added PDFs don't have thumbnails - When I tried to force them to be generated I got an error that I've never seen before: ``` $ dspace filter-media -v -f -i 10568/97613 org.im4java.core.InfoException: org.im4java.core.CommandException: org.im4java.core.CommandException: identify: not authorized `/tmp/impdfthumb5039464037201498062.pdf' @ error/constitute.c/ReadImage/412. ``` - I see there was an update to Ubuntu's ImageMagick on 2018-10-05, so maybe something changed or broke? - I get the same error when forcing `filter-media` to run on DSpace Test too, so it's gotta be an ImageMagic bug - The ImageMagick version is currently 8:6.8.9.9-7ubuntu5.13, and there is an [Ubuntu Security Notice from 2018-10-04](https://usn.ubuntu.com/3785-1/) - Wow, someone on [Twitter posted about this breaking his web application](https://twitter.com/rosscampbell/status/1048268966819319808) (and it was retweeted by the ImageMagick acount!) - I commented out the line that disables PDF thumbnails in `/etc/ImageMagick-6/policy.xml`: ``` ``` - This works, but I'm not sure what ImageMagick's long-term plan is if they are going to disable ALL image formats... - I suppose I need to enable a workaround for this in Ansible? ## 2018-10-11 - I emailed DuraSpace to update [our entry in their DSpace registry](https://duraspace.org/registry/entry/4188/?gvid=178) (the data was still on DSpace 3, JSPUI, etc) - Generate a list of the top 1500 values for `dc.subject` so Sisay can start making a controlled vocabulary for it: ``` dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-10-11-top-1500-subject.csv WITH CSV HEADER; COPY 1500 ``` - Give WorldFish advice about Handles because they are talking to some company called KnowledgeArc who recommends they do not use Handles! - Last week I emailed Altmetric to ask if their software would notice mentions of our Handle in the format "handle:10568/80775" because I noticed that the [Land Portal does this](https://landportal.org/library/resources/handle1056880775/unlocking-farming-potential-bangladesh%E2%80%99-polders) - Altmetric support responded to say no, but the reason is that Land Portal is doing even more strange stuff by not using `` tags in their page header, and using "dct:identifier" property instead of "dc:identifier" - I re-created my local DSpace databse container using [podman](https://github.com/containers/libpod) instead of Docker: ``` $ mkdir -p ~/.local/lib/containers/volumes/dspacedb_data $ sudo podman create --name dspacedb -v /home/aorth/.local/lib/containers/volumes/dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:9.6-alpine $ sudo podman start dspacedb $ createuser -h localhost -U postgres --pwprompt dspacetest $ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest superuser;' $ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2018-10-11.backup $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser;' $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest ``` - I tried to make an Artifactory in podman, but it seems to have problems because Artifactory is distributed on the Bintray repository - I can pull the `docker.bintray.io/jfrog/artifactory-oss:latest` image, but not start it - I decided to use a Sonatype Nexus repository instead: ``` $ mkdir -p ~/.local/lib/containers/volumes/nexus_data $ sudo podman run --name nexus -d -v /home/aorth/.local/lib/containers/volumes/nexus_data:/nexus_data -p 8081:8081 sonatype/nexus3 ``` - With a few changes to my local Maven `settings.xml` it is working well - Generate a list of the top 10,000 authors for Peter Ballantyne to look through: ``` dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 3 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 10000) to /tmp/2018-10-11-top-10000-authors.csv WITH CSV HEADER; COPY 10000 ``` - CTA uploaded some infographics that are very tall and their thumbnails disrupt the item lists on the front page and in their communities and collections - I decided to constrain the max height of these to 200px using CSS ([#392](https://github.com/ilri/DSpace/pull/392)) ## 2018-10-13 - Run all system updates on DSpace Test (linode19) and reboot it - Look through Peter's list of 746 author corrections in OpenRefine - I first facet by blank, trim whitespace, and then check for weird characters that might be indicative of encoding issues with this GREL: ``` or( isNotNull(value.match(/.*\uFFFD.*/)), isNotNull(value.match(/.*\u00A0.*/)), isNotNull(value.match(/.*\u200A.*/)), isNotNull(value.match(/.*\u2019.*/)), isNotNull(value.match(/.*\u00b4.*/)) ) ``` - Then I exported and applied them on my local test server: ``` $ ./fix-metadata-values.py -i 2018-10-11-top-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t CORRECT -m 3 ``` - I will apply these on CGSpace when I do the other updates tomorrow, as well as double check the high scoring ones to see if they are correct in Sisay's author controlled vocabulary ## 2018-10-14 - Merge the authors controlled vocabulary ([#393](https://github.com/ilri/DSpace/pull/393)), usage rights ([#394](https://github.com/ilri/DSpace/pull/394)), and the upstream DSpace 5.x cherry-picks ([#394](https://github.com/ilri/DSpace/pull/395)) into our `5_x-prod` branch - Switch to new CGIAR LDAP server on CGSpace, as it's been running (at least for authentication) on DSpace Test for the last few weeks, and I think they old one will be deprecated soon (today?) - Apply Peter's 746 author corrections on CGSpace and DSpace Test using my [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897) script: ``` $ ./fix-metadata-values.py -i /tmp/2018-10-11-top-authors.csv -f dc.contributor.author -t CORRECT -m 3 -db dspace -u dspace -p 'fuuu' ``` - Run all system updates on CGSpace (linode19) and reboot the server