
245 lines
13 KiB
Raw Normal View History

2021-09-02 16:21:48 +02:00
title: "September, 2021"
date: 2021-09-01T09:14:07+03:00
author: "Alan Orth"
categories: ["Notes"]
## 2021-09-02
- Troubleshooting the missing Altmetric scores on AReS
- Turns out that I didn't actually fix them last month because the check for `content.altmetric` still exists, and I can't access the DOIs using `_h.source.DOI` for some reason
- I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
- I will change `DOI` to `tomato` in the repository setup and start a re-harvest... I need to see if this is some kind of reserved word or something...
2021-09-04 20:16:03 +02:00
- Even as `tomato` I can't access that field as `_h.source.tomato` in Angular, but it does work as a filter source... sigh
2021-09-02 16:21:48 +02:00
- I'm having problems using the OpenRXV API
- The syntax Moayad showed me last month doesn't seem to honor the search query properly...
2021-09-06 11:31:11 +02:00
## 2021-09-05
- Update Docker images on AReS server (linode20) and rebuild OpenRXV:
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
- Then run system updates and reboot the server
- After the system came back up I started a fresh re-harvesting
2021-09-13 15:21:16 +02:00
## 2021-09-07
- Checking last month's Solr statistics to see if there are any new bots that I need to purge and add to the list
- made 50,000 requests on one day in August, and it is using this user agent: `Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36`
- It's a fixed line ISP in Montpellier according to, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser
- is in Sweden and made 46,000 requests in August and it is using this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- is on Amazon and made 28,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36`
- is in Sweden and made 9,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0`
- is in Germany and made 6,000 requests with this user agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/`
- is in Amazon and made 3,000 requests with this user agent: `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36`
- I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- I can identify them by their reverse DNS:
- I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again
- While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS:,, and many others
- They must be related, because I see them all using the exact same user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko`
- So this DNS is some Bing bot also...
- I extracted all the IPs and purged them using my `` script
- In total I purged 225,000 hits...
## 2021-09-12
- Start a harvest on AReS
## 2021-09-13
- Mishell Portilla asked me about thumbnails on CGSpace being small
- For example, [10568/114576]( has a lot of white space on the left side
- I created a new thumbnail with vipsthumbnail:
$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
- Looking at the PDF's metadata I see:
- Producer: iLovePDF
- Creator: Adobe InDesign 15.0 (Windows)
- Format: PDF-1.7
- Eventually I should do more tests on this and perhaps file a bug with DSpace...
- Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool
- I told them I can give them access to DSpace Test and that we should have a meeting soon
- We need to figure out what controlled vocabularies they should use
2021-09-16 05:49:05 +02:00
## 2021-09-14
- Some people from the Alliance contacted me last week about AICCRA metadata
- They have internal things called Components and Clusters, so they were asking how to store these in CGSpace
- I suggested adding new metadata values: `cg.subject.aiccraComponent` and cg.subject.aiccraCluster`
- On second thought, these are identifiers so perhaps this is better: `cg.identifier.aiccraComponent` and `cg.identifier.aiccraCluster`
## 2021-09-15
- Add ORCID identifier for new ILRI staff to our controlled vocabualary
- Also tag their twenty-five existing items on CGSpace:
$ cat 2021-09-15-add-orcids.csv,cg.creator.identifier
"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
$ ./ilri/ -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
- Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
- I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using
- I also told them that I would create some documentation listing the metadata fields, which are mandatory, and the respective controlled vocabularies
2021-09-16 15:35:00 +02:00
## 2021-09-16
- Start writing a Python script to parse `input-forms.xml` to create documentation for submissions
- Found a bug with the DSpace 6.3 REST API, it returns HTTP 500 for `dc.title` even though it exists in the registry:
- Seems to be with any field that does not have a qualifier
- I filed an issue:
2021-09-17 14:03:28 +02:00
- I decided to update all the metadata field descriptions in our registry so I can use that instead of the "hint" for each field in the input form
- I will include examples as well so that it becomes a better resource
## 2021-09-17
- I filed [an issue about using SPDX License Identifiers in CG Core v2](
- Peter Ballantyne emailed me to say that CGSpace was very slow
- The front page was returning a blank white page
- I looked at the database and the connections look low:
$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
- Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin
- But the DSpace log file shows tons of database issues:
$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17
- The earliest one I see is around midnight (now is 2PM):
2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
- But I was definitely logged into the site this morning so there were no issues then...
- It seems that a few errors are normal, but there's obviously something wrong today:
$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
- I restarted the server and DSpace came up fine... so it must have been some kind of fluke
2021-09-19 14:42:23 +02:00
- Continue working on cleaning up and annotating the metadata registry on CGSpace
- I removed two old metadata fields that we stopped using earlier this year with the CG Core v2 migration: `cg.targetaudience` and `cg.title.journal`
## 2021-09-18
- Make more progress on parsing and documenting the CGSpace submission form
- Publish on GitHub:
## 2021-09-19
- Improve CGSpace Submission Guidelines metadata parsing and documentation
- GitHub Pages is live now:
- Start a full harvest on AReS
- The harvest completed successfully, but for some reason there were only 92,000 items...
- I updated all Docker images, rebuilt the application, then ran all system updates and rebooted the system:
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
2021-09-16 15:35:00 +02:00
2021-09-20 16:31:45 +02:00
## 2021-09-20
- I synchronized the production CGSpace PostreSQL, Solr, and Assetstore data with DSpace Test
- Over the weekend a few users reported that they could not log into CGSpace
- I checked LDAP and it seems there is something wrong:
$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "" -W "(sAMAccountName=someaccountnametocheck)"
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
- I sent a message to CGNET to ask about the server settings and see if our IP is still whitelisted
- It turns out that CGNET created a new Active Directory server ( and decomissioned the old one last week
- I updated the configuration on CGSpace and confirmed that it is working
- Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:
$ dspace user -a -m -g CIAT -s Submit -p 'fuuuuuuuu'
- I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from [2020-10]({{< relref "" >}}) the account must be in the admin group in order to submit via the REST API
- Run `dspace cleanup -v` process on CGSpace to clean up old bitstreams
- Export lists of authors, donors, and affiliations for Peter Ballantyne to clean up:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-authors.csv WITH CSV HEADER;
COPY 80901
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-donors.csv WITH CSV HEADER;
COPY 1274
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-09-20-affiliations.csv WITH CSV HEADER;
COPY 8091
2021-09-23 17:19:11 +02:00
## 2021-09-23
- Peter sent me back the corrections for the affiliations
- It is about 1,280 corrections and fourteen deletions
- I cleaned them up in csv-metadata-quality and then extracted the deletes and fixes to separate files to run with `` and ``:
$ csv-metadata-quality -i ~/Downloads/2021-09-20-affiliations.csv -o /tmp/affiliations.csv -x cg.contributor.affiliation
$ csvgrep -c 'correct' -m 'DELETE' /tmp/affiliations.csv > /tmp/affiliations-delete.csv
$ csvgrep -c 'correct' -r '^.+$' /tmp/affiliations.csv | csvgrep -i -c 'correct' -m 'DELETE' > /tmp/affiliations-fix.csv
$ ./ilri/ -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./ilri/ -i /tmp/affiliations-fix.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
- Then I updated the controlled vocabulary for affiliations by exporting the top 1,000 used terms:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1000) to /tmp/2021-09-23-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-09-23-affiliations.csv | sed 1d > /tmp/affiliations.txt
- Peter also sent me 310 corrections and 234 deletions for donors so I applied those and updated the controlled vocabularies too
- Move some One CGIAR-related collections around the CGSpace hierarchy for Peter Ballantyne
2021-09-23 17:32:47 +02:00
- Mohammed Salem asked me for an ID to UUID mapping for CGSpace collections, so I generated one similar to the ID one I sent him in 2020-11:
localhost/dspace63= > \COPY (SELECT collection_id,uuid FROM collection WHERE collection_id IS NOT NULL) TO /tmp/2021-09-23-collection-id2uuid.csv WITH CSV HEADER;
COPY 1139
2021-09-23 17:19:11 +02:00
2021-09-02 16:21:48 +02:00
<!-- vim: set sw=2 ts=2: -->