2021-09-02
- Troubleshooting the missing Altmetric scores on AReS
- Turns out that I didn’t actually fix them last month because the check for
content.altmetric
still exists, and I can’t access the DOIs using _h.source.DOI
for some reason
- I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
- I will change
DOI
to tomato
in the repository setup and start a re-harvest… I need to see if this is some kind of reserved word or something…
- Even as
tomato
I can’t access that field as _h.source.tomato
in Angular, but it does work as a filter source… sigh
- I’m having problems using the OpenRXV API
- The syntax Moayad showed me last month doesn’t seem to honor the search query properly…
2021-09-05
- Update Docker images on AReS server (linode20) and rebuild OpenRXV:
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build
- Then run system updates and reboot the server
- After the system came back up I started a fresh re-harvesting
2021-09-07
- Checking last month’s Solr statistics to see if there are any new bots that I need to purge and add to the list
- 78.203.225.68 made 50,000 requests on one day in August, and it is using this user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36
- It’s a fixed line ISP in Montpellier according to AbuseIPDB.com, and has not been flagged as abusive, so it must be some CGIAR SMO person doing some web application harvesting from the browser
- 130.255.162.154 is in Sweden and made 46,000 requests in August and it is using this user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0
- 35.174.144.154 is on Amazon and made 28,000 requests with this user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
- 192.121.135.6 is in Sweden and made 9,000 requests with this user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 11.1; rv:84.0) Gecko/20100101 Firefox/84.0
- 185.38.40.66 is in Germany and made 6,000 requests with this user agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0 BoldBrains SC/1.10.2.4
- 3.225.28.105 is in Amazon and made 3,000 requests with this user agent:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36
- I also noticed that we still have tons (25,000) of requests by MSNbot using this normal-looking user agent:
Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
- I can identify them by their reverse DNS: msnbot-40-77-167-105.search.msn.com.
- I had already purged a bunch of these by their IPs in 2021-06, so it looks like I have to do that again
- While looking at the MSN requests I noticed tons of requests from another strange host using reverse IP DNS: malta2095.startdedicated.com., astra5139.startdedicated.com., and many others
- They must be related, because I see them all using the exact same user agent:
Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
- So this startdedicated.com DNS is some Bing bot also…
- I extracted all the IPs and purged them using my
check-spider-ip-hits.sh
script
- In total I purged 225,000 hits…
2021-09-12
2021-09-13
- Mishell Portilla asked me about thumbnails on CGSpace being small
- For example, 10568/114576 has a lot of white space on the left side
- I created a new thumbnail with vipsthumbnail:
$ vipsthumbnail ARRTB2020ST.pdf -s x600 -o '%s.jpg[Q=85,optimize_coding,strip]'
- Looking at the PDF’s metadata I see:
- Producer: iLovePDF
- Creator: Adobe InDesign 15.0 (Windows)
- Format: PDF-1.7
- Eventually I should do more tests on this and perhaps file a bug with DSpace…
- Some Alliance people contacted me about getting access to the CGSpace API to deposit with their TIP tool
- I told them I can give them access to DSpace Test and that we should have a meeting soon
- We need to figure out what controlled vocabularies they should use
2021-09-14
- Some people from the Alliance contacted me last week about AICCRA metadata
- They have internal things called Components and Clusters, so they were asking how to store these in CGSpace
- I suggested adding new metadata values:
cg.subject.aiccraComponent
and cg.subject.aiccraCluster`
- On second thought, these are identifiers so perhaps this is better:
cg.identifier.aiccraComponent
and cg.identifier.aiccraCluster
2021-09-15
- Add ORCID identifier for new ILRI staff to our controlled vocabualary
- Also tag their twenty-five existing items on CGSpace:
$ cat 2021-09-15-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Kotchofa, Pacem","Pacem Kotchofa: 0000-0002-1640-8807"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-09-15-add-orcids.csv -db dspace -u dspace -p 'fuuuu'
- Meeting with Leroy Mwanzia and some other Alliance people about depositing to CGSpace via API
- I gave them some technical information about the CGSpace API and links to the controlled vocabularies and metadata registries we are using
- I also told them that I would create some documentation listing the metadata fields, which are mandatory, and the respective controlled vocabularies
2021-09-16
- Start writing a Python script to parse
input-forms.xml
to create documentation for submissions
- I decided to update all the metadata field descriptions in our registry so I can use that instead of the “hint” for each field in the input form
- I will include examples as well so that it becomes a better resource
2021-09-17
$ psql -c 'SELECT * FROM pg_stat_activity' | wc -l
63
- Load on the server is under 1.0, and there are only about 1,000 XMLUI sessions, which seems to be normal for this time of day according to Munin
- But the DSpace log file shows tons of database issues:
$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-17
14779
- The earliest one I see is around midnight (now is 2PM):
2021-09-17 00:01:49,572 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ SQL Error: 0, SQLState: null
2021-09-17 00:01:49,572 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
- But I was definitely logged into the site this morning so there were no issues then…
- It seems that a few errors are normal, but there’s obviously something wrong today:
$ grep -c "Timeout waiting for idle object" dspace.log.2021-09-*
dspace.log.2021-09-01:116
dspace.log.2021-09-02:163
dspace.log.2021-09-03:77
dspace.log.2021-09-04:13
dspace.log.2021-09-05:310
dspace.log.2021-09-06:0
dspace.log.2021-09-07:29
dspace.log.2021-09-08:86
dspace.log.2021-09-09:24
dspace.log.2021-09-10:26
dspace.log.2021-09-11:12
dspace.log.2021-09-12:5
dspace.log.2021-09-13:10
dspace.log.2021-09-14:102
dspace.log.2021-09-15:542
dspace.log.2021-09-16:368
dspace.log.2021-09-17:15235
- I restarted the server and DSpace came up fine… so it must have been some kind of fluke
- Continue working on cleaning up and annotating the metadata registry on CGSpace
- I removed two old metadata fields that we stopped using earlier this year with the CG Core v2 migration:
cg.targetaudience
and cg.title.journal
2021-09-18
- Make more progress on parsing and documenting the CGSpace submission form
2021-09-19
- Improve CGSpace Submission Guidelines metadata parsing and documentation
- Start a full harvest on AReS
- The harvest completed successfully, but for some reason there were only 92,000 items…
- I updated all Docker images, rebuilt the application, then ran all system updates and rebooted the system:
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose build