mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-29 09:58:22 +01:00
436 lines
24 KiB
Markdown
436 lines
24 KiB
Markdown
---
|
|
title: "August, 2021"
|
|
date: 2021-08-01T09:01:07+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2021-08-01
|
|
|
|
- Update Docker images on AReS server (linode20) and reboot the server:
|
|
|
|
```console
|
|
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
|
|
```
|
|
|
|
- I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
|
|
|
|
<!--more-->
|
|
|
|
- First running all existing updates, taking some backups, checking for broken packages, and then rebooting:
|
|
|
|
```console
|
|
# apt update && apt dist-upgrade
|
|
# apt autoremove && apt autoclean
|
|
# check for any packages with residual configs we can purge
|
|
# dpkg -l | grep -E '^rc' | awk '{print $2}'
|
|
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
|
|
# dpkg -C
|
|
# dpkg -l > 2021-08-01-linode20-dpkg.txt
|
|
# tar -I zstd -cvf 2021-08-01-etc.tar.zst /etc
|
|
# reboot
|
|
# sed -i 's/bionic/focal/' /etc/apt/sources.list.d/*.list
|
|
# do-release-upgrade
|
|
```
|
|
- ... but of course it hit [the libxcrypt bug](https://bugs.launchpad.net/ubuntu/+source/libxcrypt/+bug/1903838)
|
|
- I had to get a copy of libcrypt.so.1.1.0 from a working Ubuntu 20.04 system and finish the upgrade manually
|
|
|
|
```console
|
|
# apt install -f
|
|
# apt dist-upgrade
|
|
# reboot
|
|
```
|
|
|
|
- After rebooting I purged all packages with residual configs and cleaned up again:
|
|
|
|
```console
|
|
# dpkg -l | grep -E '^rc' | awk '{print $2}' | xargs dpkg -P
|
|
# apt autoremove && apt autoclean
|
|
```
|
|
|
|
- Then I cleared my local Ansible fact cache and re-ran the [infrastructure playbooks](https://github.com/ilri/rmg-ansible-public)
|
|
- Open [an issue for the value mappings global replacement bug in OpenRXV](https://github.com/ilri/OpenRXV/issues/111)
|
|
- Advise Peter and Abenet on expected CGSpace budget for 2022
|
|
- Start a fresh harvesting on AReS (linode20)
|
|
|
|
## 2021-08-02
|
|
|
|
- Help Udana with OAI validation on CGSpace
|
|
- He was checking the OAI base URL on OpenArchives and I had to verify the results in order to proceed to Step 2
|
|
- Now it seems to be verified (all green): https://www.openarchives.org/Register/ValidateSite?log=R23ZWX85
|
|
- We are listed in the OpenArchives list of databases conforming to OAI 2.0
|
|
|
|
## 2021-08-03
|
|
|
|
- Run fresh re-harvest on AReS
|
|
|
|
## 2021-08-05
|
|
|
|
- Have a quick call with Mishell Portilla from CIP about a journal article that was flagged as being in a predatory journal (Beall's List)
|
|
- We agreed to unmap it from RTB's collection for now, and I asked for advice from Peter and Abenet for what to do in the future
|
|
- A developer from the Alliance asked for access to the CGSpace database so they can make some integration with PowerBI
|
|
- I told them we don't allow direct database access, and that it would be tricky anyways (that's what APIs are for!)
|
|
- I'm curious if there are still any requests coming in to CGSpace from the abusive Russian networks
|
|
- I extracted all the unique IPs that nginx processed in the last week:
|
|
|
|
```console
|
|
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.6 /var/log/nginx/access.log.7 /var/log/nginx/access.log.8 | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/2021-08-05-all-ips.txt
|
|
# wc -l /tmp/2021-08-05-all-ips.txt
|
|
43428 /tmp/2021-08-05-all-ips.txt
|
|
```
|
|
|
|
- Already I can see that the total is much less than during the attack on one weekend last month (over 50,000!)
|
|
- Indeed, now I see that there are no IPs from those networks coming in now:
|
|
|
|
```console
|
|
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/2021-08-05-all-ips.txt -o /tmp/2021-08-05-all-ips.csv
|
|
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/2021-08-05-all-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/2021-08-05-all-ips-to-purge.csv
|
|
$ wc -l /tmp/2021-08-05-all-ips-to-purge.csv
|
|
0 /tmp/2021-08-05-all-ips-to-purge.csv
|
|
```
|
|
|
|
## 2021-08-08
|
|
|
|
- Advise IWMI colleagues on best practices for thumbnails
|
|
- Add a handful of mappings for incorrect countries, regions, and licenses on AReS and start a new harvest
|
|
- I sent a message to Jacquie from WorldFish to ask if I can help her clean up the incorrect countries and regions in their repository, for example:
|
|
- WorldFish countries: Aegean, Euboea, Caribbean Sea, Caspian Sea, Chilwa Lake, Imo River, Indian Ocean, Indo-pacific
|
|
- WorldFish regions: Black Sea, Arabian Sea, Caribbean Sea, California Gulf, Mediterranean Sea, North Sea, Red Sea
|
|
- Looking at the July Solr statistics to find the top IP and user agents, looking for anything strange
|
|
- 35.174.144.154 made 11,000 requests last month with the following user agent:
|
|
|
|
```console
|
|
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36
|
|
```
|
|
|
|
- That IP is on Amazon, and from looking at the DSpace logs I don't see them logging in at all, only scraping... so I will purge hits from that IP
|
|
- I see 93.158.90.30 is some Swedish IP that also has a normal-looking user agent, but never logs in and requests thousands of XMLUI pages, I will purge their hits too
|
|
- Same deal with 130.255.162.173, which is also in Sweden and makes requests every five seconds or so
|
|
- Same deal with 93.158.90.91, also in Sweden
|
|
- 3.225.28.105 uses a normal-looking user agent but makes thousands of request to the REST API a few seconds apart
|
|
- 61.143.40.50 is in China and uses this hilarious user agent:
|
|
|
|
```console
|
|
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}"
|
|
```
|
|
|
|
- 47.252.80.214 is owned by Alibaba in the US and has the same user agent
|
|
- 159.138.131.15 is in Hong Kong and also seems to be a bot because I never see it log in and it downloads 4,300 PDFs over the course of a few hours
|
|
- 95.87.154.12 seems to be a new bot with the following user agent:
|
|
|
|
```console
|
|
Mozilla/5.0 (compatible; MaCoCu; +https://www.clarin.si/info/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data/
|
|
```
|
|
|
|
- They have a legitimate EU-funded project to enrich data for under-resourced languages in the EU
|
|
- I will purge the hits and add them to our list of bot overrides in the mean time before I submit it to COUNTER-Robots
|
|
- I see a new bot using this user agent:
|
|
|
|
```console
|
|
nettle (+https://www.nettle.sk)
|
|
```
|
|
|
|
- 129.0.211.251 is in Cameroon and uses a normal-looking user agent, but seems to be a bot of some sort, as it downloaded 900 PDFs over a short period.
|
|
- 217.182.21.193 is on OVH in France and uses a Linux user agent, but never logs in and makes several requests per minute, over 1,000 in a day
|
|
- 103.135.104.139 is in Hong Kong and also seems to be making real requests, but makes way too many to be a human
|
|
- There are probably more but that's most of them over 1,000 hits last month, so I will purge them:
|
|
|
|
```console
|
|
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
|
|
Purging 10796 hits from 35.174.144.154 in statistics
|
|
Purging 9993 hits from 93.158.90.30 in statistics
|
|
Purging 6092 hits from 130.255.162.173 in statistics
|
|
Purging 24863 hits from 3.225.28.105 in statistics
|
|
Purging 2988 hits from 93.158.90.91 in statistics
|
|
Purging 2497 hits from 61.143.40.50 in statistics
|
|
Purging 13866 hits from 159.138.131.15 in statistics
|
|
Purging 2721 hits from 95.87.154.12 in statistics
|
|
Purging 2786 hits from 47.252.80.214 in statistics
|
|
Purging 1485 hits from 129.0.211.251 in statistics
|
|
Purging 8952 hits from 217.182.21.193 in statistics
|
|
Purging 3446 hits from 103.135.104.139 in statistics
|
|
|
|
Total number of bot hits purged: 90485
|
|
```
|
|
|
|
- Then I purged a few thousand more by user agent:
|
|
|
|
```console
|
|
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri
|
|
Found 2707 hits from MaCoCu in statistics
|
|
Found 1785 hits from nettle in statistics
|
|
|
|
Total number of hits from bots: 4492
|
|
```
|
|
|
|
- I found some CGSpace metadata in the wrong fields
|
|
- Seven metadata in dc.subject (57) should be in dcterms.subject (187)
|
|
- Twelve metadata in cg.title.journal (202) should be in cg.journal (251)
|
|
- Three dc.identifier.isbn (20) should be in cg.isbn (252)
|
|
- Three dc.identifier.issn (21) should be in cg.issn (253)
|
|
- I re-ran the `migrate-fields.sh` script on CGSpace
|
|
- I exported the entire CGSpace repository as a CSV to do some work on ISSNs and ISBNs:
|
|
|
|
```console
|
|
$ csvcut -c 'id,cg.issn,cg.issn[],cg.issn[en],cg.issn[en_US],cg.isbn,cg.isbn[],cg.isbn[en_US]' /tmp/2021-08-08-cgspace.csv > /tmp/2021-08-08-issn-isbn.csv
|
|
```
|
|
|
|
- Then in OpenRefine I merged all null, blank, and en fields into the `en_US` one for each, removed all spaces, fixed invalid multi-value separators, removed everything other than ISSN/ISBNs themselves
|
|
- In total it was a few thousand metadata entries or so so I had to split the CSV with `xsv split` in order to process it
|
|
- I was reminded again how DSpace 6 is very fucking slow when it comes to any database-related operations, as it takes over an hour to process 200 metadata changes...
|
|
- In total it was 1,195 changes to ISSN and ISBN metadata fields
|
|
|
|
## 2021-08-09
|
|
|
|
- Extract all unique ISSNs to look up on Sherpa Romeo and Crossref
|
|
|
|
```console
|
|
$ csvcut -c 'cg.issn[en_US]' ~/Downloads/2021-08-08-CGSpace-ISBN-ISSN.csv | csvgrep -c 1 -r '^[0-9]{4}' | sed 1d | sort | uniq > /tmp/2021-08-09-issns.txt
|
|
$ ./ilri/sherpa-issn-lookup.py -a mehhhhhhhhhhhhh -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-sherpa-romeo.csv
|
|
$ ./ilri/crossref-issn-lookup.py -e me@cgiar.org -i /tmp/2021-08-09-issns.txt -o /tmp/2021-08-09-journals-crossref.csv
|
|
```
|
|
|
|
- Then I updated the CSV headers for each and joined the CSVs on the issn column:
|
|
|
|
```console
|
|
$ sed -i '1s/journal title/sherpa romeo journal title/' /tmp/2021-08-09-journals-sherpa-romeo.csv
|
|
$ sed -i '1s/journal title/crossref journal title/' /tmp/2021-08-09-journals-crossref.csv
|
|
$ csvjoin -c issn /tmp/2021-08-09-journals-sherpa-romeo.csv /tmp/2021-08-09-journals-crossref.csv > /tmp/2021-08-09-journals-all.csv
|
|
```
|
|
|
|
- In OpenRefine I faceted by blank in each column and copied the values from the other, then created a new column to indicate whether the values were the same with this GREL:
|
|
|
|
```console
|
|
if(cells['sherpa romeo journal title'].value == cells['crossref journal title'].value,"same","different")
|
|
```
|
|
|
|
- Then I exported the list of journals that differ and sent it to Peter for comments and corrections
|
|
- I want to build an updated controlled vocabulary so I can update CGSpace and reconcile our existing metadata against it
|
|
- Convert my `generate-thumbnails.py` script to use libvips instead of Graphicsmagick
|
|
- It is faster and uses less memory than GraphicsMagick (and ImageMagick), and produces nice thumbnails from PDFs
|
|
- One drawback is that libvips uses Poppler instead of Graphicsmagick, which apparently means that it can't work in CMYK
|
|
- I tested one item (10568/51999) that uses CMYK and the thumbnail looked OK (closer to the original than GraphicsMagick), so I'm not sure...
|
|
- Perhaps this is not a problem after all, see this PR from 2019: https://github.com/libvips/libvips/pull/1196
|
|
- I did some tests of the memory used and time elapsed with libvips, GraphicsMagick, and ImageMagick:
|
|
|
|
```console
|
|
$ /usr/bin/time -f %M:%e vipsthumbnail IPCC.pdf -s 600 -o '%s-vips.jpg[Q=85,optimize_coding,strip]'
|
|
39004:0.08
|
|
$ /usr/bin/time -f %M:%e gm convert IPCC.pdf\[0\] -quality 85 -thumbnail x600 -flatten IPCC-gm.jpg
|
|
40932:0.53
|
|
$ /usr/bin/time -f %M:%e convert IPCC.pdf\[0\] -flatten -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_cmyk.icc -profile /usr/share/ghostscript/9.54.0/iccprofiles/default_rgb.icc /tmp/impdfthumb2862933674765647409.pdf.jpg
|
|
41724:0.59
|
|
$ /usr/bin/time -f %M:%e convert -auto-orient /tmp/impdfthumb2862933674765647409.pdf.jpg -quality 85 -thumbnail 600x600 IPCC-im.jpg
|
|
24736:0.04
|
|
```
|
|
|
|
- The ImageMagick way is the same as how DSpace does it (first creating an intermediary image, then getting a thumbnail)
|
|
- libvips does use less time and memory... I should do more tests!
|
|
- I wonder if I can try to use these [unofficial Java bindings](https://github.com/criteo/JVips/blob/master/src/test/java/com/criteo/vips/example/SimpleExample.java) in DSpace
|
|
- The authors of the JVips project wrote a nice blog post about libvips performance: https://medium.com/criteo-engineering/boosting-image-processing-performance-from-imagemagick-to-libvips-268cc3451d55
|
|
- Ouch, JVips is Java 8 only as far as I can tell... that works now, but it's a non-starter going forward
|
|
|
|
## 2021-08-11
|
|
|
|
- Peter got back to me about the journal title cleanup
|
|
- From his corrections it seems an overwhelming majority of his choices match the Sherpa Romeo version of the titles rather than Crossref's
|
|
- Anyways, I exported the originals that were the same in Sherpa Romeo and Crossref as well as Peter's selections for where Sherpa Romeo and Crossref differred:
|
|
|
|
```console
|
|
$ csvcut -c cgspace ~/Downloads/2021-08-09-CGSpace-Journals-PB.csv | sort -u | sed 1d > /tmp/journals1.txt
|
|
$ csvcut -c 'sherpa romeo journal title' ~/Downloads/2021-08-09-CGSpace-Journals-All.csv | sort -u | sed 1d > /tmp/journals2.txt
|
|
$ cat /tmp/journals1.txt /tmp/journals2.txt | sort -u | wc -l
|
|
1911
|
|
```
|
|
|
|
- Now I will create a controlled vocabulary out of this list and reconcile our existing journal title metadata with it in OpenRefine
|
|
- I exported a list of all the journal titles we have in the `cg.journal` field:
|
|
|
|
```console
|
|
localhost/dspace63= > \COPY (SELECT DISTINCT(text_value) AS "cg.journal" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (251)) to /tmp/2021-08-11-journals.csv WITH CSV;
|
|
COPY 3245
|
|
```
|
|
|
|
- I started looking at reconciling them with reconcile-csv in OpenRefine, but ouch, there are 1,600 journal titles that don't match, so I'd have to go check many of them manually before selecting a match or fixing them...
|
|
- I think it's better if I try to write a Python script to fetch the ISSNs for each journal article and update them that way
|
|
- Or instead of doing it via SQL I could use CSV and parse the values there...
|
|
- A few more issues:
|
|
- Some ISSNs are non-existent in Sherpa Romeo and Crossref, but appear on issn.org's web search (their API is invite only)
|
|
- Some titles are different across all three datasets, for example ISSN 0003-1305:
|
|
- [According to ISSN.org](https://portal.issn.org/resource/ISSN/0003-1305) this is "The American statistician"
|
|
- [According to Sherpa Romeo](https://v2.sherpa.ac.uk/id/publication/20807) this is "American Statistician"
|
|
- [According to Crossref](https://search.crossref.org/?q=0003-1305&from_ui=yes&container-title=The+American+Statistician) this is "The American Statistician"
|
|
- I also realized that our previous controlled vocabulary came from CGSpace's top 500 journals, so when I replaced it with the generated list earlier today we lost some journals
|
|
- Now I went back and merged the previous with the new, and manually removed duplicates (sigh)
|
|
- I requested access to the issn.org OAI-PMH API so I can use their registry...
|
|
|
|
## 2021-08-12
|
|
|
|
- I sent an email to Sherpa Romeo's help contact to ask about missing ISSNs
|
|
- They pointed me to their [inclusion criteria](https://v2.sherpa.ac.uk/romeo/about.html) and said that missing journals should submit their open access policies to be included
|
|
- The contact from issn.org got back to me and said I should pay 1,000/year EUR for 100,000 requests to their API... no thanks
|
|
- Submit a pull request to COUNTER-Robots for the httpx bot ([#45](https://github.com/atmire/COUNTER-Robots/pull/45))
|
|
- In the mean time I added it to our local ILRI overrides
|
|
|
|
## 2021-08-15
|
|
|
|
- Start a fresh reindex on AReS
|
|
|
|
## 2021-08-16
|
|
|
|
- Meeting with Abenet and Peter about CGSpace actions and future
|
|
- We agreed to move three top-level Feed the Future projects into one community, so I created one and moved them:
|
|
|
|
```console
|
|
$ dspace community-filiator --set --parent=10568/114644 --child=10568/72600
|
|
$ dspace community-filiator --set --parent=10568/114644 --child=10568/35730
|
|
$ dspace community-filiator --set --parent=10568/114644 --child=10568/76451
|
|
```
|
|
|
|
- I made a minor fix to OpenRXV to prefix all image names with `docker.io` so it works with less changes on podman
|
|
- Docker assumes the `docker.io` registry by default, but we should be explicit
|
|
|
|
## 2021-08-17
|
|
|
|
- I made an initial attempt on the policy statements page on DSpace Test
|
|
- It is modeled on Sherpa Romeo's OpenDOAR policy statements advice
|
|
- Sit with Moayad and discuss the future of AReS
|
|
- We specifically discussed formalizing the API and documenting its use to allow as an alternative to harvesting directly from CGSpace
|
|
- We also discussed allowing linking to search results to enable something like "Explore this collection" links on CGSpace collection pages
|
|
- Lower case all AGROVOC metadata, as I had noticed a few in sentence case:
|
|
|
|
```console
|
|
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
|
|
UPDATE 484
|
|
```
|
|
|
|
- Also update some DOIs using the `dx.doi.org` format, just to keep things uniform:
|
|
|
|
```console
|
|
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, 'https://dx.doi.org', 'https://doi.org') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 220 AND text_value LIKE 'https://dx.doi.org%';
|
|
UPDATE 469
|
|
```
|
|
|
|
- Then start a full Discovery re-indexing to update the Feed the Future community item counts that have been stuck at 0 since we moved the three projects to be a subcommunity a few days ago:
|
|
|
|
```console
|
|
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
|
|
|
real 322m16.917s
|
|
user 226m43.121s
|
|
sys 3m17.469s
|
|
```
|
|
|
|
- I learned how to use the OpenRXV API, which is just a thin wrapper around Elasticsearch:
|
|
|
|
```console
|
|
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search?scroll=1d' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"size": 10,
|
|
"query": {
|
|
"bool": {
|
|
"filter": {
|
|
"term": {
|
|
"repo.keyword": "CGSpace"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
$ curl -X POST 'https://cgspace.cgiar.org/explorer/api/search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAASekWMTRwZ3lEMkVRYUtKZjgyMno4dV9CUQ=='
|
|
```
|
|
|
|
- This uses the Elasticsearch scroll ID to page through results
|
|
- The second query doesn't need the request body because it is saved for 1 day as part of the first request
|
|
- Attempt to re-do my tests with VisualVM from 2019-04
|
|
- I found that I can't connect to the Tomcat JMX port using SSH forwarding (visualvm gives an error about localhost already being monitored)
|
|
- Instead, I had to create a SOCKS proxy with SSH (ssh -D 8096), then set that up as a proxy in the VisualVM network settings, and then add the JMX connection
|
|
- See: https://dzone.com/articles/visualvm-monitoring-remote-jvm
|
|
- I have to spend more time on this...
|
|
- I fixed a bug in the Altmetric donuts on OpenRXV
|
|
- We now [try to show the donut for the DOI first if it exists, then fall back to the Handle](https://github.com/ilri/OpenRXV/pull/113)
|
|
- This is working on my local test, but not on the live site... sigh
|
|
- I started a fresh harvest, maybe it's something to do with the metadata in Elasticsearch
|
|
- I improved the quality of the "no thumbnail" placeholder image on AReS: https://github.com/ilri/OpenRXV/pull/114
|
|
- I sent some feedback to some ILRI and CCAFS colleagues about how to use better thumbnails for publications
|
|
|
|
## 2021-08-24
|
|
|
|
- In the last few days I did a lot of work on OpenRXV
|
|
- I started exploring the Angular 9.0 to 9.1 update
|
|
- I tested some updates to dependencies for Angular 9 that we somehow missed, like @tinymce/tinymce-angular, @nicky-lenaers/ngx-scroll-to, and @ng-select/ng-select
|
|
- I changed the default target from ES5 to ES2015 because ES5 was released in 2009 and the only thing we lose by moving to ES2015 is IE11 support
|
|
- I fixed a handful of issues in the Docker build and deployment process
|
|
- I started exploring changing the Docker configuration from using volumes to `COPY` instructions in the `Dockerfile` because we are having sporadic issues with permissions in containers caused by copying the host's frontend/backend directories and not being able to write to them
|
|
- I tested moving from node-sass to sass, as it has been [supported since Angular 8 apparently](https://blog.ninja-squad.com/2019/05/29/angular-cli-8.0/) and will allow us to avoid stupid node-gyp issues
|
|
|
|
## 2021-08-25
|
|
|
|
- I did a bunch of tests of the OpenRXV Angular 9.1 update and merged it to master ([#115](https://github.com/ilri/OpenRXV/pull/115))
|
|
- Last week Maria Garruccio sent me a handful of new ORCID identifiers for Bioversity staff
|
|
- We currently have 1320 unique identifiers, so this adds eleven new ones:
|
|
|
|
```console
|
|
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/bioversity-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-08-25-combined-orcids.txt
|
|
$ wc -l /tmp/2021-08-25-combined-orcids.txt
|
|
1331
|
|
```
|
|
|
|
- After I combined them and removed duplicates, I resolved all the names using my `resolve-orcids.py` script:
|
|
|
|
```console
|
|
$ ./ilri/resolve-orcids.py -i /tmp/2021-08-25-combined-orcids.txt -o /tmp/2021-08-25-combined-orcids-names.txt
|
|
```
|
|
|
|
- Tag existing items from the Alliance's new authors with ORCID iDs using `add-orcid-identifiers-csv.py` (181 new metadata fields added):
|
|
|
|
```console
|
|
$ cat 2021-08-25-add-orcids.csv
|
|
dc.contributor.author,cg.creator.identifier
|
|
"Chege, Christine G. Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
|
"Chege, Christine Kiria","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
|
"Kiria, C.","Christine G.Kiria Chege: 0000-0001-8360-0279"
|
|
"Kinyua, Ivy","Ivy Kinyua :0000-0002-1978-8833"
|
|
"Rahn, E.","Eric Rahn: 0000-0001-6280-7430"
|
|
"Rahn, Eric","Eric Rahn: 0000-0001-6280-7430"
|
|
"Jager M.","Matthias Jager: 0000-0003-1059-3949"
|
|
"Jager, M.","Matthias Jager: 0000-0003-1059-3949"
|
|
"Jager, Matthias","Matthias Jager: 0000-0003-1059-3949"
|
|
"Waswa, Boaz","Boaz Waswa: 0000-0002-0066-0215"
|
|
"Waswa, Boaz S.","Boaz Waswa: 0000-0002-0066-0215"
|
|
"Rivera, Tatiana","Tatiana Rivera: 0000-0003-4876-5873"
|
|
"Andrade, Robert","Robert Andrade: 0000-0002-5764-3854"
|
|
"Ceccarelli, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
|
|
"Ceccarellia, Viviana","Viviana Ceccarelli: 0000-0003-2160-9483"
|
|
"Nyawira, Sylvia","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
|
"Nyawira, Sylvia S.","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
|
"Nyawira, Sylvia Sarah","Sylvia Sarah Nyawira: 0000-0003-4913-1389"
|
|
"Groot, J.C.","Groot, J.C.J.: 0000-0001-6516-5170"
|
|
"Groot, J.C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
|
|
"Groot, Jeroen C.J.","Groot, J.C.J.: 0000-0001-6516-5170"
|
|
"Groot, Jeroen CJ","Groot, J.C.J.: 0000-0001-6516-5170"
|
|
"Abera, W.","Wuletawu Abera: 0000-0002-3657-5223"
|
|
"Abera, Wuletawu","Wuletawu Abera: 0000-0002-3657-5223"
|
|
"Kanyenga Lubobo, Antoine","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
|
|
"Lubobo Antoine, Kanyenga","Antoine Lubobo Kanyenga: 0000-0003-0806-9304"
|
|
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-08-25-add-orcids.csv -db dspace -u dspace -p 'fuuu'
|
|
```
|
|
|
|
## 2021-08-29
|
|
|
|
- Run a full harvest on AReS
|
|
- Also do more work the past few days on OpenRXV
|
|
- I switched the backend target from ES2017 to ES2019
|
|
- I did a proof of concept with multi-stage builds and simplifying the Docker configuration
|
|
- Update the list of ORCID identifiers on CGSpace
|
|
- Run system updates and reboot CGSpace (linode18)
|
|
|
|
## 2021-08-31
|
|
|
|
- Yesterday I finished the work to make OpenRXV use a new multi-stage Docker build system and use smarter `COPY` instructions instead of runtime volumes
|
|
- Today I merged the changes to the master branch and re-deployed AReS on linode20
|
|
- Because the `docker-compose.yml` moved to the root the Docker volume prefix changed from `docker_` to `openrxv_` so I had to stop the containers and rsync the data from the old volume to the new one in /var/lib/docker
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|