CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2022

2022-03-01

  • Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
Read more →

February, 2022

2022-02-01

  • Meeting with Peter and Abenet about CGSpace in the One CGIAR
    • We agreed to buy $5,000 worth of credits from Atmire for future upgrades
    • We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization
    • We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one
    • We agreed to try to do more alignment of affiliations/funders with ROR
Read more →

December, 2021

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679
Read more →

November, 2021

2021-11-02

  • I experimented with manually sharding the Solr statistics on DSpace Test
  • First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
Read more →

October, 2021

2021-10-01

  • Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
1879
$ wc -l /tmp/2021-10-01-affiliations.txt 
7100 /tmp/2021-10-01-affiliations.txt
  • So we have 1879/7100 (26.46%) matching already
Read more →

September, 2021

2021-09-02

  • Troubleshooting the missing Altmetric scores on AReS
    • Turns out that I didn’t actually fix them last month because the check for content.altmetric still exists, and I can’t access the DOIs using _h.source.DOI for some reason
    • I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
    • I will change DOI to tomato in the repository setup and start a re-harvest… I need to see if this is some kind of reserved word or something…
    • Even as tomato I can’t access that field as _h.source.tomato in Angular, but it does work as a filter source… sigh
  • I’m having problems using the OpenRXV API
    • The syntax Moayad showed me last month doesn’t seem to honor the search query properly…
Read more →

August, 2021

2021-08-01

  • Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
  • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
Read more →

July, 2021

2021-07-01

  • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
Read more →

June, 2021

2021-06-01

  • IWMI notified me that AReS was down with an HTTP 502 error
    • Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
    • I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
    • I simply started it and AReS was running again:
Read more →