CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

December, 2021

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679
Read more →

November, 2021

2021-11-02

  • I experimented with manually sharding the Solr statistics on DSpace Test
  • First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
Read more →

October, 2021

2021-10-01

  • Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
1879
$ wc -l /tmp/2021-10-01-affiliations.txt 
7100 /tmp/2021-10-01-affiliations.txt
  • So we have 1879/7100 (26.46%) matching already
Read more →

September, 2021

2021-09-02

  • Troubleshooting the missing Altmetric scores on AReS
    • Turns out that I didn’t actually fix them last month because the check for content.altmetric still exists, and I can’t access the DOIs using _h.source.DOI for some reason
    • I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
    • I will change DOI to tomato in the repository setup and start a re-harvest… I need to see if this is some kind of reserved word or something…
    • Even as tomato I can’t access that field as _h.source.tomato in Angular, but it does work as a filter source… sigh
  • I’m having problems using the OpenRXV API
    • The syntax Moayad showed me last month doesn’t seem to honor the search query properly…
Read more →

August, 2021

2021-08-01

  • Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
  • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
Read more →

July, 2021

2021-07-01

  • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
Read more →

June, 2021

2021-06-01

  • IWMI notified me that AReS was down with an HTTP 502 error
    • Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
    • I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
    • I simply started it and AReS was running again:
Read more →

May, 2021

2021-05-01

  • I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
    • “RI/1.0”, 1337
    • “Microsoft Office Word 2014”, 941
  • I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one… as that’s an actual user…
Read more →

April, 2021

2021-04-01

  • I wrote a script to query Sherpa’s API for our ISSNs: sherpa-issn-lookup.py
    • I’m curious to see how the results compare with the results from Crossref yesterday
  • AReS Explorer was down since this morning, I didn’t see anything in the systemd journal
    • I simply took everything down with docker-compose and then back up, and then it was OK
    • Perhaps one of the containers crashed, I should have looked closer but I was in a hurry
Read more →