CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

April, 2022

2022-04-01 I did G1GC tests on DSpace Test (linode26) to compliment the CMS tests I did yesterday The Discovery indexing took this long: real 334m33.625s user 227m51.331s sys 3m43.037s 2022-04-04 Start a full harvest on AReS Help Marianne with submit/approve access on a new collection on CGSpace Go back in Gaia’s batch reports to find records that she indicated for replacing on CGSpace (ie, those with better new copies, new versions, etc) Looking at the Solr statistics for 2022-03 on CGSpace I see 54. Read more →

March, 2022

2022-03-01

  • Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
Read more →

February, 2022

2022-02-01

  • Meeting with Peter and Abenet about CGSpace in the One CGIAR
    • We agreed to buy $5,000 worth of credits from Atmire for future upgrades
    • We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization
    • We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one
    • We agreed to try to do more alignment of affiliations/funders with ROR
Read more →

December, 2021

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679
Read more →

November, 2021

2021-11-02

  • I experimented with manually sharding the Solr statistics on DSpace Test
  • First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
Read more →

October, 2021

2021-10-01

  • Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
1879
$ wc -l /tmp/2021-10-01-affiliations.txt 
7100 /tmp/2021-10-01-affiliations.txt
  • So we have 1879/7100 (26.46%) matching already
Read more →

September, 2021

2021-09-02

  • Troubleshooting the missing Altmetric scores on AReS
    • Turns out that I didn’t actually fix them last month because the check for content.altmetric still exists, and I can’t access the DOIs using _h.source.DOI for some reason
    • I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
    • I will change DOI to tomato in the repository setup and start a re-harvest… I need to see if this is some kind of reserved word or something…
    • Even as tomato I can’t access that field as _h.source.tomato in Angular, but it does work as a filter source… sigh
  • I’m having problems using the OpenRXV API
    • The syntax Moayad showed me last month doesn’t seem to honor the search query properly…
Read more →

August, 2021

2021-08-01

  • Update Docker images on AReS server (linode20) and reboot the server:
# docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
  • I decided to upgrade linode20 from Ubuntu 18.04 to 20.04
Read more →

July, 2021

2021-07-01

  • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
Read more →