CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

June, 2022

2022-06-06

  • Look at the Solr statistics on CGSpace
    • I see 167,000 hits from a bunch of Microsoft IPs with reverse DNS “msnbot-” using the Solr query dns:*msnbot* AND dns:*.msn.com
    • I purged these first so I could see the other “real” IPs in the Solr facets
  • I see 47,500 hits from 80.248.237.167 on a data center ISP in Sweden, using a normal user agent
  • I see 13,000 hits from 163.237.216.11 on a data center ISP in Australia, using a normal user agent
  • I see 7,300 hits from 208.185.238.57 from Britanica, using a normal user agent
    • There seem to be many more of these:
Read more →

May, 2022

2022-05-04

  • I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
    • 18.207.136.176
    • 185.189.36.248
    • 50.118.223.78
    • 52.70.76.123
    • 3.236.10.11
  • Looking at the Solr statistics for 2022-04
    • 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
    • 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
    • 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
    • 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
    • 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
    • 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
    • 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
    • If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
  • I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
Read more →

April, 2022

2022-04-01 I did G1GC tests on DSpace Test (linode26) to compliment the CMS tests I did yesterday The Discovery indexing took this long: real 334m33.625s user 227m51.331s sys 3m43.037s 2022-04-04 Start a full harvest on AReS Help Marianne with submit/approve access on a new collection on CGSpace Go back in Gaia’s batch reports to find records that she indicated for replacing on CGSpace (ie, those with better new copies, new versions, etc) Looking at the Solr statistics for 2022-03 on CGSpace I see 54. Read more →

March, 2022

2022-03-01

  • Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
Read more →

February, 2022

2022-02-01

  • Meeting with Peter and Abenet about CGSpace in the One CGIAR
    • We agreed to buy $5,000 worth of credits from Atmire for future upgrades
    • We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization
    • We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one
    • We agreed to try to do more alignment of affiliations/funders with ROR
Read more →

December, 2021

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679
Read more →

November, 2021

2021-11-02

  • I experimented with manually sharding the Solr statistics on DSpace Test
  • First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
Read more →

October, 2021

2021-10-01

  • Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l 
1879
$ wc -l /tmp/2021-10-01-affiliations.txt 
7100 /tmp/2021-10-01-affiliations.txt
  • So we have 1879/7100 (26.46%) matching already
Read more →

September, 2021

2021-09-02

  • Troubleshooting the missing Altmetric scores on AReS
    • Turns out that I didn’t actually fix them last month because the check for content.altmetric still exists, and I can’t access the DOIs using _h.source.DOI for some reason
    • I can access all other kinds of item metadata using the Elasticsearch label, but not DOI!!!
    • I will change DOI to tomato in the repository setup and start a re-harvest… I need to see if this is some kind of reserved word or something…
    • Even as tomato I can’t access that field as _h.source.tomato in Angular, but it does work as a filter source… sigh
  • I’m having problems using the OpenRXV API
    • The syntax Moayad showed me last month doesn’t seem to honor the search query properly…
Read more →