2022-07-02
- I learned how to use the Levenshtein functions in PostgreSQL
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
- Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
Read more →
2022-06-06
- Look at the Solr statistics on CGSpace
- I see 167,000 hits from a bunch of Microsoft IPs with reverse DNS “msnbot-” using the Solr query
dns:*msnbot* AND dns:*.msn.com
- I purged these first so I could see the other “real” IPs in the Solr facets
- I see 47,500 hits from 80.248.237.167 on a data center ISP in Sweden, using a normal user agent
- I see 13,000 hits from 163.237.216.11 on a data center ISP in Australia, using a normal user agent
- I see 7,300 hits from 208.185.238.57 from Britanica, using a normal user agent
- There seem to be many more of these:
Read more →
2022-05-04
- I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
- 18.207.136.176
- 185.189.36.248
- 50.118.223.78
- 52.70.76.123
- 3.236.10.11
- Looking at the Solr statistics for 2022-04
- 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
- 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
- 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
- 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
- 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
- 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
- 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
- If I query Solr for
time:2022-04* AND dns:*msnbot* AND dns:*.msn.com.
I see a handful of IPs that made 41,000 requests
- I purged 93,974 hits from these IPs using my
check-spider-ip-hits.sh
script
Read more →
2022-04-01 I did G1GC tests on DSpace Test (linode26) to compliment the CMS tests I did yesterday The Discovery indexing took this long: real 334m33.625s user 227m51.331s sys 3m43.037s 2022-04-04 Start a full harvest on AReS Help Marianne with submit/approve access on a new collection on CGSpace Go back in Gaia’s batch reports to find records that she indicated for replacing on CGSpace (ie, those with better new copies, new versions, etc) Looking at the Solr statistics for 2022-03 on CGSpace I see 54.
Read more →
2022-03-01
- Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p 'fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
Read more →
2022-02-01
- Meeting with Peter and Abenet about CGSpace in the One CGIAR
- We agreed to buy $5,000 worth of credits from Atmire for future upgrades
- We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization
- We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one
- We agreed to try to do more alignment of affiliations/funders with ROR
Read more →
2022-01-01
- Start a full harvest on AReS
Read more →
2021-12-01
- Atmire merged some changes I had submitted to the COUNTER-Robots project
- I updated our local spider user agents and then re-ran the list with my
check-spider-hits.sh
script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
Total number of bot hits purged: 3679
Read more →
2021-11-02
- I experimented with manually sharding the Solr statistics on DSpace Test
- First I exported all the 2019 stats from CGSpace:
$ ./run.sh -s http://localhost:8081/solr/statistics -f 'time:2019-*' -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
Read more →
2021-10-01
- Export all affiliations on CGSpace and run them against the latest RoR data dump:
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
- So we have 1879/7100 (26.46%) matching already
Read more →