CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2023

2023-02-01

  • Export CGSpace to cross check the DOI metadata with Crossref
    • I want to try to expand my use of their data to journals, publishers, volumes, issues, etc…
Read more →

January, 2023

2023-01-01

  • Apply some more ORCID identifiers to items on CGSpace using my 2022-09-22-add-orcids.csv file
    • I want to update all ORCID names and refresh them in the database
    • I see we have some new ones that aren’t in our list if I combine with this file:
Read more →

December, 2022

2022-12-01

  • Fix some incorrect regions on CGSpace
    • I exported the CCAFS and IITA communities, extracted just the country and region columns, then ran them through csv-metadata-quality to fix the regions
  • Add a few more authors to my CSV with author names and ORCID identifiers and tag 283 items!
  • Replace “East Asia” with “Eastern Asia” region on CGSpace (UN M.49 region)
Read more →

November, 2022

2022-11-01

  • Last night I re-synced DSpace 7 Test from CGSpace
    • I also updated all my local 7_x-dev branches on the latest upstreams
  • I spent some time updating the authorizations in Alliance collections
    • I want to make sure they use groups instead of individuals where possible!
  • I reverted the Cocoon autosave change because it was more of a nuissance that Peter can’t upload CSVs from the web interface and is a very low severity security issue
Read more →

September, 2022

2022-09-01

  • A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet
  • I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
    • The submission works as expected
  • Start debugging some region-related issues with csv-metadata-quality
    • I created a new test file test-geography.csv with some different scenarios
    • I also fixed a few bugs and improved the region-matching logic
Read more →

July, 2022

2022-07-02

  • I learned how to use the Levenshtein functions in PostgreSQL
    • The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
    • Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
Read more →

June, 2022

2022-06-06

  • Look at the Solr statistics on CGSpace
    • I see 167,000 hits from a bunch of Microsoft IPs with reverse DNS “msnbot-” using the Solr query dns:*msnbot* AND dns:*.msn.com
    • I purged these first so I could see the other “real” IPs in the Solr facets
  • I see 47,500 hits from 80.248.237.167 on a data center ISP in Sweden, using a normal user agent
  • I see 13,000 hits from 163.237.216.11 on a data center ISP in Australia, using a normal user agent
  • I see 7,300 hits from 208.185.238.57 from Britanica, using a normal user agent
    • There seem to be many more of these:
Read more →

May, 2022

2022-05-04

  • I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
    • 18.207.136.176
    • 185.189.36.248
    • 50.118.223.78
    • 52.70.76.123
    • 3.236.10.11
  • Looking at the Solr statistics for 2022-04
    • 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
    • 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
    • 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
    • 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
    • 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
    • 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
    • 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
    • If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
  • I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
Read more →