CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

May, 2022

2022-05-04

  • I found a few more IPs making requests using the shady Chrome 44 user agent in the last few days so I will add them to the block list too:
    • 18.207.136.176
    • 185.189.36.248
    • 50.118.223.78
    • 52.70.76.123
    • 3.236.10.11
  • Looking at the Solr statistics for 2022-04
    • 52.191.137.59 is Microsoft, but they are using a normal user agent and making tens of thousands of requests
    • 64.39.98.62 is owned by Qualys, and all their requests are probing for /etc/passwd etc
    • 185.192.69.15 is in the Netherlands and is using a normal user agent, but making excessive automated HTTP requests to paths forbidden in robots.txt
    • 157.55.39.159 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
    • 52.233.67.176 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
    • 157.55.39.144 is owned by Microsoft and uses a normal user agent, but making excessive automated HTTP requests
    • 207.46.13.177 is owned by Microsoft and identifies as bingbot so I don’t know why its requests were logged in Solr
    • If I query Solr for time:2022-04* AND dns:*msnbot* AND dns:*.msn.com. I see a handful of IPs that made 41,000 requests
  • I purged 93,974 hits from these IPs using my check-spider-ip-hits.sh script
  • Now looking at the Solr statistics by user agent I see:
    • SomeRandomText
    • RestSharp/106.11.7.0
    • MetaInspector/5.7.0 (+https://github.com/jaimeiniesta/metainspector)
    • wp_is_mobile
    • Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1"
    • insomnia/2022.2.1
    • ZoteroTranslationServer
    • omgili/0.5 +http://omgili.com
    • curb
    • Sprout Social (Link Attachment)
  • I purged 2,900 hits from these user agents from Solr using my check-spider-hits.sh script
  • I made a pull request to COUNTER-Robots for some of these agents
    • In the mean time I will add them to our local overrides in DSpace
  • Run all system updates on AReS server, update all Docker containers, and restart the server
    • Start a harvest on AReS

2022-05-05

  • Update PostgreSQL JDBC driver to 42.3.5 in the Ansible infrastructure playbooks and deploy on DSpace Test
  • Peter asked me how many items we add to CGSpace every year
    • I wrote a SQL query to check the number of items grouped by their accession dates since 2009:
localhost/dspacetest= ☘ SELECT EXTRACT(year from text_value::date) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
 yyyy │ count 
──────┼───────
 2022 │  2073
 2021 │  6471
 2020 │  4074
 2019 │  7330
 2018 │  8899
 2017 │  6860
 2016 │  8451
 2015 │ 15692
 2014 │ 16479
 2013 │  4388
 2012 │  6472
 2011 │  2694
 2010 │  2457
 2009 │   293
  • Note that I had an issue with casting text_value to date because one item had an accession date of 2016 instead of 2016-09-29T20:14:47Z
    • Once I fixed that PostgreSQL was able to extract() the year
    • There were some other methods I tried that worked also, for example TO_DATE():
localhost/dspacetest= ☘ SELECT EXTRACT(year from TO_DATE(text_value, 'YYYY-MM-DD"T"HH24:MI:SS"Z"')) AS YYYY, COUNT(*) FROM metadatavalue WHERE metadata_field_id=11 GROUP BY YYYY ORDER BY YYYY DESC LIMIT 14;
  • But it seems PostgreSQL is smart enough to recognize date formatting in strings automatically when we cast so we don’t need to convert to date first
  • Another thing I noticed is that a few hundred items have accession dates from decades ago, perhaps this is due to importing items from the CGIAR Library?
  • I spent some time merging a few pull requests for DSpace 6.4 and porting one to main for DSpace 7.x