CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

May, 2021

2021-05-01

  • I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
    • “RI/1.0”, 1337
    • “Microsoft Office Word 2014”, 941
  • I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one… as that’s an actual user…
  • I should probably add the RI/1.0 pattern to COUNTER-Robots project
  • As well as these IPs:
    • 193.169.254.178, 21648
    • 181.62.166.177, 20323
    • 45.146.166.180, 19376
  • The first IP seems to be in Estonia and their requests to the REST API change user agents from curl to Mac OS X to Windows and more
    • Also, they seem to be trying to exploit something:
193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
  • I will report the IP on abuseipdb.com and purge their hits from Solr
  • The second IP is in Colombia and is making thousands of requests for what looks like some test site:
181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
  • But this site does not exist (yet?)
    • I will purge them from Solr
  • The third IP is in Russia apparently, and the user agent has the pl-PL locale with thousands of requests like this:
45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"
  • I will purge these all with my check-spider-ip-hits.sh script:
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics

Total number of bot hits purged: 61347

2021-05-02

  • Check the AReS Harvester indexes:
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
    "openrxv-items-temp": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },
  • I think they look OK (openrxv-items is an alias of openrxv-items-final), but I took a backup just in case:
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
  • Then I started an indexing in the AReS Explorer admin dashboard
  • The indexing finished, but it looks like the aliases are messed up again:
$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb

2021-05-05

  • Peter noticed that we no longer display cg.link.reference on the item view
    • It seems that this got dropped accidentally when we migrated to dcterms.relation in CG Core v2
    • I fixed it in the 6_x-prod branch and told him it will be live soon