--- title: "April, 2022" date: 2022-03-01T10:53:39+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2022-04-01 - I did G1GC tests on DSpace Test (linode26) to compliment the CMS tests I did yesterday - The Discovery indexing took this long: ```console real 334m33.625s user 227m51.331s sys 3m43.037s ``` ## 2022-04-04 - Start a full harvest on AReS - Help Marianne with submit/approve access on a new collection on CGSpace - Go back in Gaia's batch reports to find records that she indicated for replacing on CGSpace (ie, those with better new copies, new versions, etc) - Looking at the Solr statistics for 2022-03 on CGSpace - I see 54.229.218.204 on Amazon AWS made 49,000 requests, some of which with this user agent: `Apache-HttpClient/4.5.9 (Java/1.8.0_322)`, and many others with a normal browser agent, so that's fishy! - The DSpace agent pattern `http.?agent` seems to have caught the first ones, but I'll purge the IP ones - I see 40.77.167.80 is Bing or MSN Bot, but using a normal browser user agent, and if I search Solr for `dns:*msnbot* AND dns:*.msn.com.` I see over 100,000, which is a problem I noticed a few months ago too... - I extracted the MSN Bot IPs from Solr using an IP facet, then used the `check-spider-ip-hits.sh` script to purge them ## 2022-04-10 - Start a full harvest on AReS ## 2022-04-13 - UptimeRobot mailed to say that CGSpace was down - I looked and found the load at 44... - There seem to be a lot of locks from the XMLUI: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n 3173 dspaceWeb ``` - Looking at the top IPs in nginx's access log one IP in particular stands out: ```console 941 66.249.66.222 1224 95.108.213.28 2074 157.90.209.76 3064 66.249.66.221 95743 185.192.69.15 ``` - 185.192.69.15 is in the UK - I added a block for that IP in nginx and the load went down... ## 2022-04-16 - Start harvest on AReS ## 2022-04-18 - I woke up to several notices from UptimeRobot that CGSpace had gone down and up in the night (of course I'm on holiday out of the country for Easter) - I see there are many locks in use from the XMLUI: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c 8932 dspaceWeb ``` - Looking at the top IPs making requests it seems they are Yandex, bingbot, and Googlebot: ```console # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | awk '{print $1}' | sort | uniq -c | sort -h 752 69.162.124.231 759 66.249.64.213 864 66.249.66.222 905 2a01:4f8:221:f::2 1013 84.33.2.97 1201 157.55.39.159 1204 157.55.39.144 1209 157.55.39.102 1217 157.55.39.161 1252 207.46.13.177 1274 157.55.39.162 2553 66.249.66.221 2941 95.108.213.28 ``` - One IP is using a stange user agent though: ```console 84.33.2.97 - - [18/Apr/2022:00:20:38 +0200] "GET /bitstream/handle/10568/109581/Banana_Blomme%20_2020.pdf.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText" ``` - Overall, it seems we had 17,000 unique IPs connecting in the last nine hours (currently 9:14AM and log file rolled over at 00:00): ```console # cat /var/log/nginx/access.log | awk '{print $1}' | sort | uniq | wc -l 17314 ``` - That's a lot of unique IPs, and I see some patterns of IPs in China making ten to twenty requests each - The ISPs I've seen so far are ChinaNet and China Unicom - I extracted all the IPs from today and resolved them: ```console # cat /var/log/nginx/access.log | awk '{print $1}' | sort | uniq > /tmp/2022-04-18-ips.txt $ ./ilri/resolve-addresses-geoip2.py -i /tmp/2022-04-18-ips.txt -o /tmp/2022-04-18-ips.csv ``` - The top ASNs by IP are: ```console $ csvcut -c 2 /tmp/2022-04-18-ips.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10 102 GOOGLE 139 Maxihost LTDA 165 AMAZON-02 393 "China Mobile Communications Group Co., Ltd." 473 AMAZON-AES 616 China Mobile communications corporation 642 M247 Ltd 2336 HostRoyale Technologies Pvt Ltd 4556 Chinanet 5527 CHINA UNICOM China169 Backbone $ csvcut -c 4 /tmp/2022-04-18-ips.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10 139 262287 165 16509 180 204287 393 9808 473 14618 615 56041 642 9009 2156 203020 4556 4134 5527 4837 ``` - I spot checked a few IPs from each of these and they are definitely just making bullshit requests to Discovery and HTML sitemap etc - I will download the IP blocks for each ASN except Google and Amazon and ban them ```console $ wget https://asn.ipinfo.app/api/text/nginx/AS4837 https://asn.ipinfo.app/api/text/nginx/AS4134 https://asn.ipinfo.app/api/text/nginx/AS203020 https://asn.ipinfo.app/api/text/nginx/AS9009 https://asn.ipinfo.app/api/text/nginx/AS56041 https://asn.ipinfo.app/api/text/nginx/AS9808 $ cat AS* | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l 20296 ``` - I extracted the IPv4 and IPv6 networks: ```console $ cat AS* | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | grep ":" | sort > /tmp/ipv6-networks.txt $ cat AS* | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | grep -v ":" | sort > /tmp/ipv4-networks.txt ``` - I suspect we need to aggregate these networks since they are so many and nftables doesn't like it when they overlap: ```console $ wc -l /tmp/ipv4-networks.txt 15464 /tmp/ipv4-networks.txt $ aggregate6 /tmp/ipv4-networks.txt | wc -l 2781 $ wc -l /tmp/ipv6-networks.txt 4833 /tmp/ipv6-networks.txt $ aggregate6 /tmp/ipv6-networks.txt | wc -l 338 ``` - I deployed these lists on CGSpace, ran all updates, and rebooted the server - This list is SURELY too broad because we will block legitimate users in China... but right now how can I discern? - Also, I need to purge the hits from these 14,000 IPs in Solr when I get time - Looking back at the Munin graphs a few hours later I see this was indeed some kind of spike that was out of the ordinary: ![PostgreSQL connections day](/cgspace-notes/2022/04/postgres_connections_ALL-day.png) ![DSpace sessions day](/cgspace-notes/2022/04/jmx_dspace_sessions-day.png) - I used `grepcidr` with the aggregated network lists to extract IPs matching those networks from the nginx logs for the past day: ```console # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | awk '{print $1}' | sort -u > /tmp/ips.log # while read -r network; do grepcidr $network /tmp/ips.log >> /tmp/ipv4-ips.txt; done < /tmp/ipv4-networks-aggregated.txt # while read -r network; do grepcidr $network /tmp/ips.log >> /tmp/ipv6-ips.txt; done < /tmp/ipv6-networks-aggregated.txt # wc -l /tmp/ipv4-ips.txt 15313 /tmp/ipv4-ips.txt # wc -l /tmp/ipv6-ips.txt 19 /tmp/ipv6-ips.txt ``` - Then I purged them from Solr using the `check-spider-ip-hits.sh`: ```console $ ./ilri/check-spider-ip-hits.sh -f /tmp/ipv4-ips.txt -p ``` ## 2022-04-23 - A handful of spider user agents that I identified were merged into COUNTER-Robots so I updated the ILRI override in our DSpace and regenerated the `example` file that contains most patterns - I updated CGSpace, then ran all system updates and rebooted the host - I also ran `dspace cleanup -v` to prune the database ## 2022-04-24 - Start a harvest on AReS ## 2022-04-25 - Looking at the countries on AReS I decided to collect a list to remind Jacquie at WorldFish again about how many incorrect ones they have - There are about sixty incorrect ones, some of which I can correct via the value mappings on AReS, but most I can't - I set up value mappings for seventeen countries, then sent another sixty or so to Jacquie and Salem to hopefully delete - I notice we have over 1,000 items with region `Africa South of Sahara` - I am surprised to see these because we did a mass migration to `Sub-Saharan Africa` in 2020-10 when we aligned to UN M.49 - Oh! It seems I used a capital O in `Of`! - This is curious, I see we missed `East Asia` and `Northern America`, because those are still in our list, but UN M.49 uses `Eastern Asia` and `Northern America`... I will have to raise that with Peter and Abenet later - For now I will just re-run my fixes: ```console $ cat /tmp/regions.csv cg.coverage.region,correct East Africa,Eastern Africa West Africa,Western Africa Southeast Asia,South-eastern Asia South Asia,Southern Asia Africa South of Sahara,Sub-Saharan Africa North Africa,Northern Africa West Asia,Western Asia $ ./ilri/fix-metadata-values.py -i /tmp/regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 227 -t correct ``` - Then I started a new harvest on AReS