CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

July, 2021

2021-07-01

  • Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994

2021-07-04

  • Update all Docker containers on the AReS server (linode20) and rebuild OpenRXV:
$ cd OpenRXV
$ docker-compose -f docker/docker-compose.yml down
$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml build
  • Then run all system updates and reboot the server
  • After the server came back up I cloned the openrxv-items-final index to openrxv-items-temp and started the plugins
    • This will hopefully be faster than a full re-harvest…
  • I opened a few GitHub issues for OpenRXV bugs:
  • Rebuild DSpace Test (linode26) from a fresh Ubuntu 20.04 image on Linode
  • The start plugins on AReS had seventy-five errors from the dspace_add_missing_items plugin for some reason so I had to start a fresh indexing
  • I noticed that the WorldFish data has dozens of incorrect countries so I should talk to Salem about that because they manage it
    • Also I noticed that we weren’t using the Country formatter in OpenRXV for the WorldFish country field, so some values don’t get mapped properly
    • I added some value mappings to fix some issues with WorldFish data and added a few more fields to the repository harvesting config and started a fresh re-indexing

2021-07-05

  • The AReS harvesting last night succeeded and I started the plugins
  • Margarita from CCAFS asked if we can create a new field for AICCRA publications
    • I asked her to clarify what they want
    • AICCRA is an initiative so it might be better to create new field for that, for example cg.contributor.initiative

2021-07-06

  • Atmire merged my spider user agent changes from last month so I will update the example list we use in DSpace and remove the new ones from my ilri override file
    • Also, I concatenated all our user agents into one file and purged all hits:
$ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
Purging 95 hits from Drupal in statistics
Purging 38 hits from DTS Agent in statistics
Purging 601 hits from Microsoft Office Existence Discovery in statistics
Purging 51 hits from Site24x7 in statistics
Purging 62 hits from Trello in statistics
Purging 13574 hits from WhatsApp in statistics
Purging 144 hits from FlipboardProxy in statistics
Purging 37 hits from LinkWalker in statistics
Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
Purging 427 hits from WordPress in statistics

Total number of bot hits purged: 15030
  • Meet with the CGIAR–AGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC
  • I extracted another list of all subjects to check against AGROVOC:
\COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
  • Test Hrafn Malmquist’s proposed DBCP2 changes for DSpace 6.4 (DS-4574)
    • His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7’s JDBC pooling via JNDI
    • Tomcat 8 has DBCP2 built in, but we are stuck on Tomcat 7 for now
  • Looking into the database issues we had last month on 2021-06-23
    • I think it might have been some kind of attack because the number of XMLUI sessions was through the roof at one point (10,000!) and the number of unique IPs accessing the server that day is much higher than any other day:
# for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
10693
2021-06-11
10587
2021-06-12
7958
2021-06-13
7681
2021-06-14
12639
2021-06-15
15388
2021-06-16
12245
2021-06-17
11187
2021-06-18
9684
2021-06-19
7835
2021-06-20
7198
2021-06-21
10380
2021-06-22
10255
2021-06-23
15878
2021-06-24
9963
2021-06-25
9439
2021-06-26
7930
  • Similarly, the number of connections to the REST API was around the average for the recent weeks before:
# for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
2021-06-10
1183
2021-06-11
1074
2021-06-12
911
2021-06-13
892
2021-06-14
1320
2021-06-15
1257
2021-06-16
1208
2021-06-17
1119
2021-06-18
965
2021-06-19
985
2021-06-20
854
2021-06-21
1098
2021-06-22
1028
2021-06-23
1375
2021-06-24
1135
2021-06-25
969
2021-06-26
904
  • According to goaccess, the traffic spike started at 2AM (remember that the first “Pool empty” error in dspace.log was at 4:01AM):
# zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
  • Moayad sent a fix for the add missing items plugins issue (#107)
    • It works MUCH faster because it correctly identifies the missing handles in each repository
    • Also it adds better debug messages to the api logs

2021-07-08

  • Atmire plans to debug the database connection issues on CGSpace (linode18) today so they asked me to make the REST API inaccessible for today and tomorrow
    • I adjusted nginx to give an HTTP 403 as well as a an error message to contact me

2021-07-11

  • Start an indexing on AReS

2021-07-17

  • I’m in Cyprus mostly offline, but I noticed that CGSpace was down
    • I checked and there was a blank white page with HTTP 200
    • There are thousands of locks in PostgreSQL:
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2302
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2564
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2530
  • The locks are held by XMLUI, not REST API or OAI:
postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
     57 dspaceApi
   2671 dspaceWeb
  • I ran all updates on the server (linode18) and restarted it, then DSpace came back up
  • I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues
  • Clone the openrxv-items-temp index on AReS and re-run all the plugins, but most of the “dspace_add_missing_items” tasks failed so I will just run a full re-harvest
  • The load on CGSpace is 45.00… the nginx access.log is going so fast I can’t even read it
    • I see lots of IPs from AS206485 that are changing their user agents (Linux, Windows, macOS…)
    • This is finegroupservers.com aka “UGB - UGB Hosting OU”
    • I will get a list of their IP blocks from ipinfo.app and block them in nginx
    • There is another group of IPs that are owned by an ISP called “TrafficTransitSolution LLC” that does not have its own ASN unfortunately
    • “TrafficTransitSolution LLC” seems to be affiliated with AS206485 (UGB Hosting OU) anyways, but they sometimes use AS49453 Global Layer B.V.G also
    • I found a tool that lets you grep a file by CIDR, so I can use that to purge hits from Solr eventually:
# grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
     32 91.243.191.124
     33 91.243.191.129
     33 91.243.191.200
     34 91.243.191.115
     34 91.243.191.154
     34 91.243.191.234
     34 91.243.191.56
     35 91.243.191.187
     35 91.243.191.91
     36 91.243.191.58
     37 91.243.191.209
     39 91.243.191.119
     39 91.243.191.144
     39 91.243.191.55
     40 91.243.191.112
     40 91.243.191.182
     40 91.243.191.57
     40 91.243.191.98
     41 91.243.191.106
     44 91.243.191.79
     45 91.243.191.151
     46 91.243.191.103
     56 91.243.191.172
  • I found a few people complaining about these Russian attacks too:
  • According to AbuseIPDB.com and whois data provided by the asn tool, I see these organizations, networks, and ISPs all seem to be related:
    • Sharktech
    • LIR LLC / lir.am
    • TrafficTransitSolution LLC / traffictransitsolution.us
    • Fine Group Servers Solutions LLC / finegroupservers.com
    • UGB
    • Bulgakov Alexey Yurievich
    • Dmitry Vorozhtsov / fitz-isp.uk / UGB
    • Auction LLC / dauction.ru / UGB / traffictransitsolution.us
    • Alax LLC / alaxona.com / finegroupservers.com
    • Sysoev Aleksey Anatolevich / jobbuzzactiv.com / traffictransitsolution.us
    • Bulgakov Alexey Yurievich / UGB / blockchainnetworksolutions.co.uk / info@finegroupservers.com
    • UAB Rakrejus
  • I looked in the nginx log and copied a few IP addresses that were suspicious
    • First I looked them up in AbuseIPDB.com to get the ISP name and website
    • Then I looked them up with the asn tool, ie:
$ ./asn -n 45.80.217.235  

╭──────────────────────────────╮
│ ASN lookup for 45.80.217.235 │
╰──────────────────────────────╯

 45.80.217.235 ┌PTR -
               ├ASN 46844 (ST-BGP, US)
               ├ORG Sharktech
               ├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
               ├ABU info@traffictransitsolution.us
               ├ROA ✓ VALID (1 ROA found)
               ├TYP  Proxy host   Hosting/DC 
               ├GEO Los Angeles, California (US)
               └REP ✓ NONE
  • Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:
IP, Organization, Website, Network
45.148.126.246, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.126.0/24 (Net-traffictransitsolution-15)
45.138.102.253, TrafficTransitSolution LLC, traffictransitsolution.us, 45.138.102.0/24 (Net-traffictransitsolution-11)
45.140.205.104, Bulgakov Alexey Yurievich, finegroupservers.com, 45.140.204.0/23 (CHINA_NETWORK)
185.68.247.63, Fine Group Servers Solutions LLC, finegroupservers.com, 185.68.247.0/24 (Net-finegroupservers-18)
213.232.123.188, Fine Group Servers Solutions LLC, finegroupservers.com, 213.232.123.0/24 (Net-finegroupservers-12)
45.80.217.235, TrafficTransitSolution LLC, traffictransitsolution.us, 45.80.217.0/24 (TrafficTransitSolutionNet)
185.81.144.202, Fine Group Servers Solutions LLC, finegroupservers.com, 185.81.144.0/24 (Net-finegroupservers-19)
109.106.22.114, TrafficTransitSolution LLC, traffictransitsolution.us, 109.106.22.0/24 (TrafficTransitSolutionNet)
185.68.247.200, Fine Group Servers Solutions LLC, finegroupservers.com, 185.68.247.0/24 (Net-finegroupservers-18)
45.80.105.252, Bulgakov Alexey Yurievich, finegroupservers.com, 45.80.104.0/23 (NET-BNSL2-1)
185.233.187.156, Dmitry Vorozhtsov, mgn-host.ru, 185.233.187.0/24 (GB-FITZISP-20181106)
185.88.100.75, TrafficTransitSolution LLC, traffictransitsolution.us, 185.88.100.0/24 (Net-traffictransitsolution-17)
194.104.8.154, TrafficTransitSolution LLC, traffictransitsolution.us, 194.104.8.0/24 (Net-traffictransitsolution-37)
185.102.112.46, Fine Group Servers Solutions LLC, finegroupservers.com, 185.102.112.0/24 (Net-finegroupservers-13)
212.193.12.64, Fine Group Servers Solutions LLC, finegroupservers.com, 212.193.12.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
91.243.191.129, Auction LLC, dauction.ru, 91.243.191.0/24 (TR-QN-20180917)
45.148.232.161, Nikolaeva Ekaterina Sergeevna, blockchainnetworksolutions.co.uk, 45.148.232.0/23 (LONDON_NETWORK)
147.78.181.191, TrafficTransitSolution LLC, traffictransitsolution.us, 147.78.181.0/24 (Net-traffictransitsolution-58)
77.83.27.90, Alax LLC, alaxona.com, 77.83.27.0/24 (FINEGROUPSERVERS-LEASE)
185.250.46.119, Dmitry Vorozhtsov, mgn-host.ru, 185.250.46.0/23 (GB-FITZISP-20181106)
94.231.219.106, LIR LLC, lir.am, 94.231.219.0/24 (CN-NET-219)
45.12.65.56, Sysoev Aleksey Anatolevich, jobbuzzactiv.com / traffictransitsolution.us, 45.12.65.0/24 (TrafficTransitSolutionNet)
45.140.206.31, Bulgakov Alexey Yurievich, blockchainnetworksolutions.co.uk / info@finegroupservers.com, 45.140.206.0/23 (FRANKFURT_NETWORK)
84.54.57.130, LIR LLC, lir.am / traffictransitsolution.us, 84.54.56.0/23 (CN-FTNET-5456)
178.20.214.235, Alaxona Internet Inc., alaxona.com / finegroupservers.com, 178.20.214.0/24 (FINEGROUPSERVERS-LEASE)
37.44.253.204, Atex LLC, atex.ru / blockchainnetworksolutions.co.uk, 37.44.252.0/23 (NL-FTN-44252)
46.161.61.242, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru, 46.161.61.0/24 (FineTransitDE)
194.87.113.141, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.113.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
109.94.223.217, LIR LLC, lir.am / traffictransitsolution.us, 109.94.223.0/24 (CN-NET-223)
94.231.217.115, LIR LLC, lir.am / traffictransitsolution.us, 94.231.217.0/24 (TR-NET-217)
146.185.202.214, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net, 146.185.202.0/24 (FineTransitRU)
194.58.68.110, Fine Group Servers Solutions LLC, finegroupservers.com, 194.58.68.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
94.154.131.237, TrafficTransitSolution LLC, traffictransitsolution.us, 94.154.131.0/24 (TrafficTransitSolutionNet)
193.202.8.245, Fine Group Servers Solutions LLC, finegroupservers.com, 193.202.8.0/21 (FTL5)
212.192.27.33, Fine Group Servers Solutions LLC, finegroupservers.com, 212.192.27.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
193.202.87.218, Fine Group Servers Solutions LLC, finegroupservers.com, 193.202.84.0/22 (FTEL-2)
146.185.200.52, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net, 146.185.200.0/24 (FineTransitRU)
194.104.11.11, TrafficTransitSolution LLC, traffictransitsolution.us, 194.104.11.0/24 (Net-traffictransitsolution-40)
185.50.250.145, ATOMOHOST LLC, atomohost.com, 185.50.250.0/24 (Silverstar_Invest_Limited)
37.9.46.68, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net / , 37.9.44.0/22 (QUALITYNETWORK)
185.81.145.14, Fine Group Servers Solutions LLC, finegroupservers.com, 185.81.145.0/24 (Net-finegroupservers-20)
5.183.255.72, TrafficTransitSolution LLC, traffictransitsolution.us, 5.183.255.0/24 (Net-traffictransitsolution-32)
84.54.58.204, LIR LLC, lir.am / traffictransitsolution.us, 84.54.58.0/24 (GB-BTTGROUP-20181119)
109.236.55.175, Mosnet LLC, mosnetworks.ru / info@traffictransitsolution.us, 109.236.55.0/24 (CN-NET-55)
5.133.123.184, Mosnet LLC, mosnet.ru / abuse@blockchainnetworksolutions.co.uk, 5.133.123.0/24 (DE-NET5133123)
5.181.168.90, Fine Group Servers Solutions LLC, finegroupservers.com, 5.181.168.0/24 (Net-finegroupservers-5)
185.61.217.86, TrafficTransitSolution LLC, traffictransitsolution.us, 185.61.217.0/24 (Net-traffictransitsolution-46)
217.145.227.84, TrafficTransitSolution LLC, traffictransitsolution.us, 217.145.227.0/24 (Net-traffictransitsolution-64)
193.56.75.29, Auction LLC, dauction.ru / abuse@blockchainnetworksolutions.co.uk, 193.56.75.0/24 (CN-NET-75)
45.132.184.212, TrafficTransitSolution LLC, traffictransitsolution.us, 45.132.184.0/24 (Net-traffictransitsolution-5)
45.10.167.239, TrafficTransitSolution LLC, traffictransitsolution.us, 45.10.167.0/24 (Net-traffictransitsolution-28)
109.94.222.106, Express Courier LLC, expcourier.ru / info@traffictransitsolution.us, 109.94.222.0/24 (IN-NET-222)
62.76.232.218, Fine Group Servers Solutions LLC, finegroupservers.com, 62.76.232.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
147.78.183.221, TrafficTransitSolution LLC, traffictransitsolution.us, 147.78.183.0/24 (Net-traffictransitsolution-60)
94.158.22.202, Auction LLC, dauction.ru / info@traffictransitsolution.us, 94.158.22.0/24 (FR-QN-20180917)
85.202.194.33, Mosnet LLC, mosnet.ru / info@traffictransitsolution.us, 85.202.194.0/24 (DE-QN-20180917)
193.187.93.150, Fine Group Servers Solutions LLC, finegroupservers.com, 193.187.92.0/22 (FTL3)
185.250.45.149, Dmitry Vorozhtsov, mgn-host.ru / abuse@fitz-isp.uk, 185.250.44.0/23 (GB-FITZISP-20181106)
185.50.251.75, ATOMOHOST LLC, atomohost.com, 185.50.251.0/24 (Silverstar_Invest_Limited)
5.183.254.117, TrafficTransitSolution LLC, traffictransitsolution.us, 5.183.254.0/24 (Net-traffictransitsolution-31)
45.132.186.187, TrafficTransitSolution LLC, traffictransitsolution.us, 45.132.186.0/24 (Net-traffictransitsolution-7)
83.171.252.105, Teleport LLC, teleport.az / abuse@blockchainnetworksolutions.co.uk, 83.171.252.0/23 (DE-FTNET-252)
45.148.127.37, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.127.0/24 (Net-traffictransitsolution-16)
194.87.115.133, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.115.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
193.233.250.100, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.250.0/24 (TrafficTransitSolutionNet)
194.87.116.246, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.116.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
195.133.25.244, Fine Group Servers Solutions LLC, finegroupservers.com, 195.133.25.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
77.220.194.159, Fine Group Servers Solutions LLC, finegroupservers.com, 77.220.194.0/24 (Net-finegroupservers-3)
185.89.101.177, ATOMOHOST LLC, atomohost.com, 185.89.100.0/23 (QUALITYNETWORK)
193.151.191.133, Alax LLC, alaxona.com / info@finegroupservers.com, 193.151.191.0/24 (FINEGROUPSERVERS-LEASE)
5.181.170.147, Fine Group Servers Solutions LLC, finegroupservers.com, 5.181.170.0/24 (Net-finegroupservers-7)
193.233.249.167, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.249.0/24 (TrafficTransitSolutionNet)
46.161.59.90, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru, 46.161.59.0/24 (FineTransitJP)
213.108.3.74, TrafficTransitSolution LLC, traffictransitsolution.us, 213.108.3.0/24 (Net-traffictransitsolution-24)
193.233.251.238, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.251.0/24 (TrafficTransitSolutionNet)
178.20.215.224, Alaxona Internet Inc., alaxona.com / info@finegroupservers.com, 178.20.215.0/24 (FINEGROUPSERVERS-LEASE)
45.159.22.199, Server LLC, ixserv.ru / info@finegroupservers.com, 45.159.22.0/24 (FINEGROUPSERVERS-LEASE)
109.236.53.244, Mosnet LLC, mosnet.ru, info@traffictransitsolution.us, 109.236.53.0/24 (TR-NET-53)
  • I found a better way to get the ASNs using my resolve-addresses-geoip2.py script
    • First, get a list of all IPs making requests to nginx today:
# grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq  > /tmp/ips-sorted.txt
# wc -l /tmp/ips-sorted.txt 
10776 /tmp/ips-sorted.txt
  • Then resolve them all:
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
  • Then get the top 10 organizations and top ten ASNs:
$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
    213 AMAZON-AES
    218 ASN-QUADRANET-GLOBAL
    246 Silverstar Invest Limited
    347 Ethiopian Telecommunication Corporation
    475 DEDIPATH-LLC
    504 AS-COLOCROSSING
    598 UAB Rakrejus
    814 UGB Hosting OU
   1010 ST-BGP
   1757 Global Layer B.V.
$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
    213 14618
    218 8100
    246 35624
    347 24757
    475 35913
    504 36352
    598 62282
    814 206485
   1010 46844
   1757 49453
  • I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I’m concerned about Global Layer because it’s a huge ASN that seems to have legit hosts too…?
$ wget https://asn.ipinfo.app/api/text/nginx/AS49453
$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
$ wget https://asn.ipinfo.app/api/text/nginx/AS36352
$ wget https://asn.ipinfo.app/api/text/nginx/AS35624
$ cat AS* | sort | uniq > /tmp/abusive-networks.txt
$ wc -l /tmp/abusive-networks.txt 
2276 /tmp/abusive-networks.txt
  • Combining with my existing rules and filtering uniques:
$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
2298
  • According to Scamalytics all these are high risk ISPs (as recently as 2021-06) so I will just keep blocking them
  • I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through… sigh
  • The next thing I need to do is purge all the IPs from Solr using grepcidr…

2021-07-18

  • After blocking all the ASN network blocks yesterday I still see requests getting through from these abusive networks, so the ASN lists must be out of date
    • I decided to get a lit of all the IPs that made requests on the server in the last two days, resolve them, and then filter out those from these ASNs: 206485, 35624, 36352, 46844, 49453, 62282
$ sudo zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 | grep -E " (200|499) " | awk '{print $1}' | sort | uniq > /tmp/all-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips.txt -o /tmp/all-ips-out.csv
$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/all-ips-to-block.txt
$ wc -l /tmp/all-ips-to-block.txt 
5095 /tmp/all-ips-to-block.txt
  • Then I added them to the normal ipset we are already using with firewalld
    • I will check again in a few hours and ban more
  • I decided to extract the networks from the GeoIP database with resolve-addresses-geoip2.py so I can block them more efficiently than using the 5,000 IPs in an ipset:
$ csvgrep -c asn -r '^(206485|35624|36352|46844|49453|62282)$' /tmp/all-ips-out.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/all-networks-to-block.txt
$ grep deny roles/dspace/templates/nginx/abusive-networks.conf.j2 | sort | uniq | wc -l
2354
  • Combined with the previous networks this brings about 200 more for a total of 2,354 networks
    • I think I need to re-work the ipset stuff in my common Ansible role so that I can add such abusive networks as an iptables ipset / nftables set, and have a cron job to update them daily (from Spamhaus’s DROP and EDROP lists, for example)
  • Then I got a list of all the 5,095 IPs from above and used check-spider-ip-hits.sh to purge them from Solr:
$ ilri/check-spider-ip-hits.sh -f /tmp/all-ips-to-block.txt -p
...
Total number of bot hits purged: 197116
  • I started a harvest on AReS and it finished in a few hours now that the load on CGSpace is back to a normal level

2021-07-20

  • Looking again at the IPs making connections to CGSpace over the last few days from these seven ASNs, it’s much higher than I noticed yesterday:
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l
5643
  • I purged 27,000 more hits from the Solr stats using this new list of IPs with my check-spider-ip-hits.sh script
  • Surprise surprise, I checked the nginx logs from 2021-06-23 when we last had issues with thousands of XMLUI sessions and PostgreSQL connections and I see IPs from the same ASNs!
$ sudo zcat --force /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/all-ips-june-23.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/all-ips-june-23.txt -o /tmp/out.csv
$ csvcut -c 2,4 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 15
    265 GOOGLE,15169
    277 Silverstar Invest Limited,35624
    280 FACEBOOK,32934
    288 SAFARICOM-LIMITED,33771
    399 AMAZON-AES,14618
    427 MICROSOFT-CORP-MSN-AS-BLOCK,8075
    455 Opera Software AS,39832
    481 MTN NIGERIA Communication limited,29465
    502 DEDIPATH-LLC,35913
    506 AS-COLOCROSSING,36352
    602 UAB Rakrejus,62282
    822 ST-BGP,46844
    874 Ethiopian Telecommunication Corporation,24757
    912 UGB Hosting OU,206485
   1607 Global Layer B.V.,49453
  • Again it was over 5,000 IPs:
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624)$' /tmp/out.csv | csvcut -c ip | sed 1d | sort | uniq | wc -l         
5228
  • Interestingly, it seems these are five thousand different IP addresses than the attack from last weekend, as there are over 10,000 unique ones if I combine them!
$ cat /tmp/ips-june23.txt /tmp/ips-jul16.txt | sort | uniq | wc -l
10458
  • I purged all the (26,000) hits from these new IP addresses from Solr as well
  • Looking back at my notes for the 2019-05 attack I see that I had already identified most of these network providers (!)…
    • Also, I took a closer look at QuadraNet (AS8100) and found some association with ATOMOHOST LLC and finegroupservers.com and traffictransitsolution.us, so now I need to block/purge that ASN too!
    • I saw it on the Scamalytics 2021-06 list anyways, so at this point I have no doubt
  • Adding QuadraNet brings the total networks seen during these two attacks to 262, and the number of unique IPs to 10900:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2 /var/log/nginx/access.log.3 /var/log/nginx/access.log.4 /var/log/nginx/access.log.5 /var/log/nginx/access.log.27.gz /var/log/nginx/access.log.28.gz | grep -E " (200|499) " | grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" | awk '{print $1}' | sort | uniq > /tmp/ddos-ips.txt
# wc -l /tmp/ddos-ips.txt 
54002 /tmp/ddos-ips.txt
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ddos-ips.txt -o /tmp/ddos-ips.csv
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c ip | sed 1d | sort | uniq > /tmp/ddos-ips-to-purge.txt
$ wc -l /tmp/ddos-ips-to-purge.txt
10900 /tmp/ddos-ips-to-purge.txt
$ csvgrep -c asn -r '^(49453|46844|206485|62282|36352|35913|35624|8100)$' /tmp/ddos-ips.csv | csvcut -c network | sed 1d | sort | uniq > /tmp/ddos-networks-to-block.txt
$ wc -l /tmp/ddos-networks-to-block.txt
262 /tmp/ddos-networks-to-block.txt
  • The new total number of networks to block, including the network prefixes for these ASNs downloaded from asn.ipinfo.app, is 4,007:
$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 \
https://asn.ipinfo.app/api/text/nginx/AS46844 \
https://asn.ipinfo.app/api/text/nginx/AS206485 \
https://asn.ipinfo.app/api/text/nginx/AS62282 \
https://asn.ipinfo.app/api/text/nginx/AS36352 \
https://asn.ipinfo.app/api/text/nginx/AS35913 \
https://asn.ipinfo.app/api/text/nginx/AS35624 \
https://asn.ipinfo.app/api/text/nginx/AS8100
$ cat AS* /tmp/ddos-networks-to-block.txt | sed -e '/^$/d' -e '/^#/d' -e '/^{/d' -e 's/deny //' -e 's/;//' | sort | uniq | wc -l
4007
  • I re-applied these networks to nginx on CGSpace (linode18) and DSpace Test (linode26), and purged 14,000 more Solr statistics hits from these IPs