From 8934e3f79189a5324c3549c5ea301ade94b27545 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sun, 18 Jul 2021 00:16:49 +0300 Subject: [PATCH] Add notes for 2021-07-17 --- content/posts/2021-07.md | 250 +++++++ docs/2021-06/index.html | 906 ++++++++++-------------- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/sitemap.xml | 10 +- 25 files changed, 669 insertions(+), 541 deletions(-) diff --git a/content/posts/2021-07.md b/content/posts/2021-07.md index d70d59187..8d09d68de 100644 --- a/content/posts/2021-07.md +++ b/content/posts/2021-07.md @@ -181,4 +181,254 @@ $ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-0 - Start an indexing on AReS +## 2021-07-17 + +- I'm in Cyprus mostly offline, but I noticed that CGSpace was down + - I checked and there was a blank white page with HTTP 200 + - There are thousands of locks in PostgreSQL: + +```console +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l +2302 +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l +2564 +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l +2530 +``` + +- The locks are held by XMLUI, not REST API or OAI: + +```console +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n + 57 dspaceApi + 2671 dspaceWeb +``` + +- I ran all updates on the server (linode18) and restarted it, then DSpace came back up +- I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues +- Clone the `openrxv-items-temp` index on AReS and re-run all the plugins, but most of the "dspace_add_missing_items" tasks failed so I will just run a full re-harvest +- The load on CGSpace is 45.00... the nginx access.log is going so fast I can't even read it + - I see lots of IPs from AS206485 that are changing their user agents (Linux, Windows, macOS...) + - This is finegroupservers.com aka "UGB - UGB Hosting OU" + - I will get a list of their IP blocks from [ipinfo.app](https://asn.ipinfo.app/AS206485) and block them in nginx + - There is another group of IPs that are owned by an ISP called "TrafficTransitSolution LLC" that does not have its own ASN unfortunately + - "TrafficTransitSolution LLC" seems to be affiliated with AS206485 (UGB Hosting OU) anyways, but they sometimes use AS49453 Global Layer B.V.G also + - I found a tool that lets you grep a file by CIDR, so I can use that to purge hits from Solr eventually: + +```console +# grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n + 32 91.243.191.124 + 33 91.243.191.129 + 33 91.243.191.200 + 34 91.243.191.115 + 34 91.243.191.154 + 34 91.243.191.234 + 34 91.243.191.56 + 35 91.243.191.187 + 35 91.243.191.91 + 36 91.243.191.58 + 37 91.243.191.209 + 39 91.243.191.119 + 39 91.243.191.144 + 39 91.243.191.55 + 40 91.243.191.112 + 40 91.243.191.182 + 40 91.243.191.57 + 40 91.243.191.98 + 41 91.243.191.106 + 44 91.243.191.79 + 45 91.243.191.151 + 46 91.243.191.103 + 56 91.243.191.172 +``` + +- I found a few people complaining about these Russian attacks too: + - https://community.cloudflare.com/t/russian-ddos-completley-unmitigated-by-cloudflare/284578 + - https://vklader.com/ddos-2020-may/ +- According to AbuseIPDB.com and whois data provided by the asn tool, I see these organizations, networks, and ISPs all seem to be related: + - Sharktech + - LIR LLC / lir.am + - TrafficTransitSolution LLC / traffictransitsolution.us + - Fine Group Servers Solutions LLC / finegroupservers.com + - UGB + - Bulgakov Alexey Yurievich + - Dmitry Vorozhtsov / fitz-isp.uk / UGB + - Auction LLC / dauction.ru / UGB / traffictransitsolution.us + - Alax LLC / alaxona.com / finegroupservers.com + - Sysoev Aleksey Anatolevich / jobbuzzactiv.com / traffictransitsolution.us + - Bulgakov Alexey Yurievich / UGB / blockchainnetworksolutions.co.uk / info@finegroupservers.com + - UAB Rakrejus +- I looked in the nginx log and copied a few IP addresses that were suspicious + - First I looked them up in AbuseIPDB.com to get the ISP name and website + - Then I looked them up with the [asn](https://github.com/nitefood/asn) tool, ie: + +```console +$ ./asn -n 45.80.217.235 + +╭──────────────────────────────╮ +│ ASN lookup for 45.80.217.235 │ +╰──────────────────────────────╯ + + 45.80.217.235 ┌PTR - + ├ASN 46844 (ST-BGP, US) + ├ORG Sharktech + ├NET 45.80.217.0/24 (TrafficTransitSolutionNet) + ├ABU info@traffictransitsolution.us + ├ROA ✓ VALID (1 ROA found) + ├TYP Proxy host Hosting/DC + ├GEO Los Angeles, California (US) + └REP ✓ NONE +``` + +- Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example: + +```csv +IP, Organization, Website, Network +45.148.126.246, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.126.0/24 (Net-traffictransitsolution-15) +45.138.102.253, TrafficTransitSolution LLC, traffictransitsolution.us, 45.138.102.0/24 (Net-traffictransitsolution-11) +45.140.205.104, Bulgakov Alexey Yurievich, finegroupservers.com, 45.140.204.0/23 (CHINA_NETWORK) +185.68.247.63, Fine Group Servers Solutions LLC, finegroupservers.com, 185.68.247.0/24 (Net-finegroupservers-18) +213.232.123.188, Fine Group Servers Solutions LLC, finegroupservers.com, 213.232.123.0/24 (Net-finegroupservers-12) +45.80.217.235, TrafficTransitSolution LLC, traffictransitsolution.us, 45.80.217.0/24 (TrafficTransitSolutionNet) +185.81.144.202, Fine Group Servers Solutions LLC, finegroupservers.com, 185.81.144.0/24 (Net-finegroupservers-19) +109.106.22.114, TrafficTransitSolution LLC, traffictransitsolution.us, 109.106.22.0/24 (TrafficTransitSolutionNet) +185.68.247.200, Fine Group Servers Solutions LLC, finegroupservers.com, 185.68.247.0/24 (Net-finegroupservers-18) +45.80.105.252, Bulgakov Alexey Yurievich, finegroupservers.com, 45.80.104.0/23 (NET-BNSL2-1) +185.233.187.156, Dmitry Vorozhtsov, mgn-host.ru, 185.233.187.0/24 (GB-FITZISP-20181106) +185.88.100.75, TrafficTransitSolution LLC, traffictransitsolution.us, 185.88.100.0/24 (Net-traffictransitsolution-17) +194.104.8.154, TrafficTransitSolution LLC, traffictransitsolution.us, 194.104.8.0/24 (Net-traffictransitsolution-37) +185.102.112.46, Fine Group Servers Solutions LLC, finegroupservers.com, 185.102.112.0/24 (Net-finegroupservers-13) +212.193.12.64, Fine Group Servers Solutions LLC, finegroupservers.com, 212.193.12.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +91.243.191.129, Auction LLC, dauction.ru, 91.243.191.0/24 (TR-QN-20180917) +45.148.232.161, Nikolaeva Ekaterina Sergeevna, blockchainnetworksolutions.co.uk, 45.148.232.0/23 (LONDON_NETWORK) +147.78.181.191, TrafficTransitSolution LLC, traffictransitsolution.us, 147.78.181.0/24 (Net-traffictransitsolution-58) +77.83.27.90, Alax LLC, alaxona.com, 77.83.27.0/24 (FINEGROUPSERVERS-LEASE) +185.250.46.119, Dmitry Vorozhtsov, mgn-host.ru, 185.250.46.0/23 (GB-FITZISP-20181106) +94.231.219.106, LIR LLC, lir.am, 94.231.219.0/24 (CN-NET-219) +45.12.65.56, Sysoev Aleksey Anatolevich, jobbuzzactiv.com / traffictransitsolution.us, 45.12.65.0/24 (TrafficTransitSolutionNet) +45.140.206.31, Bulgakov Alexey Yurievich, blockchainnetworksolutions.co.uk / info@finegroupservers.com, 45.140.206.0/23 (FRANKFURT_NETWORK) +84.54.57.130, LIR LLC, lir.am / traffictransitsolution.us, 84.54.56.0/23 (CN-FTNET-5456) +178.20.214.235, Alaxona Internet Inc., alaxona.com / finegroupservers.com, 178.20.214.0/24 (FINEGROUPSERVERS-LEASE) +37.44.253.204, Atex LLC, atex.ru / blockchainnetworksolutions.co.uk, 37.44.252.0/23 (NL-FTN-44252) +46.161.61.242, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru, 46.161.61.0/24 (FineTransitDE) +194.87.113.141, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.113.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +109.94.223.217, LIR LLC, lir.am / traffictransitsolution.us, 109.94.223.0/24 (CN-NET-223) +94.231.217.115, LIR LLC, lir.am / traffictransitsolution.us, 94.231.217.0/24 (TR-NET-217) +146.185.202.214, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net, 146.185.202.0/24 (FineTransitRU) +194.58.68.110, Fine Group Servers Solutions LLC, finegroupservers.com, 194.58.68.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +94.154.131.237, TrafficTransitSolution LLC, traffictransitsolution.us, 94.154.131.0/24 (TrafficTransitSolutionNet) +193.202.8.245, Fine Group Servers Solutions LLC, finegroupservers.com, 193.202.8.0/21 (FTL5) +212.192.27.33, Fine Group Servers Solutions LLC, finegroupservers.com, 212.192.27.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +193.202.87.218, Fine Group Servers Solutions LLC, finegroupservers.com, 193.202.84.0/22 (FTEL-2) +146.185.200.52, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net, 146.185.200.0/24 (FineTransitRU) +194.104.11.11, TrafficTransitSolution LLC, traffictransitsolution.us, 194.104.11.0/24 (Net-traffictransitsolution-40) +185.50.250.145, ATOMOHOST LLC, atomohost.com, 185.50.250.0/24 (Silverstar_Invest_Limited) +37.9.46.68, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net / , 37.9.44.0/22 (QUALITYNETWORK) +185.81.145.14, Fine Group Servers Solutions LLC, finegroupservers.com, 185.81.145.0/24 (Net-finegroupservers-20) +5.183.255.72, TrafficTransitSolution LLC, traffictransitsolution.us, 5.183.255.0/24 (Net-traffictransitsolution-32) +84.54.58.204, LIR LLC, lir.am / traffictransitsolution.us, 84.54.58.0/24 (GB-BTTGROUP-20181119) +109.236.55.175, Mosnet LLC, mosnetworks.ru / info@traffictransitsolution.us, 109.236.55.0/24 (CN-NET-55) +5.133.123.184, Mosnet LLC, mosnet.ru / abuse@blockchainnetworksolutions.co.uk, 5.133.123.0/24 (DE-NET5133123) +5.181.168.90, Fine Group Servers Solutions LLC, finegroupservers.com, 5.181.168.0/24 (Net-finegroupservers-5) +185.61.217.86, TrafficTransitSolution LLC, traffictransitsolution.us, 185.61.217.0/24 (Net-traffictransitsolution-46) +217.145.227.84, TrafficTransitSolution LLC, traffictransitsolution.us, 217.145.227.0/24 (Net-traffictransitsolution-64) +193.56.75.29, Auction LLC, dauction.ru / abuse@blockchainnetworksolutions.co.uk, 193.56.75.0/24 (CN-NET-75) +45.132.184.212, TrafficTransitSolution LLC, traffictransitsolution.us, 45.132.184.0/24 (Net-traffictransitsolution-5) +45.10.167.239, TrafficTransitSolution LLC, traffictransitsolution.us, 45.10.167.0/24 (Net-traffictransitsolution-28) +109.94.222.106, Express Courier LLC, expcourier.ru / info@traffictransitsolution.us, 109.94.222.0/24 (IN-NET-222) +62.76.232.218, Fine Group Servers Solutions LLC, finegroupservers.com, 62.76.232.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +147.78.183.221, TrafficTransitSolution LLC, traffictransitsolution.us, 147.78.183.0/24 (Net-traffictransitsolution-60) +94.158.22.202, Auction LLC, dauction.ru / info@traffictransitsolution.us, 94.158.22.0/24 (FR-QN-20180917) +85.202.194.33, Mosnet LLC, mosnet.ru / info@traffictransitsolution.us, 85.202.194.0/24 (DE-QN-20180917) +193.187.93.150, Fine Group Servers Solutions LLC, finegroupservers.com, 193.187.92.0/22 (FTL3) +185.250.45.149, Dmitry Vorozhtsov, mgn-host.ru / abuse@fitz-isp.uk, 185.250.44.0/23 (GB-FITZISP-20181106) +185.50.251.75, ATOMOHOST LLC, atomohost.com, 185.50.251.0/24 (Silverstar_Invest_Limited) +5.183.254.117, TrafficTransitSolution LLC, traffictransitsolution.us, 5.183.254.0/24 (Net-traffictransitsolution-31) +45.132.186.187, TrafficTransitSolution LLC, traffictransitsolution.us, 45.132.186.0/24 (Net-traffictransitsolution-7) +83.171.252.105, Teleport LLC, teleport.az / abuse@blockchainnetworksolutions.co.uk, 83.171.252.0/23 (DE-FTNET-252) +45.148.127.37, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.127.0/24 (Net-traffictransitsolution-16) +194.87.115.133, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.115.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +193.233.250.100, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.250.0/24 (TrafficTransitSolutionNet) +194.87.116.246, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.116.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +195.133.25.244, Fine Group Servers Solutions LLC, finegroupservers.com, 195.133.25.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC) +77.220.194.159, Fine Group Servers Solutions LLC, finegroupservers.com, 77.220.194.0/24 (Net-finegroupservers-3) +185.89.101.177, ATOMOHOST LLC, atomohost.com, 185.89.100.0/23 (QUALITYNETWORK) +193.151.191.133, Alax LLC, alaxona.com / info@finegroupservers.com, 193.151.191.0/24 (FINEGROUPSERVERS-LEASE) +5.181.170.147, Fine Group Servers Solutions LLC, finegroupservers.com, 5.181.170.0/24 (Net-finegroupservers-7) +193.233.249.167, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.249.0/24 (TrafficTransitSolutionNet) +46.161.59.90, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru, 46.161.59.0/24 (FineTransitJP) +213.108.3.74, TrafficTransitSolution LLC, traffictransitsolution.us, 213.108.3.0/24 (Net-traffictransitsolution-24) +193.233.251.238, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.251.0/24 (TrafficTransitSolutionNet) +178.20.215.224, Alaxona Internet Inc., alaxona.com / info@finegroupservers.com, 178.20.215.0/24 (FINEGROUPSERVERS-LEASE) +45.159.22.199, Server LLC, ixserv.ru / info@finegroupservers.com, 45.159.22.0/24 (FINEGROUPSERVERS-LEASE) +109.236.53.244, Mosnet LLC, mosnet.ru, info@traffictransitsolution.us, 109.236.53.0/24 (TR-NET-53) +``` + +- I found a better way to get the ASNs using my `resolve-addresses-geoip2.py` script + - First, get a list of all IPs making requests to nginx today: + +```console +# grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq > /tmp/ips-sorted.txt +# wc -l /tmp/ips-sorted.txt +10776 /tmp/ips-sorted.txt +``` + +- Then resolve them all: + +```console: +$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv +``` + +- Then get the top 10 organizations and top ten ASNs: + +```console +$ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10 + 213 AMAZON-AES + 218 ASN-QUADRANET-GLOBAL + 246 Silverstar Invest Limited + 347 Ethiopian Telecommunication Corporation + 475 DEDIPATH-LLC + 504 AS-COLOCROSSING + 598 UAB Rakrejus + 814 UGB Hosting OU + 1010 ST-BGP + 1757 Global Layer B.V. +$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10 + 213 14618 + 218 8100 + 246 35624 + 347 24757 + 475 35913 + 504 36352 + 598 62282 + 814 206485 + 1010 46844 + 1757 49453 +``` + +- I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I'm concerned about Global Layer because it's a huge ASN that seems to have legit hosts too...? + +```console +$ wget https://asn.ipinfo.app/api/text/nginx/AS49453 +$ wget https://asn.ipinfo.app/api/text/nginx/AS46844 +$ wget https://asn.ipinfo.app/api/text/nginx/AS206485 +$ wget https://asn.ipinfo.app/api/text/nginx/AS62282 +$ wget https://asn.ipinfo.app/api/text/nginx/AS36352 +$ wget https://asn.ipinfo.app/api/text/nginx/AS35624 +$ cat AS* | sort | uniq > /tmp/abusive-networks.txt +$ wc -l /tmp/abusive-networks.txt +2276 /tmp/abusive-networks.txt +``` + +- Combining with my existing rules and filtering uniques: + +```console +$ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l +2298 +``` + +- [According to Scamlytics all these are high risk ISPs](https://scamalytics.com/ip/isp) (as recently as 2021-06) so I will just keep blocking them +- I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through... sigh +- The next thing I need to do is purge all the IPs from Solr using grepcidr... + diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html index 4fb8a73d0..5764ddded 100644 --- a/docs/2021-06/index.html +++ b/docs/2021-06/index.html @@ -6,35 +6,29 @@ - - + - - + + - - + @@ -44,11 +38,11 @@ I simply started it and AReS was running again: { "@context": "http://schema.org", "@type": "BlogPosting", - "headline": "June, 2021", + "headline": "July, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-06/", - "wordCount": "3505", - "datePublished": "2021-06-01T10:51:07+03:00", - "dateModified": "2021-07-01T08:53:21+03:00", + "wordCount": "2439", + "datePublished": "2021-06-01T08:53:07+03:00", + "dateModified": "2021-07-11T15:59:16+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -61,7 +55,7 @@ I simply started it and AReS was running again: - June, 2021 | CGSpace Notes + July, 2021 | CGSpace Notes @@ -113,564 +107,448 @@ I simply started it and AReS was running again:
-

June, 2021

+

July, 2021

-

2021-06-01

+

2021-07-01

+
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
+COPY 20994
+

2021-07-04

- - -
$ docker-compose -f docker/docker-compose.yml start angular_nginx
+
$ cd OpenRXV
+$ docker-compose -f docker/docker-compose.yml down
+$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
+$ docker-compose -f docker/docker-compose.yml build
 
    -
  • Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately +
  • Then run all system updates and reboot the server
  • +
  • After the server came back up I cloned the openrxv-items-final index to openrxv-items-temp and started the plugins
      -
    • I guess this is related to the SMTP issues last week
    • -
    • I had fixed the config, but didn’t restart Tomcat so DSpace didn’t load the new variables
    • -
    • I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers
    • +
    • This will hopefully be faster than a full re-harvest…
    • +
    +
  • +
  • I opened a few GitHub issues for OpenRXV bugs: + +
  • +
  • Rebuild DSpace Test (linode26) from a fresh Ubuntu 20.04 image on Linode
  • +
  • The start plugins on AReS had seventy-five errors from the dspace_add_missing_items plugin for some reason so I had to start a fresh indexing
  • +
  • I noticed that the WorldFish data has dozens of incorrect countries so I should talk to Salem about that because they manage it +
      +
    • Also I noticed that we weren’t using the Country formatter in OpenRXV for the WorldFish country field, so some values don’t get mapped properly
    • +
    • I added some value mappings to fix some issues with WorldFish data and added a few more fields to the repository harvesting config and started a fresh re-indexing
-

2021-06-03

+

2021-07-05

    -
  • Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub +
  • The AReS harvesting last night succeeded and I started the plugins
  • +
  • Margarita from CCAFS asked if we can create a new field for AICCRA publications
      -
    • At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
    • -
    • That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…
    • -
    • Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:
    • +
    • I asked her to clarify what they want
    • +
    • AICCRA is an initiative so it might be better to create new field for that, for example cg.contributor.initiative
-
https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
-
    -
  • That will sort by date issued (see: webui.itemlist.sort-option.2 in dspace.cfg), give 100 results per page, and start on item 1
  • -
  • Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
  • -
  • Fill out the CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC survey on behalf of CGSpace
  • -
-

2021-06-06

+

2021-07-06

    -
  • The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
  • -
-
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
-curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
-curl -XPUT 'http://localhost:9200/openrxv-items-final'
-curl -XPUT 'http://localhost:9200/openrxv-items-temp'
-curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
-elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
-elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
-
    -
  • Then I started a harvesting on AReS
  • -
-

2021-06-07

+
  • Atmire merged my spider user agent changes from last month so I will update the example list we use in DSpace and remove the new ones from my ilri override file
      -
    • The harvesting on AReS completed successfully
    • -
    • Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”
    • -
    -

    2021-06-10

    -
      -
    • Skype with Moayad to discuss AReS harvesting improvements -
        -
      • He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not
      • +
      • Also, I concatenated all our user agents into one file and purged all hits:
    -

    2021-06-14

    -
      -
    • Dump and re-create indexes on AReS (as above) so I can do a harvest
    • -
    -

    2021-06-16

    -
      -
    • Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS -
        -
      • For example, user agent Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0) with DNS msnbot-131-253-25-91.search.msn.com.
      • -
      • I queried Solr for all hits using the MSN bot DNS (dns:*msnbot* AND dns:*.msn.com.) and found 457,706
      • -
      • I extracted their IPs using Solr’s CSV format and ran them through my resolve-addresses.py script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
      • -
      • Note that Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com” so it is safe to purge these as non-human traffic
      • -
      • I purged the hits with ilri/check-spider-ip-hits.sh (though I had to do it in 3 batches because I forgot to increase the facet.limit so I was only getting them 100 at a time)
      • -
      -
    • -
    • Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV -
        -
      • It will hopefully also fix the duplicate and missing items issues
      • -
      • I had a Skype with him to discuss
      • -
      • I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
      • -
      -
    • -
    -
    $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
    -
      -
    • The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster -
        -
      • I harvested 90,000+ items from DSpace Test in ~3 hours
      • -
      • There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times…
      • -
      -
    • -
    -

    2021-06-17

    -
      -
    • I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases -
        -
      • The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
      • -
      -
    • -
    • Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
    • -
    • More work with Moayad on OpenRXV harvesting issues -
        -
      • Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:
      • -
      -
    • -
    -
    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
    -90459
    -$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
    -90380
    -$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
    -...
    -      2 "10568/99409"
    -      2 "10568/99410"
    -      2 "10568/99411"
    -      2 "10568/99516"
    -      3 "10568/102093"
    -      3 "10568/103524"
    -      3 "10568/106664"
    -      3 "10568/106940"
    -      3 "10568/107195"
    -      3 "10568/96546"
    -

    2021-06-20

    -
      -
    • Udana asked me to update their IWMI subjects from farmer managed irrigation systems to farmer-led irrigation -
        -
      • First I extracted the IWMI community from CGSpace:
      • -
      -
    • -
    -
    $ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
    -
      -
    • Then I used csvcut to extract just the columns I needed and do the replacement into a new CSV:
    • -
    -
    $ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
    -
      -
    • Then I uploaded the resulting CSV to CGSpace, updating 161 items
    • -
    • Start a harvest on AReS
    • -
    • I found a bug and a patch for the private items showing up in the DSpace sitemap bug -
        -
      • The fix is super simple, I should try to apply it
      • -
      -
    • -
    -

    2021-06-21

    -
      -
    • The AReS harvesting finished, but the indexes got messed up again
    • -
    • I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates -
        -
      • We have 90,000+ items, but only 85,000 unique:
      • -
      -
    • -
    -
    $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
    -90937
    -$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
    -85709
    -
      -
    • So those could be duplicates from the way we harvest pages, but they could also be from mappings… -
        -
      • Manually inspecting the duplicates where handles appear more than once:
      • -
      -
    • -
    -
    $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
    -
      -
    • Unfortunately I found no pattern: -
        -
      • Some appear twice in the Elasticsearch index, but appear in only one collection
      • -
      • Some appear twice in the Elasticsearch index, and appear in two collections
      • -
      • Some appear twice in the Elasticsearch index, but appear in three collections (!)
      • -
      -
    • -
    • So really we need to just check whether a handle exists before we insert it
    • -
    • I tested the pull request for DS-1977 that adjusts the sitemap generation code to exclude private items -
        -
      • It applies cleanly and seems to work, but we don’t actually have any private items
      • -
      • The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
      • -
      -
    • -
    • Testing the pull request for DS-4065 where the REST API’s /rest/items endpoint is not aware of private items and returns an incorrect number of items -
        -
      • This is most easily seen by setting a low limit in /rest/items, making one of the items private, and requesting items again with the same limit
      • -
      • I confirmed the issue on the current DSpace 6 Demo:
      • -
      -
    • -
    -
    $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
    -5
    -$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
    -"10673/4"
    -"10673/3"
    -"10673/6"
    -"10673/5"
    -"10673/7"
    -# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
    -$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length       
    -4
    -$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' 
    -"10673/4"
    -"10673/3"
    -"10673/5"
    -"10673/7"
    -
      -
    • I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
    • -
    • Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles -
        -
      • I emailed Fabio and Marianne to ask them to please use the Handle links
      • -
      -
    • -
    • I tested the pull request for DS-4271 where Discovery filters of type “contains” don’t work as expected when the user’s search term has spaces -
        -
      • I tested with filter “farmer managed irrigation systems” on DSpace Test
      • -
      • Before the patch I got 293 results, and the few I checked didn’t have the expected metadata value
      • -
      • After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting
      • -
      -
    • -
    • I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results: -
        -
      • 90459 in final from last harvesting
      • -
      • 90307 in temp after new harvest
      • -
      • 90327 in temp after start plugins
      • -
      -
    • -
    • The 90327 number seems closer to the “real” number of items on CGSpace… -
        -
      • Seems close, but not entirely correct yet:
      • -
      -
    • -
    -
    $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
    -90327
    -$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
    -90317
    -

    2021-06-22

    -
      -
    • Make a pull request to the COUNTER-Robots project to add two new user agents: crusty and newspaper -
        -
      • These two bots have made ~3,000 requests on CGSpace
      • -
      • Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:
      • -
      -
    • -
    -
    $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
    -Purging 1339 hits from RI\/1\.0 in statistics
    -Purging 447 hits from crusty in statistics
    -Purging 3736 hits from newspaper in statistics
    +
    $ ./ilri/check-spider-hits.sh -f /tmp/spiders -p
    +Purging 95 hits from Drupal in statistics
    +Purging 38 hits from DTS Agent in statistics
    +Purging 601 hits from Microsoft Office Existence Discovery in statistics
    +Purging 51 hits from Site24x7 in statistics
    +Purging 62 hits from Trello in statistics
    +Purging 13574 hits from WhatsApp in statistics
    +Purging 144 hits from FlipboardProxy in statistics
    +Purging 37 hits from LinkWalker in statistics
    +Purging 1 hits from [Ll]ink.?[Cc]heck.? in statistics
    +Purging 427 hits from WordPress in statistics
     
    -Total number of bot hits purged: 5522
    +Total number of bot hits purged: 15030
     
      -
    • Surprised to see RI/1.0 in there because it’s been in the override file for a while
    • -
    • Looking at the 2021 statistics in Solr I see a few more suspicious user agents: -
        -
      • PostmanRuntime/7.26.8
      • -
      • node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
      • -
      • Photon/1.0
      • -
      • StatusCake_Pagespeed_indev
      • -
      • node-superagent/3.8.3
      • -
      • cortex/1.0
      • +
      • Meet with the CGIAR–AGROVOC task group to discuss how we want to do the workflow for submitting new terms to AGROVOC
      • +
      • I extracted another list of all subjects to check against AGROVOC:
      -
    • -
    • These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one
    • -
    • I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
    • -
    -

    2021-06-23

    -
      -
    • I woke up this morning to find CGSpace down -
        -
      • The logs show a high number of abandoned PostgreSQL connections and locks:
      • -
      -
    • -
    -
    # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
    -978
    -$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    -10100
    +
    \COPY (SELECT DISTINCT(LOWER(text_value)) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-06-all-subjects.csv WITH CSV HEADER;
    +$ csvcut -c 1 /tmp/2021-07-06-all-subjects.csv | sed 1d > /tmp/2021-07-06-all-subjects.txt
    +$ ./ilri/agrovoc-lookup.py -i /tmp/2021-07-06-all-subjects.txt -o /tmp/2021-07-06-agrovoc-results-all-subjects.csv -d
     
      -
    • I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
    • -
    • In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
    • -
    • I also applied the following patches from the 6.4 milestone to our 6_x-prod branch: +
    • Test Hrafn Malmquist’s proposed DBCP2 changes for DSpace 6.4 (DS-4574)
        -
      • DS-4065: resource policy aware REST API hibernate queries
      • -
      • DS-4271: Replaced brackets with double quotes in SolrServiceImpl
      • +
      • His changes reminded me that we can perhaps switch back to using this pooling instead of Tomcat 7’s JDBC pooling via JNDI
      • +
      • Tomcat 8 has DBCP2 built in, but we are stuck on Tomcat 7 for now
    • -
    • After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
    • +
    • Looking into the database issues we had last month on 2021-06-23 +
        +
      • I think it might have been some kind of attack because the number of XMLUI sessions was through the roof at one point (10,000!) and the number of unique IPs accessing the server that day is much higher than any other day:
      -
      $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
      -63
      +
    • +
    +
    # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/access.log.*.gz /var/log/nginx/library-access.log.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
    +2021-06-10
    +10693
    +2021-06-11
    +10587
    +2021-06-12
    +7958
    +2021-06-13
    +7681
    +2021-06-14
    +12639
    +2021-06-15
    +15388
    +2021-06-16
    +12245
    +2021-06-17
    +11187
    +2021-06-18
    +9684
    +2021-06-19
    +7835
    +2021-06-20
    +7198
    +2021-06-21
    +10380
    +2021-06-22
    +10255
    +2021-06-23
    +15878
    +2021-06-24
    +9963
    +2021-06-25
    +9439
    +2021-06-26
    +7930
     
      -
    • Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:
    • +
    • Similarly, the number of connections to the REST API was around the average for the recent weeks before:
    -
    2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    +
    # for num in {10..26}; do echo "2021-06-$num"; zcat /var/log/nginx/rest.*.gz | grep "$num/Jun/2021" | awk '{print $1}' | sort | uniq | wc -l; done
    +2021-06-10
    +1183
    +2021-06-11
    +1074
    +2021-06-12
    +911
    +2021-06-13
    +892
    +2021-06-14
    +1320
    +2021-06-15
    +1257
    +2021-06-16
    +1208
    +2021-06-17
    +1119
    +2021-06-18
    +965
    +2021-06-19
    +985
    +2021-06-20
    +854
    +2021-06-21
    +1098
    +2021-06-22
    +1028
    +2021-06-23
    +1375
    +2021-06-24
    +1135
    +2021-06-25
    +969
    +2021-06-26
    +904
     
      -
    • Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
    • +
    • According to goaccess, the traffic spike started at 2AM (remember that the first “Pool empty” error in dspace.log was at 4:01AM):
    -
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
    +
    # zcat /var/log/nginx/access.log.1[45].gz /var/log/nginx/library-access.log.1[45].gz | grep -E '23/Jun/2021' | goaccess --log-format=COMBINED -
     
    -

    2021-06-24

    -
      -
    • I deployed the new OpenRXV code on CGSpace but I’m having problems with the indexing, something about missing the mappings on the openrxv-items-temp index -
        -
      • I extracted the mappings from my local instance using elasticdump and after putting them on CGSpace I was able to harvest…
      • -
      • But still, there are way too many duplicates and I’m not sure what the actual number of items should be
      • -
      • According to the OAI ListRecords for each of our repositories, we should have about: -
          -
        • MELSpace: 9537
        • -
        • WorldFish: 4483
        • -
        • CGSpace: 91305
        • -
        • Total: 105325
        • -
        -
      • -
      • Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
      • +
      • It works MUCH faster because it correctly identifies the missing handles in each repository
      • +
      • Also it adds better debug messages to the api logs
    -
    $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
    -104797
    -$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
    -99186
    +

    2021-07-08

    +
      +
    • Atmire plans to debug the database connection issues on CGSpace (linode18) today so they asked me to make the REST API inaccessible for today and tomorrow +
        +
      • I adjusted nginx to give an HTTP 403 as well as a an error message to contact me
      • +
      +
    • +
    +

    2021-07-11

    +
      +
    • Start an indexing on AReS
    • +
    +

    2021-07-17

    +
      +
    • I’m in Cyprus mostly offline, but I noticed that CGSpace was down +
        +
      • I checked and there was a blank white page with HTTP 200
      • +
      • There are thousands of locks in PostgreSQL:
      • +
      +
    • +
    +
    postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +2302
    +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +2564
    +postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +2530
     
      -
    • This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…
    • -
    • The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
    • +
    • The locks are held by XMLUI, not REST API or OAI:
    -
    $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
    -90990
    +
    postgres@linode18:~$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n
    +     57 dspaceApi
    +   2671 dspaceWeb
     
      -
    • So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?! +
    • I ran all updates on the server (linode18) and restarted it, then DSpace came back up
    • +
    • I sent a message to Atmire, as I never heard from them last week when we blocked access to the REST API for two days for them to investigate the server issues
    • +
    • Clone the openrxv-items-temp index on AReS and re-run all the plugins, but most of the “dspace_add_missing_items” tasks failed so I will just run a full re-harvest
    • +
    • The load on CGSpace is 45.00… the nginx access.log is going so fast I can’t even read it
        -
      • I notice that we are missing the type in the metadata structure config for each repository on the production site, and we are using type for item type in the actual schema… so maybe there is a conflict there
      • -
      • I will rename type to item_type and add it back to the metadata structure
      • -
      • The add missing items definitely checks this field…
      • -
      • I modified my local backup to add type: item and uploaded it to the temp index on production
      • -
      • Oh! nginx is blocking OpenRXV’s attempt to read the sitemap:
      • +
      • I see lots of IPs from AS206485 that are changing their user agents (Linux, Windows, macOS…)
      • +
      • This is finegroupservers.com aka “UGB - UGB Hosting OU”
      • +
      • I will get a list of their IP blocks from ipinfo.app and block them in nginx
      • +
      • There is another group of IPs that are owned by an ISP called “TrafficTransitSolution LLC” that does not have its own ASN unfortunately
      • +
      • “TrafficTransitSolution LLC” seems to be affiliated with AS206485 (UGB Hosting OU) anyways, but they sometimes use AS49453 Global Layer B.V.G also
      • +
      • I found a tool that lets you grep a file by CIDR, so I can use that to purge hits from Solr eventually:
    -
    172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
    +
    # grepcidr 91.243.191.0/24 /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -n
    +     32 91.243.191.124
    +     33 91.243.191.129
    +     33 91.243.191.200
    +     34 91.243.191.115
    +     34 91.243.191.154
    +     34 91.243.191.234
    +     34 91.243.191.56
    +     35 91.243.191.187
    +     35 91.243.191.91
    +     36 91.243.191.58
    +     37 91.243.191.209
    +     39 91.243.191.119
    +     39 91.243.191.144
    +     39 91.243.191.55
    +     40 91.243.191.112
    +     40 91.243.191.182
    +     40 91.243.191.57
    +     40 91.243.191.98
    +     41 91.243.191.106
    +     44 91.243.191.79
    +     45 91.243.191.151
    +     46 91.243.191.103
    +     56 91.243.191.172
     
      -
    • I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items… +
    • I found a few people complaining about these Russian attacks too:
    • -
    • According to the api logs we will be adding 5,697 items:
    • +
    • According to AbuseIPDB.com and whois data provided by the asn tool, I see these organizations, networks, and ISPs all seem to be related: +
        +
      • Sharktech
      • +
      • LIR LLC / lir.am
      • +
      • TrafficTransitSolution LLC / traffictransitsolution.us
      • +
      • Fine Group Servers Solutions LLC / finegroupservers.com
      • +
      • UGB
      • +
      • Bulgakov Alexey Yurievich
      • +
      • Dmitry Vorozhtsov / fitz-isp.uk / UGB
      • +
      • Auction LLC / dauction.ru / UGB / traffictransitsolution.us
      • +
      • Alax LLC / alaxona.com / finegroupservers.com
      • +
      • Sysoev Aleksey Anatolevich / jobbuzzactiv.com / traffictransitsolution.us
      • +
      • Bulgakov Alexey Yurievich / UGB / blockchainnetworksolutions.co.uk / info@finegroupservers.com
      • +
      • UAB Rakrejus
      -
      $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
      -5697
      +
    • +
    • I looked in the nginx log and copied a few IP addresses that were suspicious +
        +
      • First I looked them up in AbuseIPDB.com to get the ISP name and website
      • +
      • Then I looked them up with the asn tool, ie:
      • +
      +
    • +
    +
    $ ./asn -n 45.80.217.235  
    +
    +╭──────────────────────────────╮
    +│ ASN lookup for 45.80.217.235 │
    +╰──────────────────────────────╯
    +
    + 45.80.217.235 ┌PTR -
    +               ├ASN 46844 (ST-BGP, US)
    +               ├ORG Sharktech
    +               ├NET 45.80.217.0/24 (TrafficTransitSolutionNet)
    +               ├ABU info@traffictransitsolution.us
    +               ├ROA ✓ VALID (1 ROA found)
    +               ├TYP  Proxy host   Hosting/DC 
    +               ├GEO Los Angeles, California (US)
    +               └REP ✓ NONE
     
      -
    • Spent a few hours with Moayad troubleshooting and improving OpenRXV -
        -
      • We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
      • +
      • Slowly slowly I manually built up a list of the IPs, ISP names, and network blocks, for example:
      -
    • -
    -

    2021-06-25

    -
      -
    • The new OpenRXV code creates almost 200,000 jobs when the plugins start -
        -
      • I figured out how to use bee-queue/arena to view our Bull job queue
      • -
      • Also, we can see the jobs directly using redis-cli:
      • -
      -
    • -
    -
    $ redis-cli
    -127.0.0.1:6379> SCAN 0 COUNT 5
    -1) "49152"
    -2) 1) "bull:plugins:476595"
    -   2) "bull:plugins:367382"
    -   3) "bull:plugins:369228"
    -   4) "bull:plugins:438986"
    -   5) "bull:plugins:366215"
    +
    IP, Organization, Website, Network
    +45.148.126.246, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.126.0/24 (Net-traffictransitsolution-15)
    +45.138.102.253, TrafficTransitSolution LLC, traffictransitsolution.us, 45.138.102.0/24 (Net-traffictransitsolution-11)
    +45.140.205.104, Bulgakov Alexey Yurievich, finegroupservers.com, 45.140.204.0/23 (CHINA_NETWORK)
    +185.68.247.63, Fine Group Servers Solutions LLC, finegroupservers.com, 185.68.247.0/24 (Net-finegroupservers-18)
    +213.232.123.188, Fine Group Servers Solutions LLC, finegroupservers.com, 213.232.123.0/24 (Net-finegroupservers-12)
    +45.80.217.235, TrafficTransitSolution LLC, traffictransitsolution.us, 45.80.217.0/24 (TrafficTransitSolutionNet)
    +185.81.144.202, Fine Group Servers Solutions LLC, finegroupservers.com, 185.81.144.0/24 (Net-finegroupservers-19)
    +109.106.22.114, TrafficTransitSolution LLC, traffictransitsolution.us, 109.106.22.0/24 (TrafficTransitSolutionNet)
    +185.68.247.200, Fine Group Servers Solutions LLC, finegroupservers.com, 185.68.247.0/24 (Net-finegroupservers-18)
    +45.80.105.252, Bulgakov Alexey Yurievich, finegroupservers.com, 45.80.104.0/23 (NET-BNSL2-1)
    +185.233.187.156, Dmitry Vorozhtsov, mgn-host.ru, 185.233.187.0/24 (GB-FITZISP-20181106)
    +185.88.100.75, TrafficTransitSolution LLC, traffictransitsolution.us, 185.88.100.0/24 (Net-traffictransitsolution-17)
    +194.104.8.154, TrafficTransitSolution LLC, traffictransitsolution.us, 194.104.8.0/24 (Net-traffictransitsolution-37)
    +185.102.112.46, Fine Group Servers Solutions LLC, finegroupservers.com, 185.102.112.0/24 (Net-finegroupservers-13)
    +212.193.12.64, Fine Group Servers Solutions LLC, finegroupservers.com, 212.193.12.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +91.243.191.129, Auction LLC, dauction.ru, 91.243.191.0/24 (TR-QN-20180917)
    +45.148.232.161, Nikolaeva Ekaterina Sergeevna, blockchainnetworksolutions.co.uk, 45.148.232.0/23 (LONDON_NETWORK)
    +147.78.181.191, TrafficTransitSolution LLC, traffictransitsolution.us, 147.78.181.0/24 (Net-traffictransitsolution-58)
    +77.83.27.90, Alax LLC, alaxona.com, 77.83.27.0/24 (FINEGROUPSERVERS-LEASE)
    +185.250.46.119, Dmitry Vorozhtsov, mgn-host.ru, 185.250.46.0/23 (GB-FITZISP-20181106)
    +94.231.219.106, LIR LLC, lir.am, 94.231.219.0/24 (CN-NET-219)
    +45.12.65.56, Sysoev Aleksey Anatolevich, jobbuzzactiv.com / traffictransitsolution.us, 45.12.65.0/24 (TrafficTransitSolutionNet)
    +45.140.206.31, Bulgakov Alexey Yurievich, blockchainnetworksolutions.co.uk / info@finegroupservers.com, 45.140.206.0/23 (FRANKFURT_NETWORK)
    +84.54.57.130, LIR LLC, lir.am / traffictransitsolution.us, 84.54.56.0/23 (CN-FTNET-5456)
    +178.20.214.235, Alaxona Internet Inc., alaxona.com / finegroupservers.com, 178.20.214.0/24 (FINEGROUPSERVERS-LEASE)
    +37.44.253.204, Atex LLC, atex.ru / blockchainnetworksolutions.co.uk, 37.44.252.0/23 (NL-FTN-44252)
    +46.161.61.242, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru, 46.161.61.0/24 (FineTransitDE)
    +194.87.113.141, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.113.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +109.94.223.217, LIR LLC, lir.am / traffictransitsolution.us, 109.94.223.0/24 (CN-NET-223)
    +94.231.217.115, LIR LLC, lir.am / traffictransitsolution.us, 94.231.217.0/24 (TR-NET-217)
    +146.185.202.214, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net, 146.185.202.0/24 (FineTransitRU)
    +194.58.68.110, Fine Group Servers Solutions LLC, finegroupservers.com, 194.58.68.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +94.154.131.237, TrafficTransitSolution LLC, traffictransitsolution.us, 94.154.131.0/24 (TrafficTransitSolutionNet)
    +193.202.8.245, Fine Group Servers Solutions LLC, finegroupservers.com, 193.202.8.0/21 (FTL5)
    +212.192.27.33, Fine Group Servers Solutions LLC, finegroupservers.com, 212.192.27.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +193.202.87.218, Fine Group Servers Solutions LLC, finegroupservers.com, 193.202.84.0/22 (FTEL-2)
    +146.185.200.52, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net, 146.185.200.0/24 (FineTransitRU)
    +194.104.11.11, TrafficTransitSolution LLC, traffictransitsolution.us, 194.104.11.0/24 (Net-traffictransitsolution-40)
    +185.50.250.145, ATOMOHOST LLC, atomohost.com, 185.50.250.0/24 (Silverstar_Invest_Limited)
    +37.9.46.68, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru / abuse@ripe.net / , 37.9.44.0/22 (QUALITYNETWORK)
    +185.81.145.14, Fine Group Servers Solutions LLC, finegroupservers.com, 185.81.145.0/24 (Net-finegroupservers-20)
    +5.183.255.72, TrafficTransitSolution LLC, traffictransitsolution.us, 5.183.255.0/24 (Net-traffictransitsolution-32)
    +84.54.58.204, LIR LLC, lir.am / traffictransitsolution.us, 84.54.58.0/24 (GB-BTTGROUP-20181119)
    +109.236.55.175, Mosnet LLC, mosnetworks.ru / info@traffictransitsolution.us, 109.236.55.0/24 (CN-NET-55)
    +5.133.123.184, Mosnet LLC, mosnet.ru / abuse@blockchainnetworksolutions.co.uk, 5.133.123.0/24 (DE-NET5133123)
    +5.181.168.90, Fine Group Servers Solutions LLC, finegroupservers.com, 5.181.168.0/24 (Net-finegroupservers-5)
    +185.61.217.86, TrafficTransitSolution LLC, traffictransitsolution.us, 185.61.217.0/24 (Net-traffictransitsolution-46)
    +217.145.227.84, TrafficTransitSolution LLC, traffictransitsolution.us, 217.145.227.0/24 (Net-traffictransitsolution-64)
    +193.56.75.29, Auction LLC, dauction.ru / abuse@blockchainnetworksolutions.co.uk, 193.56.75.0/24 (CN-NET-75)
    +45.132.184.212, TrafficTransitSolution LLC, traffictransitsolution.us, 45.132.184.0/24 (Net-traffictransitsolution-5)
    +45.10.167.239, TrafficTransitSolution LLC, traffictransitsolution.us, 45.10.167.0/24 (Net-traffictransitsolution-28)
    +109.94.222.106, Express Courier LLC, expcourier.ru / info@traffictransitsolution.us, 109.94.222.0/24 (IN-NET-222)
    +62.76.232.218, Fine Group Servers Solutions LLC, finegroupservers.com, 62.76.232.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +147.78.183.221, TrafficTransitSolution LLC, traffictransitsolution.us, 147.78.183.0/24 (Net-traffictransitsolution-60)
    +94.158.22.202, Auction LLC, dauction.ru / info@traffictransitsolution.us, 94.158.22.0/24 (FR-QN-20180917)
    +85.202.194.33, Mosnet LLC, mosnet.ru / info@traffictransitsolution.us, 85.202.194.0/24 (DE-QN-20180917)
    +193.187.93.150, Fine Group Servers Solutions LLC, finegroupservers.com, 193.187.92.0/22 (FTL3)
    +185.250.45.149, Dmitry Vorozhtsov, mgn-host.ru / abuse@fitz-isp.uk, 185.250.44.0/23 (GB-FITZISP-20181106)
    +185.50.251.75, ATOMOHOST LLC, atomohost.com, 185.50.251.0/24 (Silverstar_Invest_Limited)
    +5.183.254.117, TrafficTransitSolution LLC, traffictransitsolution.us, 5.183.254.0/24 (Net-traffictransitsolution-31)
    +45.132.186.187, TrafficTransitSolution LLC, traffictransitsolution.us, 45.132.186.0/24 (Net-traffictransitsolution-7)
    +83.171.252.105, Teleport LLC, teleport.az / abuse@blockchainnetworksolutions.co.uk, 83.171.252.0/23 (DE-FTNET-252)
    +45.148.127.37, TrafficTransitSolution LLC, traffictransitsolution.us, 45.148.127.0/24 (Net-traffictransitsolution-16)
    +194.87.115.133, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.115.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +193.233.250.100, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.250.0/24 (TrafficTransitSolutionNet)
    +194.87.116.246, Fine Group Servers Solutions LLC, finegroupservers.com, 194.87.116.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +195.133.25.244, Fine Group Servers Solutions LLC, finegroupservers.com, 195.133.25.0/24 (FINE_GROUP_SERVERS_SOLUTIONS_LLC)
    +77.220.194.159, Fine Group Servers Solutions LLC, finegroupservers.com, 77.220.194.0/24 (Net-finegroupservers-3)
    +185.89.101.177, ATOMOHOST LLC, atomohost.com, 185.89.100.0/23 (QUALITYNETWORK)
    +193.151.191.133, Alax LLC, alaxona.com / info@finegroupservers.com, 193.151.191.0/24 (FINEGROUPSERVERS-LEASE)
    +5.181.170.147, Fine Group Servers Solutions LLC, finegroupservers.com, 5.181.170.0/24 (Net-finegroupservers-7)
    +193.233.249.167, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.249.0/24 (TrafficTransitSolutionNet)
    +46.161.59.90, Petersburg Internet Network Ltd., pinspb.ru / abusemail@depo40.ru, 46.161.59.0/24 (FineTransitJP)
    +213.108.3.74, TrafficTransitSolution LLC, traffictransitsolution.us, 213.108.3.0/24 (Net-traffictransitsolution-24)
    +193.233.251.238, OOO Freenet Group, free.net / abuse@vmage.ru, 193.233.251.0/24 (TrafficTransitSolutionNet)
    +178.20.215.224, Alaxona Internet Inc., alaxona.com / info@finegroupservers.com, 178.20.215.0/24 (FINEGROUPSERVERS-LEASE)
    +45.159.22.199, Server LLC, ixserv.ru / info@finegroupservers.com, 45.159.22.0/24 (FINEGROUPSERVERS-LEASE)
    +109.236.53.244, Mosnet LLC, mosnet.ru, info@traffictransitsolution.us, 109.236.53.0/24 (TR-NET-53)
     
      -
    • We can apparently get the names of the jobs in each hash using hget:
    • +
    • I found a better way to get the ASNs using my resolve-addresses-geoip2.py script +
        +
      • First, get a list of all IPs making requests to nginx today:
      -
      127.0.0.1:6379> TYPE bull:plugins:401827
      -hash
      -127.0.0.1:6379> HGET bull:plugins:401827 name
      -"dspace_add_missing_items"
      +
    • +
    +
    # grep -v -E "(mahider|Googlebot|Turnitin|Grammarly|Unpaywall|UptimeRobot|bot)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq  > /tmp/ips-sorted.txt
    +# wc -l /tmp/ips-sorted.txt 
    +10776 /tmp/ips-sorted.txt
     
      -
    • I whipped up a one liner to get the keys for all plugin jobs, convert to redis HGET commands to extract the value of the name field, and then sort them by their counts:
    • +
    • Then resolve them all:
    -
    $ redis-cli KEYS "bull:plugins:*" \
    -  | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
    -  | ncat -w 3 localhost 6379 \
    -  | grep -v -E '^\$' | sort | uniq -c | sort -h
    -      3 dspace_health_check
    -      4 -ERR wrong number of arguments for 'hget' command
    -     12 mel_downloads_and_views
    -    129 dspace_altmetrics
    -    932 dspace_downloads_and_views
    - 186428 dspace_add_missing_items
    +
    $ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips-sorted.txt -o /tmp/out.csv
     
      -
    • Note that this uses ncat to send commands directly to redis all at once instead of one at a time (netcat didn’t work here, as it doesn’t know when our input is finished and never quits) -
        -
      • I thought of using redis-cli --pipe but then you have to construct the commands in the redis protocol format with the number of args and length of each command
      • +
      • Then get the top 10 organizations and top ten ASNs:
      -
    • -
    • There is clearly something wrong with the new DSpace health check plugin, as it creates WAY too many jobs every time we run the plugins
    • -
    -

    2021-06-27

    -
      -
    • Looking into the spike in PostgreSQL connections last week -
        -
      • I see the same things that I always see (large number of connections waiting for lock, large number of threads, high CPU usage, etc), but I also see almost 10,000 DSpace sessions on 2021-06-25
      • -
      -
    • -
    -

    DSpace sessions

    -
      -
    • Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal:
    • -
    -
    $ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
    -dspace.log.2021-06-10
    -19072
    -dspace.log.2021-06-11
    -19224
    -dspace.log.2021-06-12
    -19215
    -dspace.log.2021-06-13
    -16721
    -dspace.log.2021-06-14
    -17880
    -dspace.log.2021-06-15
    -12103
    -dspace.log.2021-06-16
    -4651
    -dspace.log.2021-06-17
    -22785
    -dspace.log.2021-06-18
    -21406
    -dspace.log.2021-06-19
    -25967
    -dspace.log.2021-06-20
    -20850
    -dspace.log.2021-06-21
    -6388
    -dspace.log.2021-06-22
    -5945
    -dspace.log.2021-06-23
    -46371
    -dspace.log.2021-06-24
    -9024
    -dspace.log.2021-06-25
    -12521
    -dspace.log.2021-06-26
    -16163
    -dspace.log.2021-06-27
    -5886
    +
    $ csvcut -c 2 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
    +    213 AMAZON-AES
    +    218 ASN-QUADRANET-GLOBAL
    +    246 Silverstar Invest Limited
    +    347 Ethiopian Telecommunication Corporation
    +    475 DEDIPATH-LLC
    +    504 AS-COLOCROSSING
    +    598 UAB Rakrejus
    +    814 UGB Hosting OU
    +   1010 ST-BGP
    +   1757 Global Layer B.V.
    +$ csvcut -c 3 /tmp/out.csv | sed 1d | sort | uniq -c | sort -n | tail -n 10
    +    213 14618
    +    218 8100
    +    246 35624
    +    347 24757
    +    475 35913
    +    504 36352
    +    598 62282
    +    814 206485
    +   1010 46844
    +   1757 49453
     
      -
    • I see 15,000 unique IPs in the XMLUI logs alone on that day:
    • +
    • I will download blocklists for all these except Ethiopian Telecom, Quadranet, and Amazon, though I’m concerned about Global Layer because it’s a huge ASN that seems to have legit hosts too…?
    -
    # zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l
    -15835
    +
    $ wget https://asn.ipinfo.app/api/text/nginx/AS49453
    +$ wget https://asn.ipinfo.app/api/text/nginx/AS46844
    +$ wget https://asn.ipinfo.app/api/text/nginx/AS206485
    +$ wget https://asn.ipinfo.app/api/text/nginx/AS62282
    +$ wget https://asn.ipinfo.app/api/text/nginx/AS36352
    +$ wget https://asn.ipinfo.app/api/text/nginx/AS35624
    +$ cat AS* | sort | uniq > /tmp/abusive-networks.txt
    +$ wc -l /tmp/abusive-networks.txt 
    +2276 /tmp/abusive-networks.txt
     
      -
    • Annoyingly I found 37,000 more hits from Bing using dns:*msnbot* AND dns:*.msn.com. as a Solr filter -
        -
      • WTF, they are using a normal user agent: Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko
      • -
      • I will purge the IPs and add this user agent to the nginx config so that we can rate limit it
      • +
      • Combining with my existing rules and filtering uniques:
      -
    • -
    • I signed up for Bing Webmaster Tools and verified cgspace.cgiar.org with the BingSiteAuth.xml file -
        -
      • Also I adjusted the nginx config to explicitly allow access to robots.txt even when bots are rate limited
      • -
      • Also I found that Bing was auto discovering all our RSS and Atom feeds as “sitemaps” so I deleted 750 of them and submitted the real sitemap
      • -
      • I need to see if I can adjust the nginx config further to map the bot user agent to DNS like msnbot…
      • -
      -
    • -
    • Review Abdullah’s filter on click pull request -
        -
      • I rebased his code on the latest master branch and tested adding filter on click to the map and list components, and it works fine
      • -
      • There seems to be a bug that breaks scrolling on the page though…
      • -
      • Abdullah fixed the bug in the filter on click branch
      • -
      -
    • -
    -

    2021-06-28

    -
      -
    • Some work on OpenRXV -
        -
      • Integrate prettier into the frontend and backend and format everything on the master branch
      • -
      • Re-work the GitHub Actions workflow for frontend and add one for backend
      • -
      • The workflows run npm install to test dependencies, and npm ci with prettier to check formatting
      • -
      • Also I merged Abdallah’s filter on click pull request
      • -
      -
    • -
    -

    2021-06-30

    -
      -
    • CGSpace is showing a blank white page… -
        -
      • The status is HTTP 200, but it’s blank white… so UptimeRobot didn’t send a notification!
      • -
      -
    • -
    • The DSpace log shows:
    • -
    -
    2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
    +
    $ cat roles/dspace/templates/nginx/abusive-networks.conf.j2 /tmp/abusive-networks.txt | grep deny | sort | uniq | wc -l
    +2298
     
      -
    • The first one of these I see is from last night at 2021-06-29 at 10:47 PM
    • -
    • I restarted Tomcat 7 and CGSpace came back up…
    • -
    • I didn’t see that Atmire had responded last week (on 2021-06-23) about the issues we had -
        -
      • He said they had to do the same thing that they did last time: switch to the postgres user and kill all activity
      • -
      • He said they found tons of connections to the REST API, like 3-4 per second, and asked if that was normal
      • -
      • I pointed him to our Tomcat server.xml configuration, saying that we purposefully isolated the Tomcat connection pools between the API and XMLUI for this purpose…
      • -
      -
    • -
    • Export a list of all CGSpace’s AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:
    • -
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
    -COPY 20780
    -
      -
    • Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):
    • -
    -
    localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
    -COPY 1710
    -
      -
    • Fix an issue in the Ansible infrastructure playbooks for the DSpace role -
        -
      • It was causing the template module to fail when setting up the npm environment
      • -
      • We needed to install acl so that Ansible can use setfacl on the target file before becoming an unprivileged user
      • -
      -
    • -
    • I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):
    • -
    -
    Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
    -
      -
    • What’s even crazier is that it is twice that on CGSpace (linode18)!
    • -
    • Apparently OpenJDK defaults to using /dev/random (see /etc/java-8-openjdk/security/java.security):
    • -
    -
    securerandom.source=file:/dev/urandom
    -
      -
    • /dev/random blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator -
        -
      • Now Tomcat starts much faster and no warning is printed so I’m going to add this to our Ansible infrastructure playbooks
      • -
      -
    • -
    • Interesting resource about the lore behind the /dev/./urandom workaround that is posted all over the Internet, apparently due to a bug in early JVMs: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721
    • -
    • I’m experimenting with using PgBouncer for pooling instead of Tomcat’s JDBC
    • +
    • According to Scamlytics all these are high risk ISPs (as recently as 2021-06) so I will just keep blocking them
    • +
    • I deployed the block list on CGSpace (linode18) and the load is down to 1.0 but I see there are still some DDoS IPs getting through… sigh
    • +
    • The next thing I need to do is purge all the IPs from Solr using grepcidr…
    diff --git a/docs/categories/index.html b/docs/categories/index.html index b23f17cdd..bc6be7699 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 8970587eb..bc5842bc5 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 6e42d9e5f..aa8f1dec9 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 2bac807b9..255da8d11 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 9793d8392..e36a115d9 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 871836a04..4398defbc 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 2dead3fe4..296dc3351 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 5485052de..c97b95a12 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 79afb8476..a1fcece68 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 247fc31ed..c408fef78 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 610536ed7..548fda470 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 6a4ef0a8f..1ecd9559e 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index cf71c6a00..a5ac15bb7 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index ce68a98fa..19db7e5a8 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index e34edbb26..c2feadcf1 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 862939da6..2c70396ad 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index fa95149f0..189e527ec 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 83b4617f7..1c42356da 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 6e9464781..7dcb5ec0c 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 1e8cb0753..9df6998b8 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 5ceddb3a1..6e18e2046 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 100423da1..9b258e02e 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index f415a0ff2..154559ebb 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,22 +3,22 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-07-11T15:56:57+03:00 + 2021-07-11T15:59:16+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-07-11T15:56:57+03:00 + 2021-07-11T15:59:16+03:00 https://alanorth.github.io/cgspace-notes/2021-06/ 2021-07-01T08:53:21+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-07-11T15:56:57+03:00 + 2021-07-11T15:59:16+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-07-11T15:56:57+03:00 + 2021-07-11T15:59:16+03:00 https://alanorth.github.io/cgspace-notes/2021-06/ - 2021-07-11T15:56:57+03:00 + 2021-07-11T15:59:16+03:00 https://alanorth.github.io/cgspace-notes/2021-05/ 2021-07-06T17:03:55+03:00