diff --git a/content/posts/2019-05.md b/content/posts/2019-05.md index a6a2653ca..b253c258d 100644 --- a/content/posts/2019-05.md +++ b/content/posts/2019-05.md @@ -322,11 +322,21 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata - So this was definitely an attack of some sort... only God knows why - I noticed a few new bots that don't use the word "bot" in their user agent and therefore don't match Tomcat's Crawler Session Manager Valve: - `Blackboard Safeassign` + - `Unpaywall` ## 2019-05-12 +- I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log): + +``` +$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l +2206 +``` + +- I added "Unpaywall" to the list of bots in the Tomcat Crawler Session Manager Valve - Set up nginx to use TLS and proxy pass to NodeJS on the AReS development server (linode20) - Run all system updates on linode20 and reboot it - Also, there is 10 to 20% CPU steal on that VM, so I will ask Linode to move it to another host +- Commit changes to the `resolve-addresses.py` script to add proper CSV output support diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html index 4e8d814d2..596535604 100644 --- a/docs/2019-05/index.html +++ b/docs/2019-05/index.html @@ -28,7 +28,7 @@ But after this I tried to delete the item from the XMLUI and it is still present - + @@ -61,9 +61,9 @@ But after this I tried to delete the item from the XMLUI and it is still present "@type": "BlogPosting", "headline": "May, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-05\/", - "wordCount": "2254", + "wordCount": "2323", "datePublished": "2019-05-01T07:37:43\x2b03:00", - "dateModified": "2019-05-10T17:27:11\x2b03:00", + "dateModified": "2019-05-12T10:39:10\x2b03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -518,15 +518,28 @@ UPDATE metadatavalue SET text_lang='es_ES' WHERE resource_type_id=2 AND metadata
Blackboard Safeassign
Unpaywall
I see that the Unpaywall bot is resonsible for a few thousand XMLUI sessions every day (IP addresses come from nginx access.log):
+ +$ cat dspace.log.2019-05-11 | grep -E 'ip_addr=(100.26.206.188|100.27.19.233|107.22.98.199|174.129.156.41|18.205.243.110|18.205.245.200|18.207.176.164|18.207.209.186|18.212.126.89|18.212.5.59|18.213.4.150|18.232.120.6|18.234.180.224|18.234.81.13|3.208.23.222|34.201.121.183|34.201.241.214|34.201.39.122|34.203.188.39|34.207.197.154|34.207.232.63|34.207.91.147|34.224.86.47|34.227.205.181|34.228.220.218|34.229.223.120|35.171.160.166|35.175.175.202|3.80.201.39|3.81.120.70|3.81.43.53|3.84.152.19|3.85.113.253|3.85.237.139|3.85.56.100|3.87.23.95|3.87.248.240|3.87.250.3|3.87.62.129|3.88.13.9|3.88.57.237|3.89.71.15|3.90.17.242|3.90.68.247|3.91.44.91|3.92.138.47|3.94.250.180|52.200.78.128|52.201.223.200|52.90.114.186|52.90.48.73|54.145.91.243|54.160.246.228|54.165.66.180|54.166.219.216|54.166.238.172|54.167.89.152|54.174.94.223|54.196.18.211|54.198.234.175|54.208.8.172|54.224.146.147|54.234.169.91|54.235.29.216|54.237.196.147|54.242.68.231|54.82.6.96|54.87.12.181|54.89.217.141|54.89.234.182|54.90.81.216|54.91.104.162)' | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+2206
+
I added “Unpaywall” to the list of bots in the Tomcat Crawler Session Manager Valve
Set up nginx to use TLS and proxy pass to NodeJS on the AReS development server (linode20)
Run all system updates on linode20 and reboot it
Also, there is 10 to 20% CPU steal on that VM, so I will ask Linode to move it to another host
Commit changes to the resolve-addresses.py
script to add proper CSV output support