diff --git a/content/posts/2020-03.md b/content/posts/2020-03.md
index 096bf495d..3974ebc86 100644
--- a/content/posts/2020-03.md
+++ b/content/posts/2020-03.md
@@ -86,4 +86,121 @@ $ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/stati
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
```
+## 2020-03-09
+
+- Peter noticed that the Solr stats were not showing anything before 2020
+ - I had to restart Tomcat three times before all cores loaded properly...
+
+## 2020-03-10
+
+- Fix some logic issues in the nginx config
+ - Use generic blocking of `[Bb]ot` and `[Cc]rawl` and `[Ss]pider` in the "badbots" rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages *no matter what* so we punish hard here)
+ - We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)
+ - Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:
+
+```
+Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
+```
+
+- It seems to only be a problem in the last week:
+
+```
+# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
+/var/log/nginx/rest.log.1:0
+/var/log/nginx/rest.log.2:0
+/var/log/nginx/rest.log.3:0
+/var/log/nginx/rest.log.4:3625
+/var/log/nginx/rest.log.5:27458
+/var/log/nginx/rest.log.6:0
+/var/log/nginx/rest.log.7:0
+/var/log/nginx/rest.log.8:0
+/var/log/nginx/rest.log.9:0
+```
+
+- In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
+- I will purge them from Solr statistics:
+
+```
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary 'userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"'
+```
+
+- Another user agent that seems to be a bot is:
+
+```
+Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
+```
+
+- In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx's logs I see it belongs to three IPs on Online.net in France:
+
+```
+# zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
+ 63090 163.172.68.99
+ 183428 163.172.70.248
+ 147608 163.172.71.24
+```
+- It is making 10,000 to 40,000 requests to XMLUI per day...
+
+```
+# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
+/var/log/nginx/access.log.30.gz:18687
+/var/log/nginx/access.log.31.gz:28936
+/var/log/nginx/access.log.32.gz:36402
+/var/log/nginx/access.log.33.gz:38886
+/var/log/nginx/access.log.34.gz:30607
+/var/log/nginx/access.log.35.gz:19040
+/var/log/nginx/access.log.36.gz:10780
+/var/log/nginx/access.log.37.gz:5808
+/var/log/nginx/access.log.38.gz:3100
+/var/log/nginx/access.log.39.gz:1485
+/var/log/nginx/access.log.3.gz:2898
+/var/log/nginx/access.log.40.gz:373
+/var/log/nginx/access.log.41.gz:3909
+/var/log/nginx/access.log.42.gz:4729
+/var/log/nginx/access.log.43.gz:3906
+```
+
+- I will purge those hits too!
+
+```
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary 'userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"'
+```
+
+- Shit, and something happened and a few thousand hits from user agents with "Bot" in their user agent got through
+ - I need to re-run the `check-bot-hits.sh` script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn't support case-insensitive regular expressions:
+
+```
+$ ./check-spider-hits.sh -f /tmp/bots -d -p
+(DEBUG) Using spiders pattern file: /tmp/bots
+(DEBUG) Checking for hits from spider: Citoid
+Purging 11 hits from Citoid in statistics
+(DEBUG) Checking for hits from spider: ecointernet
+Purging 375 hits from ecointernet in statistics
+(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
+Purging 1 hits from ^Pattern\/[0-9] in statistics
+(DEBUG) Checking for hits from spider: sqlmap
+(DEBUG) Checking for hits from spider: Typhoeus
+Purging 6 hits from Typhoeus in statistics
+(DEBUG) Checking for hits from spider: 7siters
+(DEBUG) Checking for hits from spider: Apache-HttpClient
+Purging 3178 hits from Apache-HttpClient in statistics
+
+Total number of bot hits purged: 3571
+
+$ ./check-spider-hits.sh -f /tmp/bots -d -p
+(DEBUG) Using spiders pattern file: /tmp/bots
+(DEBUG) Checking for hits from spider: [Bb]ot
+Purging 8317 hits from [Bb]ot in statistics
+(DEBUG) Checking for hits from spider: [Cc]rawl
+Purging 1314 hits from [Cc]rawl in statistics
+(DEBUG) Checking for hits from spider: [Ss]pider
+Purging 62 hits from [Ss]pider in statistics
+(DEBUG) Checking for hits from spider: Citoid
+(DEBUG) Checking for hits from spider: ecointernet
+(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
+(DEBUG) Checking for hits from spider: sqlmap
+(DEBUG) Checking for hits from spider: Typhoeus
+(DEBUG) Checking for hits from spider: 7siters
+(DEBUG) Checking for hits from spider: Apache-HttpClient
+```
+
diff --git a/docs/2020-03/index.html b/docs/2020-03/index.html
index 5777441b9..2b7012c48 100644
--- a/docs/2020-03/index.html
+++ b/docs/2020-03/index.html
@@ -22,7 +22,7 @@ You need to download this into the DSpace 6.x source and compile it
-
+
@@ -49,9 +49,9 @@ You need to download this into the DSpace 6.x source and compile it
"@type": "BlogPosting",
"headline": "March, 2020",
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-03\/",
- "wordCount": "480",
+ "wordCount": "1102",
"datePublished": "2020-03-02T12:31:30+02:00",
- "dateModified": "2020-03-08T14:28:39+02:00",
+ "dateModified": "2020-03-08T15:53:34+02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -211,6 +211,116 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
+
2020-03-09
+
+- Peter noticed that the Solr stats were not showing anything before 2020
+
+- I had to restart Tomcat three times before all cores loaded properly…
+
+
+
+2020-03-10
+
+- Fix some logic issues in the nginx config
+
+- Use generic blocking of
[Bb]ot
and [Cc]rawl
and [Ss]pider
in the “badbots” rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages no matter what so we punish hard here)
+- We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)
+- Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:
+
+
+
+Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
+
+- It seems to only be a problem in the last week:
+
+# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
+/var/log/nginx/rest.log.1:0
+/var/log/nginx/rest.log.2:0
+/var/log/nginx/rest.log.3:0
+/var/log/nginx/rest.log.4:3625
+/var/log/nginx/rest.log.5:27458
+/var/log/nginx/rest.log.6:0
+/var/log/nginx/rest.log.7:0
+/var/log/nginx/rest.log.8:0
+/var/log/nginx/rest.log.9:0
+
+- In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
+- I will purge them from Solr statistics:
+
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
+
+- Another user agent that seems to be a bot is:
+
+Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
+
+- In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:
+
+# zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
+ 63090 163.172.68.99
+ 183428 163.172.70.248
+ 147608 163.172.71.24
+
+- It is making 10,000 to 40,000 requests to XMLUI per day…
+
+# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
+/var/log/nginx/access.log.30.gz:18687
+/var/log/nginx/access.log.31.gz:28936
+/var/log/nginx/access.log.32.gz:36402
+/var/log/nginx/access.log.33.gz:38886
+/var/log/nginx/access.log.34.gz:30607
+/var/log/nginx/access.log.35.gz:19040
+/var/log/nginx/access.log.36.gz:10780
+/var/log/nginx/access.log.37.gz:5808
+/var/log/nginx/access.log.38.gz:3100
+/var/log/nginx/access.log.39.gz:1485
+/var/log/nginx/access.log.3.gz:2898
+/var/log/nginx/access.log.40.gz:373
+/var/log/nginx/access.log.41.gz:3909
+/var/log/nginx/access.log.42.gz:4729
+/var/log/nginx/access.log.43.gz:3906
+
+- I will purge those hits too!
+
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
+
+- Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
+
+- I need to re-run the
check-bot-hits.sh
script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn’t support case-insensitive regular expressions:
+
+
+
+$ ./check-spider-hits.sh -f /tmp/bots -d -p
+(DEBUG) Using spiders pattern file: /tmp/bots
+(DEBUG) Checking for hits from spider: Citoid
+Purging 11 hits from Citoid in statistics
+(DEBUG) Checking for hits from spider: ecointernet
+Purging 375 hits from ecointernet in statistics
+(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
+Purging 1 hits from ^Pattern\/[0-9] in statistics
+(DEBUG) Checking for hits from spider: sqlmap
+(DEBUG) Checking for hits from spider: Typhoeus
+Purging 6 hits from Typhoeus in statistics
+(DEBUG) Checking for hits from spider: 7siters
+(DEBUG) Checking for hits from spider: Apache-HttpClient
+Purging 3178 hits from Apache-HttpClient in statistics
+
+Total number of bot hits purged: 3571
+
+$ ./check-spider-hits.sh -f /tmp/bots -d -p
+(DEBUG) Using spiders pattern file: /tmp/bots
+(DEBUG) Checking for hits from spider: [Bb]ot
+Purging 8317 hits from [Bb]ot in statistics
+(DEBUG) Checking for hits from spider: [Cc]rawl
+Purging 1314 hits from [Cc]rawl in statistics
+(DEBUG) Checking for hits from spider: [Ss]pider
+Purging 62 hits from [Ss]pider in statistics
+(DEBUG) Checking for hits from spider: Citoid
+(DEBUG) Checking for hits from spider: ecointernet
+(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
+(DEBUG) Checking for hits from spider: sqlmap
+(DEBUG) Checking for hits from spider: Typhoeus
+(DEBUG) Checking for hits from spider: 7siters
+(DEBUG) Checking for hits from spider: Apache-HttpClient
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 307ad34e4..222869ee5 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,27 +4,27 @@
https://alanorth.github.io/cgspace-notes/categories/
- 2020-03-08T14:28:39+02:00
+ 2020-03-08T15:53:34+02:00
https://alanorth.github.io/cgspace-notes/
- 2020-03-08T14:28:39+02:00
+ 2020-03-08T15:53:34+02:00
https://alanorth.github.io/cgspace-notes/2020-03/
- 2020-03-08T14:28:39+02:00
+ 2020-03-08T15:53:34+02:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2020-03-08T14:28:39+02:00
+ 2020-03-08T15:53:34+02:00
https://alanorth.github.io/cgspace-notes/posts/
- 2020-03-08T14:28:39+02:00
+ 2020-03-08T15:53:34+02:00