diff --git a/content/posts/2022-03.md b/content/posts/2022-03.md index 4e1bdf986..2f5568d1b 100644 --- a/content/posts/2022-03.md +++ b/content/posts/2022-03.md @@ -146,5 +146,47 @@ $ csvjoin -c id /tmp/2022-03-22-tac-duplicates.csv /tmp/tac-filenames.csv > /tmp ``` - I sent the resulting 76 items to Gaia to check +- UptimeRobot said that CGSpace was down + - I looked and found many locks belonging to the REST API application: + +```console +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi)' | sort | uniq -c | sort -n + 301 dspaceWeb + 2390 dspaceApi +``` + +- Looking at nginx's logs, I found the top addresses making requests today: + +```console +# awk '{print $1}' /var/log/nginx/rest.log | sort | uniq -c | sort -h + 1977 45.5.184.2 + 3167 70.32.90.172 + 4754 54.195.118.125 + 5411 205.186.128.185 + 6826 137.184.159.211 +``` + +- 137.184.159.211 is on DigitalOcean using this user agent: `GuzzleHttp/6.3.3 curl/7.81.0 PHP/7.4.28` + - I blocked this IP in nginx and the load went down immediately +- 205.186.128.185 is on Media Temple, but it's OK because it's the CCAFS publications importer bot +- 54.195.118.125 is on Amazon, but is also a CCAFS publications importer bot apparently (perhaps a test server) +- 70.32.90.172 is on Media Temple and has no user agent +- What is surprising to me is that we already have an nginx rule to return HTTP 403 for requests without a user agent + - I verified it works as expected with an empty user agent: + +```console +$ curl -H User-Agent:'' 'https://dspacetest.cgiar.org/rest/handle/10568/34799?expand=all' +Due to abuse we no longer permit requests without a user agent. Please specify a descriptive user agent, for example containing the word 'bot', if you are accessing the site programmatically. For more information see here: https://dspacetest.cgiar.org/page/about. +``` + +- I note that the nginx log shows '-' for a request with an empty user agent, which would be indistinguishable from a request with a '-', for example these were successful: + +```console +70.32.90.172 - - [22/Mar/2022:11:59:10 +0100] "GET /rest/handle/10568/34374?expand=all HTTP/1.0" 200 10671 "-" "-" +70.32.90.172 - - [22/Mar/2022:11:59:14 +0100] "GET /rest/handle/10568/34795?expand=all HTTP/1.0" 200 11394 "-" "-" +``` + +- I can only assume that these requests used a literal '-' so I will have to add an nginx rule to block those too +- Otherwise, I see from my notes that 70.32.90.172 is the wle.cgiar.org REST API harvester... I should ask Macaroni Bros about that