Update notes for 2019-03-26

This commit is contained in:
Alan Orth 2019-03-26 19:41:33 +02:00
parent 63ed2d91bb
commit 9f7556a803
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 111 additions and 14 deletions

View File

@ -872,4 +872,50 @@ $ ./fix-metadata-values.py -i /tmp/2019-03-26-AGROVOC-89-corrections.csv -db dsp
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
```
- UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0
- Looking at the nginx logs I don't see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:
```
# grep SemrushBot /var/log/nginx/access.log | grep -E "26/Mar/2019" | grep -E '(discover|browse)' | wc -l
2931
```
- So I'm adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with "bot" in the name for a few days to see if things calm down... maybe not just yet
- Otherwise, these are the top users in the web and API logs the last hour (1819):
```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
54 41.216.228.158
65 199.47.87.140
75 157.55.39.238
77 157.55.39.237
89 157.55.39.236
100 18.196.196.108
128 18.195.78.144
277 2a01:4f8:13b:1296::2
291 66.249.66.80
328 35.174.184.209
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E "26/Mar/2019:(18|19)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2 2409:4066:211:2caf:3c31:3fae:2212:19cc
2 35.10.204.140
2 45.251.231.45
2 95.108.181.88
2 95.137.190.2
3 104.198.9.108
3 107.167.109.88
6 66.249.66.80
13 41.89.230.156
1860 45.5.184.2
```
- For the XMLUI I see `18.195.78.144` and `18.196.196.108` requesting only CTA items and with no user agent
- They are responsible for almost 1,000 XMLUI sessions today:
```
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
937
```
- I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat's Crawler Session Manager Valve to force them to re-use their session
<!-- vim: set sw=2 ts=2: -->

View File

@ -25,7 +25,7 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-03/" />
<meta property="article:published_time" content="2019-03-01T12:16:30&#43;01:00"/>
<meta property="article:modified_time" content="2019-03-26T09:09:19&#43;02:00"/>
<meta property="article:modified_time" content="2019-03-26T18:25:05&#43;02:00"/>
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="March, 2019"/>
@ -55,9 +55,9 @@ I think I will need to ask Udana to re-copy and paste the abstracts with more ca
"@type": "BlogPosting",
"headline": "March, 2019",
"url": "https://alanorth.github.io/cgspace-notes/2019-03/",
"wordCount": "5506",
"wordCount": "5785",
"datePublished": "2019-03-01T12:16:30&#43;01:00",
"dateModified": "2019-03-26T09:09:19&#43;02:00",
"dateModified": "2019-03-26T18:25:05&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -1152,6 +1152,57 @@ $ grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2019-03-22 | sort -u | wc -l
$ ./delete-metadata-values.py -i /tmp/2019-03-26-AGROVOC-79-deletions.csv -db dspace -u dspace -p 'fuuu' -m 57 -f dc.subject -d -n
</code></pre>
<ul>
<li>UptimeRobot says CGSpace is down again, but it seems to just be slow, as the load is over 10.0</li>
<li>Looking at the nginx logs I don&rsquo;t see anything terribly abusive, but SemrushBot has made ~3,000 requests to Discovery and Browse pages today:</li>
</ul>
<pre><code># grep SemrushBot /var/log/nginx/access.log | grep -E &quot;26/Mar/2019&quot; | grep -E '(discover|browse)' | wc -l
2931
</code></pre>
<ul>
<li>So I&rsquo;m adding it to the badbot rate limiting in nginx, and actually, I kinda feel like just blocking all user agents with &ldquo;bot&rdquo; in the name for a few days to see if things calm down&hellip; maybe not just yet</li>
<li>Otherwise, these are the top users in the web and API logs the last hour (1819):</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
54 41.216.228.158
65 199.47.87.140
75 157.55.39.238
77 157.55.39.237
89 157.55.39.236
100 18.196.196.108
128 18.195.78.144
277 2a01:4f8:13b:1296::2
291 66.249.66.80
328 35.174.184.209
# zcat --force /var/log/nginx/{oai,rest,statistics}.log /var/log/nginx/{oai,rest,statistics}.log.1 | grep -E &quot;26/Mar/2019:(18|19)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2 2409:4066:211:2caf:3c31:3fae:2212:19cc
2 35.10.204.140
2 45.251.231.45
2 95.108.181.88
2 95.137.190.2
3 104.198.9.108
3 107.167.109.88
6 66.249.66.80
13 41.89.230.156
1860 45.5.184.2
</code></pre>
<ul>
<li>For the XMLUI I see <code>18.195.78.144</code> and <code>18.196.196.108</code> requesting only CTA items and with no user agent</li>
<li>They are responsible for almost 1,000 XMLUI sessions today:</li>
</ul>
<pre><code>$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=(18.195.78.144|18.196.196.108)' dspace.log.2019-03-26 | sort | uniq | wc -l
937
</code></pre>
<ul>
<li>I will add their IPs to the list of bot IPs in nginx so I can tag them as bots to let Tomcat&rsquo;s Crawler Session Manager Valve to force them to re-use their session</li>
</ul>
<!-- vim: set sw=2 ts=2: -->

View File

@ -45,7 +45,7 @@ Disallow: /cgspace-notes/2015-12/
Disallow: /cgspace-notes/2015-11/
Disallow: /cgspace-notes/
Disallow: /cgspace-notes/categories/
Disallow: /cgspace-notes/tags/notes/
Disallow: /cgspace-notes/categories/notes/
Disallow: /cgspace-notes/tags/notes/
Disallow: /cgspace-notes/posts/
Disallow: /cgspace-notes/tags/

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2019-03/</loc>
<lastmod>2019-03-26T09:09:19+02:00</lastmod>
<lastmod>2019-03-26T18:25:05+02:00</lastmod>
</url>
<url>
@ -214,7 +214,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2019-03-26T09:09:19+02:00</lastmod>
<lastmod>2019-03-26T18:25:05+02:00</lastmod>
<priority>0</priority>
</url>
@ -223,27 +223,27 @@
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-03-26T09:09:19+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2018-03-09T22:10:33+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-03-26T18:25:05+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2019-03-26T09:09:19+02:00</lastmod>
<lastmod>2019-03-26T18:25:05+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2019-03-26T09:09:19+02:00</lastmod>
<lastmod>2019-03-26T18:25:05+02:00</lastmod>
<priority>0</priority>
</url>