mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 06:35:03 +01:00
Add notes for 2018-12-04 and regenerate
This commit is contained in:
parent
5f051ca9ee
commit
0d1f50665a
@ -231,4 +231,77 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
|
||||
|
||||
- This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness...?
|
||||
|
||||
## 2018-12-04
|
||||
|
||||
- Last night Linode sent a message that the load on CGSpace (linode18) was too high, here's a list of the top users at the time and throughout the day:
|
||||
|
||||
```
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
225 40.77.167.142
|
||||
226 66.249.64.63
|
||||
232 46.101.86.248
|
||||
285 45.5.186.2
|
||||
333 54.70.40.11
|
||||
411 193.29.13.85
|
||||
476 34.218.226.147
|
||||
962 66.249.70.27
|
||||
1193 35.237.175.180
|
||||
1450 2a01:4f8:140:3192::2
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
1141 207.46.13.57
|
||||
1299 197.210.168.174
|
||||
1341 54.70.40.11
|
||||
1429 40.77.167.142
|
||||
1528 34.218.226.147
|
||||
1973 66.249.70.27
|
||||
2079 50.116.102.77
|
||||
2494 78.46.79.71
|
||||
3210 2a01:4f8:140:3192::2
|
||||
4190 35.237.175.180
|
||||
```
|
||||
|
||||
- `35.237.175.180` is known to us (CCAFS?), and I've already added it to the list of bot IPs in nginx, which appears to be working:
|
||||
|
||||
```
|
||||
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
|
||||
4772
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
|
||||
630
|
||||
```
|
||||
|
||||
- I haven't seen `2a01:4f8:140:3192::2` before. Its user agent is some new bot:
|
||||
|
||||
```
|
||||
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
|
||||
```
|
||||
|
||||
- At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:
|
||||
|
||||
```
|
||||
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
|
||||
5111
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
|
||||
419
|
||||
```
|
||||
|
||||
- `78.46.79.71` is another host on Hetzner with the following user agent:
|
||||
|
||||
```
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
```
|
||||
|
||||
- This is not the first time a host on Hetzner has used a "normal" user agent to make thousands of requests
|
||||
- At least it is re-using its Tomcat sessions somehow:
|
||||
|
||||
```
|
||||
$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
|
||||
2044
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
|
||||
1
|
||||
```
|
||||
|
||||
- In other news, it's good to see my re-work of the database connectivity in the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) actually caused a reduction of persistent database connections (from 1 to 0, but still!):
|
||||
|
||||
![PostgreSQL connections day](/cgspace-notes/2018/12/postgres_connections_db-month.png)
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -21,7 +21,7 @@ Today these are the top 10 IPs:
|
||||
" />
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-11/" /><meta property="article:published_time" content="2018-11-01T16:41:30+02:00"/>
|
||||
<meta property="article:modified_time" content="2018-11-28T09:32:04+02:00"/>
|
||||
<meta property="article:modified_time" content="2018-12-04T09:50:36+02:00"/>
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="November, 2018"/>
|
||||
@ -48,9 +48,9 @@ Today these are the top 10 IPs:
|
||||
"@type": "BlogPosting",
|
||||
"headline": "November, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-11/",
|
||||
"wordCount": "2698",
|
||||
"wordCount": "2823",
|
||||
"datePublished": "2018-11-01T16:41:30+02:00",
|
||||
"dateModified": "2018-11-28T09:32:04+02:00",
|
||||
"dateModified": "2018-12-04T09:50:36+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -156,7 +156,7 @@ Today these are the top 10 IPs:
|
||||
<li>They at least seem to be re-using their Tomcat sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03 | sort | uniq
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=84.38.130.177' dspace.log.2018-11-03
|
||||
342
|
||||
</code></pre>
|
||||
|
||||
@ -173,7 +173,7 @@ Today these are the top 10 IPs:
|
||||
<li>And it doesn’t seem they are re-using their Tomcat sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03 | sort | uniq
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=138.201.52.218' dspace.log.2018-11-03
|
||||
1243
|
||||
</code></pre>
|
||||
|
||||
@ -205,14 +205,17 @@ Today these are the top 10 IPs:
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>It’s making lots of requests and using quite a number of Tomcat sessions:</li>
|
||||
<li>It’s making lots of requests, though actually it does seem to be re-using its Tomcat sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' /home/cgspace.cgiar.org/log/dspace.log.2018-11-03 | sort | uniq
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03
|
||||
8449
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-03 | sort | uniq | wc -l
|
||||
1
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><em>Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions</em></li>
|
||||
<li>I could add this IP to the list of bot IPs in nginx, but it seems like a futile effort when some new IP could come along and do the same thing</li>
|
||||
<li>Perhaps I should think about adding rate limits to dynamic pages like <code>/discover</code> and <code>/browse</code></li>
|
||||
<li>I think it’s reasonable for a human to click one of those links five or ten times a minute…</li>
|
||||
@ -270,14 +273,17 @@ Today these are the top 10 IPs:
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><code>78.46.89.18</code> is back… and still making tons of Tomcat sessions:</li>
|
||||
<li><code>78.46.89.18</code> is back… and it is still actually re-using its Tomcat sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04
|
||||
8765
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.89.18' dspace.log.2018-11-04 | sort | uniq | wc -l
|
||||
1
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><em>Updated on 2018-12-04 to correct the grep command and point out that the bot was actually re-using its Tomcat sessions properly</em></li>
|
||||
<li>Also, now we have a ton of Facebook crawlers:</li>
|
||||
</ul>
|
||||
|
||||
@ -302,14 +308,15 @@ Today these are the top 10 IPs:
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>They are really making shit tons of Tomcat sessions:</li>
|
||||
<li>They are really making shit tons of requests:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
|
||||
14368
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
|
||||
37721
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><em>Updated on 2018-12-04 to correct the grep command to accurately show the number of requests</em></li>
|
||||
<li>Their user agent is:</li>
|
||||
</ul>
|
||||
|
||||
@ -342,16 +349,17 @@ Today these are the top 10 IPs:
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>And still making shit tons of Tomcat sessions:</li>
|
||||
<li>Now at least the Tomcat Crawler Session Manager Valve seems to be forcing it to re-use some Tomcat sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq
|
||||
28470
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04
|
||||
37721
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-04 | sort | uniq | wc -l
|
||||
15206
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>And that’s even using the Tomcat Crawler Session Manager valve!</li>
|
||||
<li>Maybe we need to limit more dynamic pages, like the “most popular” country, item, and author pages</li>
|
||||
<li>I think we still need to limit more of the dynamic pages, like the “most popular” country, item, and author pages</li>
|
||||
<li>It seems these are popular too, and there is no fucking way Facebook needs that information, yet they are requesting thousands of them!</li>
|
||||
</ul>
|
||||
|
||||
@ -377,19 +385,22 @@ Today these are the top 10 IPs:
|
||||
<li>The file <code>marlo.csv</code> was cleaned up and formatted in Open Refine</li>
|
||||
<li>165 of the items in their 2017 data are from CGSpace!</li>
|
||||
<li>I will add the data to CGSpace this week (done!)</li>
|
||||
<li>Jesus, is Facebook <em>trying</em> to be annoying?</li>
|
||||
<li>Jesus, is Facebook <em>trying</em> to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep -c "2a03:2880:11ff:"
|
||||
29889
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq
|
||||
29156
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05
|
||||
29763
|
||||
# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a03:2880:11ff' dspace.log.2018-11-05 | sort | uniq | wc -l
|
||||
1057
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "05/Nov/2018" | grep "2a03:2880:11ff:" | grep -c -E "(handle|bitstream)"
|
||||
29896
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>29,000 requests from Facebook, 29,000 Tomcat sessions, and none of the requests are to the dynamic pages I rate limited yesterday!</li>
|
||||
<li>29,000 requests from Facebook and none of the requests are to the dynamic pages I rate limited yesterday!</li>
|
||||
<li>At least the Tomcat Crawler Session Manager Valve is working now…</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-11-06">2018-11-06</h2>
|
||||
|
@ -21,7 +21,7 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
|
||||
" />
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2018-12/" /><meta property="article:published_time" content="2018-12-02T02:09:30+02:00"/>
|
||||
<meta property="article:modified_time" content="2018-12-03T13:16:42+02:00"/>
|
||||
<meta property="article:modified_time" content="2018-12-03T18:28:21+02:00"/>
|
||||
|
||||
<meta name="twitter:card" content="summary"/>
|
||||
<meta name="twitter:title" content="December, 2018"/>
|
||||
@ -48,9 +48,9 @@ I noticed that there is another issue with PDF thumbnails on CGSpace, and I see
|
||||
"@type": "BlogPosting",
|
||||
"headline": "December, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-12/",
|
||||
"wordCount": "1503",
|
||||
"wordCount": "1826",
|
||||
"datePublished": "2018-12-02T02:09:30+02:00",
|
||||
"dateModified": "2018-12-03T13:16:42+02:00",
|
||||
"dateModified": "2018-12-03T18:28:21+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -370,6 +370,87 @@ $ gm convert -resize x600 -flatten -quality 85 cover.png cover.jpg
|
||||
<li>This has got to be part Ubuntu Tomcat packaging, and part DSpace 5.x Tomcat 8.5 readiness…?</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-12-04">2018-12-04</h2>
|
||||
|
||||
<ul>
|
||||
<li>Last night Linode sent a message that the load on CGSpace (linode18) was too high, here’s a list of the top users at the time and throughout the day:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018:1(5|6|7|8)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
225 40.77.167.142
|
||||
226 66.249.64.63
|
||||
232 46.101.86.248
|
||||
285 45.5.186.2
|
||||
333 54.70.40.11
|
||||
411 193.29.13.85
|
||||
476 34.218.226.147
|
||||
962 66.249.70.27
|
||||
1193 35.237.175.180
|
||||
1450 2a01:4f8:140:3192::2
|
||||
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Dec/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
|
||||
1141 207.46.13.57
|
||||
1299 197.210.168.174
|
||||
1341 54.70.40.11
|
||||
1429 40.77.167.142
|
||||
1528 34.218.226.147
|
||||
1973 66.249.70.27
|
||||
2079 50.116.102.77
|
||||
2494 78.46.79.71
|
||||
3210 2a01:4f8:140:3192::2
|
||||
4190 35.237.175.180
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><code>35.237.175.180</code> is known to us (CCAFS?), and I’ve already added it to the list of bot IPs in nginx, which appears to be working:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03
|
||||
4772
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=35.237.175.180' dspace.log.2018-12-03 | sort | uniq | wc -l
|
||||
630
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I haven’t seen <code>2a01:4f8:140:3192::2</code> before. Its user agent is some new bot:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>At least it seems the Tomcat Crawler Session Manager Valve is working to re-use the common bot XMLUI sessions:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03
|
||||
5111
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=2a01:4f8:140:3192::2' dspace.log.2018-12-03 | sort | uniq | wc -l
|
||||
419
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><code>78.46.79.71</code> is another host on Hetzner with the following user agent:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>This is not the first time a host on Hetzner has used a “normal” user agent to make thousands of requests</li>
|
||||
<li>At least it is re-using its Tomcat sessions somehow:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03
|
||||
2044
|
||||
$ grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=78.46.79.71' dspace.log.2018-12-03 | sort | uniq | wc -l
|
||||
1
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>In other news, it’s good to see my re-work of the database connectivity in the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a> actually caused a reduction of persistent database connections (from 1 to 0, but still!):</li>
|
||||
</ul>
|
||||
|
||||
<p><img src="/cgspace-notes/2018/12/postgres_connections_db-month.png" alt="PostgreSQL connections day" /></p>
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
||||
|
||||
|
BIN
docs/2018/12/postgres_connections_db-month.png
Normal file
BIN
docs/2018/12/postgres_connections_db-month.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 11 KiB |
@ -4,12 +4,12 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-12/</loc>
|
||||
<lastmod>2018-12-03T13:16:42+02:00</lastmod>
|
||||
<lastmod>2018-12-03T18:28:21+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-11/</loc>
|
||||
<lastmod>2018-11-28T09:32:04+02:00</lastmod>
|
||||
<lastmod>2018-12-04T09:50:36+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -199,7 +199,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2018-12-03T13:16:42+02:00</lastmod>
|
||||
<lastmod>2018-12-03T18:28:21+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -210,7 +210,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-12-03T13:16:42+02:00</lastmod>
|
||||
<lastmod>2018-12-03T18:28:21+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -222,13 +222,13 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2018-12-03T13:16:42+02:00</lastmod>
|
||||
<lastmod>2018-12-03T18:28:21+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2018-12-03T13:16:42+02:00</lastmod>
|
||||
<lastmod>2018-12-03T18:28:21+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
BIN
static/2018/12/postgres_connections_db-month.png
Normal file
BIN
static/2018/12/postgres_connections_db-month.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 11 KiB |
Loading…
Reference in New Issue
Block a user