mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -140,7 +140,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today</li>
|
||||
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
|
||||
<pre tabindex="0"><code>$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
|
||||
14
|
||||
</code></pre><ul>
|
||||
<li>For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server</li>
|
||||
@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold</li>
|
||||
<li>I had a look at yesterday’s OAI and REST logs in <code>/var/log/nginx</code> but didn’t see anything unusual:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
141 157.55.39.240
|
||||
145 40.77.167.85
|
||||
162 66.249.66.92
|
||||
@ -176,7 +176,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
1495 50.116.102.77
|
||||
3904 70.32.83.92
|
||||
9904 45.5.184.196
|
||||
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
5 66.249.66.71
|
||||
6 66.249.66.67
|
||||
6 68.180.229.31
|
||||
@ -270,14 +270,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
|
||||
<li>Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
|
||||
18022
|
||||
</code></pre><ul>
|
||||
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
|
||||
3141
|
||||
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
|
||||
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
|
||||
7851
|
||||
</code></pre><ul>
|
||||
<li>I still have no idea what was causing the load to go up today</li>
|
||||
@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>I’m still not sure why this started causing alerts so repeatadely the past week</li>
|
||||
<li>I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
2049
|
||||
</code></pre><ul>
|
||||
<li>So there were 2049 unique sessions during the hour of 2AM</li>
|
||||
@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>I think I’ll need to enable access logging in nginx to figure out what’s going on</li>
|
||||
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I’ve never seen before:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
|
||||
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
|
||||
</code></pre><ul>
|
||||
<li>CORE seems to be some bot that is “Aggregating the world’s open access research papers”</li>
|
||||
<li>The contact address listed in their bot’s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
|
||||
@ -329,20 +329,20 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</code></pre><ul>
|
||||
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep -c "CORE/0.6" /var/log/nginx/access.log
|
||||
<pre tabindex="0"><code># grep -c "CORE/0.6" /var/log/nginx/access.log
|
||||
26475
|
||||
# grep -c "CORE/0.6" /var/log/nginx/access.log.1
|
||||
# grep -c "CORE/0.6" /var/log/nginx/access.log.1
|
||||
135083
|
||||
</code></pre><ul>
|
||||
<li>IP addresses for this bot currently seem to be:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
|
||||
137.108.70.6
|
||||
137.108.70.7
|
||||
</code></pre><ul>
|
||||
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||
session_id=5771742CABA3D0780860B8DA81E0551B
|
||||
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
</code></pre><ul>
|
||||
@ -350,12 +350,12 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
|
||||
26622
|
||||
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
|
||||
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
|
||||
24055
|
||||
</code></pre><ul>
|
||||
<li>Just because I’m curious who the top IPs are:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||
496 62.210.247.93
|
||||
571 46.4.94.226
|
||||
651 40.77.167.39
|
||||
@ -371,9 +371,9 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
|
||||
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
1419
|
||||
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
2811
|
||||
</code></pre><ul>
|
||||
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
|
||||
@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!</li>
|
||||
<li>That would explain the errors I was getting when trying to set it:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
|
||||
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
|
||||
</code></pre><ul>
|
||||
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
|
||||
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
|
||||
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
|
||||
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
|
||||
@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item</li>
|
||||
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
|
||||
139109 137.108.70.6
|
||||
139253 137.108.70.7
|
||||
</code></pre><ul>
|
||||
@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
|
||||
<li>Actually, come to think of it, they aren’t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
158058 GET /discover
|
||||
14260 GET /search-filter
|
||||
</code></pre><ul>
|
||||
|
Reference in New Issue
Block a user