Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
"/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -140,7 +140,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I thought maybe his account had expired (seeing as it&rsquo;s was the first of the month) but he says he was finally able to log in today</li>
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;ldap_authentication:type=failed_auth&quot; dspace.log.2017-10-01
<pre tabindex="0"><code>$ grep -c &#34;ldap_authentication:type=failed_auth&#34; dspace.log.2017-10-01
14
</code></pre><ul>
<li>For what it&rsquo;s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET&rsquo;s LDAP server</li>
@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Twice in the past twenty-four hours Linode has warned that CGSpace&rsquo;s outbound traffic rate was exceeding the notification threshold</li>
<li>I had a look at yesterday&rsquo;s OAI and REST logs in <code>/var/log/nginx</code> but didn&rsquo;t see anything unusual:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
141 157.55.39.240
145 40.77.167.85
162 66.249.66.92
@ -176,7 +176,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
1495 50.116.102.77
3904 70.32.83.92
9904 45.5.184.196
# awk '{print $1}' /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
# awk &#39;{print $1}&#39; /var/log/nginx/oai.log.1 | sort -n | uniq -c | sort -h | tail -n 10
5 66.249.66.71
6 66.249.66.67
6 68.180.229.31
@ -270,14 +270,14 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
<li>Still not sure where the load is coming from right now, but it&rsquo;s clear why there were so many alerts yesterday on the 25th!</li>
</ul>
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-25 | sort -n | uniq | wc -l
18022
</code></pre><ul>
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
</ul>
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-23 | sort -n | uniq | wc -l
3141
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
# grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; dspace.log.2017-10-26 | sort -n | uniq | wc -l
7851
</code></pre><ul>
<li>I still have no idea what was causing the load to go up today</li>
@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I&rsquo;m still not sure why this started causing alerts so repeatadely the past week</li>
<li>I don&rsquo;t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
</ul>
<pre tabindex="0"><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep &#39;2017-10-29 02:&#39; dspace.log.2017-10-29 | grep -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2049
</code></pre><ul>
<li>So there were 2049 unique sessions during the hour of 2AM</li>
@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
<li>I think I&rsquo;ll need to enable access logging in nginx to figure out what&rsquo;s going on</li>
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I&rsquo;ve never seen before:</li>
</ul>
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &quot;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&quot; 200 7776 &quot;-&quot; &quot;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&quot;
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] &#34;GET /discover?filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=Internal+Document&amp;filtertype=author&amp;filter_relational_operator=equals&amp;filter=CGIAR+Secretariat HTTP/1.1&#34; 200 7776 &#34;-&#34; &#34;Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)&#34;
</code></pre><ul>
<li>CORE seems to be some bot that is &ldquo;Aggregating the worlds open access research papers&rdquo;</li>
<li>The contact address listed in their bot&rsquo;s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
@ -329,20 +329,20 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
</code></pre><ul>
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log
26475
# grep -c &quot;CORE/0.6&quot; /var/log/nginx/access.log.1
# grep -c &#34;CORE/0.6&#34; /var/log/nginx/access.log.1
135083
</code></pre><ul>
<li>IP addresses for this bot currently seem to be:</li>
</ul>
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | awk &#39;{print $1}&#39; | sort -n | uniq
137.108.70.6
137.108.70.7
</code></pre><ul>
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won&rsquo;t help much because they are only using two sessions:</li>
</ul>
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq
session_id=5771742CABA3D0780860B8DA81E0551B
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</code></pre><ul>
@ -350,12 +350,12 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
</ul>
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
26622
# grep 137.108.70 /var/log/nginx/access.log | grep -c &quot;GET /discover&quot;
# grep 137.108.70 /var/log/nginx/access.log | grep -c &#34;GET /discover&#34;
24055
</code></pre><ul>
<li>Just because I&rsquo;m curious who the top IPs are:</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
496 62.210.247.93
571 46.4.94.226
651 40.77.167.39
@ -371,9 +371,9 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don&rsquo;t reuse their session variable, creating thousands of new sessions!</li>
</ul>
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
1419
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E &#39;session_id=[A-Z0-9]{32}&#39; | sort -n | uniq | wc -l
2811
</code></pre><ul>
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn&rsquo;t in Ubuntu 16.04&rsquo;s 7.0.68 build!</li>
<li>That would explain the errors I was getting when trying to set it:</li>
</ul>
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property &#39;crawlerIps&#39; to &#39;190\.19\.92\.5|104\.196\.152\.243&#39; did not find a matching property.
</code></pre><ul>
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
</ul>
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep -o -E &#39;session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)&#39; dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>Ask on the dspace-tech mailing list if it&rsquo;s possible to use an existing item as a template for a new item</li>
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
</ul>
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log.1 | awk &#39;{print $1}&#39; | sort -n | uniq -c | sort -h
139109 137.108.70.6
139253 137.108.70.7
</code></pre><ul>
@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
<li>I&rsquo;m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
<li>Actually, come to think of it, they aren&rsquo;t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
</ul>
<pre tabindex="0"><code># grep &quot;CORE/0.6&quot; /var/log/nginx/access.log | grep -o -E &quot;GET /(discover|search-filter)&quot; | sort -n | uniq -c | sort -rn
<pre tabindex="0"><code># grep &#34;CORE/0.6&#34; /var/log/nginx/access.log | grep -o -E &#34;GET /(discover|search-filter)&#34; | sort -n | uniq -c | sort -rn
158058 GET /discover
14260 GET /search-filter
</code></pre><ul>