mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -34,7 +34,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -124,7 +124,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<ul>
|
||||
<li>Peter emailed to point out that many items in the <a href="https://cgspace.cgiar.org/handle/10568/2703">ILRI archive collection</a> have multiple handles:</li>
|
||||
</ul>
|
||||
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
<pre tabindex="0"><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
</code></pre><ul>
|
||||
<li>There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
|
||||
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
|
||||
@ -134,13 +134,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Peter Ballantyne said he was having problems logging into CGSpace with “both” of his accounts (CGIAR LDAP and personal, apparently)</li>
|
||||
<li>I looked in the logs and saw some LDAP lookup failures due to timeout but also strangely a “no DN found” error:</li>
|
||||
</ul>
|
||||
<pre><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
|
||||
<pre tabindex="0"><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
|
||||
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
|
||||
</code></pre><ul>
|
||||
<li>I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today</li>
|
||||
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
|
||||
<pre tabindex="0"><code>$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
|
||||
14
|
||||
</code></pre><ul>
|
||||
<li>For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server</li>
|
||||
@ -152,7 +152,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Communicate with Sam from the CGIAR System Organization about some broken links coming from their CGIAR Library domain to CGSpace</li>
|
||||
<li>The first is a link to a browse page that should be handled better in nginx:</li>
|
||||
</ul>
|
||||
<pre><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
|
||||
<pre tabindex="0"><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
|
||||
</code></pre><ul>
|
||||
<li>We’ll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
|
||||
<li>The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead</li>
|
||||
@ -165,7 +165,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold</li>
|
||||
<li>I had a look at yesterday’s OAI and REST logs in <code>/var/log/nginx</code> but didn’t see anything unusual:</li>
|
||||
</ul>
|
||||
<pre><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
141 157.55.39.240
|
||||
145 40.77.167.85
|
||||
162 66.249.66.92
|
||||
@ -225,7 +225,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Delete Community 10568/102 (ILRI Research and Development Issues)</li>
|
||||
<li>Move five collections to 10568/27629 (ILRI Projects) using <code>move-collections.sh</code> with the following configuration:</li>
|
||||
</ul>
|
||||
<pre><code>10568/1637 10568/174 10568/27629
|
||||
<pre tabindex="0"><code>10568/1637 10568/174 10568/27629
|
||||
10568/1642 10568/174 10568/27629
|
||||
10568/1614 10568/174 10568/27629
|
||||
10568/75561 10568/150 10568/27629
|
||||
@ -270,12 +270,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
|
||||
<li>Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!</li>
|
||||
</ul>
|
||||
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
|
||||
18022
|
||||
</code></pre><ul>
|
||||
<li>Compared to other days there were two or three times the number of requests yesterday!</li>
|
||||
</ul>
|
||||
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-23 | sort -n | uniq | wc -l
|
||||
3141
|
||||
# grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-26 | sort -n | uniq | wc -l
|
||||
7851
|
||||
@ -302,7 +302,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>I’m still not sure why this started causing alerts so repeatadely the past week</li>
|
||||
<li>I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
|
||||
</ul>
|
||||
<pre><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
2049
|
||||
</code></pre><ul>
|
||||
<li>So there were 2049 unique sessions during the hour of 2AM</li>
|
||||
@ -310,7 +310,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>I think I’ll need to enable access logging in nginx to figure out what’s going on</li>
|
||||
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I’ve never seen before:</li>
|
||||
</ul>
|
||||
<pre><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
|
||||
<pre tabindex="0"><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
|
||||
</code></pre><ul>
|
||||
<li>CORE seems to be some bot that is “Aggregating the world’s open access research papers”</li>
|
||||
<li>The contact address listed in their bot’s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
|
||||
@ -323,39 +323,39 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<li>Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM)</li>
|
||||
<li>Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# SELECT * FROM pg_stat_activity;
|
||||
<pre tabindex="0"><code>dspace=# SELECT * FROM pg_stat_activity;
|
||||
...
|
||||
(93 rows)
|
||||
</code></pre><ul>
|
||||
<li>Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today:</li>
|
||||
</ul>
|
||||
<pre><code># grep -c "CORE/0.6" /var/log/nginx/access.log
|
||||
<pre tabindex="0"><code># grep -c "CORE/0.6" /var/log/nginx/access.log
|
||||
26475
|
||||
# grep -c "CORE/0.6" /var/log/nginx/access.log.1
|
||||
135083
|
||||
</code></pre><ul>
|
||||
<li>IP addresses for this bot currently seem to be:</li>
|
||||
</ul>
|
||||
<pre><code># grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
|
||||
137.108.70.6
|
||||
137.108.70.7
|
||||
</code></pre><ul>
|
||||
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:</li>
|
||||
</ul>
|
||||
<pre><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||
<pre tabindex="0"><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||
session_id=5771742CABA3D0780860B8DA81E0551B
|
||||
session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
</code></pre><ul>
|
||||
<li>… and most of their requests are for dynamic discover pages:</li>
|
||||
</ul>
|
||||
<pre><code># grep -c 137.108.70 /var/log/nginx/access.log
|
||||
<pre tabindex="0"><code># grep -c 137.108.70 /var/log/nginx/access.log
|
||||
26622
|
||||
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
|
||||
24055
|
||||
</code></pre><ul>
|
||||
<li>Just because I’m curious who the top IPs are:</li>
|
||||
</ul>
|
||||
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||
496 62.210.247.93
|
||||
571 46.4.94.226
|
||||
651 40.77.167.39
|
||||
@ -371,7 +371,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
|
||||
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!</li>
|
||||
</ul>
|
||||
<pre><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
1419
|
||||
# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
2811
|
||||
@ -382,11 +382,11 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!</li>
|
||||
<li>That would explain the errors I was getting when trying to set it:</li>
|
||||
</ul>
|
||||
<pre><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
|
||||
<pre tabindex="0"><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
|
||||
</code></pre><ul>
|
||||
<li>As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good:</li>
|
||||
</ul>
|
||||
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
|
||||
<pre tabindex="0"><code># grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h
|
||||
410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
|
||||
574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
|
||||
1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
|
||||
@ -399,7 +399,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item</li>
|
||||
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
|
||||
</ul>
|
||||
<pre><code># grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
|
||||
139109 137.108.70.6
|
||||
139253 137.108.70.7
|
||||
</code></pre><ul>
|
||||
@ -408,7 +408,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>I added <a href="https://goaccess.io/">GoAccess</a> to the list of package to install in the DSpace role of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
|
||||
<li>It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:</li>
|
||||
</ul>
|
||||
<pre><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
|
||||
<pre tabindex="0"><code># goaccess /var/log/nginx/access.log --log-format=COMBINED
|
||||
</code></pre><ul>
|
||||
<li>According to Uptime Robot CGSpace went down and up a few times</li>
|
||||
<li>I had a look at goaccess and I saw that CORE was actively indexing</li>
|
||||
@ -416,7 +416,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
|
||||
<li>Actually, come to think of it, they aren’t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
|
||||
</ul>
|
||||
<pre><code># grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
<pre tabindex="0"><code># grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
158058 GET /discover
|
||||
14260 GET /search-filter
|
||||
</code></pre><ul>
|
||||
|
Reference in New Issue
Block a user