mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-27
This commit is contained in:
@ -12,7 +12,7 @@ Peter emailed to point out that many items in the ILRI archive collection have m
|
||||
|
||||
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
|
||||
There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
|
||||
" />
|
||||
<meta property="og:type" content="article" />
|
||||
@ -28,10 +28,10 @@ Peter emailed to point out that many items in the ILRI archive collection have m
|
||||
|
||||
http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
|
||||
There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine
|
||||
Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.62.2" />
|
||||
<meta name="generator" content="Hugo 0.63.1" />
|
||||
|
||||
|
||||
|
||||
@ -61,7 +61,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
|
||||
<!-- combined, minified CSS -->
|
||||
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy+piAwENoVPTw=" crossorigin="anonymous">
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I+LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
|
||||
|
||||
|
||||
<!-- RSS 2.0 feed -->
|
||||
@ -108,7 +108,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<header>
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2017-10/">October, 2017</a></h2>
|
||||
<p class="blog-post-meta"><time datetime="2017-10-01T08:07:54+03:00">Sun Oct 01, 2017</time> by Alan Orth in
|
||||
<i class="fa fa-folder" aria-hidden="true"></i> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
||||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes" rel="category tag">Notes</a>
|
||||
|
||||
|
||||
</p>
|
||||
@ -119,7 +119,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</ul>
|
||||
<pre><code>http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336
|
||||
</code></pre><ul>
|
||||
<li>There appears to be a pattern but I'll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
|
||||
<li>There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine</li>
|
||||
<li>Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections</li>
|
||||
</ul>
|
||||
<h2 id="2017-10-02">2017-10-02</h2>
|
||||
@ -130,13 +130,13 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<pre><code>2017-10-01 20:24:57,928 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:ldap_attribute_lookup:type=failed_search javax.naming.CommunicationException\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is java.net.ConnectException\colon; Connection timed out (Connection timed out)]
|
||||
2017-10-01 20:22:37,982 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=CA0AA5FEAEA8805645489404CDCE9594:ip_addr=41.204.190.40:failed_login:no DN found for user pballantyne
|
||||
</code></pre><ul>
|
||||
<li>I thought maybe his account had expired (seeing as it's was the first of the month) but he says he was finally able to log in today</li>
|
||||
<li>I thought maybe his account had expired (seeing as it’s was the first of the month) but he says he was finally able to log in today</li>
|
||||
<li>The logs for yesterday show fourteen errors related to LDAP auth failures:</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c "ldap_authentication:type=failed_auth" dspace.log.2017-10-01
|
||||
14
|
||||
</code></pre><ul>
|
||||
<li>For what it's worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET's LDAP server</li>
|
||||
<li>For what it’s worth, there are no errors on any other recent days, so it must have been some network issue on Linode or CGNET’s LDAP server</li>
|
||||
<li>Linode emailed to say that linode578611 (DSpace Test) needs to migrate to a new host for a security update so I initiated the migration immediately rather than waiting for the scheduled time in two weeks</li>
|
||||
</ul>
|
||||
<h2 id="2017-10-04">2017-10-04</h2>
|
||||
@ -147,7 +147,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</ul>
|
||||
<pre><code>http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject → https://cgspace.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subject
|
||||
</code></pre><ul>
|
||||
<li>We'll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn't exist in Discovery yet, but we'll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
|
||||
<li>We’ll need to check for browse links and handle them properly, including swapping the <code>subject</code> parameter for <code>systemsubject</code> (which doesn’t exist in Discovery yet, but we’ll need to add it) as we have moved their poorly curated subjects from <code>dc.subject</code> to <code>cg.subject.system</code></li>
|
||||
<li>The second link was a direct link to a bitstream which has broken due to the sequence being updated, so I told him he should link to the handle of the item instead</li>
|
||||
<li>Help Sisay proof sixty-two IITA records on DSpace Test</li>
|
||||
<li>Lots of inconsistencies and errors in subjects, dc.format.extent, regions, countries</li>
|
||||
@ -155,8 +155,8 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</ul>
|
||||
<h2 id="2017-10-05">2017-10-05</h2>
|
||||
<ul>
|
||||
<li>Twice in the past twenty-four hours Linode has warned that CGSpace's outbound traffic rate was exceeding the notification threshold</li>
|
||||
<li>I had a look at yesterday's OAI and REST logs in <code>/var/log/nginx</code> but didn't see anything unusual:</li>
|
||||
<li>Twice in the past twenty-four hours Linode has warned that CGSpace’s outbound traffic rate was exceeding the notification threshold</li>
|
||||
<li>I had a look at yesterday’s OAI and REST logs in <code>/var/log/nginx</code> but didn’t see anything unusual:</li>
|
||||
</ul>
|
||||
<pre><code># awk '{print $1}' /var/log/nginx/rest.log.1 | sort -n | uniq -c | sort -h | tail -n 10
|
||||
141 157.55.39.240
|
||||
@ -183,7 +183,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</code></pre><ul>
|
||||
<li>Working on the nginx redirects for CGIAR Library</li>
|
||||
<li>We should start using 301 redirects and also allow for <code>/sitemap</code> to work on the library.cgiar.org domain so the CGIAR System Organization people can update their Google Search Console and allow Google to find their content in a structured way</li>
|
||||
<li>Remove eleven occurrences of <code>ACP</code> in IITA's <code>cg.coverage.region</code> using the Atmire batch edit module from Discovery</li>
|
||||
<li>Remove eleven occurrences of <code>ACP</code> in IITA’s <code>cg.coverage.region</code> using the Atmire batch edit module from Discovery</li>
|
||||
<li>Need to investigate how we can verify the library.cgiar.org using the HTML or DNS methods</li>
|
||||
<li>Run corrections on 143 ILRI Archive items that had two <code>dc.identifier.uri</code> values (Handle) that Peter had pointed out earlier this week</li>
|
||||
<li>I used OpenRefine to isolate them and then fixed and re-imported them into CGSpace</li>
|
||||
@ -197,7 +197,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<p><img src="/cgspace-notes/2017/10/dspace-thumbnail-original.png" alt="Original flat thumbnails">
|
||||
<img src="/cgspace-notes/2017/10/dspace-thumbnail-box-shadow.png" alt="Tweaked with border and box shadow"></p>
|
||||
<ul>
|
||||
<li>I'll post it to the Yammer group to see what people think</li>
|
||||
<li>I’ll post it to the Yammer group to see what people think</li>
|
||||
<li>I figured out at way to do the HTML verification for Google Search console for library.cgiar.org</li>
|
||||
<li>We can drop the HTML file in their XMLUI theme folder and it will get copied to the webapps directory during build/install</li>
|
||||
<li>Then we add an nginx alias for that URL in the library.cgiar.org vhost</li>
|
||||
@ -213,7 +213,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<img src="/cgspace-notes/2017/10/google-search-console-2.png" alt="Google Search Console 2">
|
||||
<img src="/cgspace-notes/2017/10/google-search-results.png" alt="Google Search results"></p>
|
||||
<ul>
|
||||
<li>I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace's console (currently I'm just a user) in order to do that</li>
|
||||
<li>I tried to submit a “Change of Address” request in the Google Search Console but I need to be an owner on CGSpace’s console (currently I’m just a user) in order to do that</li>
|
||||
<li>Manually clean up some communities and collections that Peter had requested a few weeks ago</li>
|
||||
<li>Delete Community 10568/102 (ILRI Research and Development Issues)</li>
|
||||
<li>Move five collections to 10568/27629 (ILRI Projects) using <code>move-collections.sh</code> with the following configuration:</li>
|
||||
@ -233,8 +233,8 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</ul>
|
||||
<p><img src="/cgspace-notes/2017/10/search-console-change-address-error.png" alt="Change of Address error"></p>
|
||||
<ul>
|
||||
<li>We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won't work—we'll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects</li>
|
||||
<li>Also the Google Search Console doesn't work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!</li>
|
||||
<li>We are sending top-level CGIAR Library traffic to their specific community hierarchy in CGSpace so this type of change of address won’t work—we’ll just need to wait for Google to slowly index everything and take note of the HTTP 301 redirects</li>
|
||||
<li>Also the Google Search Console doesn’t work very well with Google Analytics being blocked, so I had to turn off my ad blocker to get the “Change of Address” tool to work!</li>
|
||||
</ul>
|
||||
<h2 id="2017-10-12">2017-10-12</h2>
|
||||
<ul>
|
||||
@ -245,7 +245,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<ul>
|
||||
<li>Run system updates on DSpace Test and reboot server</li>
|
||||
<li>Merge changes adding a search/browse index for CGIAR System subject to <code>5_x-prod</code> (<a href="https://github.com/ilri/DSpace/pull/344">#344</a>)</li>
|
||||
<li>I checked the top browse links in Google's search results for <code>site:library.cgiar.org inurl:browse</code> and they are all redirected appropriately by the nginx rewrites I worked on last week</li>
|
||||
<li>I checked the top browse links in Google’s search results for <code>site:library.cgiar.org inurl:browse</code> and they are all redirected appropriately by the nginx rewrites I worked on last week</li>
|
||||
</ul>
|
||||
<h2 id="2017-10-22">2017-10-22</h2>
|
||||
<ul>
|
||||
@ -256,12 +256,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
</ul>
|
||||
<h2 id="2017-10-26">2017-10-26</h2>
|
||||
<ul>
|
||||
<li>In the last 24 hours we've gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace</li>
|
||||
<li>In the last 24 hours we’ve gotten a few alerts from Linode that there was high CPU and outgoing traffic on CGSpace</li>
|
||||
<li>Uptime Robot even noticed CGSpace go “down” for a few minutes</li>
|
||||
<li>In other news, I was trying to look at a question about stats raised by Magdalena and then CGSpace went down due to SQL connection pool</li>
|
||||
<li>Looking at the PostgreSQL activity I see there are 93 connections, but after a minute or two they went down and CGSpace came back up</li>
|
||||
<li>Annnd I reloaded the Atmire Usage Stats module and the connections shot back up and CGSpace went down again</li>
|
||||
<li>Still not sure where the load is coming from right now, but it's clear why there were so many alerts yesterday on the 25th!</li>
|
||||
<li>Still not sure where the load is coming from right now, but it’s clear why there were so many alerts yesterday on the 25th!</li>
|
||||
</ul>
|
||||
<pre><code># grep -o -E 'session_id=[A-Z0-9]{32}' dspace.log.2017-10-25 | sort -n | uniq | wc -l
|
||||
18022
|
||||
@ -274,12 +274,12 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
7851
|
||||
</code></pre><ul>
|
||||
<li>I still have no idea what was causing the load to go up today</li>
|
||||
<li>I finally investigated Magdalena's issue with the item download stats and now I can't reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats</li>
|
||||
<li>I finally investigated Magdalena’s issue with the item download stats and now I can’t reproduce it: I get the same number of downloads reported in the stats widget on the item page, the “Most Popular Items” page, and in Usage Stats</li>
|
||||
<li>I think it might have been an issue with the statistics not being fresh</li>
|
||||
<li>I added the admin group for the systems organization to the admin role of the top-level community of CGSpace because I guess Sisay had forgotten</li>
|
||||
<li>Magdalena asked if there was a way to reuse data in item submissions where items have a lot of similar data</li>
|
||||
<li>I told her about the possibility to use per-collection item templates, and asked if her items in question were all from a single collection</li>
|
||||
<li>We've never used it but it could be worth looking at</li>
|
||||
<li>We’ve never used it but it could be worth looking at</li>
|
||||
</ul>
|
||||
<h2 id="2017-10-27">2017-10-27</h2>
|
||||
<ul>
|
||||
@ -292,24 +292,24 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
<h2 id="2017-10-29">2017-10-29</h2>
|
||||
<ul>
|
||||
<li>Linode alerted about high CPU usage again on CGSpace around 2AM and 4AM</li>
|
||||
<li>I'm still not sure why this started causing alerts so repeatadely the past week</li>
|
||||
<li>I don't see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
|
||||
<li>I’m still not sure why this started causing alerts so repeatadely the past week</li>
|
||||
<li>I don’t see any tell tale signs in the REST or OAI logs, so trying to do rudimentary analysis in DSpace logs:</li>
|
||||
</ul>
|
||||
<pre><code># grep '2017-10-29 02:' dspace.log.2017-10-29 | grep -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
2049
|
||||
</code></pre><ul>
|
||||
<li>So there were 2049 unique sessions during the hour of 2AM</li>
|
||||
<li>Looking at my notes, the number of unique sessions was about the same during the same hour on other days when there were no alerts</li>
|
||||
<li>I think I'll need to enable access logging in nginx to figure out what's going on</li>
|
||||
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I've never seen before:</li>
|
||||
<li>I think I’ll need to enable access logging in nginx to figure out what’s going on</li>
|
||||
<li>After enabling logging on requests to XMLUI on <code>/</code> I see some new bot I’ve never seen before:</li>
|
||||
</ul>
|
||||
<pre><code>137.108.70.6 - - [29/Oct/2017:07:39:49 +0000] "GET /discover?filtertype_0=type&filter_relational_operator_0=equals&filter_0=Internal+Document&filtertype=author&filter_relational_operator=equals&filter=CGIAR+Secretariat HTTP/1.1" 200 7776 "-" "Mozilla/5.0 (compatible; CORE/0.6; +http://core.ac.uk; http://core.ac.uk/intro/contact)"
|
||||
</code></pre><ul>
|
||||
<li>CORE seems to be some bot that is “Aggregating the world’s open access research papers”</li>
|
||||
<li>The contact address listed in their bot's user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
|
||||
<li>I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve</li>
|
||||
<li>The contact address listed in their bot’s user agent is incorrect, correct page is simply: <a href="https://core.ac.uk/contact">https://core.ac.uk/contact</a></li>
|
||||
<li>I will check the logs in a few days to see if they are harvesting us regularly, then add their bot’s user agent to the Tomcat Crawler Session Valve</li>
|
||||
<li>After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now</li>
|
||||
<li>For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace</li>
|
||||
<li>For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace</li>
|
||||
</ul>
|
||||
<h2 id="2017-10-30">2017-10-30</h2>
|
||||
<ul>
|
||||
@ -333,7 +333,7 @@ Add Katherine Lutz to the groups for content submission and edit steps of the CG
|
||||
137.108.70.6
|
||||
137.108.70.7
|
||||
</code></pre><ul>
|
||||
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won't help much because they are only using two sessions:</li>
|
||||
<li>I will add their user agent to the Tomcat Session Crawler Valve but it won’t help much because they are only using two sessions:</li>
|
||||
</ul>
|
||||
<pre><code># grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
|
||||
session_id=5771742CABA3D0780860B8DA81E0551B
|
||||
@ -346,7 +346,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
|
||||
24055
|
||||
</code></pre><ul>
|
||||
<li>Just because I'm curious who the top IPs are:</li>
|
||||
<li>Just because I’m curious who the top IPs are:</li>
|
||||
</ul>
|
||||
<pre><code># awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
|
||||
496 62.210.247.93
|
||||
@ -362,7 +362,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
</code></pre><ul>
|
||||
<li>At least we know the top two are CORE, but who are the others?</li>
|
||||
<li>190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine</li>
|
||||
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions!</li>
|
||||
<li>Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don’t reuse their session variable, creating thousands of new sessions!</li>
|
||||
</ul>
|
||||
<pre><code># grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
|
||||
1419
|
||||
@ -372,7 +372,7 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>From looking at the requests, it appears these are from CIAT and CCAFS</li>
|
||||
<li>I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them</li>
|
||||
<li>Actually, according to the Tomcat docs, we could use an IP with <code>crawlerIps</code>: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve</a></li>
|
||||
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn't in Ubuntu 16.04's 7.0.68 build!</li>
|
||||
<li>Ah, wait, it looks like <code>crawlerIps</code> only came in 2017-06, so probably isn’t in Ubuntu 16.04’s 7.0.68 build!</li>
|
||||
<li>That would explain the errors I was getting when trying to set it:</li>
|
||||
</ul>
|
||||
<pre><code>WARNING: [SetPropertiesRule]{Server/Service/Engine/Host/Valve} Setting property 'crawlerIps' to '190\.19\.92\.5|104\.196\.152\.243' did not find a matching property.
|
||||
@ -389,14 +389,14 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<h2 id="2017-10-31">2017-10-31</h2>
|
||||
<ul>
|
||||
<li>Very nice, Linode alerted that CGSpace had high CPU usage at 2AM again</li>
|
||||
<li>Ask on the dspace-tech mailing list if it's possible to use an existing item as a template for a new item</li>
|
||||
<li>Ask on the dspace-tech mailing list if it’s possible to use an existing item as a template for a new item</li>
|
||||
<li>To follow up on the CORE bot traffic, there were almost 300,000 request yesterday:</li>
|
||||
</ul>
|
||||
<pre><code># grep "CORE/0.6" /var/log/nginx/access.log.1 | awk '{print $1}' | sort -n | uniq -c | sort -h
|
||||
139109 137.108.70.6
|
||||
139253 137.108.70.7
|
||||
</code></pre><ul>
|
||||
<li>I've emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace</li>
|
||||
<li>I’ve emailed the CORE people to ask if they can update the repository information from CGIAR Library to CGSpace</li>
|
||||
<li>Also, I asked if they could perhaps use the <code>sitemap.xml</code>, OAI-PMH, or REST APIs to index us more efficiently, because they mostly seem to be crawling the nearly endless Discovery facets</li>
|
||||
<li>I added <a href="https://goaccess.io/">GoAccess</a> to the list of package to install in the DSpace role of the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a></li>
|
||||
<li>It makes it very easy to analyze nginx logs from the command line, to see where traffic is coming from:</li>
|
||||
@ -406,14 +406,14 @@ session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
|
||||
<li>According to Uptime Robot CGSpace went down and up a few times</li>
|
||||
<li>I had a look at goaccess and I saw that CORE was actively indexing</li>
|
||||
<li>Also, PostgreSQL connections were at 91 (with the max being 60 per web app, hmmm)</li>
|
||||
<li>I'm really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
|
||||
<li>Actually, come to think of it, they aren't even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
|
||||
<li>I’m really starting to get annoyed with these guys, and thinking about blocking their IP address for a few days to see if CGSpace becomes more stable</li>
|
||||
<li>Actually, come to think of it, they aren’t even obeying <code>robots.txt</code>, because we actually disallow <code>/discover</code> and <code>/search-filter</code> URLs but they are hitting those massively:</li>
|
||||
</ul>
|
||||
<pre><code># grep "CORE/0.6" /var/log/nginx/access.log | grep -o -E "GET /(discover|search-filter)" | sort -n | uniq -c | sort -rn
|
||||
158058 GET /discover
|
||||
14260 GET /search-filter
|
||||
</code></pre><ul>
|
||||
<li>I tested a URL of pattern <code>/discover</code> in Google's webmaster tools and it was indeed identified as blocked</li>
|
||||
<li>I tested a URL of pattern <code>/discover</code> in Google’s webmaster tools and it was indeed identified as blocked</li>
|
||||
<li>I will send feedback to the CORE bot team</li>
|
||||
</ul>
|
||||
|
||||
|
Reference in New Issue
Block a user