Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours
I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)
The good thing is that, according to dspace.log.2017-08-01, they are all using the same Tomcat session
This means our Tomcat Crawler Session Valve is working
But many of the bots are browsing dynamic URLs like:
/handle/10568/3353/discover
/handle/10568/16510/browse
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
<li>Linode sent an alert that CGSpace (linode18) was using 350% CPU for the past two hours</li>
<li>I looked in the Activity pane of the Admin Control Panel and it seems that Google, Baidu, Yahoo, and Bing are all crawling with massive numbers of bots concurrently (~100 total, mostly Baidu and Google)</li>
<li>The good thing is that, according to <code>dspace.log.2017-08-01</code>, they are all using the same Tomcat session</li>
<li>This means our Tomcat Crawler Session Valve is working</li>
<li>But many of the bots are browsing dynamic URLs like:
<ul>
<li>/handle/10568/3353/discover</li>
<li>/handle/10568/16510/browse</li>
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <ahref="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
<li>Magdalena from CCAFS asked if there was a way to get the top ten items published in 2016 (note: not the top items in 2016!)</li>
<li>I think Atmire’s Content and Usage Analysis module should be able to do this but I will have to look at the configuration and maybe email Atmire if I can’t figure it out</li>
<li>I had a look at the moduel configuration and couldn’t figure out a way to do this, so I <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets">opened a ticket on the Atmire tracker</a></li>
<li>Atmire responded about the <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=500">missing workflow statistics issue</a> a few weeks ago but I didn’t see it for some reason</li>
<li>They said they added a publication and saw the workflow stat for the user, so I should try again and let them know</li>
<li>Usman from CIFOR emailed to ask about the status of our OAI tests for harvesting their DSpace repository</li>
<li>I told him that the OAI appears to not be harvesting properly after the first sync, and that the control panel shows an “Internal error” for that collection:</li>
<li>I don’t see anything related in our logs, so I asked him to check for our server’s IP in their logs</li>
<li>Also, in the mean time I stopped the harvesting process, reset the status, and restarted the process via the Admin control panel (note: I didn’t reset the collection, just the harvester status!)</li>
<li>Apply Abenet’s corrections for the CGIAR Library’s Consortium subcommunity (697 records)</li>
<li>I had to fix a few small things, like moving the <code>dc.title</code> column away from the beginning of the row, delete blank spaces in the abstract in vim using <code>:g/^$/d</code>, add the <code>dc.subject[en_US]</code> column back, as she had deleted it and DSpace didn’t detect the changes made there (we needed to blank the values instead)</li>
<li>Apply last updates to the CGIAR Library’s Fund community (812 items)</li>
<li>Had to do some quality checks and column renames before importing, as either Sisay or Abenet renamed a few columns and the metadata importer wanted to remove/add new metadata for title, abstract, etc.</li>
<li>Also I applied the HTML entities unescape transform on the abstract column in Open Refine</li>
<li>I need to get an author list from the database for only the CGIAR Library community to send to Peter</li>
<li>Alan to follow up with ICARDA about depositing in CGSpace, we want ICARD and Drylands legacy content but not duplicates</li>
<li>Alan to follow up on dc.rights, where are we?</li>
<li>Alan to follow up with Atmire about a dedicated field for ORCIDs, based on the discussion in the <ahref="https://wiki.duraspace.org/display/cmtygp/DCAT+Meeting+June+2017">June, 2017 DCAT meeting</a></li>
<li>Alan to ask about how to query external services like AGROVOC in the DSpace submission form</li>
<li>Follow up with Atmire on the <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510">ticket about ORCID metadata in DSpace</a></li>
<li>Follow up with Lili and Andrea about the pending CCAFS metadata and flagship updates</li>
<li>CGSpace had load issues and was throwing errors related to PostgreSQL</li>
<li>I told Tsega to reduce the max connections from 70 to 40 because actually each web application gets that limit and so for xmlui, oai, jspui, rest, etc it could be 70 x 4 = 280 connections depending on the load, and the PostgreSQL config itself is only 100!</li>
<li>I learned this on a recent discussion on the DSpace wiki</li>
<li>I need to either look into setting up a database pool through JNDI or increase the PostgreSQL max connections</li>
<li>Also, I need to find out where the load is coming from (rest?) and possibly block bots from accessing dynamic pages like Browse and Discover instead of just sending an X-Robots-Tag HTTP header</li>
<li>I noticed that Google has bitstreams from the <code>rest</code> interface in the search index. I need to ask on the dspace-tech mailing list to see what other people are doing about this, and maybe start issuing an <code>X-Robots-Tag: none</code> there!</li>
<li>The top offender is 70.32.83.92 which is actually the same IP as ccafs.cgiar.org, so I will email the Macaroni Bros to see if they can test on DSpace Test instead</li>
<li>I’ve enabled logging of <code>/oai</code> requests on nginx as well so we can potentially determine bad actors here (also to see if anyone is actually using OAI!)</li>
<li>Thinking about resource limits for PostgreSQL again after last week’s CGSpace crash and related to a recently discussion I had in the comments of the <ahref="https://wiki.duraspace.org/display/cmtygp/DCAT+Meeting+April+2017">April, 2017 DCAT meeting notes</a></li>
<li>In that thread Chris Wilper suggests a new default of 35 max connections for <code>db.maxconnections</code> (from the current default of 30), knowing that <em>each DSpace web application</em> gets to use up to this many on its own</li>
<li>It would be good to approximate what the theoretical maximum number of connections on a busy server would be, perhaps by looking to see which apps use SQL:</li>
<li>Of those five applications we’re running, only <code>solr</code> appears not to use the database directly</li>
<li>And JSPUI is only used internally (so it doesn’t really count), leaving us with OAI, REST, and XMLUI</li>
<li>Assuming each takes a theoretical maximum of 35 connections during a heavy load (35 * 3 = 105), that would put the connections well above PostgreSQL’s default max of 100 connections (remember a handful of connections are reserved for the PostgreSQL super user, see <code>superuser_reserved_connections</code>)</li>
<li>So we should adjust PostgreSQL’s max connections to be DSpace’s <code>db.maxconnections</code> * 3 + 3</li>
<li>This would allow each application to use up to <code>db.maxconnections</code> and not to go over the system’s PostgreSQL limit</li>
<li>Perhaps since CGSpace is a busy site with lots of resources we could actually use something like 40 for <code>db.maxconnections</code></li>
<li>Also worth looking into is to set up a database pool using JNDI, as apparently DSpace’s <code>db.poolname</code> hasn’t been used since around DSpace 1.7 (according to Chris Wilper’s comments in the thread)</li>
<li>Need to go check the PostgreSQL connection stats in Munin on CGSpace from the past week to get an idea if 40 is appropriate</li>
<li>Unfortunately I don’t have the breakdown of which DSpace apps are making those connections (I’ll assume XMLUI)</li>
<li>So I guess a limit of 30 (DSpace default) is too low, but 70 causes problems when the load increases and the system’s PostgreSQL <code>max_connections</code> is too low</li>
<li>For now I think maybe setting DSpace’s <code>db.maxconnections</code> to 40 and adjusting the system’s <code>max_connections</code> might be a good starting point: 40 * 3 + 4 = 123</li>
<li>Do some last minute cleanups and de-duplications of the CGIAR Library data, as I need to send it to Peter this week</li>
<li>Metadata fields like <code>dc.contributor.author</code>, <code>dc.publisher</code>, <code>dc.type</code>, and a few others had somehow been duplicated along the line</li>
<li>Also, a few dozen <code>dc.description.abstract</code> fields still had various HTML tags and entities in them</li>
<li>Also, a bunch of <code>dc.subject</code> fields that were not AGROVOC had not been moved properly to <code>cg.system.subject</code></li>
<li>I wanted to merge the various field variations like <code>cg.subject.system</code> and <code>cg.subject.system[en_US]</code> in OpenRefine but I realized it would be easier in PostgreSQL:</li>
</ul>
<pre><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=254;
</code></pre>
<ul>
<li>And actually, we can do it for other generic fields for items in those collections, for example <code>dc.description.abstract</code>:</li>
<li>And on others like <code>dc.language.iso</code>, <code>dc.relation.ispartofseries</code>, <code>dc.type</code>, <code>dc.title</code>, etc…</li>
<li>Also, to move fields from <code>dc.identifier.url</code> to <code>cg.identifier.url[en_US]</code> (because we don’t use the Dublin Core one for some reason):</li>
</ul>
<pre><code>dspace=# update metadatavalue set metadata_field_id = 219, text_lang = 'en_US' where resource_type_id = 2 AND metadata_field_id = 237;
UPDATE 15
</code></pre>
<ul>
<li>Set the text_lang of all <code>dc.identifier.uri</code> (Handle) fields to be NULL, just like default DSpace does:</li>
</ul>
<pre><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id = 2 and metadata_field_id = 25 and text_value like 'http://hdl.handle.net/10947/%';
UPDATE 4248
</code></pre>
<ul>
<li>Also update the text_lang of <code>dc.contributor.author</code> fields for metadata in these collections:</li>
<li>This would be true if the authors were like <code>CGIAR System Management Office||CGIAR System Management Office</code>, which some of the CGIAR Library’s were</li>
<li>Unfortunately when you fix these in OpenRefine and then submit the metadata to DSpace it doesn’t detect any changes, so you have to edit them all manually via DSpace’s “Edit Item”</li>
<li>Ooh! And an even more interesting regex would match <em>any</em> duplicated author:</li>
</ul>
<pre><code>isNotNull(value.match(/(.+?)\|\|\1/))
</code></pre>
<ul>
<li>Which means it can also be used to find items with duplicate <code>dc.subject</code> fields…</li>
<li>Finally sent Peter the final dump of the CGIAR System Organization community so he can have a last look at it</li>
<li>Post a message to the dspace-tech mailing list to ask about querying the AGROVOC API from the submission form</li>
<li>Abenet was asking if there was some way to hide certain internal items from the “ILRI Research Outputs” RSS feed (which is the top-level ILRI community feed), because Shirley was complaining</li>
<li>I think we could use <code>harvest.includerestricted.rss = false</code> but the items might need to be 100% restricted, not just the metadata</li>
<li>Adjust Ansible postgres role to use <code>max_connections</code> from a template variable and deploy a new limit of 123 on CGSpace</li>