mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-10 22:13:21 +01:00
523 lines
22 KiB
HTML
523 lines
22 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
|
||
|
||
<meta property="og:title" content="November, 2016" />
|
||
<meta property="og:description" content="2016-11-01 Add dc.type to the output options for Atmire’s Listings and Reports module (#286) 2016-11-02 Migrate DSpace Test to DSpace 5.5 (notes) Run all updates on DSpace Test and reboot the server Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (#63) Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes At the end it appeared to finish correctly but there were lots of errors right after it finished: 2016-11-02 15:09:48,578 INFO com." />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2016-11/" />
|
||
|
||
|
||
<meta property="og:updated_time" content="2016-11-01T09:21:00+03:00"/>
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
<meta itemprop="name" content="November, 2016">
|
||
<meta itemprop="description" content="2016-11-01 Add dc.type to the output options for Atmire’s Listings and Reports module (#286) 2016-11-02 Migrate DSpace Test to DSpace 5.5 (notes) Run all updates on DSpace Test and reboot the server Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (#63) Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes At the end it appeared to finish correctly but there were lots of errors right after it finished: 2016-11-02 15:09:48,578 INFO com.">
|
||
|
||
|
||
<meta itemprop="dateModified" content="2016-11-01T09:21:00+03:00" />
|
||
<meta itemprop="wordCount" content="1561">
|
||
|
||
|
||
|
||
<meta itemprop="keywords" content="notes," />
|
||
|
||
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
|
||
|
||
|
||
<meta name="twitter:title" content="November, 2016"/>
|
||
<meta name="twitter:description" content="2016-11-01 Add dc.type to the output options for Atmire’s Listings and Reports module (#286) 2016-11-02 Migrate DSpace Test to DSpace 5.5 (notes) Run all updates on DSpace Test and reboot the server Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (#63) Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes At the end it appeared to finish correctly but there were lots of errors right after it finished: 2016-11-02 15:09:48,578 INFO com."/>
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
<meta name="generator" content="Hugo 0.17" />
|
||
|
||
|
||
<base href="https://alanorth.github.io/cgspace-notes/">
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2016-11/">
|
||
|
||
<title>November, 2016 | CGSpace Notes</title>
|
||
|
||
<!-- combined, minified CSS -->
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.css" rel="stylesheet">
|
||
|
||
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
|
||
</div>
|
||
</header>
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title"><a href="https://alanorth.github.io/cgspace-notes/2016-11/">November, 2016</a></h2>
|
||
<p class="blog-post-meta"><time datetime="2016-11-01T09:21:00+03:00">Tue Nov 01, 2016</time> by Alan Orth in
|
||
|
||
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||
|
||
</p>
|
||
</header>
|
||
|
||
|
||
<h2 id="2016-11-01">2016-11-01</h2>
|
||
|
||
<ul>
|
||
<li>Add <code>dc.type</code> to the output options for Atmire’s Listings and Reports module (<a href="https://github.com/ilri/DSpace/pull/286">#286</a>)</li>
|
||
</ul>
|
||
|
||
<p><img src="2016/11/listings-and-reports.png" alt="Listings and Reports with output type" /></p>
|
||
|
||
<h2 id="2016-11-02">2016-11-02</h2>
|
||
|
||
<ul>
|
||
<li>Migrate DSpace Test to DSpace 5.5 (<a href="https://gist.github.com/alanorth/61013895c6efe7095d7f81000953d1cf">notes</a>)</li>
|
||
<li>Run all updates on DSpace Test and reboot the server</li>
|
||
<li>Looks like the OAI bug from DSpace 5.1 that caused validation at Base Search to fail is now fixed and DSpace Test passes validation! (<a href="https://github.com/ilri/DSpace/issues/63">#63</a>)</li>
|
||
<li>Indexing Discovery on DSpace Test took 332 minutes, which is like five times as long as it usually takes</li>
|
||
<li>At the end it appeared to finish correctly but there were lots of errors right after it finished:</li>
|
||
</ul>
|
||
|
||
<pre><code>2016-11-02 15:09:48,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76454 to Index
|
||
2016-11-02 15:09:48,584 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/3202 to Index
|
||
2016-11-02 15:09:48,589 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76455 to Index
|
||
2016-11-02 15:09:48,590 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Community: 10568/51693 to Index
|
||
2016-11-02 15:09:48,590 INFO org.dspace.discovery.IndexClient @ Done with indexing
|
||
2016-11-02 15:09:48,600 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76456 to Index
|
||
2016-11-02 15:09:48,613 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/55536 to Index
|
||
2016-11-02 15:09:48,616 INFO com.atmire.dspace.discovery.AtmireSolrService @ Wrote Collection: 10568/76457 to Index
|
||
2016-11-02 15:09:48,634 ERROR com.atmire.dspace.discovery.AtmireSolrService @
|
||
java.lang.NullPointerException
|
||
at org.dspace.discovery.SearchUtils.getDiscoveryConfiguration(SourceFile:57)
|
||
at org.dspace.discovery.SolrServiceImpl.buildDocument(SolrServiceImpl.java:824)
|
||
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:821)
|
||
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:898)
|
||
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
|
||
at org.dspace.storage.rdbms.DatabaseUtils$ReindexerThread.run(DatabaseUtils.java:945)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>DSpace is still up, and a few minutes later I see the default DSpace indexer is still running</li>
|
||
<li>Sure enough, looking back before the first one finished, I see output from both indexers interleaved in the log:</li>
|
||
</ul>
|
||
|
||
<pre><code>2016-11-02 15:09:28,545 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/47242 to Index
|
||
2016-11-02 15:09:28,633 INFO org.dspace.discovery.SolrServiceImpl @ Wrote Item: 10568/60785 to Index
|
||
2016-11-02 15:09:28,678 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55695 of 55722): 43557
|
||
2016-11-02 15:09:28,688 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55703 of 55722): 34476
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I will raise a ticket with Atmire to ask them</li>
|
||
</ul>
|
||
|
||
<h2 id="2016-11-06">2016-11-06</h2>
|
||
|
||
<ul>
|
||
<li>After re-deploying and re-indexing I didn’t see the same issue, and the indexing completed in 85 minutes, which is about how long it is supposed to take</li>
|
||
</ul>
|
||
|
||
<h2 id="2016-11-07">2016-11-07</h2>
|
||
|
||
<ul>
|
||
<li>Horrible one liner to get Linode ID from certain Ansible host vars:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ grep -A 3 contact_info * | grep -E "(Orth|Sisay|Peter|Daniel|Tsega)" | awk -F'-' '{print $1}' | grep linode | uniq | xargs grep linode_id
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>I noticed some weird CRPs in the database, and they don’t show up in Discovery for some reason, perhaps the <code>:</code></li>
|
||
<li>I’ll export these and fix them in batch:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=230 group by text_value order by count desc) to /tmp/crp.csv with csv;
|
||
COPY 22
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Test running the replacements:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ ./fix-metadata-values.py -i /tmp/CRPs.csv -f cg.contributor.crp -t correct -m 230 -d dspace -u dspace -p 'fuuu'
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Add <code>AMR</code> to ILRI subjects and remove one duplicate instance of IITA in author affiliations controlled vocabulary (<a href="https://github.com/ilri/DSpace/pull/288">#288</a>)</li>
|
||
</ul>
|
||
|
||
<h2 id="2016-11-08">2016-11-08</h2>
|
||
|
||
<ul>
|
||
<li>Atmire’s Listings and Reports module seems to be broken on DSpace 5.5</li>
|
||
</ul>
|
||
|
||
<p><img src="2016/11/listings-and-reports-55.png" alt="Listings and Reports broken in DSpace 5.5" /></p>
|
||
|
||
<ul>
|
||
<li>I’ve filed a ticket with Atmire</li>
|
||
<li>Thinking about batch updates for ORCIDs and authors</li>
|
||
<li>Playing with <a href="https://github.com/moonlitesolutions/SolrClient">SolrClient</a> in Python to query Solr</li>
|
||
<li>All records in the authority core are either <code>authority_type:orcid</code> or <code>authority_type:person</code></li>
|
||
<li>There is a <code>deleted</code> field and all items seem to be <code>false</code>, but might be important sanity check to remember</li>
|
||
<li>The way to go is probably to have a CSV of author names and authority IDs, then to batch update them in PostgreSQL</li>
|
||
<li>Dump of the top ~200 authors in CGSpace:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=3 group by text_value order by count desc limit 210) to /tmp/210-authors.csv with csv;
|
||
</code></pre>
|
||
|
||
<h2 id="2016-11-09">2016-11-09</h2>
|
||
|
||
<ul>
|
||
<li>CGSpace crashed so I quickly ran system updates, applied one or two of the waiting changes from the <code>5_x-prod</code> branch, and rebooted the server</li>
|
||
<li>The error was <code>Timeout waiting for idle object</code> but I haven’t looked into the Tomcat logs to see what happened</li>
|
||
<li>Also, I ran the corrections for CRPs from earlier this week</li>
|
||
</ul>
|
||
|
||
<h2 id="2016-11-10">2016-11-10</h2>
|
||
|
||
<ul>
|
||
<li>Helping Megan Zandstra and CIAT with some questions about the REST API</li>
|
||
<li>Playing with <code>find-by-metadata-field</code>, this works:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}'
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>But the results are deceiving because metadata fields can have text languages and your query must match exactly!</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# select distinct text_value, text_lang from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
|
||
text_value | text_lang
|
||
------------+-----------
|
||
SEEDS |
|
||
SEEDS |
|
||
SEEDS | en_US
|
||
(3 rows)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>So basically, the text language here could be null, blank, or en_US</li>
|
||
<li>To query metadata with these properties, you can do:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
|
||
55
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
|
||
34
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>The results (55+34=89) don’t seem to match those from the database:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang is null;
|
||
count
|
||
-------
|
||
15
|
||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='';
|
||
count
|
||
-------
|
||
4
|
||
dspace=# select count(text_value) from metadatavalue where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS' and text_lang='en_US';
|
||
count
|
||
-------
|
||
66
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>So, querying from the API I get 55 + 34 = 89 results, but the database actually only has 85…</li>
|
||
<li>And the <code>find-by-metadata-field</code> endpoint doesn’t seem to have a way to get all items with the field, or a wildcard value</li>
|
||
<li>I’ll ask a question on the dspace-tech mailing list</li>
|
||
<li>And speaking of <code>text_lang</code>, this is interesting:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||
text_lang
|
||
-----------
|
||
|
||
ethnob
|
||
en
|
||
spa
|
||
EN
|
||
es
|
||
frn
|
||
en_
|
||
en_US
|
||
|
||
EN_US
|
||
eng
|
||
en_U
|
||
fr
|
||
(14 rows)
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Generate a list of all these so I can maybe fix them in batch:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# \copy (select distinct text_lang, count(*) from metadatavalue where resource_type_id=2 group by text_lang order by count desc) to /tmp/text-langs.csv with csv;
|
||
COPY 14
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Perhaps we need to fix them all in batch, or experiment with fixing only certain metadatavalues:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and metadata_field_id=203 and text_value='SEEDS';
|
||
UPDATE 85
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>The <code>fix-metadata.py</code> script I have is meant for specific metadata values, so if I want to update some <code>text_lang</code> values I should just do it directly in the database</li>
|
||
<li>For example, on a limited set:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspace=# update metadatavalue set text_lang=NULL where resource_type_id=2 and metadata_field_id=203 and text_value='LIVESTOCK' and text_lang='';
|
||
UPDATE 420
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>And assuming I want to do it for all fields:</li>
|
||
</ul>
|
||
|
||
<pre><code>dspacetest=# update metadatavalue set text_lang=NULL where resource_type_id=2 and text_lang='';
|
||
UPDATE 183726
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>After that restarted Tomcat and PostgreSQL (because I’m superstitious about caches) and now I see the following in REST API query:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS"}' | jq length
|
||
71
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":""}' | jq length
|
||
0
|
||
$ curl -s -H "accept: application/json" -H "Content-Type: application/json" -X POST "http://localhost:8080/rest/items/find-by-metadata-field" -d '{"key": "cg.subject.ilri","value": "SEEDS", "language":"en_US"}' | jq length
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Not sure what’s going on, but Discovery shows 83 values, and database shows 85, so I’m going to reindex Discovery just in case</li>
|
||
</ul>
|
||
|
||
<h2 id="2016-11-14">2016-11-14</h2>
|
||
|
||
<ul>
|
||
<li>I applied Atmire’s suggestions to fix Listings and Reports for DSpace 5.5 and now it works</li>
|
||
<li>There were some issues with the <code>dspace/modules/jspui/pom.xml</code>, which is annoying because all I did was rebase our working 5.1 code on top of 5.5, meaning Atmire’s installation procedure must have changed</li>
|
||
<li>So there is apparently this Tomcat native way to limit web crawlers to one session: <a href="https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve">Crawler Session Manager</a></li>
|
||
<li>After adding that to <code>server.xml</code> bots matching the pattern in the configuration will all use ONE session, just like normal users:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Mon, 14 Nov 2016 19:47:29 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=323694E079A53D5D024F839290EDD7E8; Path=/; Secure; HttpOnly
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
X-Robots-Tag: none
|
||
|
||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Mon, 14 Nov 2016 19:47:35 GMT
|
||
Server: nginx
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>The first one gets a session, and any after that — within 60 seconds — will be internally mapped to the same session by Tomcat</li>
|
||
<li>This means that when Google or Baidu slam you with tens of concurrent connections they will all map to ONE internal session, which saves RAM!</li>
|
||
</ul>
|
||
|
||
<h2 id="2016-11-15">2016-11-15</h2>
|
||
|
||
<ul>
|
||
<li>The Tomcat JVM heap looks really good after applying the Crawler Session Manager fix on DSpace Test last night:</li>
|
||
</ul>
|
||
|
||
<p><img src="2016/11/dspacetest-tomcat-jvm-day.png" alt="Tomcat JVM heap (day) after setting up the Crawler Session Manager" />
|
||
<img src="2016/11/dspacetest-tomcat-jvm-week.png" alt="Tomcat JVM heap (week) after setting up the Crawler Session Manager" /></p>
|
||
|
||
<ul>
|
||
<li>Seems the default regex doesn’t catch Baidu, though:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Tue, 15 Nov 2016 08:49:54 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=131409D143E8C01DE145C50FC748256E; Path=/; Secure; HttpOnly
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
|
||
$ http --print h https://dspacetest.cgiar.org 'User-Agent:Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
|
||
HTTP/1.1 200 OK
|
||
Connection: keep-alive
|
||
Content-Encoding: gzip
|
||
Content-Language: en-US
|
||
Content-Type: text/html;charset=utf-8
|
||
Date: Tue, 15 Nov 2016 08:49:59 GMT
|
||
Server: nginx
|
||
Set-Cookie: JSESSIONID=F6403C084480F765ED787E41D2521903; Path=/; Secure; HttpOnly
|
||
Transfer-Encoding: chunked
|
||
Vary: Accept-Encoding
|
||
X-Cocoon-Version: 2.2.0
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Adding Baiduspider to the list of user agents seems to work, and the final configuration should be:</li>
|
||
</ul>
|
||
|
||
<pre><code><!-- Crawler Session Manager Valve helps mitigate damage done by web crawlers -->
|
||
<Valve className="org.apache.catalina.valves.CrawlerSessionManagerValve"
|
||
crawlerUserAgents=".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*|.*Baiduspider.*" />
|
||
</code></pre>
|
||
|
||
<ul>
|
||
<li>Looking at the bots that were active yesterday it seems the above regex should be sufficient:</li>
|
||
</ul>
|
||
|
||
<pre><code>$ grep -o -E 'Mozilla/5\.0 \(compatible;.*\"' /var/log/nginx/access.log | sort | uniq
|
||
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "-"
|
||
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
|
||
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
|
||
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" "-"
|
||
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)" "-"
|
||
</code></pre>
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 offset-sm-1 blog-sidebar">
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="/cgspace-notes/2016-11/">November, 2016</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2016-10/">October, 2016</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2016-09/">September, 2016</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2016-08/">August, 2016</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2016-07/">July, 2016</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
<footer class="blog-footer">
|
||
<p>
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
</body>
|
||
|
||
</html>
|