Update notes for 2019-04-06

This commit is contained in:
Alan Orth 2019-04-06 11:47:45 +03:00
parent 2d768f6486
commit 682a2c2194
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
3 changed files with 119 additions and 8 deletions

View File

@ -109,4 +109,54 @@ statistics-2017: org.apache.solr.common.SolrException:org.apache.solr.common.Sol
- I restarted it again and all the Solr cores came up properly... - I restarted it again and all the Solr cores came up properly...
## 2019-04-06
- Udana asked why item [10568/91278](https://cgspace.cgiar.org/handle/10568/91278) didn't have an Altmetric badge on CGSpace, but on the [WLE website](https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity) it does
- I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all
- I tweeted the item and I assume this will link the Handle with the DOI in the system
- Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:
```
# zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
222 18.195.78.144
245 207.46.13.58
303 207.46.13.194
328 66.249.79.33
564 207.46.13.210
566 66.249.79.62
575 40.77.167.66
1803 66.249.79.59
2834 2a01:4f8:140:3192::2
9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E "06/Apr/2019:(06|07|08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
31 66.249.79.62
41 207.46.13.210
42 40.77.167.66
54 42.113.50.219
132 66.249.79.59
785 2001:41d0:d:1990::
1164 45.5.184.72
2014 50.116.102.77
4267 45.5.186.2
4893 205.186.128.185
```
- `45.5.184.72` is in Colombia so it's probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT's datasets collection:
```
GET /handle/10568/72970/discover?filtertype_0=type&filtertype_1=author&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Dataset&filtertype=dateIssued&filter_relational_operator=equals&filter=2014
```
- Their user agent is the one I added to the badbots list in nginx last week: "GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1"
- They made 22,000 requests to Discover on this collection today alone (and it's only 11AM):
```
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep "06/Apr/2019" | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
22077 /handle/10568/72970/discover
```
- I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover
- Maria from Bioversity recommended that we use the phrase "AGROVOC subject" instead of "Subject" in Listings and Reports
- I made a pull request to update this and merged it to the `5_x-prod` branch ([#418](https://github.com/ilri/DSpace/pull/418))
<!-- vim: set sw=2 ts=2: --> <!-- vim: set sw=2 ts=2: -->

View File

@ -38,7 +38,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<meta property="og:type" content="article" /> <meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2019-04/" />
<meta property="article:published_time" content="2019-04-01T09:00:43&#43;03:00"/> <meta property="article:published_time" content="2019-04-01T09:00:43&#43;03:00"/>
<meta property="article:modified_time" content="2019-04-05T22:22:41&#43;03:00"/> <meta property="article:modified_time" content="2019-04-05T23:07:30&#43;03:00"/>
<meta name="twitter:card" content="summary"/> <meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="April, 2019"/> <meta name="twitter:title" content="April, 2019"/>
@ -81,9 +81,9 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
"@type": "BlogPosting", "@type": "BlogPosting",
"headline": "April, 2019", "headline": "April, 2019",
"url": "https://alanorth.github.io/cgspace-notes/2019-04/", "url": "https://alanorth.github.io/cgspace-notes/2019-04/",
"wordCount": "661", "wordCount": "980",
"datePublished": "2019-04-01T09:00:43&#43;03:00", "datePublished": "2019-04-01T09:00:43&#43;03:00",
"dateModified": "2019-04-05T22:22:41&#43;03:00", "dateModified": "2019-04-05T23:07:30&#43;03:00",
"author": { "author": {
"@type": "Person", "@type": "Person",
"name": "Alan Orth" "name": "Alan Orth"
@ -286,6 +286,67 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>I restarted it again and all the Solr cores came up properly&hellip;</li> <li>I restarted it again and all the Solr cores came up properly&hellip;</li>
</ul> </ul>
<h2 id="2019-04-06">2019-04-06</h2>
<ul>
<li>Udana asked why item <a href="https://cgspace.cgiar.org/handle/10568/91278"><sup>10568</sup>&frasl;<sub>91278</sub></a> didn&rsquo;t have an Altmetric badge on CGSpace, but on the <a href="https://wle.cgiar.org/food-and-agricultural-innovation-pathways-prosperity">WLE website</a> it does
<ul>
<li>I looked and saw that the WLE website is using the Altmetric score associated with the DOI, and that the Handle has no score at all</li>
<li>I tweeted the item and I assume this will link the Handle with the DOI in the system</li>
</ul></li>
<li>Linode sent an alert that there was high CPU usage this morning on CGSpace (linode18) and these were the top IPs in the webserver access logs around the time:</li>
</ul>
<pre><code># zcat --force /var/log/nginx/{access,error,library-access}.log /var/log/nginx/{access,error,library-access}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
222 18.195.78.144
245 207.46.13.58
303 207.46.13.194
328 66.249.79.33
564 207.46.13.210
566 66.249.79.62
575 40.77.167.66
1803 66.249.79.59
2834 2a01:4f8:140:3192::2
9623 45.5.184.72
# zcat --force /var/log/nginx/{rest,oai}.log /var/log/nginx/{rest,oai}.log.1 | grep -E &quot;06/Apr/2019:(06|07|08|09)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
31 66.249.79.62
41 207.46.13.210
42 40.77.167.66
54 42.113.50.219
132 66.249.79.59
785 2001:41d0:d:1990::
1164 45.5.184.72
2014 50.116.102.77
4267 45.5.186.2
4893 205.186.128.185
</code></pre>
<ul>
<li><code>45.5.184.72</code> is in Colombia so it&rsquo;s probably CIAT, and I see they are indeed trying to get crawl the Discover pages on CIAT&rsquo;s datasets collection:</li>
</ul>
<pre><code>GET /handle/10568/72970/discover?filtertype_0=type&amp;filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_relational_operator_0=equals&amp;filter_1=&amp;filter_0=Dataset&amp;filtertype=dateIssued&amp;filter_relational_operator=equals&amp;filter=2014
</code></pre>
<ul>
<li>Their user agent is the one I added to the badbots list in nginx last week: &ldquo;GuzzleHttp/6.3.3 curl/7.47.0 PHP/7.0.30-0ubuntu0.16.04.1&rdquo;</li>
<li>They made 22,000 requests to Discover on this collection today alone (and it&rsquo;s only 11AM):</li>
</ul>
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &quot;06/Apr/2019&quot; | grep 45.5.184.72 | grep -oE '/handle/[0-9]+/[0-9]+/discover' | sort | uniq -c
22077 /handle/10568/72970/discover
</code></pre>
<ul>
<li>I need to find a contact at CIAT to tell them to use the REST API rather than crawling Discover</li>
<li>Maria from Bioversity recommended that we use the phrase &ldquo;AGROVOC subject&rdquo; instead of &ldquo;Subject&rdquo; in Listings and Reports
<ul>
<li>I made a pull request to update this and merged it to the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/418">#418</a>)</li>
</ul></li>
</ul>
<!-- vim: set sw=2 ts=2: --> <!-- vim: set sw=2 ts=2: -->

View File

@ -4,7 +4,7 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/2019-04/</loc> <loc>https://alanorth.github.io/cgspace-notes/2019-04/</loc>
<lastmod>2019-04-05T22:22:41+03:00</lastmod> <lastmod>2019-04-05T23:07:30+03:00</lastmod>
</url> </url>
<url> <url>
@ -219,7 +219,7 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/</loc> <loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2019-04-05T22:22:41+03:00</lastmod> <lastmod>2019-04-05T23:07:30+03:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>
@ -230,7 +230,7 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc> <loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2019-04-05T22:22:41+03:00</lastmod> <lastmod>2019-04-05T23:07:30+03:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>
@ -242,13 +242,13 @@
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc> <loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2019-04-05T22:22:41+03:00</lastmod> <lastmod>2019-04-05T23:07:30+03:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>
<url> <url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc> <loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2019-04-05T22:22:41+03:00</lastmod> <lastmod>2019-04-05T23:07:30+03:00</lastmod>
<priority>0</priority> <priority>0</priority>
</url> </url>