mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-22 19:43:24 +01:00
446 lines
18 KiB
HTML
446 lines
18 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
<meta property="og:title" content="August, 2020" />
|
|
<meta property="og:description" content="2020-08-02
|
|
|
|
I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values
|
|
|
|
It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)
|
|
It implements a “force” mode too that will clear existing country codes and re-tag everything
|
|
It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…
|
|
|
|
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-08/" />
|
|
<meta property="article:published_time" content="2020-08-02T15:35:54+03:00" />
|
|
<meta property="article:modified_time" content="2020-08-06T16:24:01+03:00" />
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="August, 2020"/>
|
|
<meta name="twitter:description" content="2020-08-02
|
|
|
|
I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values
|
|
|
|
It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)
|
|
It implements a “force” mode too that will clear existing country codes and re-tag everything
|
|
It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…
|
|
|
|
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.74.3" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "August, 2020",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2020-08/",
|
|
"wordCount": "1421",
|
|
"datePublished": "2020-08-02T15:35:54+03:00",
|
|
"dateModified": "2020-08-06T16:24:01+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-08/">
|
|
|
|
<title>August, 2020 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-08/">August, 2020</a></h2>
|
|
<p class="blog-post-meta"><time datetime="2020-08-02T15:35:54+03:00">Sun Aug 02, 2020</time> by Alan Orth in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2020-08-02">2020-08-02</h2>
|
|
<ul>
|
|
<li>I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their <code>cg.coverage.country</code> text values
|
|
<ul>
|
|
<li>It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)</li>
|
|
<li>It implements a “force” mode too that will clear existing country codes and re-tag everything</li>
|
|
<li>It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<ul>
|
|
<li>The code is currently on my personal GitHub: <a href="https://github.com/alanorth/dspace-curation-tasks">https://github.com/alanorth/dspace-curation-tasks</a>
|
|
<ul>
|
|
<li>I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the <code>dspace/lib</code> directory (not to mention the config)</li>
|
|
</ul>
|
|
</li>
|
|
<li>I forked the <a href="https://github.com/ilri/dspace-curation-tasks">dspace-curation-tasks to ILRI’s GitHub</a> and <a href="https://issues.sonatype.org/browse/OSSRH-59650">submitted the project to Maven Central</a> so I can integrate it more easily with our DSpace build via dependencies</li>
|
|
</ul>
|
|
<h2 id="2020-08-03">2020-08-03</h2>
|
|
<ul>
|
|
<li>Atmire responded to the ticket about the ongoing upgrade issues
|
|
<ul>
|
|
<li>They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!</li>
|
|
<li>They also said they have never experienced the <code>type: 5</code> site statistics issue, so I need to try to purge those and continue with the stats processing</li>
|
|
</ul>
|
|
</li>
|
|
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
|
|
</ul>
|
|
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
|
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
|
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
|
</code></pre><ul>
|
|
<li>Andrea from Macaroni Bros emailed me a few days ago to say he’s having issues with the CGSpace REST API
|
|
<ul>
|
|
<li>He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: <a href="https://www.rtb.cgiar.org/publications/">https://www.rtb.cgiar.org/publications/</a></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-08-04">2020-08-04</h2>
|
|
<ul>
|
|
<li>Look into the REST API issues that Macaroni Bros raised last week:
|
|
<ul>
|
|
<li>The first one was about the <code>collections</code> endpoint returning empty items:
|
|
<ul>
|
|
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2</a> (offset=2 is correct)</li>
|
|
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3</a> (offset=3 is empty)</li>
|
|
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4">https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4</a> (offset=4 is correct again)</li>
|
|
</ul>
|
|
</li>
|
|
<li>I confirm that the second link returns zero items on CGSpace…
|
|
<ul>
|
|
<li>I tested on my local development instance and it returns one item correctly…</li>
|
|
<li>I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly…</li>
|
|
<li>Perhaps an indexing issue?</li>
|
|
</ul>
|
|
</li>
|
|
<li>The second issue is the <code>collections</code> endpoint returning the wrong number of items:
|
|
<ul>
|
|
<li><a href="https://cgspace.cgiar.org/rest/collections/1445">https://cgspace.cgiar.org/rest/collections/1445</a> (numberItems: 63)</li>
|
|
<li><a href="https://cgspace.cgiar.org/rest/collections/1445/items">https://cgspace.cgiar.org/rest/collections/1445/items</a> (real number of items: 61)</li>
|
|
</ul>
|
|
</li>
|
|
<li>I confirm that it is indeed happening on CGSpace…
|
|
<ul>
|
|
<li>And actually I can replicate the same issue on my local CGSpace 5.8 instance:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
|
|
"numberItems" : 63,
|
|
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
|
|
61
|
|
</code></pre><ul>
|
|
<li>Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:</li>
|
|
</ul>
|
|
<pre><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
|
|
"numberItems" : 61,
|
|
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
|
|
59
|
|
</code></pre><ul>
|
|
<li>Ah! I exported that collection’s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
|
|
<ul>
|
|
<li>I dealt with this problem in 2017-01 and the solution is to check the <code>collection2item</code> table:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
|
|
id | collection_id | item_id
|
|
--------+---------------+---------
|
|
133698 | 966 | 107687
|
|
134685 | 1445 | 107687
|
|
134686 | 1445 | 107687
|
|
(3 rows)
|
|
</code></pre><ul>
|
|
<li>So for each id you can delete one duplicate mapping:</li>
|
|
</ul>
|
|
<pre><code>dspace=# DELETE FROM collection2item WHERE id='134686';
|
|
dspace=# DELETE FROM collection2item WHERE id='128819';
|
|
</code></pre><ul>
|
|
<li>Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter’s preferred display names</li>
|
|
</ul>
|
|
<pre><code>$ cat 2020-08-04-PB-new-countries.csv
|
|
cg.coverage.country,correct
|
|
CAPE VERDE,CABO VERDE
|
|
COCOS ISLANDS,COCOS (KEELING) ISLANDS
|
|
"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
|
|
COTE D'IVOIRE,CÔTE D'IVOIRE
|
|
"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
|
|
PALESTINE,"PALESTINE, STATE OF"
|
|
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
|
</code></pre><ul>
|
|
<li>I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
|
|
<ul>
|
|
<li>I started a full Discovery re-indexing</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-08-05">2020-08-05</h2>
|
|
<ul>
|
|
<li>Port my <a href="https://github.com/ilri/dspace-curation-tasks">dspace-curation-tasks</a> to DSpace 6 and tag version <code>6.0-SNAPSHOT</code></li>
|
|
<li>I downloaded the <a href="https://unstats.un.org/unsd/methodology/m49/overview/">UN M.49</a> CSV file to start working on updating the CGSpace regions
|
|
<ul>
|
|
<li>First issue is they don’t version the file so you have no idea when it was released</li>
|
|
<li>Second issue is that three rows have errors due to not using quotes around “China, Macao Special Administrative Region”</li>
|
|
</ul>
|
|
</li>
|
|
<li>Bizu said she was having problems approving tasks on CGSpace
|
|
<ul>
|
|
<li>I looked at the PostgreSQL locks and they have skyrocketed since yesterday:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<p><img src="/cgspace-notes/2020/08/postgres_locks_ALL-day.png" alt="PostgreSQL locks day"></p>
|
|
<p><img src="/cgspace-notes/2020/08/postgres_querylength_ALL-day.png" alt="PostgreSQL query length day"></p>
|
|
<ul>
|
|
<li>Seems that something happened yesterday afternoon at around 5PM…
|
|
<ul>
|
|
<li>For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue</li>
|
|
<li>I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly</li>
|
|
</ul>
|
|
</li>
|
|
<li>I checked the nginx logs around 5PM yesterday to see who was accessing the server:</li>
|
|
</ul>
|
|
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
|
|
</code></pre><ul>
|
|
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
|
|
<ul>
|
|
<li>But that pattern doesn’t match in the nginx bot list or Tomcat’s crawler session manager valve because we’re only checking for <code>[Bb]ot</code>!</li>
|
|
<li>So they have created thousands of Tomcat sessions:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
|
5693
|
|
</code></pre><ul>
|
|
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources
|
|
<ul>
|
|
<li>Perhaps <code>[Bb][Oo][Tt]</code>…</li>
|
|
</ul>
|
|
</li>
|
|
<li>I see another IP 104.198.96.245, which is also using the “RTB website BOT” but there are 70,000 hits in Solr from earlier this year before they started using the user agent
|
|
<ul>
|
|
<li>I purged all the hits from Solr, including a few thousand from 64.62.202.71</li>
|
|
</ul>
|
|
</li>
|
|
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
|
|
</ul>
|
|
<pre><code>$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
|
1585
|
|
$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
|
5691
|
|
</code></pre><ul>
|
|
<li>38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
|
|
</ul>
|
|
<pre><code>Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
|
|
</code></pre><ul>
|
|
<li>64.62.202.71 is using a user agent I’ve never seen before:</li>
|
|
</ul>
|
|
<pre><code>Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
|
|
</code></pre><ul>
|
|
<li>So now our “bot” regex can’t even match that…
|
|
<ul>
|
|
<li>Unless we change it to <code>[Bb]\.?[Oo]\.?[Tt]\.?</code>… which seems to match all variations of “bot” I can think of right now, according to <a href="https://regexr.com/59lpt">regexr.com</a>:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<pre><code>RTB website BOT
|
|
Altmetribot
|
|
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
|
|
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
|
|
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
|
|
</code></pre><ul>
|
|
<li>And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):</li>
|
|
</ul>
|
|
<pre><code>$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
|
|
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
|
2777
|
|
</code></pre><ul>
|
|
<li>I will add <code>Turnitin</code> to the Tomcat Crawler Session Manager Valve regex as well…</li>
|
|
</ul>
|
|
<h2 id="2020-08-06">2020-08-06</h2>
|
|
<ul>
|
|
<li>I have been working on processing the Solr statistics with the Atmire tool on DSpace Test the last few days:
|
|
<ul>
|
|
<li>statistics:
|
|
<ul>
|
|
<li>2,040,385 docs: 2h 28m 49s</li>
|
|
</ul>
|
|
</li>
|
|
<li>statistics-2019:
|
|
<ul>
|
|
<li>8,960,000 docs: 12h 7s</li>
|
|
<li>1,780,575 docs: 2h 7m 29s</li>
|
|
</ul>
|
|
</li>
|
|
<li>statistics-2018:
|
|
<ul>
|
|
<li>1,970,000 docs: 12h 1m 28s</li>
|
|
<li>360,000 docs: 2h 54m 56s (Linode rebooted)</li>
|
|
<li>1,110,000 docs: 7h 1m 44s (Restarted Tomcat, oops)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I’m having the same issues with Java heap space that I had last month
|
|
<ul>
|
|
<li>The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m…</li>
|
|
<li>Also, I decided to try to purge all the <code>-unmigrated</code> docs that it had found so far to see if that helps…</li>
|
|
<li>There were about 466,000 records unmigrated so far, most of which were <code>type: 5</code> (SITE statistics)</li>
|
|
<li>Now it is processing again…</li>
|
|
</ul>
|
|
</li>
|
|
<li>I developed a small Java class called <code>FixJpgJpgThumbnails</code> to remove “.jpg.jpg” thumbnails from the <code>THUMBNAIL</code> bundle and replace them with their originals from the <code>ORIGINAL</code> bundle
|
|
<ul>
|
|
<li>The code is based on <a href="https://github.com/UoW-IRRs/DSpace-Scripts/blob/master/src/main/java/nz/ac/waikato/its/irr/scripts/RemovePNGThumbnailsForPDFs.java">RemovePNGThumbnailsForPDFs.java</a> by Andrea Schweer</li>
|
|
<li>I incorporated it into my dspace-curation-tasks repository, then renamed it to <a href="https://github.com/ilri/cgspace-java-helpers">cgspace-java-helpers</a></li>
|
|
<li>In testing I found that I can replace ~4,000 thumbnails on CGSpace!</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2020-08-07">2020-08-07</h2>
|
|
<ul>
|
|
<li>I improved the <code>RemovePNGThumbnailsForPDFs.java</code> a bit more to exclude infographics and original bitstreams larger than 100KiB
|
|
<ul>
|
|
<li>I ran it on CGSpace and it cleaned up 3,769 thumbnails!</li>
|
|
<li>Afterwards I ran <code>dspace cleanup -v</code> to remove the deleted thumbnails</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2020-08/">August, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-07/">July, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-06/">June, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-05/">May, 2020</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2020-04/">April, 2020</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|