cgspace-notes/docs/2020-09/index.html

769 lines
35 KiB
HTML
Raw Normal View History

2020-09-03 12:50:56 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2020" />
<meta property="og:description" content="2020-09-02
Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS
The AReS Explorer hasn&rsquo;t updated its index since 2020-08-22 when I last forced it
I restarted it again now and told Moayad that the automatic indexing isn&rsquo;t working
Add Alliance of Bioversity International and CIAT to affiliations on CGSpace
Abenet told me that the general search text on AReS doesn&rsquo;t get reset when you use the &ldquo;Reset Filters&rdquo; button
I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-09/" />
<meta property="article:published_time" content="2020-09-02T15:35:54+03:00" />
2020-10-06 15:59:31 +02:00
<meta property="article:modified_time" content="2020-10-01T10:47:40+03:00" />
2020-09-03 12:50:56 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2020"/>
<meta name="twitter:description" content="2020-09-02
Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS
The AReS Explorer hasn&rsquo;t updated its index since 2020-08-22 when I last forced it
I restarted it again now and told Moayad that the automatic indexing isn&rsquo;t working
Add Alliance of Bioversity International and CIAT to affiliations on CGSpace
Abenet told me that the general search text on AReS doesn&rsquo;t get reset when you use the &ldquo;Reset Filters&rdquo; button
I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39
I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40
"/>
2020-10-12 16:53:24 +02:00
<meta name="generator" content="Hugo 0.76.3" />
2020-09-03 12:50:56 +02:00
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-09/",
2020-10-01 09:47:40 +02:00
"wordCount": "2970",
2020-09-03 12:50:56 +02:00
"datePublished": "2020-09-02T15:35:54+03:00",
2020-10-06 15:59:31 +02:00
"dateModified": "2020-10-01T10:47:40+03:00",
2020-09-03 12:50:56 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2020-09/">
<title>September, 2020 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.6da5c906cc7a8fbb93f31cd2316c5dbe3f19ac4aa6bfb066f1243045b8f6061e.css" rel="stylesheet" integrity="sha256-baXJBsx6j7uT8xzSMWxdvj8ZrEqmv7Bm8SQwRbj2Bh4=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f3d2a1f5980bab30ddd0d8cadbd496475309fc48e2b1d052c5c09e6facffcb0f.js" integrity="sha256-89Kh9ZgLqzDd0NjK29SWR1MJ/EjisdBSxcCeb6z/yw8=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2020-09/">September, 2020</a></h2>
<p class="blog-post-meta"><time datetime="2020-09-02T15:35:54+03:00">Wed Sep 02, 2020</time> by Alan Orth in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/cgspace-notes/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2020-09-02">2020-09-02</h2>
<ul>
<li>Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS</li>
<li>The AReS Explorer hasn&rsquo;t updated its index since 2020-08-22 when I last forced it
<ul>
<li>I restarted it again now and told Moayad that the automatic indexing isn&rsquo;t working</li>
</ul>
</li>
<li>Add <code>Alliance of Bioversity International and CIAT</code> to affiliations on CGSpace</li>
<li>Abenet told me that the general search text on AReS doesn&rsquo;t get reset when you use the &ldquo;Reset Filters&rdquo; button
<ul>
<li>I filed a bug on OpenRXV: <a href="https://github.com/ilri/OpenRXV/issues/39">https://github.com/ilri/OpenRXV/issues/39</a></li>
</ul>
</li>
<li>I filed an issue on OpenRXV to make some minor edits to the admin UI: <a href="https://github.com/ilri/OpenRXV/issues/40">https://github.com/ilri/OpenRXV/issues/40</a></li>
</ul>
<ul>
<li>I ran the country code tagger on CGSpace:</li>
</ul>
<pre><code>$ time chrt -b 0 dspace curate -t countrycodetagger -i all -r - -l 500 -s object | tee /tmp/2020-09-02-countrycodetagger.log
...
real 2m10.516s
user 1m43.953s
sys 0m15.192s
$ grep -c added /tmp/2020-09-02-countrycodetagger.log
39
</code></pre><ul>
<li>I still need to create a cron job for this&hellip;</li>
<li>Sisay and Abenet said they can&rsquo;t log in with LDAP on DSpace Test (DSpace 6)
<ul>
<li>I tried and I can&rsquo;t either&hellip; but it is working on CGSpace</li>
<li>The error on DSpace 6 is:</li>
</ul>
</li>
</ul>
<pre><code>2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
</code></pre><ul>
<li>I tried to query LDAP directly using the application credentials with ldapsearch and it works:</li>
</ul>
<pre><code>$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;applicationaccount@cgiarad.org&quot; -W &quot;(sAMAccountName=me)&quot;
</code></pre><ul>
<li>According to the <a href="https://wiki.lyrasis.org/display/DSDOC6x/Authentication+Plugins#AuthenticationPlugins-LDAPAuthentication">DSpace 6 docs</a> we need to escape commas in our LDAP parameters due to the new configuration system
<ul>
<li>I added the commas and restarted DSpace (though technically we shouldn&rsquo;t need to restart due to the new config system hot reloading configs)</li>
<li>Run all system updates on DSpace Test (linode26) and reboot it</li>
<li>After the restart LDAP login works&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2020-09-03">2020-09-03</h2>
<ul>
<li>Fix some erroneous &ldquo;review status&rdquo; fields that Abenet noticed on AReS
<ul>
<li>I used my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts with the following input files:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-09-03-fix-review-status.csv
dc.description.version,correct
Externally Peer Reviewed,Peer Review
Peer Reviewed,Peer Review
Peer review,Peer Review
Peer reviewed,Peer Review
Peer-Reviewed,Peer Review
Peer-reviewed,Peer Review
peer Review,Peer Review
$ cat 2020-09-03-delete-review-status.csv
dc.description.version
Report
Formally Published
Poster
Unrefereed reprint
$ ./delete-metadata-values.py -i 2020-09-03-delete-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -m 68
$ ./fix-metadata-values.py -i 2020-09-03-fix-review-status.csv -db dspace -u dspace -p 'fuuu' -f dc.description.version -t 'correct' -m 68
</code></pre><ul>
<li>Start reviewing 95 items for IITA (20201stbatch)
<ul>
<li>I used my <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool to check and fix some low-hanging fruit first</li>
<li>This fixed a few unnecessary Unicode, excessive whitespace, invalid multi-value separator, and duplicate metadata values</li>
<li>Then I looked at the data in OpenRefine and noticed some things:
<ul>
<li>All issue dates use year only, but some have months in the citation so they could be more specific</li>
<li>I normalized all the DOIs to use &ldquo;<a href="https://doi.org">https://doi.org</a>&rdquo; format</li>
<li>I fixed a few AGROVOC subjects with a simple GREL: <code>value.replace(&quot;GRAINS&quot;,&quot;GRAIN&quot;).replace(&quot;SOILS&quot;,&quot;SOIL&quot;).replace(&quot;CORN&quot;,&quot;MAIZE&quot;)</code></li>
<li>But there are a few more that are invalid that she will have to look at</li>
<li>I uploaded the items to <a href="https://dspacetest.cgiar.org/handle/10568/108357">DSpace Test</a> and it was apparently successful but I get these errors to the console:</li>
</ul>
</li>
</ul>
</li>
</ul>
<pre><code>Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
Error while updating
java.lang.NullPointerException
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</code></pre><ul>
<li>There are more in the DSpace log so I will raise it with Atmire immediately</li>
</ul>
2020-09-04 12:32:16 +02:00
<h2 id="2020-09-04">2020-09-04</h2>
<ul>
<li>I was checking the recent IITA data for duplicates when I noticed that one in CIFOR&rsquo;s Archive and saw that CIFOR has updated a bunch of their website URLs, for example:
<ul>
<li><a href="http://www.cifor.org/nc/online-library/browse/view-publication/publication/151.html">http://www.cifor.org/nc/online-library/browse/view-publication/publication/151.html</a><a href="https://www.cifor.org/knowledge/publication/151">https://www.cifor.org/knowledge/publication/151</a></li>
<li><a href="https://www.cifor.org/library/4033">https://www.cifor.org/library/4033</a><a href="https://www.cifor.org/knowledge/publication/4033">https://www.cifor.org/knowledge/publication/4033</a></li>
<li><a href="https://www.cifor.org/pid/5087">https://www.cifor.org/pid/5087</a><a href="https://www.cifor.org/knowledge/publication/5087">https://www.cifor.org/knowledge/publication/5087</a></li>
</ul>
</li>
<li>I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
</code></pre><ul>
<li>I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
<li>I mapped one duplicated from the CIFOR Archives and re-uploaded the 94 IITA items to a new collection on <a href="https://dspacetest.cgiar.org/handle/10568/108453">DSpace Test</a></li>
</ul>
2020-09-08 11:10:08 +02:00
<h2 id="2020-09-08">2020-09-08</h2>
<ul>
<li>I noticed that the &ldquo;share&rdquo; link in AReS wasn&rsquo;t working properly because it excludes the &ldquo;explorer&rdquo; part of the URI</li>
</ul>
<p><img src="/cgspace-notes/2020/09/ares-share-link.png" alt="AReS share link broken"></p>
<ul>
<li>I filed an issue on GitHub: <a href="https://github.com/ilri/OpenRXV/issues/41">https://github.com/ilri/OpenRXV/issues/41</a></li>
<li>I uploaded the 94 IITA items that I had been working on last week to CGSpace</li>
<li>RTB emailed to ask why they are getting HTTP 503 errors during harvesting to the RTB WordPress website
<ul>
<li>From the screenshot I can see they are requesting URLs like this:</li>
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/bitstream/handle/10568/82745/Characteristics-Silage.JPG
</code></pre><ul>
<li>So they end up getting rate limited due to the XMLUI rate limits
<ul>
<li>I told them to use the REST API bitstream retrieve links, because we don&rsquo;t have any rate limits there</li>
</ul>
</li>
</ul>
2020-09-10 11:18:03 +02:00
<h2 id="2020-09-09">2020-09-09</h2>
<ul>
<li>Wire up the systemd service/timer for the CGSpace Country Code Tagger curation task in the <a href="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>
<ul>
<li><del>For now it won&rsquo;t work on DSpace 6 because the curation task invocation needs to be slightly different (minus the <code>-l</code> parameter) and for some reason the task isn&rsquo;t working on DSpace Test (version 6) right now</del></li>
<li>I added DSpace 6 support to the playbook templates&hellip;</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode26), re-deploy the DSpace 6 test branch, and reboot the server
<ul>
<li>After rebooting I deleted old copies of the cgspace-java-helpers JAR in the DSpace lib directory and then the curation worked</li>
<li>To my great surprise the curation worked (and completed, albeit a few times slower) on my local DSpace 6 environment as well:</li>
</ul>
</li>
</ul>
<pre><code>$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
</code></pre><h2 id="2020-09-10">2020-09-10</h2>
<ul>
<li>I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night&hellip; w00t</li>
<li>I started looking at Peter&rsquo;s changes to the CGSpace regions that were proposed in 2020-07
<ul>
<li>The changes will be:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-09-10-fix-cgspace-regions.csv
cg.coverage.region,correct
EAST AFRICA,EASTERN AFRICA
WEST AFRICA,WESTERN AFRICA
SOUTHEAST ASIA,SOUTHEASTERN ASIA
SOUTH ASIA,SOUTHERN ASIA
AFRICA SOUTH OF SAHARA,SUB-SAHARAN AFRICA
NORTH AFRICA,NORTHERN AFRICA
WEST ASIA,WESTERN ASIA
SOUTHWEST ASIA,SOUTHWESTERN ASIA
$ ./fix-metadata-values.py -i 2020-09-10-fix-cgspace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227 -d -n
Connected to database.
Would fix 12227 occurences of: EAST AFRICA
Would fix 7996 occurences of: WEST AFRICA
Would fix 3515 occurences of: SOUTHEAST ASIA
Would fix 3443 occurences of: SOUTH ASIA
Would fix 1134 occurences of: AFRICA SOUTH OF SAHARA
Would fix 357 occurences of: NORTH AFRICA
Would fix 81 occurences of: WEST ASIA
Would fix 3 occurences of: SOUTHWEST ASIA
</code></pre><ul>
<li>I think we need to wait for the web team, though, as they need to update their mappings
<ul>
<li>Not to mention that we&rsquo;ll need to give WLE and CCAFS time to update their harvesters as well&hellip; hmmm</li>
</ul>
</li>
2020-09-10 14:00:40 +02:00
<li>Looking at the top user agents active on CGSpace in 2020-08 and I see:
<ul>
<li><code>Delphi 2009</code>: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)</li>
<li><code>Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)</code>: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA&rsquo;s content)</li>
<li><code>RTB website BOT</code>: 12282</li>
<li><code>ILRI Livestock Website Publications importer BOT</code>: 9393</li>
</ul>
</li>
<li>Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn&rsquo;t commit the change</li>
<li>HTTrack is in the agents list so I&rsquo;m not sure why DSpace registers a hit from that request</li>
<li>Also, I am surprised to see the RTB and ILRI bots here because they have &ldquo;BOT&rdquo; in the name and that should also be dropped</li>
<li>I also see hits from <code>curl</code> and <code>Java/1.8.0_66</code> and <code>Apache-HttpClient</code> so WTF&hellip; those are supposed to be dropped by the default agents list</li>
<li>Some IP <code>2607:f298:5:101d:f816:3eff:fed9:a484</code> made 9,000 requests with the <code>RI/1.0</code> user agent this year&hellip;
<ul>
<li>That&rsquo;s on DreamHost&hellip;?</li>
</ul>
</li>
<li>I purged 448658 hits from these agents and added <code>Delphi</code> to our local agents overload for Solr as well as Tomcat&rsquo;s Crawler Session Manager Valve so that it forces them to re-use a single session</li>
<li>I made a pull request on the COUNTER-Robots project for the Daum robot: <a href="https://github.com/atmire/COUNTER-Robots/pull/38">https://github.com/atmire/COUNTER-Robots/pull/38</a>
<ul>
<li>This bot made 8,000 requests to CGSpace this year</li>
<li>I purged about 20,000 total requests from this bot from our Solr stats for the last few years</li>
</ul>
</li>
2020-09-10 11:18:03 +02:00
</ul>
2020-09-12 18:53:57 +02:00
<h2 id="2020-09-11">2020-09-11</h2>
<ul>
<li>Peter noticed that an export from AReS shows some items with zero views and others with zero views/downloads, but on CGSpace and in the statistics API there are views/downloads
<ul>
<li>I need to ask Moayad&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2020-09-12">2020-09-12</h2>
<ul>
<li>Carlos Tejo from the LandPortal emailed to ask for advice about integrating their <a href="https://landvoc.org/">LandVoc</a> vocabulary, which is a subset of AGROVOC, into DSpace
<ul>
<li>I told him that they could use the DSpace authority control framework and sent an example of the VIAFAuthority from the DSpace-CRIS project: <a href="https://github.com/4Science/DSpace/blob/dspace-6_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java">https://github.com/4Science/DSpace/blob/dspace-6_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java</a></li>
</ul>
</li>
<li>Redeploy the latest <code>5_x-prod</code> branch on CGSpace, re-run the latest Ansible DSpace playbook, run all system updates, and reboot the server (linode18)
<ul>
<li>This will bring the latest bot lists for Solr and Tomcat</li>
<li>I had to restart Tomcat 7 three times before all Solr statistics cores came up OK</li>
</ul>
</li>
<li>Leroy and Carol from CIAT/Bioversity were asking for information about posting to the CGSpace REST API from Sharepoint
<ul>
<li>I told them that we don&rsquo;t allow this yet, but that we need to check in the future whether content can be posted to a workflow</li>
</ul>
</li>
</ul>
2020-09-15 16:32:29 +02:00
<h2 id="2020-09-15">2020-09-15</h2>
<ul>
<li>Charlotte from Altmetric said they had issues parsing the XML file I sent them last month
<ul>
<li>I told them that it was mimicking the same format that they had sent me (fourteen pages of XML responses concatenated together)!</li>
</ul>
</li>
<li>A few days ago IWMI asked us if we can add a new field on CGSpace for their library identifier
<ul>
<li>The IDs look like this: H049940</li>
<li>I suggested that we use <code>cg.identifier.iwmilibrary</code></li>
<li>I added it to the input forms and push it to the <code>5_x-prod</code> and 6.x branches and will re-deploy it in the next few days</li>
</ul>
</li>
<li>Abenet asked me to import sixty-nine (69) CIP Annual Reports to CGSpace
<ul>
<li>I looked at the data in OpenRefine and it is very good quality</li>
<li>I only added descriptions to the filename field so that SAFBuilder will add them to the bitstreams on import:</li>
</ul>
</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre><ul>
<li>Then I created a SAF bundle with SAFBuilder:</li>
</ul>
<pre><code>$ ./safbuilder.sh -c ~/Downloads/cip-annual-reports/cip-reports.csv
</code></pre><ul>
<li>And imported them into my local test instance of CGSpace:</li>
</ul>
<pre><code>$ ~/dspace/bin/dspace import -a -e y.arrr@cgiar.org -m /tmp/2020-09-15-cip-annual-reports.map -s ~/Downloads/cip-annual-reports/SimpleArchiveFormat
</code></pre><ul>
<li>Then I uploaded them to CGSpace</li>
</ul>
2020-09-16 12:47:13 +02:00
<h2 id="2020-09-16">2020-09-16</h2>
<ul>
<li>Looking further into Carlos Tejos&rsquo;s question about integrating LandVoc (the AGROVOC subset) into DSpace
<ul>
<li>I see that you can actually get LandVoc concepts directly from AGROVOC&rsquo;s SPARQL, for example with <a href="http://agrovoc.uniroma2.it/sparql#query=PREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0A%0ASELECT+%3Fconcept%0AWHERE+%7B%0A++%3Fconcept+a+skos%3AConcept+%3B%0A+++++++++++skos%3AinScheme+%3Chttp%3A%2F%2Flandvoc.org%2Flandvoc%3E+.%0A%0A%7D+ORDER+BY+%3Fconcept&amp;contentTypeConstruct=text%2Fturtle&amp;contentTypeSelect=application%2Fsparql-results%2Bjson&amp;endpoint=http%3A%2F%2Fagrovoc.uniroma2.it%2Fsparql&amp;requestMethod=POST&amp;tabTitle=Query&amp;headers=%7B%7D&amp;outputFormat=table">this query</a></li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/09/agrovoc-landvoc-sparql.png" alt="AGROVOC LandVoc SPARQL"></p>
<ul>
<li>So maybe we can query AGROVOC directly using a similar method to <a href="https://github.com/4Science/DSpace/blob/dspace-5_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/TGNAuthority.java">DSpace-CRIS&rsquo;s GettyAuthority</a></li>
<li>I wired up DSpace-CRIS&rsquo;s VIAFAuthority to see how authorities for auto suggested names get stored
<ul>
<li>After submission you can see the item&rsquo;s VIAF identifier:</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/09/viaf-authority.png" alt="VIAF authority"></p>
<ul>
<li>And this identifier is the ID on VIAF, pretty cool!</li>
</ul>
<p><img src="/cgspace-notes/2020/09/viaf-darwin.png" alt="VIAF entry for Charles Darwin"></p>
<ul>
<li>I did a similar test with the Getty Thesaurus of Geographic Names (TGN) and it stores the concept URI in the authority:</li>
</ul>
<p><img src="/cgspace-notes/2020/09/tgn-concept-uri.png" alt="TGNAuthority"></p>
<ul>
<li>But the authority values are not exposed anywhere as metadata&hellip;
<ul>
<li>I need to play with it a bit more I guess&hellip;</li>
</ul>
</li>
<li>The nice thing is that the Getty example from DSpace-CRIS uses SPARQL as well, and the TGN authority extends it
<ul>
<li>We could use a similar model for AGROVOC/LandVoc very easily</li>
</ul>
</li>
</ul>
2020-09-17 14:33:37 +02:00
<h2 id="2020-09-17">2020-09-17</h2>
<ul>
<li>Maria from Bioveristy asked about the ORCID identifier for one of her colleagues that seems to have been removed from our list
<ul>
<li>I re-added it to our controlled vocabulary and added the identifier to fifty-one of his existing items on CGSpace using my script:</li>
</ul>
</li>
</ul>
<pre><code>$ cat 2020-09-17-add-bioversity-orcids.csv
dc.contributor.author,cg.creator.id
&quot;Etten, Jacob van&quot;,&quot;Jacob van Etten: 0000-0001-7554-2558&quot;
&quot;van Etten, Jacob&quot;,&quot;Jacob van Etten: 0000-0001-7554-2558&quot;
$ ./add-orcid-identifiers-csv.py -i 2020-09-17-add-bioversity-orcids.csv -db dspace -u dspace -p 'dom@in34sniper'
</code></pre><ul>
<li>I sent a follow-up message to Atmire to look into the two remaining issues with the DSpace 6 upgrade
<ul>
<li>First is the fact that we have zero results in our Listings and Reports, for any search</li>
<li>Second is the error we get during CSV imports</li>
</ul>
</li>
<li>Help Natalia and Cathy from Bioversity-CIAT with their OpenSearch query on &ldquo;trade offs&rdquo; again
<ul>
<li>They wanted to build a search query with multiple filters (type, crpsubject, status) and the general query &ldquo;trade offs&rdquo;</li>
<li>I found a great <a href="https://www.kiwi.fi/pages/viewpage.action?pageId=45782169">reference for DSpace&rsquo;s OpenSearch syntax</a> (albeit in Finnish, but the example URLs show the syntax clearly)</li>
<li>We can use quotes and <code>AND</code> and <code>OR</code> and even group search parameters with parenthesis!</li>
<li>So now I built a query for Natalia which uses these (showing without URL encoding so you can see the syntax):</li>
</ul>
</li>
</ul>
<pre><code>https://cgspace.cgiar.org/open-search/discover?query=type:&quot;Journal Article&quot; AND status:&quot;Open Access&quot; AND crpsubject:&quot;Water, Land and Ecosystems&quot; AND &quot;tradeoffs&quot;&amp;rpp=100
</code></pre><ul>
<li>I noticed that my <code>move-collections.sh</code> script didn&rsquo;t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection <code>resource_id</code> parameters in the PostgreSQL query</li>
</ul>
2020-09-18 18:25:55 +02:00
<h2 id="2020-09-18">2020-09-18</h2>
<ul>
<li>Help Natalia with her WLE &ldquo;tradeoffs&rdquo; search query again&hellip;</li>
</ul>
2020-09-22 11:38:53 +02:00
<h2 id="2020-09-20">2020-09-20</h2>
<ul>
<li>Deploy latest 5_x-prod branch on CGSpace, run all system updates, and reboot the server
<ul>
<li>To my great surprise, all the Solr statistics cores came up correctly after reboot</li>
</ul>
</li>
<li>Deploy latest 6_x-dev branch on DSpace Test, run all system updates and reboot the server</li>
</ul>
<h2 id="2020-09-22">2020-09-22</h2>
<ul>
<li>Abenet sent some feedback about AReS
<ul>
<li>The item views and downloads are still incorrect</li>
<li>I looked in the server&rsquo;s API logs and there are no errors, and the database has many more views/downloads:</li>
</ul>
</li>
</ul>
<pre><code>dspacestatistics=# SELECT SUM(views) FROM items;
sum
----------
15714024
(1 row)
dspacestatistics=# SELECT SUM(downloads) FROM items;
sum
----------
13979911
(1 row)
</code></pre><ul>
2020-09-22 13:14:18 +02:00
<li>I deleted &ldquo;Report&rdquo; from twelve items that had it in their peer review field:</li>
</ul>
<pre><code>dspace=# BEGIN;
BEGIN
dspace=# DELETE FROM metadatavalue WHERE text_value='Report' AND resource_type_id=2 AND metadata_field_id=68;
DELETE 12
dspace=# COMMIT;
</code></pre><ul>
2020-09-23 11:59:10 +02:00
<li>I added all CG center- and CRP-specific subject fields and mapped them to <code>dc.subject</code> in AReS</li>
<li>After forcing a re-harvesting now the review status is much cleaner and the missing subjects are available</li>
2020-09-22 11:38:53 +02:00
<li>Last week Natalia from CIAT had asked me to download all the PDFs for a certain query:
<ul>
<li>items with status &ldquo;Open Access&rdquo;</li>
<li>items with type &ldquo;Journal Article&rdquo;</li>
<li>items containing any of the following words: water land and ecosystems &amp; trade offs</li>
<li>The resulting OpenSearch query is: <a href="https://cgspace.cgiar.org/open-search/discover?query=type:%22Journal">https://cgspace.cgiar.org/open-search/discover?query=type:&quot;Journal</a> Article&quot; AND status:&ldquo;Open Access&rdquo; AND Water Land Ecosystems trade offs&amp;rpp=1</li>
<li>There were 241 results with a total of 208 PDFs, which I downloaded with my <code>get-wle-pdfs.py</code> script and shared to her via bashupload.com</li>
</ul>
</li>
</ul>
2020-09-23 11:59:10 +02:00
<h2 id="2020-09-23">2020-09-23</h2>
<ul>
<li>Peter said he was having problems submitting items to CGSpace
<ul>
<li>On a hunch I looked at the PostgreSQL locks in Munin and indeed the normal issue with locks is back (though I haven&rsquo;t seen it in a few months?)</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2020/09/postgres_connections_ALL-day.png" alt="PostgreSQL connections day"></p>
<ul>
<li>Instead of restarting Tomcat I restarted the PostgreSQL service and then Peter said he was able to submit the item&hellip;</li>
<li>Experiment with doing direct queries for items in the <a href="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>
<ul>
<li>I tested querying a handful of item UUIDs with a date range and returning their hits faceted by <code>id</code></li>
<li>Assuming a list of item UUIDs was posted to the REST API we could prepare them for a Solr query by joining them into a string with &ldquo;OR&rdquo; and escaping the hyphens:</li>
</ul>
</li>
</ul>
<pre><code>...
item_ids = ['0079470a-87a1-4373-beb1-b16e3f0c4d81', '007a9df1-0871-4612-8b28-5335982198cb']
item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
...
solr_query_params = {
&quot;q&quot;: f&quot;id:({item_ids_str})&quot;,
&quot;fq&quot;: &quot;type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]&quot;,
&quot;facet&quot;: &quot;true&quot;,
&quot;facet.field&quot;: &quot;id&quot;,
&quot;facet.mincount&quot;: 1,
&quot;facet.limit&quot;: 1,
&quot;facet.offset&quot;: 0,
&quot;stats&quot;: &quot;true&quot;,
&quot;stats.field&quot;: &quot;id&quot;,
&quot;stats.calcdistinct&quot;: &quot;true&quot;,
&quot;shards&quot;: shards,
&quot;rows&quot;: 0,
&quot;wt&quot;: &quot;json&quot;,
}
</code></pre><ul>
<li>The date range format for Solr is important, but it seems we only need to add <code>T00:00:00Z</code> to the normal ISO 8601 YYYY-MM-DD strings</li>
</ul>
2020-09-26 11:14:38 +02:00
<h2 id="2020-09-25">2020-09-25</h2>
<ul>
<li>I did some more work on the dspace-statistics-api and finalized the support for sending a POST to <code>/items</code>:</li>
</ul>
<pre><code>$ curl -s -d @request.json https://dspacetest.cgiar.org/rest/statistics/items | json_pp
{
&quot;currentPage&quot; : 0,
&quot;limit&quot; : 10,
&quot;statistics&quot; : [
{
&quot;downloads&quot; : 3329,
&quot;id&quot; : &quot;b2c1bbfd-65b0-438c-9e49-d271c49b2696&quot;,
&quot;views&quot; : 1565
},
{
&quot;downloads&quot; : 3797,
&quot;id&quot; : &quot;f44cf173-2344-4eb2-8f00-ee55df32c76f&quot;,
&quot;views&quot; : 48
},
{
&quot;downloads&quot; : 11064,
&quot;id&quot; : &quot;8542f9da-9ce1-4614-abf4-f2e3fdb4b305&quot;,
&quot;views&quot; : 26
},
{
&quot;downloads&quot; : 6782,
&quot;id&quot; : &quot;2324aa41-e9de-4a2b-bc36-16241464683e&quot;,
&quot;views&quot; : 19
},
{
&quot;downloads&quot; : 48,
&quot;id&quot; : &quot;0fe573e7-042a-4240-a4d9-753b61233908&quot;,
&quot;views&quot; : 12
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000e61ca-695d-43e5-9ab8-1f3fd7a67a32&quot;,
&quot;views&quot; : 4
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000dc7cd-9485-424b-8ecf-78002613cc87&quot;,
&quot;views&quot; : 1
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000e1616-3901-4431-80b1-c6bc67312d8c&quot;,
&quot;views&quot; : 1
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000ea897-5557-49c7-9f54-9fa192c0f83b&quot;,
&quot;views&quot; : 1
},
{
&quot;downloads&quot; : 0,
&quot;id&quot; : &quot;000ec427-97e5-4766-85a5-e8dd62199ab5&quot;,
&quot;views&quot; : 1
}
],
&quot;totalPages&quot; : 13
}
</code></pre><ul>
<li>I deployed it on DSpace Test and sent a note to Salem so he can test it</li>
<li>I still need to add tests&hellip;</li>
<li>After that I will probably tag it as version 1.3.0</li>
</ul>
<h2 id="2020-09-25-1">2020-09-25</h2>
<ul>
<li>Atmire responded with some notes about the issues we&rsquo;re having with CUA and L&amp;R on DSpace Test
<ul>
<li>They think they have found the reason the issues are happening&hellip;</li>
</ul>
</li>
</ul>
2020-09-29 13:58:35 +02:00
<h2 id="2020-09-29">2020-09-29</h2>
<ul>
<li>Atmire sent a pull request yesterday with a potential fix for the Listings and Reports (L&amp;R) issue
<ul>
<li>I tried to build it on DSpace Test but I got an HTTP 401 Unauthorized for the artifact</li>
<li>I sent them a message&hellip;</li>
</ul>
</li>
</ul>
2020-10-01 09:47:40 +02:00
<h2 id="2020-09-30">2020-09-30</h2>
<ul>
<li>Experiment with re-creating IWMI&rsquo;s &ldquo;Monthly Abstract&rdquo; type report with an AReS template
<ul>
<li>The template library for reports is: <a href="https://docxtemplater.com">https://docxtemplater.com</a></li>
<li>Conditions start with a pound and end with a slash: {#items} {/items}</li>
<li>An inverted section begins with a caret (hat) and ends with a slash: {^citation} No citation{/citation}</li>
<li>I found a bug: templates with a space in the file name don&rsquo;t download</li>
<li>It would be nice if we could use <a href="https://docxtemplater.readthedocs.io/en/latest/angular_parse.html">angular expressions</a> to make more complex templates
<ul>
<li>Ability to iterate over authors (to change the separator)</li>
<li>Ability to get item number in a loop (for a list)</li>
<li>To do things like checking if a CRP is &ldquo;WLE&rdquo;</li>
</ul>
</li>
</ul>
</li>
</ul>
2020-09-03 12:50:56 +02:00
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
2020-10-06 15:59:31 +02:00
<li><a href="/cgspace-notes/2020-10/">October, 2020</a></li>
2020-09-03 12:50:56 +02:00
<li><a href="/cgspace-notes/2020-09/">September, 2020</a></li>
<li><a href="/cgspace-notes/2020-08/">August, 2020</a></li>
<li><a href="/cgspace-notes/2020-07/">July, 2020</a></li>
<li><a href="/cgspace-notes/2020-06/">June, 2020</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>