<li>Replace Marissa van Epp for Rhys Bucknall in the CCAFS groups on CGSpace because Marissa no longer works at CCAFS</li>
<li>The AReS Explorer hasn’t updated its index since 2020-08-22 when I last forced it
<ul>
<li>I restarted it again now and told Moayad that the automatic indexing isn’t working</li>
</ul>
</li>
<li>Add <code>Alliance of Bioversity International and CIAT</code> to affiliations on CGSpace</li>
<li>Abenet told me that the general search text on AReS doesn’t get reset when you use the “Reset Filters” button
<ul>
<li>I filed a bug on OpenRXV: <ahref="https://github.com/ilri/OpenRXV/issues/39">https://github.com/ilri/OpenRXV/issues/39</a></li>
</ul>
</li>
<li>I filed an issue on OpenRXV to make some minor edits to the admin UI: <ahref="https://github.com/ilri/OpenRXV/issues/40">https://github.com/ilri/OpenRXV/issues/40</a></li>
</ul>
<ul>
<li>I ran the country code tagger on CGSpace:</li>
<pretabindex="0"><code>2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
<li>According to the <ahref="https://wiki.lyrasis.org/display/DSDOC6x/Authentication+Plugins#AuthenticationPlugins-LDAPAuthentication">DSpace 6 docs</a> we need to escape commas in our LDAP parameters due to the new configuration system
<ul>
<li>I added the commas and restarted DSpace (though technically we shouldn’t need to restart due to the new config system hot reloading configs)</li>
<li>Run all system updates on DSpace Test (linode26) and reboot it</li>
<li>After the restart LDAP login works…</li>
</ul>
</li>
</ul>
<h2id="2020-09-03">2020-09-03</h2>
<ul>
<li>Fix some erroneous “review status” fields that Abenet noticed on AReS
<ul>
<li>I used my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts with the following input files:</li>
<li>Start reviewing 95 items for IITA (20201stbatch)
<ul>
<li>I used my <ahref="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> tool to check and fix some low-hanging fruit first</li>
<li>This fixed a few unnecessary Unicode, excessive whitespace, invalid multi-value separator, and duplicate metadata values</li>
<li>Then I looked at the data in OpenRefine and noticed some things:
<ul>
<li>All issue dates use year only, but some have months in the citation so they could be more specific</li>
<li>I normalized all the DOIs to use “<ahref="https://doi.org">https://doi.org</a>” format</li>
<li>I fixed a few AGROVOC subjects with a simple GREL: <code>value.replace("GRAINS","GRAIN").replace("SOILS","SOIL").replace("CORN","MAIZE")</code></li>
<li>But there are a few more that are invalid that she will have to look at</li>
<li>I uploaded the items to <ahref="https://dspacetest.cgiar.org/handle/10568/108357">DSpace Test</a> and it was apparently successful but I get these errors to the console:</li>
<li>I was checking the recent IITA data for duplicates when I noticed that one in CIFOR’s Archive and saw that CIFOR has updated a bunch of their website URLs, for example:
<pretabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
</code></pre><ul>
<li>I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
<ul>
<li><code>$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id</code></li>
<li>I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: <code>if(cell.recon.matched, cell.recon.match.name, value)</code></li>
</ul>
</li>
<li>I mapped one duplicated from the CIFOR Archives and re-uploaded the 94 IITA items to a new collection on <ahref="https://dspacetest.cgiar.org/handle/10568/108453">DSpace Test</a></li>
<li>I noticed that the “share” link in AReS wasn’t working properly because it excludes the “explorer” part of the URI</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/09/ares-share-link.png"alt="AReS share link broken"></p>
<ul>
<li>I filed an issue on GitHub: <ahref="https://github.com/ilri/OpenRXV/issues/41">https://github.com/ilri/OpenRXV/issues/41</a></li>
<li>I uploaded the 94 IITA items that I had been working on last week to CGSpace</li>
<li>RTB emailed to ask why they are getting HTTP 503 errors during harvesting to the RTB WordPress website
<ul>
<li>From the screenshot I can see they are requesting URLs like this:</li>
<li>Wire up the systemd service/timer for the CGSpace Country Code Tagger curation task in the <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure scripts</a>
<ul>
<li><del>For now it won’t work on DSpace 6 because the curation task invocation needs to be slightly different (minus the <code>-l</code> parameter) and for some reason the task isn’t working on DSpace Test (version 6) right now</del></li>
<li>I added DSpace 6 support to the playbook templates…</li>
</ul>
</li>
<li>Run system updates on DSpace Test (linode26), re-deploy the DSpace 6 test branch, and reboot the server
<ul>
<li>After rebooting I deleted old copies of the cgspace-java-helpers JAR in the DSpace lib directory and then the curation worked</li>
<li>To my great surprise the curation worked (and completed, albeit a few times slower) on my local DSpace 6 environment as well:</li>
<li>Looking at the top user agents active on CGSpace in 2020-08 and I see:
<ul>
<li><code>Delphi 2009</code>: 235353 (this is GARDIAN harvester I guess, as the IP is in Greece)</li>
<li><code>Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)</code>: 57004 (IP is 18.196.100.94, and the requests seem to be for CTA’s content)</li>
<li>Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn’t commit the change</li>
<li>HTTrack is in the agents list so I’m not sure why DSpace registers a hit from that request</li>
<li>Also, I am surprised to see the RTB and ILRI bots here because they have “BOT” in the name and that should also be dropped</li>
<li>I also see hits from <code>curl</code> and <code>Java/1.8.0_66</code> and <code>Apache-HttpClient</code> so WTF… those are supposed to be dropped by the default agents list</li>
<li>Some IP <code>2607:f298:5:101d:f816:3eff:fed9:a484</code> made 9,000 requests with the <code>RI/1.0</code> user agent this year…
<ul>
<li>That’s on DreamHost…?</li>
</ul>
</li>
<li>I purged 448658 hits from these agents and added <code>Delphi</code> to our local agents overload for Solr as well as Tomcat’s Crawler Session Manager Valve so that it forces them to re-use a single session</li>
<li>I made a pull request on the COUNTER-Robots project for the Daum robot: <ahref="https://github.com/atmire/COUNTER-Robots/pull/38">https://github.com/atmire/COUNTER-Robots/pull/38</a>
<ul>
<li>This bot made 8,000 requests to CGSpace this year</li>
<li>I purged about 20,000 total requests from this bot from our Solr stats for the last few years</li>
<li>Peter noticed that an export from AReS shows some items with zero views and others with zero views/downloads, but on CGSpace and in the statistics API there are views/downloads
<ul>
<li>I need to ask Moayad…</li>
</ul>
</li>
</ul>
<h2id="2020-09-12">2020-09-12</h2>
<ul>
<li>Carlos Tejo from the LandPortal emailed to ask for advice about integrating their <ahref="https://landvoc.org/">LandVoc</a> vocabulary, which is a subset of AGROVOC, into DSpace
<ul>
<li>I told him that they could use the DSpace authority control framework and sent an example of the VIAFAuthority from the DSpace-CRIS project: <ahref="https://github.com/4Science/DSpace/blob/dspace-6_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java">https://github.com/4Science/DSpace/blob/dspace-6_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java</a></li>
</ul>
</li>
<li>Redeploy the latest <code>5_x-prod</code> branch on CGSpace, re-run the latest Ansible DSpace playbook, run all system updates, and reboot the server (linode18)
<ul>
<li>This will bring the latest bot lists for Solr and Tomcat</li>
<li>I had to restart Tomcat 7 three times before all Solr statistics cores came up OK</li>
</ul>
</li>
<li>Leroy and Carol from CIAT/Bioversity were asking for information about posting to the CGSpace REST API from Sharepoint
<ul>
<li>I told them that we don’t allow this yet, but that we need to check in the future whether content can be posted to a workflow</li>
<li>Looking further into Carlos Tejos’s question about integrating LandVoc (the AGROVOC subset) into DSpace
<ul>
<li>I see that you can actually get LandVoc concepts directly from AGROVOC’s SPARQL, for example with <ahref="http://agrovoc.uniroma2.it/sparql#query=PREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0A%0ASELECT+%3Fconcept%0AWHERE+%7B%0A++%3Fconcept+a+skos%3AConcept+%3B%0A+++++++++++skos%3AinScheme+%3Chttp%3A%2F%2Flandvoc.org%2Flandvoc%3E+.%0A%0A%7D+ORDER+BY+%3Fconcept&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fagrovoc.uniroma2.it%2Fsparql&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table">this query</a></li>
<li>So maybe we can query AGROVOC directly using a similar method to <ahref="https://github.com/4Science/DSpace/blob/dspace-5_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/TGNAuthority.java">DSpace-CRIS’s GettyAuthority</a></li>
<li>I wired up DSpace-CRIS’s VIAFAuthority to see how authorities for auto suggested names get stored
<ul>
<li>After submission you can see the item’s VIAF identifier:</li>
<li>I sent a follow-up message to Atmire to look into the two remaining issues with the DSpace 6 upgrade
<ul>
<li>First is the fact that we have zero results in our Listings and Reports, for any search</li>
<li>Second is the error we get during CSV imports</li>
</ul>
</li>
<li>Help Natalia and Cathy from Bioversity-CIAT with their OpenSearch query on “trade offs” again
<ul>
<li>They wanted to build a search query with multiple filters (type, crpsubject, status) and the general query “trade offs”</li>
<li>I found a great <ahref="https://www.kiwi.fi/pages/viewpage.action?pageId=45782169">reference for DSpace’s OpenSearch syntax</a> (albeit in Finnish, but the example URLs show the syntax clearly)</li>
<li>We can use quotes and <code>AND</code> and <code>OR</code> and even group search parameters with parenthesis!</li>
<li>So now I built a query for Natalia which uses these (showing without URL encoding so you can see the syntax):</li>
<pretabindex="0"><code>https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
<li>I noticed that my <code>move-collections.sh</code> script didn’t work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection <code>resource_id</code> parameters in the PostgreSQL query</li>
<li>Last week Natalia from CIAT had asked me to download all the PDFs for a certain query:
<ul>
<li>items with status “Open Access”</li>
<li>items with type “Journal Article”</li>
<li>items containing any of the following words: water land and ecosystems & trade offs</li>
<li>The resulting OpenSearch query is: <ahref="https://cgspace.cgiar.org/open-search/discover?query=type:%22Journal">https://cgspace.cgiar.org/open-search/discover?query=type:"Journal</a> Article" AND status:“Open Access” AND Water Land Ecosystems trade offs&rpp=1</li>
<li>There were 241 results with a total of 208 PDFs, which I downloaded with my <code>get-wle-pdfs.py</code> script and shared to her via bashupload.com</li>
<li>Peter said he was having problems submitting items to CGSpace
<ul>
<li>On a hunch I looked at the PostgreSQL locks in Munin and indeed the normal issue with locks is back (though I haven’t seen it in a few months?)</li>
<li>Instead of restarting Tomcat I restarted the PostgreSQL service and then Peter said he was able to submit the item…</li>
<li>Experiment with doing direct queries for items in the <ahref="https://github.com/ilri/dspace-statistics-api">dspace-statistics-api</a>
<ul>
<li>I tested querying a handful of item UUIDs with a date range and returning their hits faceted by <code>id</code></li>
<li>Assuming a list of item UUIDs was posted to the REST API we could prepare them for a Solr query by joining them into a string with “OR” and escaping the hyphens:</li>
item_ids_str = ' OR '.join(item_ids).replace('-', '\-')
...
solr_query_params = {
"q": f"id:({item_ids_str})",
"fq": "type:2 AND isBot:false AND statistics_type:view AND time:[2020-01-01T00:00:00Z TO 2020-09-02T00:00:00Z]",
"facet": "true",
"facet.field": "id",
"facet.mincount": 1,
"facet.limit": 1,
"facet.offset": 0,
"stats": "true",
"stats.field": "id",
"stats.calcdistinct": "true",
"shards": shards,
"rows": 0,
"wt": "json",
}
</code></pre><ul>
<li>The date range format for Solr is important, but it seems we only need to add <code>T00:00:00Z</code> to the normal ISO 8601 YYYY-MM-DD strings</li>
<li>Experiment with re-creating IWMI’s “Monthly Abstract” type report with an AReS template
<ul>
<li>The template library for reports is: <ahref="https://docxtemplater.com">https://docxtemplater.com</a></li>
<li>Conditions start with a pound and end with a slash: {#items} {/items}</li>
<li>An inverted section begins with a caret (hat) and ends with a slash: {^citation} No citation{/citation}</li>
<li>I found a bug: templates with a space in the file name don’t download</li>
<li>It would be nice if we could use <ahref="https://docxtemplater.readthedocs.io/en/latest/angular_parse.html">angular expressions</a> to make more complex templates
<ul>
<li>Ability to iterate over authors (to change the separator)</li>
<li>Ability to get item number in a loop (for a list)</li>
<li>To do things like checking if a CRP is “WLE”</li>