- Sisay and Abenet said they can't log in with LDAP on DSpace Test (DSpace 6)
- I tried and I can't either... but it is working on CGSpace
- The error on DSpace 6 is:
```
2020-09-02 12:03:10,666 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=A629116488DCC467E1EA2062A2E2EFD7:ip_addr=92.220.02.201:failed_login:no DN found for user aorth
```
- I tried to query LDAP directly using the application credentials with ldapsearch and it works:
- According to the [DSpace 6 docs](https://wiki.lyrasis.org/display/DSDOC6x/Authentication+Plugins#AuthenticationPlugins-LDAPAuthentication) we need to escape commas in our LDAP parameters due to the new configuration system
- I added the commas and restarted DSpace (though technically we shouldn't need to restart due to the new config system hot reloading configs)
- Run all system updates on DSpace Test (linode26) and reboot it
- Start reviewing 95 items for IITA (20201stbatch)
- I used my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) tool to check and fix some low-hanging fruit first
- This fixed a few unnecessary Unicode, excessive whitespace, invalid multi-value separator, and duplicate metadata values
- Then I looked at the data in OpenRefine and noticed some things:
- Allissue dates use year only, but some have months in the citation so they could be more specific
- I normalized all the DOIs to use "https://doi.org" format
- I fixed a few AGROVOC subjects with a simple GREL: `value.replace("GRAINS","GRAIN").replace("SOILS","SOIL").replace("CORN","MAIZE")`
- But there are a few more that are invalid that she will have to look at
- I uploaded the items to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/108357) and it was apparently successful but I get these errors to the console:
```
Thu Sep 03 12:26:33 CEST 2020 | Query:containerItem:ea7a2648-180d-4fce-bdc5-c3aa2304fc58
Error while updating
java.lang.NullPointerException
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl$5.visit(SourceFile:1131)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.visitEachStatisticShard(SourceFile:212)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1104)
at com.atmire.dspace.cua.CUASolrLoggerServiceImpl.update(SourceFile:1093)
at org.dspace.statistics.StatisticsLoggingConsumer.consume(SourceFile:104)
at org.dspace.event.BasicDispatcher.consume(BasicDispatcher.java:177)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:123)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- There are more in the DSpace log so I will raise it with Atmire immediately
- I was checking the recent IITA data for duplicates when I noticed that one in CIFOR's Archive and saw that CIFOR has updated a bunch of their website URLs, for example:
- I will update our nearly 6,000 metadata values for CIFOR in the database accordingly:
```
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^(http://)?www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/([[:digit:]]+)\.html$', 'https://www.cifor.org/knowledge/publication/\3') WHERE metadata_field_id=219 AND text_value ~ 'www\.cifor\.org/(nc/)?online-library/browse/view-publication/publication/[[:digit:]]+';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/library/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/library/[[:digit:]]+/?';
dspace=# UPDATE metadatavalue SET text_value = regexp_replace(text_value, '^https?://www\.cifor\.org/pid/([[:digit:]]+)/?$', 'https://www.cifor.org/knowledge/publication/\1') WHERE metadata_field_id=219 AND text_value ~ 'https?://www\.cifor\.org/pid/[[:digit:]]+';
```
- I did some cleanup on the author affiliations of the IITA data our 2019-04 list using reconcile-csv and OpenRefine:
-`$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id`
- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)`
- I mapped one duplicated from the CIFOR Archives and re-uploaded the 94 IITA items to a new collection on [DSpace Test](https://dspacetest.cgiar.org/handle/10568/108453)
- Wire up the systemd service/timer for the CGSpace Country Code Tagger curation task in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public)
- ~~For now it won't work on DSpace 6 because the curation task invocation needs to be slightly different (minus the `-l` parameter) and for some reason the task isn't working on DSpace Test (version 6) right now~~
- I added DSpace 6 support to the playbook templates...
- Run system updates on DSpace Test (linode26), re-deploy the DSpace 6 test branch, and reboot the server
- After rebooting I deleted old copies of the cgspace-java-helpers JAR in the DSpace lib directory and then the curation worked
- To my great surprise the curation worked (and completed, albeit a few times slower) on my local DSpace 6 environment as well:
```
$ ~/dspace63/bin/dspace curate -t countrycodetagger -i all -s object
```
## 2020-09-10
- I checked the country code tagger on CGSpace and DSpace Test and it ran fine from the systemd timer last night... w00t
- I started looking at Peter's changes to the CGSpace regions that were proposed in 2020-07
- Shit, I meant to add Delphi to the DSpace spider agents list last month but I guess I didn't commit the change
- HTTrack is in the agents list so I'm not sure why DSpace registers a hit from that request
- Also, I am surprised to see the RTB and ILRI bots here because they have "BOT" in the name and that should also be dropped
- I also see hits from `curl` and `Java/1.8.0_66` and `Apache-HttpClient` so WTF... those are supposed to be dropped by the default agents list
- Some IP `2607:f298:5:101d:f816:3eff:fed9:a484` made 9,000 requests with the `RI/1.0` user agent this year...
- That's on DreamHost...?
- I purged 448658 hits from these agents and added `Delphi` to our local agents overload for Solr as well as Tomcat's Crawler Session Manager Valve so that it forces them to re-use a single session
- I made a pull request on the COUNTER-Robots project for the Daum robot: https://github.com/atmire/COUNTER-Robots/pull/38
- This bot made 8,000 requests to CGSpace this year
- I purged about 20,000 total requests from this bot from our Solr stats for the last few years
- Peter noticed that an export from AReS shows some items with zero views and others with zero views/downloads, but on CGSpace and in the statistics API there are views/downloads
- I need to ask Moayad...
## 2020-09-12
- Carlos Tejo from the LandPortal emailed to ask for advice about integrating their [LandVoc](https://landvoc.org/) vocabulary, which is a subset of AGROVOC, into DSpace
- I told him that they could use the DSpace authority control framework and sent an example of the VIAFAuthority from the DSpace-CRIS project: https://github.com/4Science/DSpace/blob/dspace-6_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/VIAFAuthority.java
- Redeploy the latest `5_x-prod` branch on CGSpace, re-run the latest Ansible DSpace playbook, run all system updates, and reboot the server (linode18)
- This will bring the latest bot lists for Solr and Tomcat
- I had to restart Tomcat 7 three times before all Solr statistics cores came up OK
- Leroy and Carol from CIAT/Bioversity were asking for information about posting to the CGSpace REST API from Sharepoint
- I told them that we don't allow this yet, but that we need to check in the future whether content can be posted to a workflow
- Looking further into Carlos Tejos's question about integrating LandVoc (the AGROVOC subset) into DSpace
- I see that you can actually get LandVoc concepts directly from AGROVOC's SPARQL, for example with [this query](http://agrovoc.uniroma2.it/sparql#query=PREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX+skos%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0A%0ASELECT+%3Fconcept%0AWHERE+%7B%0A++%3Fconcept+a+skos%3AConcept+%3B%0A+++++++++++skos%3AinScheme+%3Chttp%3A%2F%2Flandvoc.org%2Flandvoc%3E+.%0A%0A%7D+ORDER+BY+%3Fconcept&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=http%3A%2F%2Fagrovoc.uniroma2.it%2Fsparql&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table)
- So maybe we can query AGROVOC directly using a similar method to [DSpace-CRIS's GettyAuthority](https://github.com/4Science/DSpace/blob/dspace-5_x_x-cris/dspace-api/src/main/java/org/dspace/content/authority/TGNAuthority.java)
- I wired up DSpace-CRIS's VIAFAuthority to see how authorities for auto suggested names get stored
- After submission you can see the item's VIAF identifier:
- I sent a follow-up message to Atmire to look into the two remaining issues with the DSpace 6 upgrade
- First is the fact that we have zero results in our Listings and Reports, for any search
- Second is the error we get during CSV imports
- Help Natalia and Cathy from Bioversity-CIAT with their OpenSearch query on "trade offs" again
- They wanted to build a search query with multiple filters (type, crpsubject, status) and the general query "trade offs"
- I found a great [reference for DSpace's OpenSearch syntax](https://www.kiwi.fi/pages/viewpage.action?pageId=45782169) (albeit in Finnish, but the example URLs show the syntax clearly)
- We can use quotes and `AND` and `OR` and even group search parameters with parenthesis!
- So now I built a query for Natalia which uses these (showing without URL encoding so you can see the syntax):
```
https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND crpsubject:"Water, Land and Ecosystems" AND "tradeoffs"&rpp=100
```
- I noticed that my `move-collections.sh` script didn't work on DSpace 6 because of the change from IDs to UUIDs, so I modified it to quote the collection `resource_id` parameters in the PostgreSQL query
- Last week Natalia from CIAT had asked me to download all the PDFs for a certain query:
- items with status "Open Access"
- items with type "Journal Article"
- items containing any of the following words: water land and ecosystems & trade offs
- The resulting OpenSearch query is: https://cgspace.cgiar.org/open-search/discover?query=type:"Journal Article" AND status:"Open Access" AND Water Land Ecosystems trade offs&rpp=1
- There were 241 results with a total of 208 PDFs, which I downloaded with my `get-wle-pdfs.py` script and shared to her via bashupload.com
- Instead of restarting Tomcat I restarted the PostgreSQL service and then Peter said he was able to submit the item...
- Experiment with doing direct queries for items in the [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api)
- I tested querying a handful of item UUIDs with a date range and returning their hits faceted by `id`
- Assuming a list of item UUIDs was posted to the REST API we could prepare them for a Solr query by joining them into a string with "OR" and escaping the hyphens:
- Experiment with re-creating IWMI's "Monthly Abstract" type report with an AReS template
- The template library for reports is: https://docxtemplater.com
- Conditions start with a pound and end with a slash: {#items} {/items}
- An inverted section begins with a caret (hat) and ends with a slash: {^citation} No citation{/citation}
- I found a bug: templates with a space in the file name don't download
- It would be nice if we could use [angular expressions](https://docxtemplater.readthedocs.io/en/latest/angular_parse.html) to make more complex templates
- Ability to iterate over authors (to change the separator)
- Ability to get item number in a loop (for a list)