- I added the "most-popular" pages to the list that return `X-Robots-Tag: none` to try to inform bots not to index or follow those pages
- Also, I implemented an nginx rate limit of twelve requests per minute on all dynamic pages... I figure a human user might legitimately request one every five seconds
- I wrote a small Python script [add-dc-rights.py](https://gist.github.com/alanorth/4ff81d5f65613814a66cb6f84fdf1fc5) to add usage rights (`dc.rights`) to CGSpace items based on the CSV Hector gave me from MARLO:
- While I was updating the [rest-find-collections.py](https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50) script I noticed it was using `expand=all` to get the collection and community IDs
- I realized I actually only need `expand=collections,subCommunities`, and I wanted to see how much overhead the extra expands created so I did three runs of each:
```
$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
```
- Average time with all expands was 14.3 seconds, and 12.8 seconds with `collections,subCommunities`, so **1.5 seconds difference**!
- Update my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) to use a database management class with Python contexts so that connections and cursors are automatically opened and closed
- I deployed verison 0.7.0 of the dspace-statistics-api on DSpace Test (linode19) so I can test it for a few days (and check the Munin stats to see the change in database connections) before deploying on CGSpace
- I also enabled systemd's persistent journal by setting [`Storage=persistent` in *journald.conf*](https://www.freedesktop.org/software/systemd/man/journald.conf.html)
- Apparently [Ubuntu 16.04 defaulted to using rsyslog for boot records until early 2018](https://www.freedesktop.org/software/systemd/man/journald.conf.html), so I removed `rsyslog` too
- Sisay changed his leave to be full days until December so I need to finish the IITA records that he was working on ([IITA_ ALIZZY1802-csv_oct23](https://dspacetest.cgiar.org/handle/10568/107871)
- Sisay had said there were a few PDFs missing and Bosede sent them this week, so I had to find those items on DSpace Test and add the bitstreams to the items manually
- As for the collection mappings I think I need to export the CSV from DSpace Test, add mappings for each type (ie Books go to IITA books collection, etc), then re-import to DSpace Test, then export from DSpace command line in "migrate" mode...
- From there I should be able to script the removal of the old DSpace Test collection so they just go to the correct IITA collections on import into CGSpace
- Finally import the 277 IITA (ALIZZY1802) records to CGSpace
- I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace's `dspace export` command doesn't include the collections for the items!
- Delete all old IITA collections on DSpace Test and run `dspace cleanup` to get rid of all the bitstreams
- Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:
```
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
- The Discovery re-indexing on CGSpace never finished yesterday... the command died after six minutes
- The `dspace.log.2018-11-19` shows this at the time:
```
2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
java.lang.IllegalStateException: DSpace kernel cannot be null
at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2018-11-19 15:23:04,223 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
```
- I looked in the Solr log around that time and I don't see anything...
- Working on Udana's WLE records from last month, first the sixteen records in [2018-11-20 RDL Temp](https://dspacetest.cgiar.org/handle/10568/108254)
- these items will go to the [Restoring Degraded Landscapes collection](https://dspacetest.cgiar.org/handle/10568/81592)
- a few items missing DOIs, but they are easily available on the publication page
- clean up DOIs to use "https://doi.org" format
- clean up some cg.identifier.url to remove unneccessary query strings
- remove columns with no metadata (river basin, place, target audience, isbn, uri, publisher, ispartofseries, subject)
- fix column with invalid spaces in metadata field name (cg. subject. wle)
- trim and collapse whitespace in all fields
- remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: `value.replace('<27>','')`
- add dc.rights to some fields that I noticed while checking DOIs
- Then the 24 records in [2018-11-20 VRC Temp](https://dspacetest.cgiar.org/handle/10568/108271)
- these items will go to the [Variability, Risks and Competing Uses collection](https://dspacetest.cgiar.org/handle/10568/81589)
- trim and collapse whitespace in all fields (lots in WLE subject!)
- clean up some cg.identifier.url fields that had unneccessary anchors in their links
- clean up DOIs to use "https://doi.org" format
- fix column with invalid spaces in metadata field name (cg. subject. wle)
- remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)
- remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: `value.replace('<27>','')`
- I notice a few items using DOIs pointing at ICARDA's DSpace like: https://doi.org/20.500.11766/8178, which then points at the "real" DOI on the publisher's site... these should be using the real DOI instead of ICARDA's "fake" Handle DOI
- Some items missing DOIs, but they clearly have them if you look at the publisher's site
- Judy Kimani was having issues resuming submissions in another ILRI collection recently, and the issue there was due to an empty group defined for the "accept/reject" step (aka workflow step 1)
- The error then was "authorization denied for workflow step 1" where "workflow step 1" was the "accept/reject" step, which had a group defined, but was empty
- Tezira says she's also getting the same "authorization denied" error for workflow step 1 when resuming submissions, so I told Abenet to delete the empty group
- [This WLE item](https://cgspace.cgiar.org/handle/10568/97709) is issued on 2018-10 and accessioned on 2018-10-22 but does not show up in the [WLE R4D Learning Series](https://cgspace.cgiar.org/handle/10568/41888) collection on CGSpace for some reason, and therefore does not show up on the WLE publication website
- I tried to remove that collection from Discovery and do a simple re-index: