- *Updated on 2018-12-04 to correct the grep command above, as it was inaccurate and it seems the bot was actually already re-using its Tomcat sessions*
- I added the "most-popular" pages to the list that return `X-Robots-Tag: none` to try to inform bots not to index or follow those pages
- Also, I implemented an nginx rate limit of twelve requests per minute on all dynamic pages... I figure a human user might legitimately request one every five seconds
- I wrote a small Python script [add-dc-rights.py](https://gist.github.com/alanorth/4ff81d5f65613814a66cb6f84fdf1fc5) to add usage rights (`dc.rights`) to CGSpace items based on the CSV Hector gave me from MARLO:
- Jesus, is Facebook *trying* to be annoying? At least the Tomcat Crawler Session Manager Valve is working to force the bot to re-use its Tomcat sessions:
- While I was updating the [rest-find-collections.py](https://gist.github.com/alanorth/ddd7f555f0e487fe0e9d3eb4ff26ce50) script I noticed it was using `expand=all` to get the collection and community IDs
- I realized I actually only need `expand=collections,subCommunities`, and I wanted to see how much overhead the extra expands created so I did three runs of each:
```
$ time ./rest-find-collections.py 10568/27629 --rest-url https://dspacetest.cgiar.org/rest
```
- Average time with all expands was 14.3 seconds, and 12.8 seconds with `collections,subCommunities`, so **1.5 seconds difference**!
- Update my [dspace-statistics-api](https://github.com/ilri/dspace-statistics-api) to use a database management class with Python contexts so that connections and cursors are automatically opened and closed
- I deployed verison 0.7.0 of the dspace-statistics-api on DSpace Test (linode19) so I can test it for a few days (and check the Munin stats to see the change in database connections) before deploying on CGSpace
- I also enabled systemd's persistent journal by setting [`Storage=persistent` in *journald.conf*](https://www.freedesktop.org/software/systemd/man/journald.conf.html)
- Apparently [Ubuntu 16.04 defaulted to using rsyslog for boot records until early 2018](https://www.freedesktop.org/software/systemd/man/journald.conf.html), so I removed `rsyslog` too
- Help troubleshoot an issue with Judy Kimani submitting to the [ILRI project reports, papers and documents](https://cgspace.cgiar.org/handle/10568/78) collection on CGSpace
- Sisay changed his leave to be full days until December so I need to finish the IITA records that he was working on ([IITA_ ALIZZY1802-csv_oct23](https://dspacetest.cgiar.org/handle/10568/107871))
- Sisay had said there were a few PDFs missing and Bosede sent them this week, so I had to find those items on DSpace Test and add the bitstreams to the items manually
- As for the collection mappings I think I need to export the CSV from DSpace Test, add mappings for each type (ie Books go to IITA books collection, etc), then re-import to DSpace Test, then export from DSpace command line in "migrate" mode...
- From there I should be able to script the removal of the old DSpace Test collection so they just go to the correct IITA collections on import into CGSpace
- Finally import the 277 IITA (ALIZZY1802) records to CGSpace
- I had to export them from DSpace Test and import them into a temporary collection on CGSpace first, then export the collection as CSV to map them to new owning collections (IITA books, IITA posters, etc) with OpenRefine because DSpace's `dspace export` command doesn't include the collections for the items!
- Delete all old IITA collections on DSpace Test and run `dspace cleanup` to get rid of all the bitstreams
- Generate a new list of the top 1500 AGROVOC subjects on CGSpace to send to Peter and Sisay:
```
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE metadata_field_id = 57 AND resource_type_id = 2 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2018-11-19-top-1500-subject.csv WITH CSV HEADER;
- The Discovery re-indexing on CGSpace never finished yesterday... the command died after six minutes
- The `dspace.log.2018-11-19` shows this at the time:
```
2018-11-19 15:23:04,221 ERROR com.atmire.dspace.discovery.AtmireSolrService @ DSpace kernel cannot be null
java.lang.IllegalStateException: DSpace kernel cannot be null
at org.dspace.utils.DSpace.getServiceManager(DSpace.java:63)
at org.dspace.utils.DSpace.getSingletonService(DSpace.java:87)
at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:102)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:815)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:884)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
2018-11-19 15:23:04,223 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (4629 of 76007): 72731
```
- I looked in the Solr log around that time and I don't see anything...
- Working on Udana's WLE records from last month, first the sixteen records in [2018-11-20 RDL Temp](https://dspacetest.cgiar.org/handle/10568/108254)
- these items will go to the [Restoring Degraded Landscapes collection](https://dspacetest.cgiar.org/handle/10568/81592)
- a few items missing DOIs, but they are easily available on the publication page
- clean up DOIs to use "https://doi.org" format
- clean up some cg.identifier.url to remove unneccessary query strings
- remove columns with no metadata (river basin, place, target audience, isbn, uri, publisher, ispartofseries, subject)
- fix column with invalid spaces in metadata field name (cg. subject. wle)
- trim and collapse whitespace in all fields
- remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: `value.replace('<27>','')`
- add dc.rights to some fields that I noticed while checking DOIs
- Then the 24 records in [2018-11-20 VRC Temp](https://dspacetest.cgiar.org/handle/10568/108271)
- these items will go to the [Variability, Risks and Competing Uses collection](https://dspacetest.cgiar.org/handle/10568/81589)
- trim and collapse whitespace in all fields (lots in WLE subject!)
- clean up some cg.identifier.url fields that had unneccessary anchors in their links
- clean up DOIs to use "https://doi.org" format
- fix column with invalid spaces in metadata field name (cg. subject. wle)
- remove columns with no metadata (place, target audience, isbn, uri, publisher, ispartofseries, subject)
- remove some weird Unicode characters (0xfffd) from abstracts, citations, and titles using Open Refine: `value.replace('<27>','')`
- I notice a few items using DOIs pointing at ICARDA's DSpace like: https://doi.org/20.500.11766/8178, which then points at the "real" DOI on the publisher's site... these should be using the real DOI instead of ICARDA's "fake" Handle DOI
- Some items missing DOIs, but they clearly have them if you look at the publisher's site
- Judy Kimani was having issues resuming submissions in another ILRI collection recently, and the issue there was due to an empty group defined for the "accept/reject" step (aka workflow step 1)
- The error then was "authorization denied for workflow step 1" where "workflow step 1" was the "accept/reject" step, which had a group defined, but was empty
- Tezira says she's also getting the same "authorization denied" error for workflow step 1 when resuming submissions, so I told Abenet to delete the empty group
- [This WLE item](https://cgspace.cgiar.org/handle/10568/97709) is issued on 2018-10 and accessioned on 2018-10-22 but does not show up in the [WLE R4D Learning Series](https://cgspace.cgiar.org/handle/10568/41888) collection on CGSpace for some reason, and therefore does not show up on the WLE publication website
- I tried to remove that collection from Discovery and do a simple re-index:
- Erica from AgriKnowledge emailed me to say that they have implemented the changes in their item page UI so that they include the permanent identifier on items harvested from CGSpace, for example: https://www.agriknowledge.org/concern/generics/wd375w33s
- I think we might want to prune some old accounts from CGSpace, perhaps users who haven't logged in in the last two years would be a conservative bunch:
```
$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 | wc -l
409
$ dspace dsrun org.dspace.eperson.Groomer -a -b 11/27/2016 -d
```
- This deleted about 380 users, skipping those who have submissions in the repository
- Judy Kimani was having problems taking tasks in the [ILRI project reports, papers and documents](https://cgspace.cgiar.org/handle/10568/78) collection again
- The workflow step 1 (accept/reject) is now undefined for some reason
- Last week the group was defined, but empty, so we added her to the group and she was able to take the tasks
- Since then it looks like the group was deleted, so now she didn't have permission to take or leave the tasks in her pool
- We added her back to the group, then she was able to take the tasks, and then we removed the group again, as we generally don't use this step in CGSpace