- I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
- "RI/1.0", 1337
- "Microsoft Office Word 2014", 941
- I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one... as that's an actual user...
<!--more-->
- I should probably add the `RI/1.0` pattern to COUNTER-Robots project
- As well as these IPs:
- 193.169.254.178, 21648
- 181.62.166.177, 20323
- 45.146.166.180, 19376
- The first IP seems to be in Estonia and their requests to the REST API change user agents from curl to Mac OS X to Windows and more
- Also, they seem to be trying to exploit something:
- I set up a clean DSpace 6.4 instance locally to test some things against, for example to be able to rule out whether some issues are due to Atmire modules or are fixed in the as-of-yet-unreleased DSpace 6.4
- I had to delete all the Atmire schemas, then it worked fine on Tomcat 8.5 with Mirage (I didn't want to bother with npm and ruby for Mirage 2)
- Then I tried to see if I could reproduce the mapping issue that Marianne raised last month
- I tried unmapping and remapping to the CGIAR Gender grants collection and the collection appears in the item view's list of mapped collections, but not on the collection browse itself
- Then I tried mapping to a new collection and it was the same as above
- So this issue is really just a DSpace bug, and nothing to do with Atmire and not fixed in the unreleased DSpace 6.4
- I will try one more time after updating the Discovery index (I'm also curious how fast it is on vanilla DSpace 6.4, though I think I tried that when I did the flame graphs in 2019 and it was miserable)
```console
$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b 4053.24s user 53.17s system 38% cpu 2:58:53.83 total
```
- Nope! Still slow, and still no mapped item...
- I even tried unmapping it from all collections, and adding it to a single new owning collection...
- Ah hah! Actually, I was inspecting the item's authorization policies when I noticed that someone had made the item private!
- After making it public again I was able to see it in the target collection
- The indexes on AReS Explorer are messed up after last week's harvesting:
- The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
- I also spent some time looking at IWMI's reports again
- On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article"
- Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata...
- We can get the collections an item is in from the `parentCollectionList` metadata, but it is saved in Elasticsearch as a string instead of a list...
- I told them it won't be possible to replicate their reports exactly
- I decided to look at the CLARISA controlled vocabularies again
- They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
- They have updated their Swagger interface but it still requires an API key if you want to use it from curl
- They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see
- I exported a list of the institutions to look closer
- I found twelve items with whitespace issues
- There are some weird entries like `Research Institute for Aquaculture No1` and `Research Institute for Aquaculture No2`
- A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
- I found 100+ items with multiple languages in there name like `Ministère de l’Agriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries`
- Over 600 institutions have the country in their name like `Ministry of Coordination of Environmental Affairs (Mozambique)`
- For URLs they have `null` in some places... which is weird... why not just leave it blank?
- I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump):