- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
- It implements a "force" mode too that will clear existing country codes and re-tag everything
- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...
<!--more-->
- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
- I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
- I forked the [dspace-curation-tasks to ILRI's GitHub](https://github.com/ilri/dspace-curation-tasks) and [submitted the project to Maven Central](https://issues.sonatype.org/browse/OSSRH-59650) so I can integrate it more easily with our DSpace build via dependencies
- Atmire responded to the ticket about the ongoing upgrade issues
- They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!
- They also said they have never experienced the `type: 5` site statistics issue, so I need to try to purge those and continue with the stats processing
- I purged all unmigrated stats in a few cores and then restarted processing:
- Andrea from Macaroni Bros emailed me a few days ago to say he's having issues with the CGSpace REST API
- He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: https://www.rtb.cgiar.org/publications/
- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources
- Perhaps `[Bb][Oo][Tt]`...
- I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent
- I purged all the hits from Solr, including a few thousand from 64.62.202.71
- A few more IPs causing lots of Tomcat sessions yesterday:
- 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
```
Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
```
- 64.62.202.71 is using a user agent I've never seen before:
- Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt):