- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
- It implements a "force" mode too that will clear existing country codes and re-tag everything
- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...
<!--more-->
- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
- I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
- I forked the [dspace-curation-tasks to ILRI's GitHub](https://github.com/ilri/dspace-curation-tasks) and [submitted the project to Maven Central](https://issues.sonatype.org/browse/OSSRH-59650) so I can integrate it more easily with our DSpace build via dependencies
- Atmire responded to the ticket about the ongoing upgrade issues
- They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!
- They also said they have never experienced the `type: 5` site statistics issue, so I need to try to purge those and continue with the stats processing
- I purged all unmigrated stats in a few cores and then restarted processing:
- Andrea from Macaroni Bros emailed me a few days ago to say he's having issues with the CGSpace REST API
- He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: https://www.rtb.cgiar.org/publications/
- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources
- Perhaps `[Bb][Oo][Tt]`...
- I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent
- I purged all the hits from Solr, including a few thousand from 64.62.202.71
- A few more IPs causing lots of Tomcat sessions yesterday:
- 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
```
Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
```
- 64.62.202.71 is using a user agent I've never seen before:
- Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt):
- I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I'm having the same issues with Java heap space that I had last month
- The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m...
- Also, I decided to try to purge all the `-unmigrated` docs that it had found so far to see if that helps...
- There were about 466,000 records unmigrated so far, most of which were `type: 5` (SITE statistics)
- Now it is processing again...
- I developed a small Java class called `FixJpgJpgThumbnails` to remove ".jpg.jpg" thumbnails from the `THUMBNAIL` bundle and replace them with their originals from the `ORIGINAL` bundle
- The code is based on [RemovePNGThumbnailsForPDFs.java](https://github.com/UoW-IRRs/DSpace-Scripts/blob/master/src/main/java/nz/ac/waikato/its/irr/scripts/RemovePNGThumbnailsForPDFs.java) by Andrea Schweer
- I incorporated it into my dspace-curation-tasks repository, then renamed it to [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers)
- The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:
```
Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.update(SourceFile:128)
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI.main(SourceFile:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- It lists a few of the records that it is having issues with and they all have integer IDs
- When I checked Solr I see 8,000 of them, some of which have type 0 and some with no type...
- The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space...
- I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:
- I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn't all come back up OK
- I had to restart Tomcat five times before they all came up!
- After that I generated a report for 2018 and 2019 on each server and found that the difference is about 10,000–20,000 per month, which is much less than I was expecting
- I noticed some authors that should have ORCID identifiers, but didn't (perhaps older items before we were tagging ORCID metadata)
- With the simple list below I added 1,341 identifiers!
- That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:
```
dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
- I looked into the strange Solr record above that had "{set=830}" in the communities and collections
- There are exactly 11724 records like this in the current CGSpace (DSpace 5.8) statistics-2018 Solr core
- None of them have an `id` or `type` field!
- I see 242,000 of them in the statistics-2017 core, 185,063 in the statistics-2016 core... all the way to 2010, but not in 2019 or the current statistics core
- I decided to purge all of these records from CGSpace right now so they don't even have a chance at being an issue on the real migration:
- I added `Googlebot` and `Twitterbot` to the list of explicitly allowed bots
- In Google's case, they were getting lumped in with all the other bad bots and then important links like the sitemaps were returning HTTP 503, but they generally respect `robots.txt` so we should just allow them (perhaps we can control the crawl rate in the webmaster console)
- In Twitter's case they were also getting lumped in with the bad bots too, but really they only make ~50 or so requests a day when someone posts a CGSpace link on Twitter
- I tagged the ISO 3166-1 Alpha2 country codes on all items on CGSpace using my [CountryCodeTagger](https://github.com/ilri/cgspace-java-helpers) curation task