Add notes for 2020-08-02

2025-01-27 05:49:12 +01:00 · 2020-08-02 22:14:16 +03:00
parent 99ec5f167f
commit 8fcfb9821d
90 changed files with 878 additions and 619 deletions
--- a/content/posts/2020-07.md
+++ b/content/posts/2020-07.md
@ -857,4 +857,46 @@ Fixed 13 occurences of: Muloi, D.
 Fixed 4 occurences of: Muloi, D.M.
 ```

+## 2020-07-28
+
+- I started analyzing the situation with the cases I've seen where a Solr record fails to be migrated:
+  - `id: 0-unmigrated` are mostly (all?) `type: 5` aka site view
+  - `id: -1-unmigrated` are mostly (all?) `type: 5` aka site view
+  - `id: -1` are mostly (all?) `type: 5` aka site view
+  - `id: 59184-unmigrated` where "59184" is the id of an item or bitstream that no longer exists
+- Why doesn't Atmire's code ignore any id with "-unmigrated"?
+- I sent feedback to Atmire since they had responded to my previous question yesterday
+  - They said that the DSpace 6 version of CUA does not work with Tomcat 8.5...
+- I spent a few hours trying to write a [Jython-based curation task](https://wiki.lyrasis.org/display/DSDOC5x/Curation+tasks+in+Jython) to update ISO 3166-1 Alpha2 country codes based on each item's ISO 3166-1 country
+  - Peter doesn't want to use the ISO 3166-1 list because he objects to a few names, so I thought we might be able to use country codes or numeric codes and update the names with a curation task
+  - The work is very rough but kinda works: [mytask.py](https://gist.github.com/alanorth/6a31af592b3467f7b63ac8aea7c75d52)
+  - What is nice is that the `dso.update()` method updates the data the "DSpace way" so we don't need to re-index Solr
+  - I had a clever idea to "vendor" the pycountry code using `pip install pycountry -t`, but pycountry dropped support for Python 2 in 2019 so we can only use an outdated version
+  - In the end it's really limiting to this particular task in Jython because we are stuck with Python 2, we can't use virtual environments, and there is a lot of code we'd need to write to be able to handle the ISO 3166 country lists
+  - Python 2 is no longer supported by the Python community anyways so it's probably better to figure out how to do this in Java
+
+## 2020-07-29
+
+- The Atmire stats tool (com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI) created 150GB of log files due to errors and the disk got full on DSpace Test (linode26)
+  - This morning I had noticed that the run I started last night said that 54,000,000 (54 million!) records failed to process, but the core only had 6 million or so documents to process...!
+  - I removed the large log files and optimized the Solr core
+
+## 2020-07-30
+
+- Looking into ISO 3166-1 from the iso-codes package
+  - I see that all current 249 countries have names, 173 have official names, and 6 have common names:
+
+```
+# grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
+249
+# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
+249
+# grep -c -E '"official_name":' /usr/share/iso-codes/json/iso_3166-1.json
+173
+# grep -c -E '"common_name":' /usr/share/iso-codes/json/iso_3166-1.json
+6
+```
+
+- Wow, the `CC-BY-NC-ND-3.0-IGO` license that I had [requested in 2019-02](https://github.com/spdx/license-list-XML/issues/767) was finally merged into SPDX...
+
 <!-- vim: set sw=2 ts=2: -->
--- a/content/posts/2020-08.md
+++ b/content/posts/2020-08.md
@ -0,0 +1,21 @@
+---
+title: "August, 2020"
+date: 2020-07-02T15:35:54+03:00
+author: "Alan Orth"
+categories: ["Notes"]
+---
+
+## 2020-08-02
+
+- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
+  - It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
+  - It implements a "force" mode too that will clear existing country codes and re-tag everything
+  - It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...
+
+<!--more-->
+
+- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
+  - I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
+
+
+<!-- vim: set sw=2 ts=2: -->