Add notes for 2020-08-02

This commit is contained in:
2020-08-02 22:14:16 +03:00
parent 99ec5f167f
commit 8fcfb9821d
90 changed files with 878 additions and 619 deletions

View File

@ -857,4 +857,46 @@ Fixed 13 occurences of: Muloi, D.
Fixed 4 occurences of: Muloi, D.M.
```
## 2020-07-28
- I started analyzing the situation with the cases I've seen where a Solr record fails to be migrated:
- `id: 0-unmigrated` are mostly (all?) `type: 5` aka site view
- `id: -1-unmigrated` are mostly (all?) `type: 5` aka site view
- `id: -1` are mostly (all?) `type: 5` aka site view
- `id: 59184-unmigrated` where "59184" is the id of an item or bitstream that no longer exists
- Why doesn't Atmire's code ignore any id with "-unmigrated"?
- I sent feedback to Atmire since they had responded to my previous question yesterday
- They said that the DSpace 6 version of CUA does not work with Tomcat 8.5...
- I spent a few hours trying to write a [Jython-based curation task](https://wiki.lyrasis.org/display/DSDOC5x/Curation+tasks+in+Jython) to update ISO 3166-1 Alpha2 country codes based on each item's ISO 3166-1 country
- Peter doesn't want to use the ISO 3166-1 list because he objects to a few names, so I thought we might be able to use country codes or numeric codes and update the names with a curation task
- The work is very rough but kinda works: [mytask.py](https://gist.github.com/alanorth/6a31af592b3467f7b63ac8aea7c75d52)
- What is nice is that the `dso.update()` method updates the data the "DSpace way" so we don't need to re-index Solr
- I had a clever idea to "vendor" the pycountry code using `pip install pycountry -t`, but pycountry dropped support for Python 2 in 2019 so we can only use an outdated version
- In the end it's really limiting to this particular task in Jython because we are stuck with Python 2, we can't use virtual environments, and there is a lot of code we'd need to write to be able to handle the ISO 3166 country lists
- Python 2 is no longer supported by the Python community anyways so it's probably better to figure out how to do this in Java
## 2020-07-29
- The Atmire stats tool (com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI) created 150GB of log files due to errors and the disk got full on DSpace Test (linode26)
- This morning I had noticed that the run I started last night said that 54,000,000 (54 million!) records failed to process, but the core only had 6 million or so documents to process...!
- I removed the large log files and optimized the Solr core
## 2020-07-30
- Looking into ISO 3166-1 from the iso-codes package
- I see that all current 249 countries have names, 173 have official names, and 6 have common names:
```
# grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
249
# grep -c -E '"official_name":' /usr/share/iso-codes/json/iso_3166-1.json
173
# grep -c -E '"common_name":' /usr/share/iso-codes/json/iso_3166-1.json
6
```
- Wow, the `CC-BY-NC-ND-3.0-IGO` license that I had [requested in 2019-02](https://github.com/spdx/license-list-XML/issues/767) was finally merged into SPDX...
<!-- vim: set sw=2 ts=2: -->

21
content/posts/2020-08.md Normal file
View File

@ -0,0 +1,21 @@
---
title: "August, 2020"
date: 2020-07-02T15:35:54+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2020-08-02
- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
- It implements a "force" mode too that will clear existing country codes and re-tag everything
- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...
<!--more-->
- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
- I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
<!-- vim: set sw=2 ts=2: -->