mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-08-02
This commit is contained in:
@ -857,4 +857,46 @@ Fixed 13 occurences of: Muloi, D.
|
||||
Fixed 4 occurences of: Muloi, D.M.
|
||||
```
|
||||
|
||||
## 2020-07-28
|
||||
|
||||
- I started analyzing the situation with the cases I've seen where a Solr record fails to be migrated:
|
||||
- `id: 0-unmigrated` are mostly (all?) `type: 5` aka site view
|
||||
- `id: -1-unmigrated` are mostly (all?) `type: 5` aka site view
|
||||
- `id: -1` are mostly (all?) `type: 5` aka site view
|
||||
- `id: 59184-unmigrated` where "59184" is the id of an item or bitstream that no longer exists
|
||||
- Why doesn't Atmire's code ignore any id with "-unmigrated"?
|
||||
- I sent feedback to Atmire since they had responded to my previous question yesterday
|
||||
- They said that the DSpace 6 version of CUA does not work with Tomcat 8.5...
|
||||
- I spent a few hours trying to write a [Jython-based curation task](https://wiki.lyrasis.org/display/DSDOC5x/Curation+tasks+in+Jython) to update ISO 3166-1 Alpha2 country codes based on each item's ISO 3166-1 country
|
||||
- Peter doesn't want to use the ISO 3166-1 list because he objects to a few names, so I thought we might be able to use country codes or numeric codes and update the names with a curation task
|
||||
- The work is very rough but kinda works: [mytask.py](https://gist.github.com/alanorth/6a31af592b3467f7b63ac8aea7c75d52)
|
||||
- What is nice is that the `dso.update()` method updates the data the "DSpace way" so we don't need to re-index Solr
|
||||
- I had a clever idea to "vendor" the pycountry code using `pip install pycountry -t`, but pycountry dropped support for Python 2 in 2019 so we can only use an outdated version
|
||||
- In the end it's really limiting to this particular task in Jython because we are stuck with Python 2, we can't use virtual environments, and there is a lot of code we'd need to write to be able to handle the ISO 3166 country lists
|
||||
- Python 2 is no longer supported by the Python community anyways so it's probably better to figure out how to do this in Java
|
||||
|
||||
## 2020-07-29
|
||||
|
||||
- The Atmire stats tool (com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI) created 150GB of log files due to errors and the disk got full on DSpace Test (linode26)
|
||||
- This morning I had noticed that the run I started last night said that 54,000,000 (54 million!) records failed to process, but the core only had 6 million or so documents to process...!
|
||||
- I removed the large log files and optimized the Solr core
|
||||
|
||||
## 2020-07-30
|
||||
|
||||
- Looking into ISO 3166-1 from the iso-codes package
|
||||
- I see that all current 249 countries have names, 173 have official names, and 6 have common names:
|
||||
|
||||
```
|
||||
# grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
|
||||
249
|
||||
# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
|
||||
249
|
||||
# grep -c -E '"official_name":' /usr/share/iso-codes/json/iso_3166-1.json
|
||||
173
|
||||
# grep -c -E '"common_name":' /usr/share/iso-codes/json/iso_3166-1.json
|
||||
6
|
||||
```
|
||||
|
||||
- Wow, the `CC-BY-NC-ND-3.0-IGO` license that I had [requested in 2019-02](https://github.com/spdx/license-list-XML/issues/767) was finally merged into SPDX...
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
21
content/posts/2020-08.md
Normal file
21
content/posts/2020-08.md
Normal file
@ -0,0 +1,21 @@
|
||||
---
|
||||
title: "August, 2020"
|
||||
date: 2020-07-02T15:35:54+03:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
|
||||
## 2020-08-02
|
||||
|
||||
- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
|
||||
- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
|
||||
- It implements a "force" mode too that will clear existing country codes and re-tag everything
|
||||
- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...
|
||||
|
||||
<!--more-->
|
||||
|
||||
- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
|
||||
- I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
|
||||
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
Reference in New Issue
Block a user