cgspace-notes/content/posts/2020-08.md

125 lines
6.4 KiB
Markdown
Raw Normal View History

2020-08-02 21:14:16 +02:00
---
title: "August, 2020"
date: 2020-07-02T15:35:54+03:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2020-08-02
- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values
- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names)
- It implements a "force" mode too that will clear existing country codes and re-tag everything
- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa...
<!--more-->
- The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks
- I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config)
2020-08-02 22:55:04 +02:00
- I forked the [dspace-curation-tasks to ILRI's GitHub](https://github.com/ilri/dspace-curation-tasks) and [submitted the project to Maven Central](https://issues.sonatype.org/browse/OSSRH-59650) so I can integrate it more easily with our DSpace build via dependencies
2020-08-02 21:14:16 +02:00
2020-08-03 15:27:51 +02:00
## 2020-08-03
- Atmire responded to the ticket about the ongoing upgrade issues
- They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!
- They also said they have never experienced the `type: 5` site statistics issue, so I need to try to purge those and continue with the stats processing
- I purged all unmigrated stats in a few cores and then restarted processing:
```
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
```
- Andrea from Macaroni Bros emailed me a few days ago to say he's having issues with the CGSpace REST API
- He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: https://www.rtb.cgiar.org/publications/
2020-08-05 14:00:06 +02:00
## 2020-08-04
- Look into the REST API issues that Macaroni Bros raised last week:
- The first one was about the `collections` endpoint returning empty items:
- https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2 (offset=2 is correct)
- https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3 (offset=3 is empty)
- https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4 (offset=4 is correct again)
- I confirm that the second link returns zero items on CGSpace...
- I tested on my local development instance and it returns one item correctly...
- I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly...
- Perhaps an indexing issue?
- The second issue is the `collections` endpoint returning the wrong number of items:
- https://cgspace.cgiar.org/rest/collections/1445 (numberItems: 63)
- https://cgspace.cgiar.org/rest/collections/1445/items (real number of items: 61)
- I confirm that it is indeed happening on CGSpace...
- And actually I can replicate the same issue on my local CGSpace 5.8 instance:
```
$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
"numberItems" : 63,
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
61
```
- Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:
```
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
"numberItems" : 61,
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
59
```
- Ah! I exported that collection's metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
- I dealt with this problem in 2017-01 and the solution is to check the `collection2item` table:
```
dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
id | collection_id | item_id
--------+---------------+---------
133698 | 966 | 107687
134685 | 1445 | 107687
134686 | 1445 | 107687
(3 rows)
```
- So for each id you can delete one duplicate mapping:
```
dspace=# DELETE FROM collection2item WHERE id='134686';
dspace=# DELETE FROM collection2item WHERE id='128819';
```
- Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter's preferred display names
```
$ cat 2020-08-04-PB-new-countries.csv
cg.coverage.country,correct
CAPE VERDE,CABO VERDE
COCOS ISLANDS,COCOS (KEELING) ISLANDS
"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
COTE D'IVOIRE,CÔTE D'IVOIRE
"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
PALESTINE,"PALESTINE, STATE OF"
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
```
- I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
- I started a full Discovery re-indexing
## 2020-08-05
- Port my [dspace-curation-tasks](https://github.com/ilri/dspace-curation-tasks) to DSpace 6 and tag version `6.0-SNAPSHOT`
- I downloaded the [UN M.49](https://unstats.un.org/unsd/methodology/m49/overview/) CSV file to start working on updating the CGSpace regions
- First issue is they don't version the file so you have no idea when it was released
- Second issue is that three rows have errors due to not using quotes around "China, Macao Special Administrative Region"
- Bizu said she was having problems approving tasks on CGSpace
- I looked at the PostgreSQL locks and they have skyrocketed since yesterday:
![PostgreSQL locks day](/cgspace-notes/2020/08/postgres_locks_ALL-day.png)
![PostgreSQL query length day](/cgspace-notes/2020/08/postgres_querylength_ALL-day.png)
- Seems that something happened yesterday afternoon at around 5PM...
- For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue
- I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly
2020-08-02 21:14:16 +02:00
<!-- vim: set sw=2 ts=2: -->