cgspace-notes/content/posts/2021-03.md

280 lines
13 KiB
Markdown
Raw Normal View History

2021-03-04 21:46:05 +01:00
---
title: "March, 2021"
date: 2021-03-01T10:13:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2021-03-01
- Discuss some OpenRXV issues with Abdullah from CodeObia
- He's trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API
- Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies
<!--more-->
## 2021-03-02
- I fixed three build and runtime issues in OpenRXV:
- [fix highcharts-angular and ngx-tour-core build](https://github.com/ilri/OpenRXV/pull/80)
- [frontend/package.json: Pin @types/ramda at 0.27.34](https://github.com/ilri/OpenRXV/pull/82)
- Then I merged a few fixes that Abdullah had worked on last week
## 2021-03-03
- I [fixed another frontend build warning on OpenRXV](https://github.com/ilri/OpenRXV/issues/83)
- Then I [updated the frontend container to use Node.js 12 and Ubuntu 20.04](https://github.com/ilri/OpenRXV/pull/84)
- Also, I [added a GitHub Actions workflow to build the frontend](https://github.com/ilri/OpenRXV/pull/85)
- I did some testing of Abdullah's patch for the values mapping search on OpenRXV
- It still doesn't work with multi-word values, so I recorded a video with wf-recorder and uploaded it to [the issue](https://github.com/ilri/OpenRXV/issues/43) for him to investigate
## 2021-03-04
- Peter is having issues with the workflow since yesterday
- I looked at the Munin stats and see a high number of database locks since yesterday
![PostgreSQL locks week](/cgspace-notes/2021/03/postgres_locks_ALL-week.png)
![PostgreSQL connections week](/cgspace-notes/2021/03/postgres_connections_cgspace-week.png)
- I looked at the number of connections in PostgreSQL and it's definitely high again:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020
```
- I reported it to Atmire to take a look, on the [same issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851) we had been tracking this before
- Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
- I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my `add-orcid-identifier.py` script:
```console
$ cat 2021-03-04-add-zoe-campbell-orcid.csv
dc.contributor.author,cg.creator.identifier
"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
```
- I still need to do cleanup on the journal articles metadata
- Peter sent me some cleanups but I can't use them in the search/replace format he gave
- I think it's better to export the metadata values with IDs and import cleaned up ones as CSV
```console
localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
```
- I used OpenRefine to remove all journal values that didn't have one of these values: ; ( )
- Then I cloned the `cg.journal` field to `cg.volume` and `cg.issue`
- I used some GREL expressions like these to extract the journal name, volume, and issue:
```console
value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
```
- Then I uploaded the changes to CGSpace using `dspace metadata-import`
- Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..."
- I searched the DSpace issue tracker and found several issues reporting this:
- [DS-3985 Delete item fails](https://jira.lyrasis.org/browse/DS-3985)
- [DS-4004 Authorization denied Exception when trying to delete permanently an item, collection or community as a non-Admin user](https://jira.lyrasis.org/browse/DS-4004)
- [DS-4297 Authorization error when trying to delete item by submitter/administrator](https://jira.lyrasis.org/browse/DS-4297)
- The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection...
- In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection's admin and submit groups, exactly as the issue described
- I added a comment about our issue to [DS-4297](https://jira.lyrasis.org/browse/DS-4297)
- Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana's submissions
- I edited Udana's submission to CGSpace:
- corrected the title
- added language English
- changed the link to the external item page instead of PDF
- added SDGs from the external item page
- added AGROVOC subjects from the external item page
- added pagination (extent)
- changed the license to "other" because CC-BY-NC-ND is not printed anywhere in the PDF or external item page
2021-03-05 19:52:36 +01:00
## 2021-03-05
- I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:
```console
$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
2021-03-06 12:35:20 +01:00
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
2021-03-05 19:52:36 +01:00
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
```
- The trick is that when you create a volume like "myvolume" from a `docker-compose.yml` file, Docker will create it with the name "docker_myvolume"
- If you create it manually on the command line with `docker volume create myvolume` then the name is literally "myvolume"
- I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
- Delete the `openrxv-items-temp` index to test a fresh harvesting:
```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
```
2021-03-06 12:35:20 +01:00
## 2021-03-05
- Check the results of the AReS harvesting from last night:
```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
"count" : 101761,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- Set the current items index to read only and make a backup:
```console
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
```
- Delete the current items index and clone the temp one to it:
```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
```
- Then delete the temp and backup:
```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{"acknowledged":true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
```
- I made some pull requests to OpenRXV:
- [docker/docker-compose.yml: Use docker volumes](https://github.com/ilri/OpenRXV/pull/86)
- [docker/docker-compose.yml: Pin Redis to version 5](https://github.com/ilri/OpenRXV/pull/87)
- I deployed the latest changes from the last few days on AReS production
2021-03-07 14:51:12 +01:00
## 2021-03-07
- I realized there is something wrong with the Elasticsearch indexes on AReS
- On a new test environment I see `openrxv-items` is correctly an alias of `openrxv-items-final`:
```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items-final": {
"aliases": {
"openrxv-items": {}
}
},
```
- But on AReS production `openrxv-items` has somehow become an index:
```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
"openrxv-items": {
"aliases": {}
},
"openrxv-items-final": {
"aliases": {}
},
"openrxv-items-temp": {
"aliases": {}
},
```
- I fixed the issue on production by cloning the `openrxv-items` index to `openrxv-items-final`, deleting `openrxv-items`, and then re-creating it as an alias:
```console
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
```
- Delete backups and remove read-only mode on `openrxv-items`:
```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```
- Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
- Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:
```console
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
```
- I see the usual IPs for CCAFS and ILRI importer bots, but also `143.233.242.132` which appears to be for GARDIAN:
```console
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418
```
- They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a "normal" user agent
- Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
- I will add this IP to the list of bots in nginx and purge it from Solr with my `check-spider-ip-hits.sh` script
- I made a few changes to OpenRXV:
- [Migrated away from links to use networks](https://github.com/ilri/OpenRXV/issues/89)
- [Converted the backend container to use a custom image that includes `unoconv`](https://github.com/ilri/OpenRXV/issues/68) so we don't have to manually install it anymore
2021-03-08 19:13:40 +01:00
## 2021-03-08
- I approved the WLE item that I edited last week, and all the metadata is there: https://hdl.handle.net/10568/111810
- So I'm not sure what Niroshini's issue with metadata is...
- Peter sent a message yesterday saying that his item finally got committed
- I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
```
- On 2021-03-03 the PostgreSQL transactions started rising:
![PostgreSQL query length week](/cgspace-notes/2021/03/postgres_querylength_ALL-week.png)
- After that the connections and locks started going up, peaking on 2021-03-06:
![PostgreSQL locks week](/cgspace-notes/2021/03/postgres_locks_ALL-week.png)
![PostgreSQL connections week](/cgspace-notes/2021/03/postgres_connections_ALL-week.png)
- I sent another message to Atmire to ask if they have time to look into this
- CIFOR is pressuring me to upload the batch items from last week
- Vika sent me a final file with some duplicates that Peter identified removed
- I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through `csv-metadata-quality` checker and uploaded them to CGSpace
- In total there are 1,088 items
- Udana from IWMI emailed to ask about CGSpace thumbnails
- Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
- [The item](https://hdl.handle.net/10568/111794) was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item
- Abenet got a quote from Atmire to buy 125 credits for 3750€
- Maria at Bioversity sent some feedback about duplicate items on AReS
- I'm wondering if the issue of the `openrxv-items-final` index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
- I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:
```console
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
```
- As I saw on my local test instance, even when you cancel a harvesting, it replaces the `openrxv-items-final` index with whatever is in `openrxv-items-temp` automatically, so I assume it will do the same now
2021-03-04 21:46:05 +01:00
<!-- vim: set sw=2 ts=2: -->