cgspace-notes/content/posts/2021-03.md

---
title: "March, 2021"
date: 2021-03-01T10:13:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2021-03-01

- Discuss some OpenRXV issues with Abdullah from CodeObia
  - He's trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API
  - Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies

<!--more-->

## 2021-03-02

- I fixed three build and runtime issues in OpenRXV:
  - [fix highcharts-angular and ngx-tour-core build](https://github.com/ilri/OpenRXV/pull/80)
  - [frontend/package.json: Pin @types/ramda at 0.27.34](https://github.com/ilri/OpenRXV/pull/82)
- Then I merged a few fixes that Abdullah had worked on last week

## 2021-03-03

- I [fixed another frontend build warning on OpenRXV](https://github.com/ilri/OpenRXV/issues/83)
- Then I [updated the frontend container to use Node.js 12 and Ubuntu 20.04](https://github.com/ilri/OpenRXV/pull/84)
- Also, I [added a GitHub Actions workflow to build the frontend](https://github.com/ilri/OpenRXV/pull/85)
- I did some testing of Abdullah's patch for the values mapping search on OpenRXV
  - It still doesn't work with multi-word values, so I recorded a video with wf-recorder and uploaded it to [the issue](https://github.com/ilri/OpenRXV/issues/43) for him to investigate

## 2021-03-04

- Peter is having issues with the workflow since yesterday
  - I looked at the Munin stats and see a high number of database locks since yesterday

![PostgreSQL locks week](/cgspace-notes/2021/03/postgres_locks_ALL-week.png)
![PostgreSQL connections week](/cgspace-notes/2021/03/postgres_connections_cgspace-week.png)

- I looked at the number of connections in PostgreSQL and it's definitely high again:

```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020
```

- I reported it to Atmire to take a look, on the [same issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851) we had been tracking this before
- Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
- I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my `add-orcid-identifier.py` script:

```console
$ cat 2021-03-04-add-zoe-campbell-orcid.csv 
dc.contributor.author,cg.creator.identifier
"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
```

- I still need to do cleanup on the journal articles metadata
  - Peter sent me some cleanups but I can't use them in the search/replace format he gave
  - I think it's better to export the metadata values with IDs and import cleaned up ones as CSV

```console
localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087
```

- I used OpenRefine to remove all journal values that didn't have one of these values: ; ( )
  - Then I cloned the `cg.journal` field to `cg.volume` and `cg.issue`
  - I used some GREL expressions like these to extract the journal name, volume, and issue:

```console
value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
```

- Then I uploaded the changes to CGSpace using `dspace metadata-import`
- Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
  - The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..."
  - I searched the DSpace issue tracker and found several issues reporting this:
    - [DS-3985 Delete item fails](https://jira.lyrasis.org/browse/DS-3985)
    - [DS-4004 Authorization denied Exception when trying to delete permanently an item, collection or community as a non-Admin user](https://jira.lyrasis.org/browse/DS-4004)
    - [DS-4297 Authorization error when trying to delete item by submitter/administrator](https://jira.lyrasis.org/browse/DS-4297)
  - The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection...
  - In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection's admin and submit groups, exactly as the issue described
  - I added a comment about our issue to [DS-4297](https://jira.lyrasis.org/browse/DS-4297)
- Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana's submissions
  - I edited Udana's submission to CGSpace:
    - corrected the title
    - added language English
    - changed the link to the external item page instead of PDF
    - added SDGs from the external item page
    - added AGROVOC subjects from the external item page
    - added pagination (extent)
    - changed the license to "other" because CC-BY-NC-ND is not printed anywhere in the PDF or external item page

## 2021-03-05

- I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:

```console
$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d
```

- The trick is that when you create a volume like "myvolume" from a `docker-compose.yml` file, Docker will create it with the name "docker_myvolume"
  - If you create it manually on the command line with `docker volume create myvolume` then the name is literally "myvolume"
- I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
- Delete the `openrxv-items-temp` index to test a fresh harvesting:

```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
```

## 2021-03-05

- Check the results of the AReS harvesting from last night:

```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 101761,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
```

- Set the current items index to read only and make a backup:

```console
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
```

- Delete the current items index and clone the temp one to it:

```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
```

- Then delete the temp and backup:

```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{"acknowledged":true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
```

- I made some pull requests to OpenRXV:
  - [docker/docker-compose.yml: Use docker volumes](https://github.com/ilri/OpenRXV/pull/86)
  - [docker/docker-compose.yml: Pin Redis to version 5](https://github.com/ilri/OpenRXV/pull/87)
- I deployed the latest changes from the last few days on AReS production

## 2021-03-07

- I realized there is something wrong with the Elasticsearch indexes on AReS
  - On a new test environment I see `openrxv-items` is correctly an alias of `openrxv-items-final`:

```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },
```

- But on AReS production `openrxv-items` has somehow become an index:

```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {}
    },
```

- I fixed the issue on production by cloning the `openrxv-items` index to `openrxv-items-final`, deleting `openrxv-items`, and then re-creating it as an alias:

```console
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
```

- Delete backups and remove read-only mode on `openrxv-items`:

```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```

- Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
  - Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:

```console
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
```

- I see the usual IPs for CCAFS and ILRI importer bots, but also `143.233.242.132` which appears to be for GARDIAN:

```console
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418
```

- They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a "normal" user agent
  - Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
  - I will add this IP to the list of bots in nginx and purge it from Solr with my `check-spider-ip-hits.sh` script
- I made a few changes to OpenRXV:
  - [Migrated away from links to use networks](https://github.com/ilri/OpenRXV/issues/89)
  - [Converted the backend container to use a custom image that includes `unoconv`](https://github.com/ilri/OpenRXV/issues/68) so we don't have to manually install it anymore

## 2021-03-08

- I approved the WLE item that I edited last week, and all the metadata is there: https://hdl.handle.net/10568/111810
  - So I'm not sure what Niroshini's issue with metadata is...
- Peter sent a message yesterday saying that his item finally got committed
  - I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):

```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
```

- On 2021-03-03 the PostgreSQL transactions started rising:

![PostgreSQL query length week](/cgspace-notes/2021/03/postgres_querylength_ALL-week.png)

- After that the connections and locks started going up, peaking on 2021-03-06:

![PostgreSQL locks week](/cgspace-notes/2021/03/postgres_locks_ALL-week.png)
![PostgreSQL connections week](/cgspace-notes/2021/03/postgres_connections_ALL-week.png)

- I sent another message to Atmire to ask if they have time to look into this
- CIFOR is pressuring me to upload the batch items from last week
  - Vika sent me a final file with some duplicates that Peter identified removed
  - I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through `csv-metadata-quality` checker and uploaded them to CGSpace
  - In total there are 1,088 items
- Udana from IWMI emailed to ask about CGSpace thumbnails
- Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
  - [The item](https://hdl.handle.net/10568/111794) was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item
- Abenet got a quote from Atmire to buy 125 credits for 3750€
- Maria at Bioversity sent some feedback about duplicate items on AReS
- I'm wondering if the issue of the `openrxv-items-final` index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
  - I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:

```console
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS
```

- As I saw on my local test instance, even when you cancel a harvesting, it replaces the `openrxv-items-final` index with whatever is in `openrxv-items-temp` automatically, so I assume it will do the same now

## 2021-03-09

- The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
  - This means that [the issue](https://github.com/ilri/OpenRXV/issues/64) we were facing for a few months was due to the `openrxv-items` index being deleted and re-created as a standalone index instead of an alias of `openrxv-items-final`
- Talk to Moayad about OpenRXV development
  - We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift
  - I proposed a solution in the [GitHub issue](https://github.com/ilri/OpenRXV/issues/67), where we consult the site's XML sitemap after harvesting to see if we missed any items, and then we harvest them individually
- Peter sent me a list of 356 DOIs from Altmetric that don't have our Handles, so we need to Tweet them
  - I used my `doi-to-handle.py` script to generate a list of handles and titles for him:

```console
$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
```

## 2021-03-10

- Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses `cg.isijournal` and MELSpace uses `mel.impact-factor`
  - I filed [an issue](https://github.com/AgriculturalSemantics/cg-core/issues/39) on the cg-core project to ask colleagues for ideas
- Peter said he doesn't see "Source Code" or "Software" in the [output type facet on the ILRI community](https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type), but I see it on the home page, so I will try to do a full Discovery re-index:

```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    318m20.485s
user    215m15.196s
sys     2m51.529s
```

- Now I see ten items for "Source Code" in the facets...
- Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code
- Added the ability to check `dcterms.license` values against the SPDX licenses in the csv-metadata-quality tool
  - Also, I made some other minor fixes and released [version 0.4.6](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.6) on GitHub
- Proof and upload twenty-seven items to CGSpace for Peter Ballantyne
  - Mostly Ugandan outputs for CRP Livestock and Livestock and Fish

## 2021-03-14

- Switch to linux-kvm kernel on linode20 and linode18:

```console
# apt update && apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
# apt autoremove && apt autoclean
# reboot
```

- Deploy latest changes from `6_x-prod` branch on CGSpace
- Deploy latest changes from OpenRXV `master` branch on AReS
- Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
- Back up the current `openrxv-items-final` index on AReS to start a new harvest:

```console
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```

- After the harvesting finished it seems the indexes got messed up again, as `openrxv-items` is an alias of `openrxv-items-temp` instead of `openrxv-items-final`:

```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
```

- Anyways, the number of items in `openrxv-items` seems OK and the AReS Explorer UI is working fine
  - I will have to manually fix the indexes before the next harvesting
- Publish the web version of the DSpace CSV Metadata Quality checker tool that I wrote this weekend on GitHub: https://github.com/ilri/csv-metadata-quality-web
  - Also, it is deployed on Heroku: https://fierce-ocean-30836.herokuapp.com/
  - I was running it on Google App Engine originally, but they have *way* too aggressive caching of static assets

## 2021-03-16

- Review ten items for Livestock and Fish and Dryland Systems from Peter
  - I told him to try the new web-based CSV Metadata Qualiter checker and he thought it was cool
  - I found one exact duplicate item and it gave me an idea to try to detect this in the tool

## 2021-03-17

- I added the ability to check for duplicate items to csv-metadata-quality
- I also made some minor optimizations in the Pandas code
- I [tagged version 0.4.7 of csv-metadata-quality on GitHub](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.7)

## 2021-03-18

- I added the ability to check for, and fix, "mojibake" characters in csv-metadata-quality

## 2021-03-21

- Last week Atmire asked me which browser I was using to test the duplicate checker, which I had [reported](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934) as not loading
  - I tried to load it in Chrome and it works... hmmm
- Back up the current `openrxv-items-final` index to start a fresh AReS Harvest:

```console
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```

- Then start harvesting in the AReS Explorer admin UI

## 2021-03-22

- The harvesting on AReS yesterday completed, but somehow I have twice the number of items:

```console
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 206204,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
```

- Hmmm and even my backup index has a strange number of items:

```console
$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
{
  "count" : 844,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
```

- I deleted all indexes and re-created the openrxv-items alias:

```console
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-temp": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    }
```

- Then I started a new harvesting
- I switched the Node.js in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) to v12 since v10 will cease to be supported soon
  - I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server
- The AReS harvest finally finished, with 1047 pages of items, but the `openrxv-items-final` index is empty and the `openrxv-items-temp` index has a 103,000 items:

```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 103162,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
```

- I tried to clone the temp index to the final, but got an error:

```console
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}% 
```

- I looked in the Docker logs for Elasticsearch and saw a few memory errors:

```console
java.lang.OutOfMemoryError: Java heap space
```

- According to `/usr/share/elasticsearch/config/jvm.options` in the Elasticsearch container the default JVM heap is 1g
  - I see the running Java process has `-Xms 1g -Xmx 1g` in its process invocation so I guess that it must be indeed using 1g
  - We can [change the heap size with the ES_JAVA_OPTS environment variable](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html)
  - Or perhaps better, we should [use a jvm.options.d file](https://www.elastic.co/guide/en/elasticsearch/reference/master/jvm-options.html) because if you use the environment variable it overrides all other JVM options from the default `jvm.options`
  - I tried to set memory to 1536m by binding an options file and restarting the container, but it didn't seem to work
  - Nevertheless, after restarting I see 103,000 items in the Explorer...
  - But the indexes are still kinda messed up... the `openrxv-items` index is an alias of the wrong index!

```console
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
```

## 2021-03-23

- For reference you can also get the Elasticsearch JVM stats from the API:

```console
$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
```

- I re-deployed AReS with 1.5GB of heap using the `ES_JAVA_OPTS` environment variable
  - It turns out that this *is* the recommended way to set the heap: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html
- Then I fixed the aliases to make sure `openrxv-items` was an alias of `openrxv-items-final`, similar to how I did a few weeks ago
- I re-created the temp index:

```console
$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
```

## 2021-03-24

- Atmire responded to the [ticket about the Duplicate Checker](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934)
  - He says it works for him in Firefox, so I checked and it seems to have been an issue with my LocalCDN addon
- I re-deployed DSpace Test (linode26) from the latest CGSpace (linode18) data
  - I want to try to finish up processing the duplicates in Solr that [Atmire advised on last month](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839)
  - The current statistics core is 57861236 kilobytes:

```console
# du -s /home/dspacetest.cgiar.org/solr/statistics
57861236        /home/dspacetest.cgiar.org/solr/statistics
```

- I applied their changes to `config/spring/api/atmire-cua-update.xml` and started the duplicate processor:

```console
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
```

- The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
- Hah, I still got a memory error after only a few minutes:

```console
...
Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s                                      
Exception: GC overhead limit exceeded                                                                          
java.lang.OutOfMemoryError: GC overhead limit exceeded 
```

- I guess we really do have to use `-r 100`
- Now the thing runs for a few minutes and "finishes":

```console
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
Loading @mire database changes for module MQM
Changes have been processed


*************************
* Update Script Started *
*************************

Run 1
Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:17 CET 2021
Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:41 CET 2021
Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
Done. | Wed Mar 24 14:45:46 CET 2021
Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
Done. | Wed Mar 24 14:45:54 CET 2021
Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
Done. | Wed Mar 24 14:45:58 CET 2021
Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
Done. | Wed Mar 24 14:45:59 CET 2021
Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
Run 1 took 5m 53s


**************************
* Update Script Finished *
**************************
```

- If I run it again it finds the same 4,824 docs and processes them...
  - I asked Atmire for feedback on this: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839

## 2021-03-25

- Niroshini from IWMI is still having problems adding metadata during the edit step of the workflow on CGSpace
  - I told her to try to register using a private email account and we'll add her to the WLE group so she can try that way

## 2021-03-28

- Make a backup of the `openrxv-items-final` index on AReS Explorer and start a new harvest

<!-- vim: set sw=2 ts=2: -->
-												Add notes

											
										
										
											2021-03-04 21:46:05 +01:00
+								---
 								title: "March, 2021"
 								date: 2021-03-01T10:13:54+02:00
 								author: "Alan Orth"
 								categories: ["Notes"]
 								---
 								## 2021-03-01
 								- Discuss some OpenRXV issues with Abdullah from CodeObia
 								  - He's trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API
 								  - Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies
 								<!--more-->
 								## 2021-03-02
 								- I fixed three build and runtime issues in OpenRXV:
 								  - [fix highcharts-angular and ngx-tour-core build](https://github.com/ilri/OpenRXV/pull/80)
 								  - [frontend/package.json: Pin @types/ramda at 0.27.34](https://github.com/ilri/OpenRXV/pull/82)
 								- Then I merged a few fixes that Abdullah had worked on last week
 								## 2021-03-03
 								- I [fixed another frontend build warning on OpenRXV](https://github.com/ilri/OpenRXV/issues/83)
 								- Then I [updated the frontend container to use Node.js 12 and Ubuntu 20.04](https://github.com/ilri/OpenRXV/pull/84)
 								- Also, I [added a GitHub Actions workflow to build the frontend](https://github.com/ilri/OpenRXV/pull/85)
 								- I did some testing of Abdullah's patch for the values mapping search on OpenRXV
 								  - It still doesn't work with multi-word values, so I recorded a video with wf-recorder and uploaded it to [the issue](https://github.com/ilri/OpenRXV/issues/43) for him to investigate
 								## 2021-03-04
 								- Peter is having issues with the workflow since yesterday
 								  - I looked at the Munin stats and see a high number of database locks since yesterday
 								![PostgreSQL locks week](/cgspace-notes/2021/03/postgres_locks_ALL-week.png)
 								![PostgreSQL connections week](/cgspace-notes/2021/03/postgres_connections_cgspace-week.png)
 								- I looked at the number of connections in PostgreSQL and it's definitely high again:
 								```console
 								$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
 
 								```
 								- I reported it to Atmire to take a look, on the [same issue](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=851) we had been tracking this before
 								- Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
 								- I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my `add-orcid-identifier.py` script:
 								```console
 								$ cat 2021-03-04-add-zoe-campbell-orcid.csv
 								dc.contributor.author,cg.creator.identifier
 								"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
 								"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
 								$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'
 								```
 								- I still need to do cleanup on the journal articles metadata
 								  - Peter sent me some cleanups but I can't use them in the search/replace format he gave
 								  - I think it's better to export the metadata values with IDs and import cleaned up ones as CSV
 								```console
 								localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
 								COPY 32087
 								```
 								- I used OpenRefine to remove all journal values that didn't have one of these values: ; ( )
 								  - Then I cloned the `cg.journal` field to `cg.volume` and `cg.issue`
 								  - I used some GREL expressions like these to extract the journal name, volume, and issue:
 								```console
 								value.partition(';')[0].trim() # to get journal names
 								value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
 								value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues
 								```
 								- Then I uploaded the changes to CGSpace using `dspace metadata-import`
 								- Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
 								  - The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..."
 								  - I searched the DSpace issue tracker and found several issues reporting this:
 								    - [DS-3985 Delete item fails](https://jira.lyrasis.org/browse/DS-3985)
 								    - [DS-4004 Authorization denied Exception when trying to delete permanently an item, collection or community as a non-Admin user](https://jira.lyrasis.org/browse/DS-4004)
 								    - [DS-4297 Authorization error when trying to delete item by submitter/administrator](https://jira.lyrasis.org/browse/DS-4297)
 								  - The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection...
 								  - In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection's admin and submit groups, exactly as the issue described
 								  - I added a comment about our issue to [DS-4297](https://jira.lyrasis.org/browse/DS-4297)
 								- Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana's submissions
 								  - I edited Udana's submission to CGSpace:
 								    - corrected the title
 								    - added language English
 								    - changed the link to the external item page instead of PDF
 								    - added SDGs from the external item page
 								    - added AGROVOC subjects from the external item page
 								    - added pagination (extent)
 								    - changed the license to "other" because CC-BY-NC-ND is not printed anywhere in the PDF or external item page
-												Add notes for 2021-03-05

											
										
										
											2021-03-05 19:52:36 +01:00
+								## 2021-03-05
 								- I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:
 								```console
 								$ docker-compose -f docker/docker-compose.yml down
 								$ docker volume create docker_esData_7
 								$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
-												Add notes for 2021-03-06

											
										
										
											2021-03-06 12:35:20 +01:00
+								$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
-												Add notes for 2021-03-05

											
										
										
											2021-03-05 19:52:36 +01:00
+								$ docker rm es_dummy
 								# edit docker/docker-compose.yml to switch from bind mount to volume
 								$ docker-compose -f docker/docker-compose.yml up -d
 								```
 								- The trick is that when you create a volume like "myvolume" from a `docker-compose.yml` file, Docker will create it with the name "docker_myvolume"
 								  - If you create it manually on the command line with `docker volume create myvolume` then the name is literally "myvolume"
 								- I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
 								- Delete the `openrxv-items-temp` index to test a fresh harvesting:
 								```console
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
 								```
-												Add notes for 2021-03-06

											
										
										
											2021-03-06 12:35:20 +01:00
+								## 2021-03-05
 								- Check the results of the AReS harvesting from last night:
 								```console
 								$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
 								{
 								  "count" : 101761,
 								  "_shards" : {
 								    "total" : 1,
 								    "successful" : 1,
 								    "skipped" : 0,
 								    "failed" : 0
 								  }
 								}
 								```
 								- Set the current items index to read only and make a backup:
 								```console
 								$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
 								$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05
 								```
 								- Delete the current items index and clone the temp one to it:
 								```console
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items'
 								$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
 								$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
 								```
 								- Then delete the temp and backup:
 								```console
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
 								{"acknowledged":true}%
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'
 								```
 								- I made some pull requests to OpenRXV:
 								  - [docker/docker-compose.yml: Use docker volumes](https://github.com/ilri/OpenRXV/pull/86)
 								  - [docker/docker-compose.yml: Pin Redis to version 5](https://github.com/ilri/OpenRXV/pull/87)
 								- I deployed the latest changes from the last few days on AReS production
-												Add notes for 2021-03-07

											
										
										
											2021-03-07 14:51:12 +01:00
+								## 2021-03-07
 								- I realized there is something wrong with the Elasticsearch indexes on AReS
 								  - On a new test environment I see `openrxv-items` is correctly an alias of `openrxv-items-final`:
 								```console
 								$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
 								...
 								    "openrxv-items-final": {
 								        "aliases": {
 								            "openrxv-items": {}
 								        }
 								    },
 								```
 								- But on AReS production `openrxv-items` has somehow become an index:
 								```console
 								$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
 								...
 								    "openrxv-items": {
 								        "aliases": {}
 								    },
 								    "openrxv-items-final": {
 								        "aliases": {}
 								    },
 								    "openrxv-items-temp": {
 								        "aliases": {}
 								    },
 								```
 								- I fixed the issue on production by cloning the `openrxv-items` index to `openrxv-items-final`, deleting `openrxv-items`, and then re-creating it as an alias:
 								```console
 								$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
 								$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
 								$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items'
 								$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
 								```
 								- Delete backups and remove read-only mode on `openrxv-items`:
 								```console
 								$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
 								$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
 								```
 								- Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
 								  - Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:
 								```console
 								# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -
 								```
 								- I see the usual IPs for CCAFS and ILRI importer bots, but also `143.233.242.132` which appears to be for GARDIAN:
 								```console
 								# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
 
 								# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
 
 								```
 								- They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a "normal" user agent
 								  - Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
 								  - I will add this IP to the list of bots in nginx and purge it from Solr with my `check-spider-ip-hits.sh` script
 								- I made a few changes to OpenRXV:
 								  - [Migrated away from links to use networks](https://github.com/ilri/OpenRXV/issues/89)
 								  - [Converted the backend container to use a custom image that includes `unoconv`](https://github.com/ilri/OpenRXV/issues/68) so we don't have to manually install it anymore
-												Add notes for 2021-03-08

											
										
										
											2021-03-08 19:13:40 +01:00
+								## 2021-03-08
 								- I approved the WLE item that I edited last week, and all the metadata is there: https://hdl.handle.net/10568/111810
 								  - So I'm not sure what Niroshini's issue with metadata is...
 								- Peter sent a message yesterday saying that his item finally got committed
 								  - I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):
 								```console
 								$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
 
 								```
 								- On 2021-03-03 the PostgreSQL transactions started rising:
 								![PostgreSQL query length week](/cgspace-notes/2021/03/postgres_querylength_ALL-week.png)
 								- After that the connections and locks started going up, peaking on 2021-03-06:
 								![PostgreSQL locks week](/cgspace-notes/2021/03/postgres_locks_ALL-week.png)
 								![PostgreSQL connections week](/cgspace-notes/2021/03/postgres_connections_ALL-week.png)
 								- I sent another message to Atmire to ask if they have time to look into this
 								- CIFOR is pressuring me to upload the batch items from last week
 								  - Vika sent me a final file with some duplicates that Peter identified removed
 								  - I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through `csv-metadata-quality` checker and uploaded them to CGSpace
 								  - In total there are 1,088 items
 								- Udana from IWMI emailed to ask about CGSpace thumbnails
 								- Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
 								  - [The item](https://hdl.handle.net/10568/111794) was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item
 								- Abenet got a quote from Atmire to buy 125 credits for 3750€
 								- Maria at Bioversity sent some feedback about duplicate items on AReS
 								- I'm wondering if the issue of the `openrxv-items-final` index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
 								  - I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:
 								```console
 								$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
 								$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
 								# start harvesting on AReS
 								```
 								- As I saw on my local test instance, even when you cancel a harvesting, it replaces the `openrxv-items-final` index with whatever is in `openrxv-items-temp` automatically, so I assume it will do the same now
-												Add notes for 2021-03-09

											
										
										
											2021-03-09 14:48:47 +01:00
+								## 2021-03-09
 								- The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
 								  - This means that [the issue](https://github.com/ilri/OpenRXV/issues/64) we were facing for a few months was due to the `openrxv-items` index being deleted and re-created as a standalone index instead of an alias of `openrxv-items-final`
 								- Talk to Moayad about OpenRXV development
 								  - We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift
 								  - I proposed a solution in the [GitHub issue](https://github.com/ilri/OpenRXV/issues/67), where we consult the site's XML sitemap after harvesting to see if we missed any items, and then we harvest them individually
 								- Peter sent me a list of 356 DOIs from Altmetric that don't have our Handles, so we need to Tweet them
 								  - I used my `doi-to-handle.py` script to generate a list of handles and titles for him:
 								```console
 								$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'
 								```
-												Add notes for 2021-03-10

											
										
										
											2021-03-10 14:04:01 +01:00
+								## 2021-03-10
 								- Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses `cg.isijournal` and MELSpace uses `mel.impact-factor`
 								  - I filed [an issue](https://github.com/AgriculturalSemantics/cg-core/issues/39) on the cg-core project to ask colleagues for ideas
-												Add notes for 2021-03-11

											
										
										
											2021-03-11 07:46:33 +01:00
+								- Peter said he doesn't see "Source Code" or "Software" in the [output type facet on the ILRI community](https://cgspace.cgiar.org/handle/10568/1/search-filter?field=type), but I see it on the home page, so I will try to do a full Discovery re-index:
 								```console
 								$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
 								real    318m20.485s
 								user    215m15.196s
 								sys     2m51.529s
 								```
 								- Now I see ten items for "Source Code" in the facets...
-												Add notes for 2021-03-14

											
										
										
											2021-03-14 09:50:02 +01:00
+								- Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code
 								- Added the ability to check `dcterms.license` values against the SPDX licenses in the csv-metadata-quality tool
 								  - Also, I made some other minor fixes and released [version 0.4.6](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.6) on GitHub
 								- Proof and upload twenty-seven items to CGSpace for Peter Ballantyne
 								  - Mostly Ugandan outputs for CRP Livestock and Livestock and Fish
 								## 2021-03-14
 								- Switch to linux-kvm kernel on linode20 and linode18:
 								```console
 								# apt update && apt full-upgrade
 								# apt install linux-kvm
 								# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
 								# apt autoremove && apt autoclean
 								# reboot
 								```
 								- Deploy latest changes from `6_x-prod` branch on CGSpace
 								- Deploy latest changes from OpenRXV `master` branch on AReS
 								- Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
 								- Back up the current `openrxv-items-final` index on AReS to start a new harvest:
 								```console
 								$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
 								$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
 								$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
 								```
-												Add notes for 2021-03-10

											
										
										
											2021-03-10 14:04:01 +01:00
-												Add notes for 2021-03-14

											
										
										
											2021-03-14 20:34:07 +01:00
+								- After the harvesting finished it seems the indexes got messed up again, as `openrxv-items` is an alias of `openrxv-items-temp` instead of `openrxv-items-final`:
 								```console
 								$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
 								...
 								    "openrxv-items-final": {
 								        "aliases": {}
 								    },
 								    "openrxv-items-temp": {
 								        "aliases": {
 								            "openrxv-items": {}
 								        }
 								    },
 								```
 								- Anyways, the number of items in `openrxv-items` seems OK and the AReS Explorer UI is working fine
 								  - I will have to manually fix the indexes before the next harvesting
 								- Publish the web version of the DSpace CSV Metadata Quality checker tool that I wrote this weekend on GitHub: https://github.com/ilri/csv-metadata-quality-web
 								  - Also, it is deployed on Heroku: https://fierce-ocean-30836.herokuapp.com/
 								  - I was running it on Google App Engine originally, but they have *way* too aggressive caching of static assets
-												Add notes for 2021-03-17

											
										
										
											2021-03-17 13:57:45 +01:00
+								## 2021-03-16
 								- Review ten items for Livestock and Fish and Dryland Systems from Peter
 								  - I told him to try the new web-based CSV Metadata Qualiter checker and he thought it was cool
 								  - I found one exact duplicate item and it gave me an idea to try to detect this in the tool
 								## 2021-03-17
 								- I added the ability to check for duplicate items to csv-metadata-quality
 								- I also made some minor optimizations in the Pandas code
 								- I [tagged version 0.4.7 of csv-metadata-quality on GitHub](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.7)
-												Add notes for 2021-03-22

											
										
										
											2021-03-23 08:34:40 +01:00
+								## 2021-03-18
 								- I added the ability to check for, and fix, "mojibake" characters in csv-metadata-quality
 								## 2021-03-21
 								- Last week Atmire asked me which browser I was using to test the duplicate checker, which I had [reported](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934) as not loading
 								  - I tried to load it in Chrome and it works... hmmm
 								- Back up the current `openrxv-items-final` index to start a fresh AReS Harvest:
 								```console
 								$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
 								$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
 								$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
 								```
 								- Then start harvesting in the AReS Explorer admin UI
 								## 2021-03-22
 								- The harvesting on AReS yesterday completed, but somehow I have twice the number of items:
 								```console
 								$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
 								{
 								  "count" : 206204,
 								  "_shards" : {
 								    "total" : 1,
 								    "successful" : 1,
 								    "skipped" : 0,
 								    "failed" : 0
 								  }
 								}
 								```
 								- Hmmm and even my backup index has a strange number of items:
 								```console
 								$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
 								{
 								  "count" : 844,
 								  "_shards" : {
 								    "total" : 1,
 								    "successful" : 1,
 								    "skipped" : 0,
 								    "failed" : 0
 								  }
 								}
 								```
 								- I deleted all indexes and re-created the openrxv-items alias:
 								```console
 								$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
 								$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
 								...
 								    "openrxv-items-temp": {
 								        "aliases": {}
 								    },
 								    "openrxv-items-final": {
 								        "aliases": {
 								            "openrxv-items": {}
 								        }
 								    }
 								```
 								- Then I started a new harvesting
 								- I switched the Node.js in the [Ansible infrastructure scripts](https://github.com/ilri/rmg-ansible-public) to v12 since v10 will cease to be supported soon
 								  - I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server
 								- The AReS harvest finally finished, with 1047 pages of items, but the `openrxv-items-final` index is empty and the `openrxv-items-temp` index has a 103,000 items:
 								```console
 								$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
 								{
 								  "count" : 103162,
 								  "_shards" : {
 								    "total" : 1,
 								    "successful" : 1,
 								    "skipped" : 0,
 								    "failed" : 0
 								  }
 								}
 								```
 								- I tried to clone the temp index to the final, but got an error:
 								```console
 								$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
 								{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}%
 								```
 								- I looked in the Docker logs for Elasticsearch and saw a few memory errors:
 								```console
 								java.lang.OutOfMemoryError: Java heap space
 								```
 								- According to `/usr/share/elasticsearch/config/jvm.options` in the Elasticsearch container the default JVM heap is 1g
 								  - I see the running Java process has `-Xms 1g -Xmx 1g` in its process invocation so I guess that it must be indeed using 1g
 								  - We can [change the heap size with the ES_JAVA_OPTS environment variable](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html)
 								  - Or perhaps better, we should [use a jvm.options.d file](https://www.elastic.co/guide/en/elasticsearch/reference/master/jvm-options.html) because if you use the environment variable it overrides all other JVM options from the default `jvm.options`
 								  - I tried to set memory to 1536m by binding an options file and restarting the container, but it didn't seem to work
 								  - Nevertheless, after restarting I see 103,000 items in the Explorer...
 								  - But the indexes are still kinda messed up... the `openrxv-items` index is an alias of the wrong index!
 								```console
 								    "openrxv-items-final": {
 								        "aliases": {}
 								    },
 								    "openrxv-items-temp": {
 								        "aliases": {
 								            "openrxv-items": {}
 								        }
 								    },
 								```
 								## 2021-03-23
 								- For reference you can also get the Elasticsearch JVM stats from the API:
 								```console
 								$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool
 								```
 								- I re-deployed AReS with 1.5GB of heap using the `ES_JAVA_OPTS` environment variable
 								  - It turns out that this *is* the recommended way to set the heap: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html
 								- Then I fixed the aliases to make sure `openrxv-items` was an alias of `openrxv-items-final`, similar to how I did a few weeks ago
 								- I re-created the temp index:
 								```console
 								$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
 								```
-												Add notes for 2021-03-28

											
										
										
											2021-03-28 14:47:19 +02:00
+								## 2021-03-24
 								- Atmire responded to the [ticket about the Duplicate Checker](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=934)
 								  - He says it works for him in Firefox, so I checked and it seems to have been an issue with my LocalCDN addon
 								- I re-deployed DSpace Test (linode26) from the latest CGSpace (linode18) data
 								  - I want to try to finish up processing the duplicates in Solr that [Atmire advised on last month](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839)
 								  - The current statistics core is 57861236 kilobytes:
 								```console
 								# du -s /home/dspacetest.cgiar.org/solr/statistics
 								57861236        /home/dspacetest.cgiar.org/solr/statistics
 								```
 								- I applied their changes to `config/spring/api/atmire-cua-update.xml` and started the duplicate processor:
 								```console
 								$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
 								$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12
 								```
 								- The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
 								- Hah, I still got a memory error after only a few minutes:
 								```console
 								...
 								Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s
 								Exception: GC overhead limit exceeded
 								java.lang.OutOfMemoryError: GC overhead limit exceeded
 								```
 								- I guess we really do have to use `-r 100`
 								- Now the thing runs for a few minutes and "finishes":
 								```console
 								$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
 								Loading @mire database changes for module MQM
 								Changes have been processed
 								*************************
 								* Update Script Started *
 								*************************
 								Run 1
 								Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
 								Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
 								Done. | Wed Mar 24 14:42:17 CET 2021
 								Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
 								Done. | Wed Mar 24 14:42:41 CET 2021
 								Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
 								Done. | Wed Mar 24 14:45:46 CET 2021
 								Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
 								Done. | Wed Mar 24 14:45:54 CET 2021
 								Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
 								Done. | Wed Mar 24 14:45:58 CET 2021
 								Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
 								Done. | Wed Mar 24 14:45:59 CET 2021
 								Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
 								Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
 								Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
 								Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
 								Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
 								Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
 								Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
 								Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
 								Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
 								Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
 								Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
 								Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
 								Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
 								Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
 								Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
 								Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
 								Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
 								Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
 								Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
 								Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
 								Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
 								Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
 								Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
 								Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
 								Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
 								Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
 								Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
 								Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
 								Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
 								Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
 								Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
 								Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
 								Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
 								Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
 								Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
 								Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
 								Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
 								Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
 								Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
 								Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
 								Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
 								Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
 								Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
 								Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
 								Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
 								Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
 								Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
 								Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
 								Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
 								Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
 								Run 1 took 5m 53s
 								**************************
 								* Update Script Finished *
 								**************************
 								```
 								- If I run it again it finds the same 4,824 docs and processes them...
 								  - I asked Atmire for feedback on this: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839
 								## 2021-03-25
 								- Niroshini from IWMI is still having problems adding metadata during the edit step of the workflow on CGSpace
 								  - I told her to try to register using a private email account and we'll add her to the WLE group so she can try that way
 								## 2021-03-28
 								- Make a backup of the `openrxv-items-final` index on AReS Explorer and start a new harvest
-												Add notes

											
										
										
											2021-03-04 21:46:05 +01:00
+								<!-- vim: set sw=2 ts=2: -->