cgspace-notes/2021-03.md at 4f60b2aff3a9512c22498e2638b6c94a2423ddc1

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-17 20:27:05 +01:00

Alan Orth 4f60b2aff3

Add notes for 2021-03-09

2021-03-09 15:48:47 +02:00

15 KiB

Raw Blame History

title

date

author

2021-03-01

Discuss some OpenRXV issues with Abdullah from CodeObia
- He's trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API
- Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies

2021-03-02

I fixed three build and runtime issues in OpenRXV:
- fix highcharts-angular and ngx-tour-core build
- frontend/package.json: Pin @types/ramda at 0.27.34
Then I merged a few fixes that Abdullah had worked on last week

2021-03-03

I fixed another frontend build warning on OpenRXV
Then I updated the frontend container to use Node.js 12 and Ubuntu 20.04
Also, I added a GitHub Actions workflow to build the frontend
I did some testing of Abdullah's patch for the values mapping search on OpenRXV
- It still doesn't work with multi-word values, so I recorded a video with wf-recorder and uploaded it to the issue for him to investigate

2021-03-04

Peter is having issues with the workflow since yesterday
- I looked at the Munin stats and see a high number of database locks since yesterday

I looked at the number of connections in PostgreSQL and it's definitely high again:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020

I reported it to Atmire to take a look, on the same issue we had been tracking this before
Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my add-orcid-identifier.py script:

$ cat 2021-03-04-add-zoe-campbell-orcid.csv 
dc.contributor.author,cg.creator.identifier
"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'

I still need to do cleanup on the journal articles metadata
- Peter sent me some cleanups but I can't use them in the search/replace format he gave
- I think it's better to export the metadata values with IDs and import cleaned up ones as CSV

localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087

I used OpenRefine to remove all journal values that didn't have one of these values: ; ( )
- Then I cloned the cg.journal field to cg.volume and cg.issue
- I used some GREL expressions like these to extract the journal name, volume, and issue:

value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues

Then I uploaded the changes to CGSpace using dspace metadata-import
Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..."
- I searched the DSpace issue tracker and found several issues reporting this:
- The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection...
- In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection's admin and submit groups, exactly as the issue described
- I added a comment about our issue to DS-4297
Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana's submissions
- I edited Udana's submission to CGSpace:
  - corrected the title
  - added language English
  - changed the link to the external item page instead of PDF
  - added SDGs from the external item page
  - added AGROVOC subjects from the external item page
  - added pagination (extent)
  - changed the license to "other" because CC-BY-NC-ND is not printed anywhere in the PDF or external item page

2021-03-05

I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:

$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d

The trick is that when you create a volume like "myvolume" from a docker-compose.yml file, Docker will create it with the name "docker_myvolume"
- If you create it manually on the command line with docker volume create myvolume then the name is literally "myvolume"
I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
Delete the openrxv-items-temp index to test a fresh harvesting:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'

2021-03-05

Check the results of the AReS harvesting from last night:

$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 101761,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Set the current items index to read only and make a backup:

$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05

Delete the current items index and clone the temp one to it:

$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items

Then delete the temp and backup:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{"acknowledged":true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'

I made some pull requests to OpenRXV:
- docker/docker-compose.yml: Use docker volumes
- docker/docker-compose.yml: Pin Redis to version 5
I deployed the latest changes from the last few days on AReS production

2021-03-07

I realized there is something wrong with the Elasticsearch indexes on AReS
- On a new test environment I see openrxv-items is correctly an alias of openrxv-items-final:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },

But on AReS production openrxv-items has somehow become an index:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {}
    },

I fixed the issue on production by cloning the openrxv-items index to openrxv-items-final, deleting openrxv-items, and then re-creating it as an alias:

$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'

Delete backups and remove read-only mode on openrxv-items:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
- Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:

# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -

I see the usual IPs for CCAFS and ILRI importer bots, but also 143.233.242.132 which appears to be for GARDIAN:

# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418

They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a "normal" user agent
- Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
- I will add this IP to the list of bots in nginx and purge it from Solr with my check-spider-ip-hits.sh script
I made a few changes to OpenRXV:
- Migrated away from links to use networks
- Converted the backend container to use a custom image that includes unoconv so we don't have to manually install it anymore

2021-03-08

I approved the WLE item that I edited last week, and all the metadata is there: https://hdl.handle.net/10568/111810
- So I'm not sure what Niroshini's issue with metadata is...
Peter sent a message yesterday saying that his item finally got committed
- I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13

On 2021-03-03 the PostgreSQL transactions started rising:

After that the connections and locks started going up, peaking on 2021-03-06:

I sent another message to Atmire to ask if they have time to look into this
CIFOR is pressuring me to upload the batch items from last week
- Vika sent me a final file with some duplicates that Peter identified removed
- I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through csv-metadata-quality checker and uploaded them to CGSpace
- In total there are 1,088 items
Udana from IWMI emailed to ask about CGSpace thumbnails
Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
- The item was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item
Abenet got a quote from Atmire to buy 125 credits for 3750€
Maria at Bioversity sent some feedback about duplicate items on AReS
I'm wondering if the issue of the openrxv-items-final index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
- I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:

$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS

As I saw on my local test instance, even when you cancel a harvesting, it replaces the openrxv-items-final index with whatever is in openrxv-items-temp automatically, so I assume it will do the same now

2021-03-09

The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
- This means that the issue we were facing for a few months was due to the openrxv-items index being deleted and re-created as a standalone index instead of an alias of openrxv-items-final
Talk to Moayad about OpenRXV development
- We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift
- I proposed a solution in the GitHub issue, where we consult the site's XML sitemap after harvesting to see if we missed any items, and then we harvest them individually
Peter sent me a list of 356 DOIs from Altmetric that don't have our Handles, so we need to Tweet them
- I used my doi-to-handle.py script to generate a list of handles and titles for him:

$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'

15 KiB Raw Blame History

2021-03-01

2021-03-02

2021-03-03

2021-03-04

2021-03-05

2021-03-05

2021-03-07

2021-03-08

2021-03-09

15 KiB

Raw Blame History