cgspace-notes/2021-03.md at a7b90e58ab7cdbb05de20ba6f32dc9eff892f06a

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-16 11:57:03 +01:00

Alan Orth 5daf2f8c21

Add notes for 2021-04-13

2021-04-13 21:13:08 +03:00

33 KiB

Raw Blame History

title

date

author

2021-03-01

Discuss some OpenRXV issues with Abdullah from CodeObia
- He's trying to work on the DSpace 6+ metadata schema autoimport using the DSpace 6+ REST API
- Also, we found some issues building and running OpenRXV currently due to ecosystem shift in the Node.js dependencies

2021-03-02

I fixed three build and runtime issues in OpenRXV:
- fix highcharts-angular and ngx-tour-core build
- frontend/package.json: Pin @types/ramda at 0.27.34
Then I merged a few fixes that Abdullah had worked on last week

2021-03-03

I fixed another frontend build warning on OpenRXV
Then I updated the frontend container to use Node.js 12 and Ubuntu 20.04
Also, I added a GitHub Actions workflow to build the frontend
I did some testing of Abdullah's patch for the values mapping search on OpenRXV
- It still doesn't work with multi-word values, so I recorded a video with wf-recorder and uploaded it to the issue for him to investigate

2021-03-04

Peter is having issues with the workflow since yesterday
- I looked at the Munin stats and see a high number of database locks since yesterday

I looked at the number of connections in PostgreSQL and it's definitely high again:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1020

I reported it to Atmire to take a look, on the same issue we had been tracking this before
Abenet asked me to add a new ORCID for ILRI staff member Zoe Campbell
I added it to the controlled vocabulary and then tagged her existing items on CGSpace using my add-orcid-identifier.py script:

$ cat 2021-03-04-add-zoe-campbell-orcid.csv 
dc.contributor.author,cg.creator.identifier
"Campbell, Zoë","Zoe Campbell: 0000-0002-4759-9976"
"Campbell, Zoe A.","Zoe Campbell: 0000-0002-4759-9976"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-03-04-add-zoe-campbell-orcid.csv -db dspace -u dspace -p 'fuuu'

I still need to do cleanup on the journal articles metadata
- Peter sent me some cleanups but I can't use them in the search/replace format he gave
- I think it's better to export the metadata values with IDs and import cleaned up ones as CSV

localhost/dspace63= > \COPY (SELECT dspace_object_id AS id, text_value as "cg.journal" FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=251) to /tmp/2021-02-24-journals.csv WITH CSV HEADER;
COPY 32087

I used OpenRefine to remove all journal values that didn't have one of these values: ; ( )
- Then I cloned the cg.journal field to cg.volume and cg.issue
- I used some GREL expressions like these to extract the journal name, volume, and issue:

value.partition(';')[0].trim() # to get journal names
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^(\d+)\(\d+\)/,"$1") # to get journal volumes
value.partition(/[0-9]+\([0-9]+\)/)[1].replace(/^\d+\((\d+)\)/,"$1") # to get journal issues

Then I uploaded the changes to CGSpace using dspace metadata-import
Margarita from CCAFS was asking about an error deleting some items that were showing up in Google and should have been private
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..."
- I searched the DSpace issue tracker and found several issues reporting this:
- The issue is apparently with non-admin users who are in the admin and submit groups of the owning collection...
- In this case the item was uploaded to the CCAFS Reports collection, and Margarita is a non-admin user who is a member of the collection's admin and submit groups, exactly as the issue described
- I added a comment about our issue to DS-4297
Yesterday Abenet added me to a WLE collection approver/editer steps so we can try to figure out why Niroshini is having issues adding metadata to Udana's submissions
- I edited Udana's submission to CGSpace:
  - corrected the title
  - added language English
  - changed the link to the external item page instead of PDF
  - added SDGs from the external item page
  - added AGROVOC subjects from the external item page
  - added pagination (extent)
  - changed the license to "other" because CC-BY-NC-ND is not printed anywhere in the PDF or external item page

2021-03-05

I migrated the Docker bind mount for the AReS Elasticsearch container to a Docker volume:

$ docker-compose -f docker/docker-compose.yml down
$ docker volume create docker_esData_7
$ docker container create --name es_dummy -v docker_esData_7:/usr/share/elasticsearch/data:rw elasticsearch:7.6.2
$ docker cp docker/esData_7/nodes es_dummy:/usr/share/elasticsearch/data
$ docker rm es_dummy
# edit docker/docker-compose.yml to switch from bind mount to volume
$ docker-compose -f docker/docker-compose.yml up -d

The trick is that when you create a volume like "myvolume" from a docker-compose.yml file, Docker will create it with the name "docker_myvolume"
- If you create it manually on the command line with docker volume create myvolume then the name is literally "myvolume"
I still need to make the changes to git master and add these notes to the pull request so Moayad and others can benefit
Delete the openrxv-items-temp index to test a fresh harvesting:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'

2021-03-05

Check the results of the AReS harvesting from last night:

$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 101761,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Set the current items index to read only and make a backup:

$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-05

Delete the current items index and clone the temp one to it:

$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items

Then delete the temp and backup:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{"acknowledged":true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-05'

I made some pull requests to OpenRXV:
- docker/docker-compose.yml: Use docker volumes
- docker/docker-compose.yml: Pin Redis to version 5
I deployed the latest changes from the last few days on AReS production

2021-03-07

I realized there is something wrong with the Elasticsearch indexes on AReS
- On a new test environment I see openrxv-items is correctly an alias of openrxv-items-final:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },

But on AReS production openrxv-items has somehow become a concrete index:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {}
    },

I fixed the issue on production by cloning the openrxv-items index to openrxv-items-final, deleting openrxv-items, and then re-creating it as an alias:

$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-03-07
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-final
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'

Delete backups and remove read-only mode on openrxv-items:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-03-07'
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

Linode sent alerts about the CPU usage on CGSpace yesterday and the day before
- Looking in the logs I see a few IPs making heavy usage on the REST API and XMLUI:

# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep -E '0[56]/Mar/2021' | goaccess --log-format=COMBINED -

I see the usual IPs for CCAFS and ILRI importer bots, but also 143.233.242.132 which appears to be for GARDIAN:

# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c Delphi
6237
# zgrep '143.233.242.132' /var/log/nginx/access.log.1 | grep -c -v Delphi
6418

They seem to make requests twice, once with the Delphi user agent that we know and already mark as a bot, and once with a "normal" user agent
- Looking in Solr I see they have been using this IP for awhile, as they have 100,000 hits going back into 2020
- I will add this IP to the list of bots in nginx and purge it from Solr with my check-spider-ip-hits.sh script
I made a few changes to OpenRXV:
- Migrated away from links to use networks
- Converted the backend container to use a custom image that includes unoconv so we don't have to manually install it anymore

2021-03-08

I approved the WLE item that I edited last week, and all the metadata is there: https://hdl.handle.net/10568/111810
- So I'm not sure what Niroshini's issue with metadata is...
Peter sent a message yesterday saying that his item finally got committed
- I looked at the Munin graphs and there was a MASSIVE spike in database activity two days ago, and now database locks are back down to normal levels (from 1000+):

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13

On 2021-03-03 the PostgreSQL transactions started rising:

After that the connections and locks started going up, peaking on 2021-03-06:

I sent another message to Atmire to ask if they have time to look into this
CIFOR is pressuring me to upload the batch items from last week
- Vika sent me a final file with some duplicates that Peter identified removed
- I extracted and re-applied my basic corrections from last week in OpenRefine, then ran the items through csv-metadata-quality checker and uploaded them to CGSpace
- In total there are 1,088 items
Udana from IWMI emailed to ask about CGSpace thumbnails
Udana from IWMI emailed to ask about an item uploaded recently that does not appear in AReS
- The item was added to the archive on 2021-03-05, and I last harvested on 2021-03-06, so this might be an issue of a missing item
Abenet got a quote from Atmire to buy 125 credits for 3750€
Maria at Bioversity sent some feedback about duplicate items on AReS
I'm wondering if the issue of the openrxv-items-final index not getting cleared after a successful harvest (which results in having 200,000, then 300,000, etc items) has to do with the alias issue I fixed yesterday
- I will start a fresh harvest on AReS without now to check, but first back up the current index just in case:

$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-08
# start harvesting on AReS

As I saw on my local test instance, even when you cancel a harvesting, it replaces the openrxv-items-final index with whatever is in openrxv-items-temp automatically, so I assume it will do the same now

2021-03-09

The harvesting on AReS finished last night and everything worked as expected, with no manual intervention
- This means that the issue we were facing for a few months was due to the openrxv-items index being deleted and re-created as a standalone index instead of an alias of openrxv-items-final
Talk to Moayad about OpenRXV development
- We realized that the missing/duplicate items issue is probably due to the long harvesting time on the REST API, as the time between starting the harvesting on page 0 and finishing the harvesting on page 900 (in the CGSpace example), some items will have been added to the repository, which causes the pages to shift
- I proposed a solution in the GitHub issue, where we consult the site's XML sitemap after harvesting to see if we missed any items, and then we harvest them individually
Peter sent me a list of 356 DOIs from Altmetric that don't have our Handles, so we need to Tweet them
- I used my doi-to-handle.py script to generate a list of handles and titles for him:

$ ./ilri/doi-to-handle.py -i /tmp/dois.txt -o /tmp/handles.txt -db dspace -u dspace -p 'fuuu'

2021-03-10

Colleagues from ICARDA asked about how we should handle ISI journals in CG Core, as CGSpace uses cg.isijournal and MELSpace uses mel.impact-factor
- I filed an issue on the cg-core project to ask colleagues for ideas
Peter said he doesn't see "Source Code" or "Software" in the output type facet on the ILRI community, but I see it on the home page, so I will try to do a full Discovery re-index:

$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    318m20.485s
user    215m15.196s
sys     2m51.529s

Now I see ten items for "Source Code" in the facets...
Add GPL and MIT licenses to the list of licenses on CGSpace input form since we will start capturing more software and source code
Added the ability to check dcterms.license values against the SPDX licenses in the csv-metadata-quality tool
- Also, I made some other minor fixes and released version 0.4.6 on GitHub
Proof and upload twenty-seven items to CGSpace for Peter Ballantyne
- Mostly Ugandan outputs for CRP Livestock and Livestock and Fish

2021-03-14

Switch to linux-kvm kernel on linode20 and linode18:

# apt update && apt full-upgrade
# apt install linux-kvm
# apt remove linux-generic linux-image-generic linux-headers-generic linux-firmware
# apt autoremove && apt autoclean
# reboot

Deploy latest changes from 6_x-prod branch on CGSpace
Deploy latest changes from OpenRXV master branch on AReS
Last week Peter added OpenRXV to CGSpace: https://hdl.handle.net/10568/112982
Back up the current openrxv-items-final index on AReS to start a new harvest:

$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-14
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

After the harvesting finished it seems the indexes got messed up again, as openrxv-items is an alias of openrxv-items-temp instead of openrxv-items-final:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },

Anyways, the number of items in openrxv-items seems OK and the AReS Explorer UI is working fine
- I will have to manually fix the indexes before the next harvesting
Publish the web version of the DSpace CSV Metadata Quality checker tool that I wrote this weekend on GitHub: https://github.com/ilri/csv-metadata-quality-web
- Also, it is deployed on Heroku: https://fierce-ocean-30836.herokuapp.com/
- I was running it on Google App Engine originally, but they have way too aggressive caching of static assets

2021-03-16

Review ten items for Livestock and Fish and Dryland Systems from Peter
- I told him to try the new web-based CSV Metadata Qualiter checker and he thought it was cool
- I found one exact duplicate item and it gave me an idea to try to detect this in the tool

2021-03-17

I added the ability to check for duplicate items to csv-metadata-quality
I also made some minor optimizations in the Pandas code
I tagged version 0.4.7 of csv-metadata-quality on GitHub

2021-03-18

I added the ability to check for, and fix, "mojibake" characters in csv-metadata-quality

2021-03-21

Last week Atmire asked me which browser I was using to test the duplicate checker, which I had reported as not loading
- I tried to load it in Chrome and it works... hmmm
Back up the current openrxv-items-final index to start a fresh AReS Harvest:

$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-2021-03-21
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

Then start harvesting in the AReS Explorer admin UI

2021-03-22

The harvesting on AReS yesterday completed, but somehow I have twice the number of items:

$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 206204,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Hmmm and even my backup index has a strange number of items:

$ curl -s 'http://localhost:9200/openrxv-items-final-2021-03-21/_count?q=*&pretty'
{
  "count" : 844,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

I deleted all indexes and re-created the openrxv-items alias:

$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
...
    "openrxv-items-temp": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    }

Then I started a new harvesting
I switched the Node.js in the Ansible infrastructure scripts to v12 since v10 will cease to be supported soon
- I re-deployed DSpace Test (linode26) with Node.js 12 and restarted the server
The AReS harvest finally finished, with 1047 pages of items, but the openrxv-items-final index is empty and the openrxv-items-temp index has a 103,000 items:

$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
  "count" : 103162,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

I tried to clone the temp index to the final, but got an error:

$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-final
{"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"}],"type":"resource_already_exists_exception","reason":"index [openrxv-items-final/LmxH-rQsTRmTyWex2d8jxw] already exists","index_uuid":"LmxH-rQsTRmTyWex2d8jxw","index":"openrxv-items-final"},"status":400}%

I looked in the Docker logs for Elasticsearch and saw a few memory errors:

java.lang.OutOfMemoryError: Java heap space

According to /usr/share/elasticsearch/config/jvm.options in the Elasticsearch container the default JVM heap is 1g
- I see the running Java process has -Xms 1g -Xmx 1g in its process invocation so I guess that it must be indeed using 1g
- We can change the heap size with the ES_JAVA_OPTS environment variable
- Or perhaps better, we should use a jvm.options.d file because if you use the environment variable it overrides all other JVM options from the default jvm.options
- I tried to set memory to 1536m by binding an options file and restarting the container, but it didn't seem to work
- Nevertheless, after restarting I see 103,000 items in the Explorer...
- But the indexes are still kinda messed up... the openrxv-items index is an alias of the wrong index!

    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },

2021-03-23

For reference you can also get the Elasticsearch JVM stats from the API:

$ curl -s 'http://localhost:9200/_nodes/jvm?human' | python -m json.tool

I re-deployed AReS with 1.5GB of heap using the ES_JAVA_OPTS environment variable
- It turns out that this is the recommended way to set the heap: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/jvm-options.html
Then I fixed the aliases to make sure openrxv-items was an alias of openrxv-items-final, similar to how I did a few weeks ago
I re-created the temp index:

$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'

2021-03-24

Atmire responded to the ticket about the Duplicate Checker
- He says it works for him in Firefox, so I checked and it seems to have been an issue with my LocalCDN addon
I re-deployed DSpace Test (linode26) from the latest CGSpace (linode18) data
- I want to try to finish up processing the duplicates in Solr that Atmire advised on last month
- The current statistics core is 57861236 kilobytes:

# du -s /home/dspacetest.cgiar.org/solr/statistics
57861236        /home/dspacetest.cgiar.org/solr/statistics

I applied their changes to config/spring/api/atmire-cua-update.xml and started the duplicate processor:

$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx4096m'
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 1000 -c statistics -t 12

The default number of records per query is 10,000, which caused memory issues, so I will try with 1000 (Atmire used 100, but that seems too low!)
Hah, I still got a memory error after only a few minutes:

...
Run 1 —  80% — 5,000/6,263 docs — 25s — 6m 31s                                      
Exception: GC overhead limit exceeded                                                                          
java.lang.OutOfMemoryError: GC overhead limit exceeded

I guess we really do have to use -r 100
Now the thing runs for a few minutes and "finishes":

$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12
Loading @mire database changes for module MQM
Changes have been processed


*************************
* Update Script Started *
*************************

Run 1
Start updating Solr Storage Reports | Wed Mar 24 14:42:17 CET 2021
Deleting old storage docs from Solr... | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:17 CET 2021
Processing storage reports for type: eperson | Wed Mar 24 14:42:17 CET 2021
Done. | Wed Mar 24 14:42:41 CET 2021
Processing storage reports for type: group | Wed Mar 24 14:42:41 CET 2021
Done. | Wed Mar 24 14:45:46 CET 2021
Processing storage reports for type: collection | Wed Mar 24 14:45:46 CET 2021
Done. | Wed Mar 24 14:45:54 CET 2021
Processing storage reports for type: community | Wed Mar 24 14:45:54 CET 2021
Done. | Wed Mar 24 14:45:58 CET 2021
Committing to Solr... | Wed Mar 24 14:45:58 CET 2021
Done. | Wed Mar 24 14:45:59 CET 2021
Successfully finished updating Solr Storage Reports | Wed Mar 24 14:45:59 CET 2021
Run 1 —   2% — 100/4,824 docs — 3m 47s — 3m 47s
Run 1 —   4% — 200/4,824 docs — 2s — 3m 50s
Run 1 —   6% — 300/4,824 docs — 2s — 3m 53s
Run 1 —   8% — 400/4,824 docs — 2s — 3m 55s
Run 1 —  10% — 500/4,824 docs — 2s — 3m 58s
Run 1 —  12% — 600/4,824 docs — 2s — 4m 1s
Run 1 —  15% — 700/4,824 docs — 2s — 4m 3s
Run 1 —  17% — 800/4,824 docs — 2s — 4m 6s
Run 1 —  19% — 900/4,824 docs — 2s — 4m 9s
Run 1 —  21% — 1,000/4,824 docs — 2s — 4m 11s
Run 1 —  23% — 1,100/4,824 docs — 2s — 4m 14s
Run 1 —  25% — 1,200/4,824 docs — 2s — 4m 16s
Run 1 —  27% — 1,300/4,824 docs — 2s — 4m 19s
Run 1 —  29% — 1,400/4,824 docs — 2s — 4m 22s
Run 1 —  31% — 1,500/4,824 docs — 2s — 4m 24s
Run 1 —  33% — 1,600/4,824 docs — 2s — 4m 27s
Run 1 —  35% — 1,700/4,824 docs — 2s — 4m 29s
Run 1 —  37% — 1,800/4,824 docs — 2s — 4m 32s
Run 1 —  39% — 1,900/4,824 docs — 2s — 4m 35s
Run 1 —  41% — 2,000/4,824 docs — 2s — 4m 37s
Run 1 —  44% — 2,100/4,824 docs — 2s — 4m 40s
Run 1 —  46% — 2,200/4,824 docs — 2s — 4m 42s
Run 1 —  48% — 2,300/4,824 docs — 2s — 4m 45s
Run 1 —  50% — 2,400/4,824 docs — 2s — 4m 48s
Run 1 —  52% — 2,500/4,824 docs — 2s — 4m 50s
Run 1 —  54% — 2,600/4,824 docs — 2s — 4m 53s
Run 1 —  56% — 2,700/4,824 docs — 2s — 4m 55s
Run 1 —  58% — 2,800/4,824 docs — 2s — 4m 58s
Run 1 —  60% — 2,900/4,824 docs — 2s — 5m 1s
Run 1 —  62% — 3,000/4,824 docs — 2s — 5m 3s
Run 1 —  64% — 3,100/4,824 docs — 2s — 5m 6s
Run 1 —  66% — 3,200/4,824 docs — 3s — 5m 9s
Run 1 —  68% — 3,300/4,824 docs — 2s — 5m 12s
Run 1 —  70% — 3,400/4,824 docs — 2s — 5m 14s
Run 1 —  73% — 3,500/4,824 docs — 2s — 5m 17s
Run 1 —  75% — 3,600/4,824 docs — 2s — 5m 20s
Run 1 —  77% — 3,700/4,824 docs — 2s — 5m 22s
Run 1 —  79% — 3,800/4,824 docs — 2s — 5m 25s
Run 1 —  81% — 3,900/4,824 docs — 2s — 5m 27s
Run 1 —  83% — 4,000/4,824 docs — 2s — 5m 30s
Run 1 —  85% — 4,100/4,824 docs — 2s — 5m 33s
Run 1 —  87% — 4,200/4,824 docs — 2s — 5m 35s
Run 1 —  89% — 4,300/4,824 docs — 2s — 5m 38s
Run 1 —  91% — 4,400/4,824 docs — 2s — 5m 41s
Run 1 —  93% — 4,500/4,824 docs — 2s — 5m 43s
Run 1 —  95% — 4,600/4,824 docs — 2s — 5m 46s
Run 1 —  97% — 4,700/4,824 docs — 2s — 5m 49s
Run 1 — 100% — 4,800/4,824 docs — 2s — 5m 51s
Run 1 — 100% — 4,824/4,824 docs — 2s — 5m 53s
Run 1 took 5m 53s


**************************
* Update Script Finished *
**************************

If I run it again it finds the same 4,824 docs and processes them...
- I asked Atmire for feedback on this: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=839

2021-03-25

Niroshini from IWMI is still having problems adding metadata during the edit step of the workflow on CGSpace
- I told her to try to register using a private email account and we'll add her to the WLE group so she can try that way

2021-03-28

Make a backup of the openrxv-items-final index on AReS Explorer and start a new harvest

2021-03-29

The AReS harvesting that I started yesterday finished successfully and all indexes look OK:
- openrxv-items is an alias of openrxv-items-final and has a correct number of items
Last week Bosede from IITA said she was trying to move an item from one collection to another and the system was "rolling" and never finished
- I looked in Munin and I don't see anything particularly wrong that day, so I told her to try again
Marianne Gadeberg asked about mapping an item last week
- Searched for the item's handle, the title, the title in quotes, the UUID, with pluses instead of spaces, etc in the item mapper... but I can never find it in the results
- I see someone has reported this issue on Jira in DSpace 5.x's XMLUI item mapper: https://jira.lyrasis.org/browse/DS-2761
- The Solr log shows that my query (with and without quotes, etc) has 143 results:

2021-03-29 08:55:40,073 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={q=Gender+mainstreaming+in+local+potato+seed+system+in+Georgia&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=-location:l5308ea39-7c65-401b-890b-c2b93dad649a&wt=javabin&version=2} hits=143 status=0 QTime=0

But the item mapper only displays ten items, with no pagination
- There is no way to search by handle or ID
- I mapped the item manually using a CSV

2021-03-30

I realized I never finished deleting all the old fields after our CG Core migration a few months ago
- I found a few occurrences of old metadata so I had to move them where possible and delete them where not
I updated the [CG Core v2 migration page]({{< relref "cgspace-cgcorev2-migration.md" >}})
Marianne Gadeberg wrote to ask why the item she wanted to map a few days ago still doesn't appear in the mapped collection
- I looked on the item page itself and it lists the collection, but doesn't appear in the collection list
- I tried to forceably reindex the collection and the item, but it didn't seem to work
- Now I will try a complete Discovery re-index

2021-03-31

The Discovery re-index finished, but the CIP item still does not appear in the GENDER Platform grants collection
- The item page itself DOES list the grants collection! WTF
- I sent a message to the dspace-tech mailing list to see if someone can comment
- I even tried unmapping and re-mapping, but it doesn't change anything: the item still doesn't appear in the collection, but I can see that it is mapped
I signed up for a SHERPA API key so I can try to write something to get journal names from ISSN
- This code seems to get a journal title, though I only tried it with a few ISSNs:

import requests

query_params = {'item-type': 'publication', 'format': 'Json', 'limit': 10, 'offset': 0, 'api-key': 'blahhhahahah', 'filter': '[["issn","equals","0011-183X"]]'}
r = requests.get('https://v2.sherpa.ac.uk/cgi/retrieve')
if r.status_code and len(r.json()['items']) > 0:
    r.json()['items'][0]['title'][0]['title']

I exported a list of all our ISSNs from CGSpace:

localhost/dspace63= > \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=253) to /tmp/2021-03-31-issns.csv;
COPY 3081

I wrote a script to check the ISSNs against Crossref's API: crossref-issn-lookup.py
- I suspect Crossref might have better data actually...

33 KiB Raw Blame History

2021-03-01

2021-03-02

2021-03-03

2021-03-04

2021-03-05

2021-03-05

2021-03-07

2021-03-08

2021-03-09

2021-03-10

2021-03-14

2021-03-16

2021-03-17

2021-03-18

2021-03-21

2021-03-22

2021-03-23

2021-03-24

2021-03-25

2021-03-28

2021-03-29

2021-03-30

2021-03-31

33 KiB

Raw Blame History