Add notes for 2021-06-21

This commit is contained in:
2021-06-21 16:24:40 +03:00
parent a6d606ca0e
commit b787c427ab
101 changed files with 310 additions and 135 deletions

View File

@ -87,7 +87,7 @@ $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/vol
- The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it's much faster
- I harvested 90,000+ items from DSpace Test in ~3 hours
- There seem to be some issues with the health check step though
- There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times...
## 2021-06-17
@ -116,4 +116,83 @@ $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk
3 "10568/96546"
```
## 2021-06-20
- Udana asked me to update their IWMI subjects from `farmer managed irrigation systems` to `farmer-led irrigation`
- First I extracted the IWMI community from CGSpace:
```console
$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
```
- Then I used `csvcut` to extract just the columns I needed and do the replacement into a new CSV:
```console
$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
```
- Then I uploaded the resulting CSV to CGSpace, updating 161 items
- Start a harvest on AReS
- I found [a bug](https://jira.lyrasis.org/browse/DS-1977) and [a patch](https://github.com/DSpace/DSpace/pull/2584) for the private items showing up in the DSpace sitemap bug
- The fix is super simple, I should try to apply it
## 2021-06-21
- The AReS harvesting finished, but the indexes got messed up again
- I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
- We have 90,000+ items, but only 85,000 unique:
```console
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
90937
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
85709
```
- So those could be duplicates from the way we harvest pages, but they could also be from mappings...
- Manually inspecting the duplicates where handles appear more than once:
```console
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
```
- Unfortunately I found no pattern:
- Some appear twice in the Elasticsearch index, but appear in only one collection
- Some appear twice in the Elasticsearch index, and appear in *two* collections
- Some appear twice in the Elasticsearch index, but appear in three collections (!)
- So really we need to just check whether a handle exists before we insert it
- I tested the [pull request for DS-1977](https://github.com/DSpace/DSpace/pull/2584) that adjusts the sitemap generation code to exclude private items
- It applies cleanly and seems to work, but we don't actually have any private items
- The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
- Testing the [pull request for DS-4065](https://github.com/DSpace/DSpace/pull/2275) where the REST API's `/rest/items` endpoint is not aware of private items and returns an incorrect number of items
- This is most easily seen by setting a low limit in `/rest/items`, making one of the items private, and requesting items again with the same limit
- I confirmed the issue on the current DSpace 6 Demo:
```console
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
5
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
"10673/4"
"10673/3"
"10673/6"
"10673/5"
"10673/7"
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
4
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
"10673/4"
"10673/3"
"10673/5"
"10673/7"
```
- I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
- Last week I noticed that the Gender Platform website is using "cgspace.cgiar.org" links for CGSpace, instead of handles
- I emailed Fabio and Marianne to ask them to please use the Handle links
- I tested the [pull request for DS-4271](https://github.com/DSpace/DSpace/pull/2543) where Discovery filters of type "contains" don't work as expected when the user's search term has spaces
- I tested with filter "farmer managed irrigation systems" on DSpace Test
- Before the patch I got 293 results, and the few I checked didn't have the expected metadata value
- After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting
<!-- vim: set sw=2 ts=2: -->