cgspace-notes/content/posts/2021-04.md

---
title: "April, 2021"
date: 2021-04-01T09:50:54+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2021-04-01

- I wrote a script to query Sherpa's API for our ISSNs: `sherpa-issn-lookup.py`
  - I'm curious to see how the results compare with the results from Crossref yesterday
- AReS Explorer was down since this morning, I didn't see anything in the systemd journal
  - I simply took everything down with docker-compose and then back up, and then it was OK
  - Perhaps one of the containers crashed, I should have looked closer but I was in a hurry

<!--more-->

## 2021-04-03

- Biruk from ICT contacted me to say that some CGSpace users still can't log in
  - I guess the CGSpace LDAP bind account is really still locked after last week's reset
  - He fixed the account and then I was finally able to bind and query:

```console
$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
```

## 2021-04-04

- Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:

```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
```

- Then set the `openrxv-items-final` index to read-only so we can make a backup:

```console
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
{"acknowledged":true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
```

- Then start a harvesting on AReS Explorer
- Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
  - He was hitting [a bug on AReS](https://github.com/ilri/OpenRXV/issues/66) and also he only needed stats for 2020, and AReS currently only gives all-time stats
- I cleaned up about 230 ISSNs on CGSpace in OpenRefine
  - I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: `isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))`
  - Then I applied them on CGSpace with the `fix-metadata-values.py` script:

```console
$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
```

- For now I only fixed obvious errors like "1234-5678." and "e-ISSN: 1234-5678" etc, but there are still lots of invalid ones which need more manual work:
  - Too few characters
  - Too many characters
  - ISBNs
- Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps

## 2021-04-04

- The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:

```console
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
{
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
...
}
```

- `openrxv-items` should be an alias of `openrxv-items-final`, not `openrxv-temp`... I will have to fix that manually
- Enrico asked for more information on the RTB stats I gave him yesterday
  - I remembered (again) that we can't filter Atmire's CUA stats by date issued
  - To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range
  - I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:

```console
$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
  sed '1d' | \
  csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
  csvgrep -c issued -m 2020 | \
  csvcut -c id | \
  sed '1d' | \
  sort | \
  uniq
```

- So I remember in the future, this basically does the following:
  - Use csvcut to extract the id and all date issued columns from the CSV
  - Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)
  - Use csvsql to concatenate the various date issued columns (coalescing where null)
  - Use csvgrep to filter items by date issued in 2020
  - Use csvcut to extract the id column
  - Use sed to delete the header row
  - Use sort and uniq to filter out any duplicate IDs (there were three)
- Then I have a list of 296 IDs for RTB items issued in 2020
- I constructed a JSON file to post to the DSpace Statistics API:

```json
{
  "limit": 100,
  "page": 0,
  "dateFrom": "2020-01-01T00:00:00Z",
  "dateTo": "2020-12-31T00:00:00Z",
  "items": [
"00358715-b70c-4fdd-aa55-730e05ba739e",
"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",
"02fb7630-d71a-449e-b65d-32b4ea7d6904",
...
  ]
}
```

- Then I submitted the file three times (changing the page parameter):

```console
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
```

- Then I extracted the views and downloads in the most ridiculous way:

```console
$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
30364
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
9100
```

- For curiousity I did the same exercise for items issued in 2019 and got the following:
  - Views: 30721
  - Downloads: 10205

<!-- vim: set sw=2 ts=2: -->
Add notes for 2021-04 2021-04-05 18:36:44 +02:00			`---`
			`title: "April, 2021"`
			`date: 2021-04-01T09:50:54+03:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2021-04-01`

			- I wrote a script to query Sherpa's API for our ISSNs: `sherpa-issn-lookup.py`
			`- I'm curious to see how the results compare with the results from Crossref yesterday`
			`- AReS Explorer was down since this morning, I didn't see anything in the systemd journal`
			`- I simply took everything down with docker-compose and then back up, and then it was OK`
			`- Perhaps one of the containers crashed, I should have looked closer but I was in a hurry`

			`<!--more-->`

			`## 2021-04-03`

			`- Biruk from ICT contacted me to say that some CGSpace users still can't log in`
			`- I guess the CGSpace LDAP bind account is really still locked after last week's reset`
			`- He fixed the account and then I was finally able to bind and query:`

			```console
			`$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"`
			```

			`## 2021-04-04`

			`- Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:`

			```console
			`$ curl -s 'http://localhost:9200/_alias/' \| python -m json.tool \| less`
			```

			- Then set the `openrxv-items-final` index to read-only so we can make a backup:

			```console
			`$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'`
			`{"acknowledged":true}%`
			`$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup`
			`{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%`
			`$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'`
			```

			`- Then start a harvesting on AReS Explorer`
			`- Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace`
			`- He was hitting [a bug on AReS](https://github.com/ilri/OpenRXV/issues/66) and also he only needed stats for 2020, and AReS currently only gives all-time stats`
			`- I cleaned up about 230 ISSNs on CGSpace in OpenRefine`
			- I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: `isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))`
			- Then I applied them on CGSpace with the `fix-metadata-values.py` script:

			```console
			`$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253`
			```

			`- For now I only fixed obvious errors like "1234-5678." and "e-ISSN: 1234-5678" etc, but there are still lots of invalid ones which need more manual work:`
			`- Too few characters`
			`- Too many characters`
			`- ISBNs`
			`- Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps`

			`## 2021-04-04`

			`- The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:`

			```console
			`$ curl -s 'http://localhost:9200/_alias/' \| python -m json.tool`
			`{`
			`"openrxv-items-final": {`
			`"aliases": {}`
			`},`
			`"openrxv-items-temp": {`
			`"aliases": {`
			`"openrxv-items": {}`
			`}`
			`},`
			`...`
			`}`
			```

			- `openrxv-items` should be an alias of `openrxv-items-final`, not `openrxv-temp`... I will have to fix that manually
			`- Enrico asked for more information on the RTB stats I gave him yesterday`
			`- I remembered (again) that we can't filter Atmire's CUA stats by date issued`
			`- To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range`
			`- I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:`

			```console
			`$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv`
			`$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv \| \`
			`sed '1d' \| \`
			`csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")\|\|COALESCE(c, "")\|\|COALESCE(d, "") AS issued FROM stdin' \| \`
			`csvgrep -c issued -m 2020 \| \`
			`csvcut -c id \| \`
			`sed '1d' \| \`
			`sort \| \`
			`uniq`
			```

			`- So I remember in the future, this basically does the following:`
			`- Use csvcut to extract the id and all date issued columns from the CSV`
			`- Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)`
			`- Use csvsql to concatenate the various date issued columns (coalescing where null)`
			`- Use csvgrep to filter items by date issued in 2020`
			`- Use csvcut to extract the id column`
			`- Use sed to delete the header row`
			`- Use sort and uniq to filter out any duplicate IDs (there were three)`
			`- Then I have a list of 296 IDs for RTB items issued in 2020`
			`- I constructed a JSON file to post to the DSpace Statistics API:`

			```json
			`{`
			`"limit": 100,`
			`"page": 0,`
			`"dateFrom": "2020-01-01T00:00:00Z",`
			`"dateTo": "2020-12-31T00:00:00Z",`
			`"items": [`
			`"00358715-b70c-4fdd-aa55-730e05ba739e",`
			`"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",`
			`"02fb7630-d71a-449e-b65d-32b4ea7d6904",`
			`...`
			`]`
			`}`
			```

			`- Then I submitted the file three times (changing the page parameter):`

			```console
			`$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items \| json_pp > /tmp/page1.json`
			`$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items \| json_pp > /tmp/page2.json`
			`$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items \| json_pp > /tmp/page3.json`
			```

			`- Then I extracted the views and downloads in the most ridiculous way:`

			```console
			`$ grep views /tmp/page*.json \| grep -o -E '[0-9]+$' \| sed 's/,//' \| xargs \| sed -e 's/ /+/g' \| bc`
			`30364`
			`$ grep downloads /tmp/page*.json \| grep -o -E '[0-9]+,' \| sed 's/,//' \| xargs \| sed -e 's/ /+/g' \| bc`
			`9100`
			```

			`- For curiousity I did the same exercise for items issued in 2019 and got the following:`
			`- Views: 30721`
			`- Downloads: 10205`

			`<!-- vim: set sw=2 ts=2: -->`