cgspace-notes/content/posts/2022-01.md

225 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "January, 2022"
date: 2022-01-01T15:20:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-01-01
- Start a full harvest on AReS
<!--more-->
## 2022-01-06
- Add ORCID identifier for Chris Jones to CGSpace
- Also tag eighty-eight of his items in CGSpace:
```console
$ cat 2022-01-06-add-orcids.csv
dc.contributor.author,cg.creator.identifier
"Jones, Chris","Chris Jones: 0000-0001-9096-9728"
"Jones, Christopher S.","Chris Jones: 0000-0001-9096-9728"
$ ./ilri/add-orcid-identifiers-csv.py -i 2022-01-06-add-orcids.csv -db dspace63 -u dspacetest -p 'dom@in34sniper'
```
## 2022-01-09
- Validate and register CGSpace on [OpenArchives](https://www.openarchives.org/Register/ValidateSite?log=Z2V7WCT7)
- Last month IWMI colleagues were asking me to look into this, and after checking the OpenArchives mailing list it seems there was a problem on the server side
- Now it has worked and the message is "Successfully updated OAI registration database to status COMPLIANT."
- I received an email (as the Admin contact on our OAI) that says:
> Your repository has been registered in the OAI database of conforming repositories.
- Now I'm taking a screenshot of the validation page for posterity, because the logs seem to go away after some time
![OpenArchives.org registration](/cgspace-notes/2022/01/openarchives-registration.png)
- I tried to re-build the Docker image for OpenRXV and got an error in the backend:
```console
...
> openrxv-backend@0.0.1 build
> nest build
node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.
2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
~~~~~~~~~~~~~~~~~~~~~
node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.
3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Found 2 error(s).
```
- Ah, it seems the code on the server was slightly out of date
- I checked out the latest master branch and it built
## 2022-01-12
- Fix some citation formatting issues in Gaia's [eighteen CAS Green Cover publications on DSpace Test](https://dspacetest.cgiar.org/handle/10568/115230)
## 2022-01-19
- Francesca was having issues with a submission on CGSpace this week
- I checked and see a lot of locks in PostgreSQL:
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (3506 rows)
1 application_name
9 psql
10
3487 dspaceWeb
```
- As before, I see messages from PostgreSQL about processes waiting for locks since I enabled the `log_lock_waits` setting last month:
```console
$ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
12
```
- I set a system alert on DSpace and then restarted the server
## 2022-01-20
- Abenet gave me a thumbs up for Gaia's eighteen CAS Green Cover items from last month
- I created a SimpleArchiveFormat bundle with SAFBuilder and then imported them on CGSpace:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=aorth@mjanja.ch --source /tmp/SimpleArchiveFormat --mapfile=./2022-01-20-green-covers.map
```
## 2022-01-21
- Start working on the rest of the ~980 CGIAR TAC and ICW documents from Gaia
- I did some cleanups and standardization of author names
- I also noticed that a few dozen items had no dates at all, so I checked the PDFs and found dates for them in the text
- Otherwise all items have only a year, which is not great...
- Proof of concept upgrade of OpenRXV from Angular 9 to Angular 10
- I did some basic tests and created a [pull request](https://github.com/ilri/OpenRXV/pull/128)
## 2022-01-22
- Spend some time adding months to the CGIAR TAC and IWC records from Gaia
- Most of the PDFs have only YYYY, so this is annoying...
## 2022-01-23
- Finalize cleaning up the dates on the CGIAR TAC and IWC records from Gaia
- Rebuild AReS and start a fresh harvest
## 2022-01-25
- Help Udana from IWMI answer some questions about licenses on their journal articles
- I was surprised to see they have 921 total, but only about 200 have a `dcterms.license` field
- I updated about thirty manually, but really Udana should do more...
- Normalize the metadata `text_lang` attributes on CGSpace database:
```console
dspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2803350
en | 6232
| 3200
fr | 2
vn | 2
92 | 1
sp | 1
| 0
(8 rows)
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en', '92', '');
UPDATE 9433
```
- Then export the WLE Journal Articles collection again so there are fewer columns to mess with
## 2022-01-26
- Send Gaia an example of the duplicate report for the first 200 TAC items to see what she thinks
## 2022-01-27
- Work on WLE's Journal Articles a bit more
- I realized that ~130 items have DOIs in their citation, but no `cg.identifier.doi` field
- I used this OpenRefine GREL to copy them:
```
cells['dcterms.bibliographicCitation[en_US]'].value.split("doi: ")[1]
```
- I also spent a bit of time cleaning up ILRI Journal Articles, but I notice that we don't put DOIs in the citation so it's not possible to fix items that are missing DOIs that way
- And I cleaned up and normalized some licenses
- Francesca from Bioversity was having issues with a submission on CGSpace again
- I looked at PostgreSQL and see an increasing number of locks:
```console
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
1
1 ------------------
1 (537 rows)
1 application_name
9 psql
51 dspaceApi
477 dspaceWeb
$ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
3
```
- I set a system alert on CGSpace and then restarted Tomcat and PostgreSQL
- The issue in Francesca's case was actually that someone had taken the task, not that PostgreSQL transactions were locked!
## 2022-01-28
- Finalize the last ~100 WLE Journal Article items without licensese and DOIs
- I did as many as I could, also updating http links to https for many journal links
- Federica Bottamedi contacted us from the system office to say that she took over for Vini (Abhilasha Vaid)
- She created an account on CGSpace and now we need to see which workflows she should belong to
- Start a fresh harvesting on AReS
- I adjusted the `check-duplicates.py` script to write the output to a CSV file including the id, both titles, both dates, and the handle link
- I included the id because I will need a unique field to join the resulting list of non-duplicates with the original CSV where the rest of the metadata and filenames are
- Since these items are not in DSpace yet, I generated simple numeric IDs in OpenRefine using this GREL transform: `row.index + 1`
- Then I ran `check-duplicates.py` on items 1200 and sent the resulting CSV to Gaia
- Delete one duplicate item I saw in IITA's Journal Articles that was uploaded earlier in WLE
- Also do some general cleanup on IITA's Journal Articles collection in OpenRefine
- Delete one duplicate item I saw in ILRI's Journal Articles collection
- Also do some general cleanup on ILRI's Journal Articles collection in OpenRefine and csv-metadata-quality
## 2022-01-29
- I did some more cleanup on the ILRI Journal Articles
- I added missing journal titles for items that had ISSNs
- Then I added pages for items that had them in the citation
- First, I faceted the citation field based on whether or not the item had something like ": 232-234" present:
```console
value.contains(/:\s?\d+(-|)\d+/)
```
- Then I faceted by blank on `dcterms.extent` and did a transform to extract the page information for over 1,000 items!
```console
'p. ' +
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[0] +
'-' +
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|)(\d+).*/)[2]
```
- Then I did similar for `cg.volume` and `cg.issue`, also based on the citation, for example to extract the "16" from "Journal of Blah 16(1)", where "16" is the second capture group in a zero-based match:
```console
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
```
- This was 3,000 items so I imported the changes on CGSpace 1,000 at a time...
<!-- vim: set sw=2 ts=2: -->