Add notes for 2021-02-23

This commit is contained in:
2021-02-24 09:21:07 +02:00
parent 3be27df78e
commit 6b348cb3a2
23 changed files with 267 additions and 29 deletions

View File

@ -528,4 +528,129 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
# start indexing in AReS
```
## 2021-02-22
- Start looking at splitting the series name and number in `dcterms.isPartOf` now that we have migrated to CG Core v2
- The numbers will go to `cg.number`
- I notice there are about 100 series without a number, but they still have a semicolon, for example `Esporo 72;`
- I think I will replace those like this:
```console
localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
UPDATE 104
```
- As for splitting the other values, I think I can export the `dspace_object_id` and `text_value` and then upload it as a CSV rather than writing a Python script to create the new metadata values
## 2021-02-22
- Check the results of the AReS harvesting from last night:
```console
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
{
"count" : 101380,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
```
- Set the current items index to read only and make a backup:
```console
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
```
- Delete the current items index and clone the temp one to it:
```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
```
- Then delete the temp and backup:
```console
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
{"acknowledged":true}%
$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
```
## 2021-02-23
- CodeObia sent a [pull request for clickable countries on AReS](https://github.com/ilri/OpenRXV/pull/75)
- I deployed it and it seems to work, so I asked Abenet and Peter to test it so we can get feedback
- Remove semicolons from series names without numbers:
```console
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
UPDATE 104
dspace=# COMMIT;
```
- Set all `text_lang` values on CGSpace to `en_US` to make the series replacements easier (this didn't work, read below):
```console
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 911
cgspace=# COMMIT;
```
- Then export all series with their IDs to CSV:
```console
dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
```
- In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check
- For example many Spore items are like "Spore, Spore 23"
- Also, "Agritrade, August 2002"
- Then I copied the column to a new one called `cg.number[en_US]` and split the values for each on the semicolon using `value.split(';')[0]` and `value.split(';')[1]`
- I tried to upload some of the series data to DSpace Test but I'm having an issue where some fields change that shouldn't
- It seems not all fields get updated when I set the text_lang globally, but if I updated it manually like this it works:
```console
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
UPDATE 1
```
- This also seems to work, using the id for just that one item:
```console
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
UPDATE 37
```
- This seems to work better for some reason:
```console
dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
UPDATE 18659
```
- I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace:
```console
$ dspace metadata-import -f /tmp/0.csv
```
- It took FOREVER to import each file... like several hours. MY GOD DSpace 6 is slow.
- Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros
- She is not seeing the community list for CGSpace, and I see weird requests like this in the logs:
```console
104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
```
- The first request is OK, but the second one is malformed for sure
<!-- vim: set sw=2 ts=2: -->