From 6b348cb3a20baf5197f89dbded91cb582440ce82 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 24 Feb 2021 09:21:07 +0200 Subject: [PATCH] Add notes for 2021-02-23 --- content/posts/2021-02.md | 125 ++++++++++++++++++++++++ docs/2021-02/index.html | 121 ++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/sitemap.xml | 10 +- 23 files changed, 267 insertions(+), 29 deletions(-) diff --git a/content/posts/2021-02.md b/content/posts/2021-02.md index 6c0a0a43a..7134bd0af 100644 --- a/content/posts/2021-02.md +++ b/content/posts/2021-02.md @@ -528,4 +528,129 @@ $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' # start indexing in AReS ``` +## 2021-02-22 + +- Start looking at splitting the series name and number in `dcterms.isPartOf` now that we have migrated to CG Core v2 + - The numbers will go to `cg.number` + - I notice there are about 100 series without a number, but they still have a semicolon, for example `Esporo 72;` + - I think I will replace those like this: + +```console +localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$'; +UPDATE 104 +``` + +- As for splitting the other values, I think I can export the `dspace_object_id` and `text_value` and then upload it as a CSV rather than writing a Python script to create the new metadata values + +## 2021-02-22 + +- Check the results of the AReS harvesting from last night: + +```console +$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty' +{ + "count" : 101380, + "_shards" : { + "total" : 1, + "successful" : 1, + "skipped" : 0, + "failed" : 0 + } +} +``` + +- Set the current items index to read only and make a backup: + +```console +$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}' +$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22 +``` + +- Delete the current items index and clone the temp one to it: + +```console +$ curl -XDELETE 'http://localhost:9200/openrxv-items' +$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' +$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items +``` + +- Then delete the temp and backup: + +```console +$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' +{"acknowledged":true}% +$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22' +``` + +## 2021-02-23 + +- CodeObia sent a [pull request for clickable countries on AReS](https://github.com/ilri/OpenRXV/pull/75) + - I deployed it and it seems to work, so I asked Abenet and Peter to test it so we can get feedback +- Remove semicolons from series names without numbers: + +```console +dspace=# BEGIN; +dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$'; +UPDATE 104 +dspace=# COMMIT; +``` + +- Set all `text_lang` values on CGSpace to `en_US` to make the series replacements easier (this didn't work, read below): + +```console +dspace=# BEGIN; +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item); +UPDATE 911 +cgspace=# COMMIT; +``` + +- Then export all series with their IDs to CSV: + +```console +dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER; +``` + +- In OpenRefine I trimmed and consolidated whitespace, then made some quick cleanups to normalize the fields based on a sanity check + - For example many Spore items are like "Spore, Spore 23" + - Also, "Agritrade, August 2002" +- Then I copied the column to a new one called `cg.number[en_US]` and split the values for each on the semicolon using `value.split(';')[0]` and `value.split(';')[1]` +- I tried to upload some of the series data to DSpace Test but I'm having an issue where some fields change that shouldn't + - It seems not all fields get updated when I set the text_lang globally, but if I updated it manually like this it works: + +```console +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845; +UPDATE 1 +``` + +- This also seems to work, using the id for just that one item: + +```console +dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322'; +UPDATE 37 +``` + +- This seems to work better for some reason: + +```console +dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item); +UPDATE 18659 +``` + +- I split the CSV file in batches of 5,000 using xsv, then imported them one by one in CGSpace: + +```console +$ dspace metadata-import -f /tmp/0.csv +``` + +- It took FOREVER to import each file... like several hours. MY GOD DSpace 6 is slow. +- Help Dominique Perera debug some issues with the WordPress DSpace importer plugin from Macaroni Bros + - She is not seeing the community list for CGSpace, and I see weird requests like this in the logs: + +```console +104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT" +104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT" +``` + +- The first request is OK, but the second one is malformed for sure + diff --git a/docs/2021-02/index.html b/docs/2021-02/index.html index b0e504d8b..3ce913b4b 100644 --- a/docs/2021-02/index.html +++ b/docs/2021-02/index.html @@ -32,7 +32,7 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty - + @@ -70,9 +70,9 @@ $ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty "@type": "BlogPosting", "headline": "February, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-02/", - "wordCount": "3111", + "wordCount": "3754", "datePublished": "2021-02-01T10:13:54+02:00", - "dateModified": "2021-02-21T18:20:37+02:00", + "dateModified": "2021-02-21T20:37:27+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -678,7 +678,120 @@ $ dspace metadata-import -e aorth@mjanja.ch -f /tmp/cifor.csv
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
 # start indexing in AReS
-
+

2021-02-22

+ +
localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
+UPDATE 104
+
+

2021-02-22

+ +
$ curl -s 'http://localhost:9200/openrxv-items-temp/_count?q=*&pretty'
+{
+  "count" : 101380,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  }
+}
+
+
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d' {"settings": {"index.blocks.write":true}}'
+$ curl -s -X POST http://localhost:9200/openrxv-items/_clone/openrxv-items-2021-02-22
+
+
$ curl -XDELETE 'http://localhost:9200/openrxv-items'
+$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
+$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items
+
+
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
+{"acknowledged":true}%
+$ curl -XDELETE 'http://localhost:9200/openrxv-items-2021-02-22'
+

2021-02-23

+ +
dspace=# BEGIN;
+dspace=# UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '^(.+?);$','\1', 'g') WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item) AND text_value ~ ';$';
+UPDATE 104
+dspace=# COMMIT;
+
+
dspace=# BEGIN;
+dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE text_lang !='en_US' AND dspace_object_id IN (SELECT uuid FROM item);
+UPDATE 911
+cgspace=# COMMIT;
+
+
dspace=# \COPY (SELECT dspace_object_id, text_value as "dcterms.isPartOf[en_US]" FROM metadatavalue WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item)) TO /tmp/2021-02-23-series.csv WITH CSV HEADER;
+
+
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_value_id=5355845;
+UPDATE 1
+
+
dspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id='9840d19b-a6ae-4352-a087-6d74d2629322';
+UPDATE 37
+
+
dspacetest=# UPDATE metadatavalue SET text_lang='en_US' WHERE metadata_field_id=166 AND dspace_object_id IN (SELECT uuid FROM item);
+UPDATE 18659
+
+
$ dspace metadata-import -f /tmp/0.csv
+
+
104.198.97.97 - - [23/Feb/2021:11:41:17 +0100] "GET /rest/communities?limit=1000 HTTP/1.1" 200 188779 "https://cgspace.cgiar.org/rest /communities?limit=1000" "RTB website BOT"
+104.198.97.97 - - [23/Feb/2021:11:41:18 +0100] "GET /rest/communities//communities HTTP/1.1" 404 714 "https://cgspace.cgiar.org/rest/communities//communities" "RTB website BOT"
+
+ diff --git a/docs/categories/index.html b/docs/categories/index.html index cdfa7ed0c..734a44b6e 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 467832fc7..7285381a7 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 1e8a5d217..b1693ec31 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 78b93f4e7..4d138a488 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 00be8858d..c711f3876 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 7b5d83215..8487e5aac 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index d54a5fe77..4f2477aeb 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index cb6717d22..7a8f0ccbe 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 7a6536fc2..8e826ddf4 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index ba03e779c..6996c6da6 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 7d36e34cb..9e29feb88 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 3c083bc95..facec3217 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 2b3bdf0e8..ad86b70ce 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 05f9eb443..610f33b7f 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 09c033af4..d89bb3f6f 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 3c18528f8..f0fc880d4 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 96183fb74..e386d6c44 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index cb713d849..0581c0334 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 3650ae008..5915334c1 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 6cb0aeab2..a1716de14 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index c6f6e7fc7..3e7e08725 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2021-02-21T18:20:37+02:00 + 2021-02-21T20:37:27+02:00 https://alanorth.github.io/cgspace-notes/ - 2021-02-21T18:20:37+02:00 + 2021-02-21T20:37:27+02:00 https://alanorth.github.io/cgspace-notes/2021-02/ - 2021-02-21T18:20:37+02:00 + 2021-02-21T20:37:27+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-02-21T18:20:37+02:00 + 2021-02-21T20:37:27+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-02-21T18:20:37+02:00 + 2021-02-21T20:37:27+02:00