diff --git a/content/posts/2020-10.md b/content/posts/2020-10.md index 7d9a53f4a..b26c69f8a 100644 --- a/content/posts/2020-10.md +++ b/content/posts/2020-10.md @@ -410,4 +410,183 @@ user 7m59.182s sys 2m22.713s ``` +## 2020-10-18 + +- Macaroni Bros wrote to me to ask why some of their CCAFS harvesting is failing + - They are scraping HTML from /browse responses like this: + +https://cgspace.cgiar.org/browse?type=crpsubject&value=Climate+Change%2C+Agriculture+and+Food+Security&XML&rpp=5000 + +- They are using the user agent "CCAFS Website Publications importer BOT" so they are getting rate limited by nginx +- Ideally they would use the REST `find-by-metadata-field` endpoint, but it is *really* slow for large result sets (like twenty minutes!): + +``` +$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}' +``` + +- For now I will whitelist their user agent so that they can continue scraping /browse +- I figured out that the mappings for AReS are stored in Elasticsearch + - There is a Kibana interface running on port 5601 that can help explore the values in the index + - I can interact with Elasticsearch by sending requests, for example to delete an item by its `_id`: + +``` +$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d' +{ + "query": { + "match": { + "_id": "64j_THMBiwiQ-PKfCSlI" + } + } +} +``` + +- I added a new find/replace: + +``` +$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d' +{ + "find": "ALAN1", + "replace": "ALAN2", +} +' +``` + +- I see it in Kibana, and I can search it in Elasticsearch, but I don't see it in OpenRXV's mapping values dashboard +- Now I deleted everything in the `openrxv-values` index: + +``` +$ curl -XDELETE http://localhost:9200/openrxv-values +``` + +- Then I tried posting it again: + +``` +$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d' +{ + "find": "ALAN1", + "replace": "ALAN2", +} +' +``` + +- But I still don't see it in AReS +- Interesting! I added a find/replace manually in AReS and now I see the one I POSTed... +- I fixed a few bugs in the Simple and Extended PDF reports on AReS + - Add missing ISI Journal and Type to Simple PDF report + - Fix DOIs in Simple PDF report + - Add missing "https://hdl.handle.net" to Handles in Extented PDF report +- Testing Atmire CUA and L&R based on their feedback from a few days ago + - I no longer get the NullPointerException from CUA when importing metadata on the command line (!) + - Listings and Reports now shows results for simple queries that I tested (!), though it seems that there are some new JavaScript libraries I need to allow in nginx +- I sent a mail to the dspace-tech mailing list asking about the error with DSpace 6's "Export Search Metadata" function + - If I search for an author like "Orth, Alan" it gives an HTTP 400, but if I search for "Orth" alone it exports a CSV + - I replicated the same issue on demo.dspace.org + +## 2020-10-19 + +- Last night I learned how to POST mappings to Elasticsearch for AReS: + +``` +$ curl -XDELETE http://localhost:9200/openrxv-values +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json +``` + +- The JSON file looks like this, with one instruction on each line: + +``` +{"index":{}} +{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" } +{"index":{}} +{ "find": "FISH", "replace": "Fish" } +``` + +- Adjust the report templates on AReS based on some of Peter's feedback +- I wrote a quick Python script to filter and convert the old AReS mappings to [Elasticsearch's Bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) format: + +```python +#!/usr/bin/env python3 + +import json +import re + +f = open('/tmp/mapping.json', 'r') +data = json.load(f) + +# Iterate over old mapping file, which is in format "find": "replace", ie: +# +# "alan": "ALAN" +# +# And convert to proper dictionaries for import into Elasticsearch's Bulk API: +# +# { "find": "alan", "replace": "ALAN" } +# +for find, replace in data.items(): + # Skip all upper and all lower case strings because they are indicative of + # some AGROVOC or other mappings we no longer want to do + if find.isupper() or find.islower() or replace.isupper() or replace.islower(): + continue + + # Skip replacements with acronyms like: + # + # International Livestock Research Institute - ILRI + # + acronym_pattern = re.compile(r"[A-Z]+$") + acronym_pattern_match = acronym_pattern.search(replace) + if acronym_pattern_match is not None: + continue + + mapping = { "find": find, "replace": replace } + + # Print command for Elasticsearch + print('{"index":{}}') + print(json.dumps(mapping)) + +f.close() +``` + +- It filters all upper and lower case strings as well as any replacements that end in an acronym like "- ILRI", reducing the number of mappings from around 4,000 to about 900 +- I deleted the existing `openrxv-values` Elasticsearch core and then POSTed it: + +``` +$ ./convert-mapping.py > /tmp/elastic-mappings.txt +$ curl -XDELETE http://localhost:9200/openrxv-values +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt +``` + +- Then in AReS I didn't see the mappings in the dashboard until I added a new one manually, after which they all appeared + - I started a new harvesting +- I checked the CIMMYT DSpace repository and I see they have [the REST API enabled](https://repository.cimmyt.org/rest) + - The data doesn't look too bad actually: they have countries in title case, AGROVOC in upper case, CRPs, etc + - According to [their OAI](https://repository.cimmyt.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc) they have 6,500 items in the repository + - I would be interested to explore the possibility to harvest them... +- Bosede said they were having problems with the "Access" step during item submission + - I looked at the Munin graphs for PostgreSQL and both connections and locks look normal so I'm not sure what it could be + - I restarted the PostgreSQL service just to see if that would help +- I ran the `dspace cleanup -v` process on CGSpace and got an error: + +``` +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle" + Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle". +``` + +- The solution is, as always: + +``` +$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);' +UPDATE 1 +``` + +- After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge: + +``` +$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p + +Purging 2474 hits from ShortLinkTranslate in statistics +Purging 2568 hits from RI\/1\.0 in statistics +Purging 1851 hits from ILRI Livestock Website Publications importer BOT in statistics +Purging 1282 hits from curl in statistics + +Total number of bot hits purged: 8174 +``` + diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html index c014515b6..6ebd71b3c 100644 --- a/docs/2019-01/index.html +++ b/docs/2019-01/index.html @@ -26,7 +26,7 @@ I don’t see anything interesting in the web server logs around that time t - + @@ -59,7 +59,7 @@ I don’t see anything interesting in the web server logs around that time t "url": "https://alanorth.github.io/cgspace-notes/2019-01/", "wordCount": "5532", "datePublished": "2019-01-02T09:48:30+02:00", - "dateModified": "2019-10-28T13:39:25+02:00", + "dateModified": "2020-10-19T15:23:30+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -791,7 +791,7 @@ sys 0m2.396s
sort_by=3
is accession date (as configured in `dspace.cfg)sort_by=3
is accession date (as configured in dspace.cfg
)Livestock and Fish
CRP as well… hmm.find-by-metadata-field
endpoint, but it is really slow for large result sets (like twenty minutes!):$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
+
_id
:$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
+{
+ "query": {
+ "match": {
+ "_id": "64j_THMBiwiQ-PKfCSlI"
+ }
+ }
+}
+
$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
+{
+ "find": "ALAN1",
+ "replace": "ALAN2",
+}
+'
+
openrxv-values
index:$ curl -XDELETE http://localhost:9200/openrxv-values
+
$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
+{
+ "find": "ALAN1",
+ "replace": "ALAN2",
+}
+'
+
$ curl -XDELETE http://localhost:9200/openrxv-values
+$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
+
{"index":{}}
+{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
+{"index":{}}
+{ "find": "FISH", "replace": "Fish" }
+
#!/usr/bin/env python3
+
+import json
+import re
+
+f = open('/tmp/mapping.json', 'r')
+data = json.load(f)
+
+# Iterate over old mapping file, which is in format "find": "replace", ie:
+#
+# "alan": "ALAN"
+#
+# And convert to proper dictionaries for import into Elasticsearch's Bulk API:
+#
+# { "find": "alan", "replace": "ALAN" }
+#
+for find, replace in data.items():
+ # Skip all upper and all lower case strings because they are indicative of
+ # some AGROVOC or other mappings we no longer want to do
+ if find.isupper() or find.islower() or replace.isupper() or replace.islower():
+ continue
+
+ # Skip replacements with acronyms like:
+ #
+ # International Livestock Research Institute - ILRI
+ #
+ acronym_pattern = re.compile(r"[A-Z]+$")
+ acronym_pattern_match = acronym_pattern.search(replace)
+ if acronym_pattern_match is not None:
+ continue
+
+ mapping = { "find": find, "replace": replace }
+
+ # Print command for Elasticsearch
+ print('{"index":{}}')
+ print(json.dumps(mapping))
+
+f.close()
+
openrxv-values
Elasticsearch core and then POSTed it:$ ./convert-mapping.py > /tmp/elastic-mappings.txt
+$ curl -XDELETE http://localhost:9200/openrxv-values
+$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
+
dspace cleanup -v
process on CGSpace and got an error:Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
+ Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
+
$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
+UPDATE 1
+
$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
+
+Purging 2474 hits from ShortLinkTranslate in statistics
+Purging 2568 hits from RI\/1\.0 in statistics
+Purging 1851 hits from ILRI Livestock Website Publications importer BOT in statistics
+Purging 1282 hits from curl in statistics
+
+Total number of bot hits purged: 8174
diff --git a/docs/categories/index.html b/docs/categories/index.html
index e096a24f9..ee7c9d038 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 3c1f85b50..680d83ec3 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index ff1a79473..f2d4d8e6d 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 27625e880..45d2722d4 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 6fc3c7b2a..11ec1e73d 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index eff6d5540..2aa959819 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index 85d6199db..67c12a019 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 8bdb75626..38610caee 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 2833756bc..fe3e8fb25 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 713b123d1..cfa7b1774 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 79edd2199..1d3a91740 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 8ad89a52c..0c3ff519d 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index 525fe91c6..d0fdd6188 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index e9177ae99..cab1ada97 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index d5aa8b89e..865cee404 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 2deb354b1..62aff9f44 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 9a98147b0..c6e100bc9 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 5152784f9..917f1ef1b 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 68df87e7f..fee7f1f3e 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -9,7 +9,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 77eba8b3e..a7b4bebdd 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,27 +4,27 @@