From 28d25cdac0d2e261adea5c95f69314956e214749 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Mon, 19 Oct 2020 15:47:59 +0300 Subject: [PATCH] Add notes for 2020-10-19 --- content/posts/2020-10.md | 179 +++++++++++++++++++++++ docs/2019-01/index.html | 6 +- docs/2020-10/index.html | 184 +++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/sitemap.xml | 12 +- 23 files changed, 388 insertions(+), 31 deletions(-) diff --git a/content/posts/2020-10.md b/content/posts/2020-10.md index 7d9a53f4a..b26c69f8a 100644 --- a/content/posts/2020-10.md +++ b/content/posts/2020-10.md @@ -410,4 +410,183 @@ user 7m59.182s sys 2m22.713s ``` +## 2020-10-18 + +- Macaroni Bros wrote to me to ask why some of their CCAFS harvesting is failing + - They are scraping HTML from /browse responses like this: + +https://cgspace.cgiar.org/browse?type=crpsubject&value=Climate+Change%2C+Agriculture+and+Food+Security&XML&rpp=5000 + +- They are using the user agent "CCAFS Website Publications importer BOT" so they are getting rate limited by nginx +- Ideally they would use the REST `find-by-metadata-field` endpoint, but it is *really* slow for large result sets (like twenty minutes!): + +``` +$ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}' +``` + +- For now I will whitelist their user agent so that they can continue scraping /browse +- I figured out that the mappings for AReS are stored in Elasticsearch + - There is a Kibana interface running on port 5601 that can help explore the values in the index + - I can interact with Elasticsearch by sending requests, for example to delete an item by its `_id`: + +``` +$ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d' +{ + "query": { + "match": { + "_id": "64j_THMBiwiQ-PKfCSlI" + } + } +} +``` + +- I added a new find/replace: + +``` +$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d' +{ + "find": "ALAN1", + "replace": "ALAN2", +} +' +``` + +- I see it in Kibana, and I can search it in Elasticsearch, but I don't see it in OpenRXV's mapping values dashboard +- Now I deleted everything in the `openrxv-values` index: + +``` +$ curl -XDELETE http://localhost:9200/openrxv-values +``` + +- Then I tried posting it again: + +``` +$ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d' +{ + "find": "ALAN1", + "replace": "ALAN2", +} +' +``` + +- But I still don't see it in AReS +- Interesting! I added a find/replace manually in AReS and now I see the one I POSTed... +- I fixed a few bugs in the Simple and Extended PDF reports on AReS + - Add missing ISI Journal and Type to Simple PDF report + - Fix DOIs in Simple PDF report + - Add missing "https://hdl.handle.net" to Handles in Extented PDF report +- Testing Atmire CUA and L&R based on their feedback from a few days ago + - I no longer get the NullPointerException from CUA when importing metadata on the command line (!) + - Listings and Reports now shows results for simple queries that I tested (!), though it seems that there are some new JavaScript libraries I need to allow in nginx +- I sent a mail to the dspace-tech mailing list asking about the error with DSpace 6's "Export Search Metadata" function + - If I search for an author like "Orth, Alan" it gives an HTTP 400, but if I search for "Orth" alone it exports a CSV + - I replicated the same issue on demo.dspace.org + +## 2020-10-19 + +- Last night I learned how to POST mappings to Elasticsearch for AReS: + +``` +$ curl -XDELETE http://localhost:9200/openrxv-values +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json +``` + +- The JSON file looks like this, with one instruction on each line: + +``` +{"index":{}} +{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" } +{"index":{}} +{ "find": "FISH", "replace": "Fish" } +``` + +- Adjust the report templates on AReS based on some of Peter's feedback +- I wrote a quick Python script to filter and convert the old AReS mappings to [Elasticsearch's Bulk API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) format: + +```python +#!/usr/bin/env python3 + +import json +import re + +f = open('/tmp/mapping.json', 'r') +data = json.load(f) + +# Iterate over old mapping file, which is in format "find": "replace", ie: +# +# "alan": "ALAN" +# +# And convert to proper dictionaries for import into Elasticsearch's Bulk API: +# +# { "find": "alan", "replace": "ALAN" } +# +for find, replace in data.items(): + # Skip all upper and all lower case strings because they are indicative of + # some AGROVOC or other mappings we no longer want to do + if find.isupper() or find.islower() or replace.isupper() or replace.islower(): + continue + + # Skip replacements with acronyms like: + # + # International Livestock Research Institute - ILRI + # + acronym_pattern = re.compile(r"[A-Z]+$") + acronym_pattern_match = acronym_pattern.search(replace) + if acronym_pattern_match is not None: + continue + + mapping = { "find": find, "replace": replace } + + # Print command for Elasticsearch + print('{"index":{}}') + print(json.dumps(mapping)) + +f.close() +``` + +- It filters all upper and lower case strings as well as any replacements that end in an acronym like "- ILRI", reducing the number of mappings from around 4,000 to about 900 +- I deleted the existing `openrxv-values` Elasticsearch core and then POSTed it: + +``` +$ ./convert-mapping.py > /tmp/elastic-mappings.txt +$ curl -XDELETE http://localhost:9200/openrxv-values +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt +``` + +- Then in AReS I didn't see the mappings in the dashboard until I added a new one manually, after which they all appeared + - I started a new harvesting +- I checked the CIMMYT DSpace repository and I see they have [the REST API enabled](https://repository.cimmyt.org/rest) + - The data doesn't look too bad actually: they have countries in title case, AGROVOC in upper case, CRPs, etc + - According to [their OAI](https://repository.cimmyt.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc) they have 6,500 items in the repository + - I would be interested to explore the possibility to harvest them... +- Bosede said they were having problems with the "Access" step during item submission + - I looked at the Munin graphs for PostgreSQL and both connections and locks look normal so I'm not sure what it could be + - I restarted the PostgreSQL service just to see if that would help +- I ran the `dspace cleanup -v` process on CGSpace and got an error: + +``` +Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle" + Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle". +``` + +- The solution is, as always: + +``` +$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);' +UPDATE 1 +``` + +- After looking at the CGSpace Solr stats for 2020-10 I found some hits to purge: + +``` +$ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p + +Purging 2474 hits from ShortLinkTranslate in statistics +Purging 2568 hits from RI\/1\.0 in statistics +Purging 1851 hits from ILRI Livestock Website Publications importer BOT in statistics +Purging 1282 hits from curl in statistics + +Total number of bot hits purged: 8174 +``` + diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html index c014515b6..6ebd71b3c 100644 --- a/docs/2019-01/index.html +++ b/docs/2019-01/index.html @@ -26,7 +26,7 @@ I don’t see anything interesting in the web server logs around that time t - + @@ -59,7 +59,7 @@ I don’t see anything interesting in the web server logs around that time t "url": "https://alanorth.github.io/cgspace-notes/2019-01/", "wordCount": "5532", "datePublished": "2019-01-02T09:48:30+02:00", - "dateModified": "2019-10-28T13:39:25+02:00", + "dateModified": "2020-10-19T15:23:30+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -791,7 +791,7 @@ sys 0m2.396s
  • After rebooting I notice that the Linode kernel went down from 4.19.8 to 4.18.16…
  • Atmire sent a quote on our ticket about purchasing the Metadata Quality Module (MQM) for DSpace 5.8
  • Abenet asked me for an OpenSearch query that could generate and RSS feed for items in the Livestock CRP
  • -
  • According to my notes, sort_by=3 is accession date (as configured in `dspace.cfg)
  • +
  • According to my notes, sort_by=3 is accession date (as configured in dspace.cfg)
  • The query currently shows 3023 items, but a Discovery search for Livestock CRP only returns 858 items
  • That query seems to return items tagged with Livestock and Fish CRP as well… hmm.
  • diff --git a/docs/2020-10/index.html b/docs/2020-10/index.html index bccfbdce1..09b386f21 100644 --- a/docs/2020-10/index.html +++ b/docs/2020-10/index.html @@ -23,7 +23,7 @@ During the FlywayDB migration I got an error: - + @@ -51,9 +51,9 @@ During the FlywayDB migration I got an error: "@type": "BlogPosting", "headline": "October, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-10/", - "wordCount": "2831", + "wordCount": "3789", "datePublished": "2020-10-06T16:55:54+03:00", - "dateModified": "2020-10-14T22:21:03+03:00", + "dateModified": "2020-10-15T18:11:00+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -598,6 +598,184 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/dc-subject.x real 88m21.678s user 7m59.182s sys 2m22.713s +

    2020-10-18

    + +

    https://cgspace.cgiar.org/browse?type=crpsubject&value=Climate+Change%2C+Agriculture+and+Food+Security&XML&rpp=5000

    + +
    $ curl -f -H "CCAFS Website Publications importer BOT" -H "Content-Type: application/json" -X POST "https://dspacetest.cgiar.org/rest/items/find-by-metadata-field?limit=100" -d '{"key":"cg.contributor.crp", "value":"Climate Change, Agriculture and Food Security","language": "en_US"}'
    +
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_delete_by_query" -H 'Content-Type: application/json' -d'
    +{
    +  "query": {
    +    "match": {
    +      "_id": "64j_THMBiwiQ-PKfCSlI"
    +    }
    +  }
    +}
    +
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
    +{
    +  "find": "ALAN1",
    +  "replace": "ALAN2",
    +}
    +'
    +
    +
    $ curl -XDELETE http://localhost:9200/openrxv-values
    +
    +
    $ curl -XPOST "localhost:9200/openrxv-values/_doc?pretty" -H 'Content-Type: application/json' -d'
    +{
    +  "find": "ALAN1",
    +  "replace": "ALAN2",
    +}
    +'
    +
    +

    2020-10-19

    + +
    $ curl -XDELETE http://localhost:9200/openrxv-values
    +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @./mapping.json
    +
    +
    {"index":{}}
    +{ "find": "CRP on Dryland Systems - DS", "replace": "Dryland Systems" }
    +{"index":{}}
    +{ "find": "FISH", "replace": "Fish" }
    +
    +
    #!/usr/bin/env python3
    +
    +import json
    +import re
    +
    +f = open('/tmp/mapping.json', 'r')
    +data = json.load(f)
    +
    +# Iterate over old mapping file, which is in format "find": "replace", ie:
    +#
    +#   "alan": "ALAN"
    +#
    +# And convert to proper dictionaries for import into Elasticsearch's Bulk API:
    +#
    +#   { "find": "alan", "replace": "ALAN" }
    +#
    +for find, replace in data.items():
    +    # Skip all upper and all lower case strings because they are indicative of
    +    # some AGROVOC or other mappings we no longer want to do
    +    if find.isupper() or find.islower() or replace.isupper() or replace.islower():
    +        continue
    +
    +    # Skip replacements with acronyms like:
    +    #
    +    #   International Livestock Research Institute - ILRI
    +    #
    +    acronym_pattern = re.compile(r"[A-Z]+$")
    +    acronym_pattern_match = acronym_pattern.search(replace)
    +    if acronym_pattern_match is not None:
    +        continue
    +
    +    mapping = { "find": find, "replace": replace }
    +
    +    # Print command for Elasticsearch
    +    print('{"index":{}}')
    +    print(json.dumps(mapping))
    +
    +f.close()
    +
    +
    $ ./convert-mapping.py > /tmp/elastic-mappings.txt
    +$ curl -XDELETE http://localhost:9200/openrxv-values
    +$ curl -XPOST http://localhost:9200/openrxv-values/_doc/_bulk -H "Content-Type: application/json" --data-binary @/tmp/elastic-mappings.txt
    +
    +
    Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
    +  Detail: Key (bitstream_id)=(192921) is still referenced from table "bundle".
    +
    +
    $ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (192921);'
    +UPDATE 1
    +
    +
    $ ./check-spider-hits.sh -f /tmp/agents -s statistics -u http://localhost:8083/solr -p
    +
    +Purging 2474 hits from ShortLinkTranslate in statistics
    +Purging 2568 hits from RI\/1\.0 in statistics
    +Purging 1851 hits from ILRI Livestock Website Publications importer BOT in statistics
    +Purging 1282 hits from curl in statistics
    +
    +Total number of bot hits purged: 8174
     
    diff --git a/docs/categories/index.html b/docs/categories/index.html index e096a24f9..ee7c9d038 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 3c1f85b50..680d83ec3 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index ff1a79473..f2d4d8e6d 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 27625e880..45d2722d4 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 6fc3c7b2a..11ec1e73d 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/index.html b/docs/index.html index eff6d5540..2aa959819 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 85d6199db..67c12a019 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 8bdb75626..38610caee 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 2833756bc..fe3e8fb25 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 713b123d1..cfa7b1774 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 79edd2199..1d3a91740 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 8ad89a52c..0c3ff519d 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 525fe91c6..d0fdd6188 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index e9177ae99..cab1ada97 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index d5aa8b89e..865cee404 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 2deb354b1..62aff9f44 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 9a98147b0..c6e100bc9 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 5152784f9..917f1ef1b 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index 68df87e7f..fee7f1f3e 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 77eba8b3e..a7b4bebdd 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-10-14T22:21:03+03:00 + 2020-10-19T15:23:30+03:00 https://alanorth.github.io/cgspace-notes/ - 2020-10-14T22:21:03+03:00 + 2020-10-19T15:23:30+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-10-14T22:21:03+03:00 + 2020-10-19T15:23:30+03:00 https://alanorth.github.io/cgspace-notes/2020-10/ - 2020-10-14T22:21:03+03:00 + 2020-10-15T18:11:00+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-10-14T22:21:03+03:00 + 2020-10-19T15:23:30+03:00 @@ -144,7 +144,7 @@ https://alanorth.github.io/cgspace-notes/2019-01/ - 2019-10-28T13:39:25+02:00 + 2020-10-19T15:23:30+03:00