From ebe0cea35b631a41376ec29b1ce3db7005951894 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Thu, 20 Aug 2020 17:35:11 +0300 Subject: [PATCH] Update notes for 2020-08-20 --- content/posts/2020-08.md | 29 ++++++++++++++++++ docs/2020-08/index.html | 40 +++++++++++++++++++++++-- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/sitemap.xml | 10 +++---- 20 files changed, 88 insertions(+), 25 deletions(-) diff --git a/content/posts/2020-08.md b/content/posts/2020-08.md index 4842f9e7f..a4638e756 100644 --- a/content/posts/2020-08.md +++ b/content/posts/2020-08.md @@ -463,6 +463,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H - Furthermore, it seems that each item is curated once for each collection it appears in, causing about 115,000 items to be processed, even though we only have about 87,000 - I had been running the tasks on the entire repository with `-i 10568/0`, but I think I might need to try again with the special `all` option before writing to the dspace-tech mailing list for help - Actually I just tested the `all` option on DSpace 5.8 and it still does many of the items multiple times, once for each of their mappings + - I sent a message to the dspace-tech mailing list - I finished the Atmire stats processing on all cores on DSpace Test: - statistics: - 2,040,385 docs: 2h 28m 49s @@ -495,4 +496,32 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H - On both my local test and DSpace Test I get no results when searching for "Orth, A." and "Orth, Alan" or even Delia Grace, but the Discovery index is up to date and I have eighteen items... - I sent a message to Atmire... +## 2020-08-20 + +- Natalia from CIAT was asking how she can download all the PDFs for the items in a search result + - The search result is for the keyword "trade off" in the WLE community + - I converted the Discovery search to an open-search query to extract the XML, but we can't get all the results on one page so I had to change the `rpp` to 100 and request a few times to get them all: + +``` +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml +``` + +- Ugh, and to extract the `` from each `` we have to use an XPath query, but use a [hack to ignore the default namespace by setting each element's local name](http://blog.powered-up-games.com/wordpress/archives/70): + +``` +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt +$ sort -u /tmp/ids.txt > /tmp/ids-sorted.txt +$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt +``` + +- Now I have all the handles for the matching items and I can use the REST API to get each item's PDFs... + - I wrote `get-wle-pdfs.py` to read the handles from a text file and get all PDFs: https://github.com/ilri/DSpace/blob/5_x-prod/get-wle-pdfs.py +- Add `Foreign, Commonwealth and Development Office, United Kingdom` to the controlled vocabulary for sponsors on CGSpace + - This is the new name for DFID as of 2020-09-01 + - We will continue using DFID for older items + diff --git a/docs/2020-08/index.html b/docs/2020-08/index.html index 12081ce43..640092822 100644 --- a/docs/2020-08/index.html +++ b/docs/2020-08/index.html @@ -19,7 +19,7 @@ It is class based so I can easily add support for other vocabularies, and the te - + @@ -43,9 +43,9 @@ It is class based so I can easily add support for other vocabularies, and the te "@type": "BlogPosting", "headline": "August, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-08/", - "wordCount": "3168", + "wordCount": "3406", "datePublished": "2020-08-02T15:35:54+03:00", - "dateModified": "2020-08-14T11:22:16+03:00", + "dateModified": "2020-08-19T22:08:33+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -632,6 +632,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
  • I had been running the tasks on the entire repository with -i 10568/0, but I think I might need to try again with the special all option before writing to the dspace-tech mailing list for help
    • Actually I just tested the all option on DSpace 5.8 and it still does many of the items multiple times, once for each of their mappings
    • +
    • I sent a message to the dspace-tech mailing list
  • I finished the Atmire stats processing on all cores on DSpace Test: @@ -702,6 +703,39 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
  • +

    2020-08-20

    +
      +
    • Natalia from CIAT was asking how she can download all the PDFs for the items in a search result +
        +
      • The search result is for the keyword “trade off” in the WLE community
      • +
      • I converted the Discovery search to an open-search query to extract the XML, but we can’t get all the results on one page so I had to change the rpp to 100 and request a few times to get them all:
      • +
      +
    • +
    +
    $ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
    +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
    +$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
    +
    +
    $ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
    +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
    +$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
    +$ sort -u /tmp/ids.txt > /tmp/ids-sorted.txt
    +$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
    +
      +
    • Now I have all the handles for the matching items and I can use the REST API to get each item’s PDFs… + +
    • +
    • Add Foreign, Commonwealth and Development Office, United Kingdom to the controlled vocabulary for sponsors on CGSpace +
        +
      • This is the new name for DFID as of 2020-09-01
      • +
      • We will continue using DFID for older items
      • +
      +
    • +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index e04042c96..75900c458 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index c60a9ac71..7d1bcc27c 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 4dbcf07ee..39c7352fe 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 5fe9e526b..49ca1a2ae 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 0702e011e..147c96190 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/index.html b/docs/index.html index ab4c20224..902fc9c4e 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 45311881b..d246ebd6f 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 6b6292df3..0b433459a 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index fcd9a69c2..28dd5087a 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index ed9730c4c..b39c94a19 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 764829627..083ce682a 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 3838e4141..00486069e 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 33bf699f3..8a92f1d2d 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index 324cc9862..7b13b3d73 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index fc5581cec..c5335d05a 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 00359c4de..b0f60ea14 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index b110f8cda..9e76aa1fb 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 1464d06fa..702e3f72f 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/2020-08/ - 2020-08-14T11:22:16+03:00 + 2020-08-19T22:08:33+03:00 https://alanorth.github.io/cgspace-notes/categories/ - 2020-08-14T11:22:16+03:00 + 2020-08-19T22:08:33+03:00 https://alanorth.github.io/cgspace-notes/ - 2020-08-14T11:22:16+03:00 + 2020-08-19T22:08:33+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-08-14T11:22:16+03:00 + 2020-08-19T22:08:33+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-08-14T11:22:16+03:00 + 2020-08-19T22:08:33+03:00