Update notes for 2020-08-20

2025-01-27 05:49:12 +01:00 · 2020-08-20 17:35:11 +03:00
parent d2c037d0de
commit ebe0cea35b
20 changed files with 88 additions and 25 deletions
--- a/content/posts/2020-08.md
+++ b/content/posts/2020-08.md
@ -463,6 +463,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H
  - Furthermore, it seems that each item is curated once for each collection it appears in, causing about 115,000 items to be processed, even though we only have about 87,000
 - I had been running the tasks on the entire repository with `-i 10568/0`, but I think I might need to try again with the special `all` option before writing to the dspace-tech mailing list for help
  - Actually I just tested the `all` option on DSpace 5.8 and it still does many of the items multiple times, once for each of their mappings
+  - I sent a message to the dspace-tech mailing list
 - I finished the Atmire stats processing on all cores on DSpace Test:
  - statistics:
    - 2,040,385 docs: 2h 28m 49s
@ -495,4 +496,32 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H
  - On both my local test and DSpace Test I get no results when searching for "Orth, A." and "Orth, Alan" or even Delia Grace, but the Discovery index is up to date and I have eighteen items...
  - I sent a message to Atmire...

+## 2020-08-20
+
+- Natalia from CIAT was asking how she can download all the PDFs for the items in a search result
+  - The search result is for the keyword "trade off" in the WLE community
+  - I converted the Discovery search to an open-search query to extract the XML, but we can't get all the results on one page so I had to change the `rpp` to 100 and request a few times to get them all:
+
+```
+$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
+$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
+$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
+```
+
+- Ugh, and to extract the `<id>` from each `<entry>` we have to use an XPath query, but use a [hack to ignore the default namespace by setting each element's local name](http://blog.powered-up-games.com/wordpress/archives/70):
+
+```
+$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
+$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
+$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
+$ sort -u /tmp/ids.txt > /tmp/ids-sorted.txt
+$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
+```
+
+- Now I have all the handles for the matching items and I can use the REST API to get each item's PDFs...
+  - I wrote `get-wle-pdfs.py` to read the handles from a text file and get all PDFs: https://github.com/ilri/DSpace/blob/5_x-prod/get-wle-pdfs.py
+- Add `Foreign, Commonwealth and Development Office, United Kingdom` to the controlled vocabulary for sponsors on CGSpace
+  - This is the new name for DFID as of 2020-09-01
+  - We will continue using DFID for older items
+
 <!-- vim: set sw=2 ts=2: -->