June, 2017

Thu Jun 01, 2017 by Alan Orth in Notes

2017-06-01

After discussion with WLE and CGSpace content people, we decided to just add one metadata field for the WLE Research Themes
The cg.identifier.wletheme field will be used for both Phase I and Phase II Research Themes
Then we’ll create a new sub-community for Phase II and create collections for the research themes there
The current “Research Themes” community will be renamed to “WLE Phase I Research Themes”
Tagged all items in the current Phase I collections with their appropriate themes
Create pull request to add Phase II research themes to the submission form: #328
Add cg.subject.system to CGSpace metadata registry, for subject from the upcoming CGIAR Library migration

2017-06-04

After adding cg.identifier.wletheme to 1106 WLE items I can see the field on XMLUI but not in REST!
Strangely it happens on DSpace Test AND on CGSpace!
I tried to re-index Discovery but it didn’t fix it
Run all system updates on DSpace Test and reboot the server
After rebooting the server (and therefore restarting Tomcat) the new metadata field is available
I’ve sent a message to the dspace-tech mailing list to ask if this is a bug and whether I should file a Jira ticket

2016-06-05

Rename WLE’s “Research Themes” sub-community to “WLE Phase I Research Themes” on DSpace Test so Macaroni Bros can continue their testing
Macaroni Bros tested it and said it’s fine, so I renamed it on CGSpace as well
Working on how to automate the extraction of the CIAT Book chapters, doing some magic in OpenRefine to extract page from–to from cg.identifier.url and dc.format.extent, respectively:
- cg.identifier.url: value.split("page=", "")[1]
- dc.format.extent: value.replace("p. ", "").split("-")[1].toNumber() - value.replace("p. ", "").split("-")[0].toNumber()
Finally, after some filtering to see which small outliers there were (based on dc.format.extent using “p. 1-14” vs “29 p.”), create a new column with last page number:
- cells["dc.page.from"].value.toNumber() + cells["dc.format.pages"].value.toNumber()
Then create a new, unique file name to be used in the output, based on a SHA1 of the dc.title and with a description:
- dc.page.to: value.split(" ")[0].replace(",","").toLowercase() + "-" + sha1(value).get(1,9) + ".pdf__description:" + cells["dc.type"].value
Start processing 769 records after filtering the following (there are another 159 records that have some other format, or for example they have their own PDF which I will process later), using a modified generate-thumbnails.py script to read certain fields and then pass to GhostScript:
- cg.identifier.url: value.contains("page=")
- dc.format.extent: or(value.contains("p. "),value.contains(" p."))
- Command like: $ gs -dNOPAUSE -dBATCH -dFirstPage=14 -dLastPage=27 -sDEVICE=pdfwrite -sOutputFile=beans.pdf -f 12605-1.pdf
17 of the items have issues with incorrect page number ranges, and upon closer inspection they do not appear in the referenced PDF
I’ve flagged them and proceeded without them (752 total) on DSpace Test:

$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" [dspace]/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/93843 --source /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ --mapfile=/tmp/ciat-books.map &> /tmp/ciat-books.log

I went and did some basic sanity checks on the remaining items in the CIAT Book Chapters and decided they are mostly fine (except one duplicate and the flagged ones), so I imported them to DSpace Test too (162 items)
Total items in CIAT Book Chapters is 914, with the others being flagged for some reason, and we should send that back to CIAT
Restart Tomcat on CGSpace so that the cg.identifier.wletheme field is available on REST API for Macaroni Bros