Add notes for 2022-08-24

2025-01-27 05:49:12 +01:00 · 2022-08-24 21:24:07 -07:00
parent 64d5b998f9
commit 5084b5ca5e
29 changed files with 131 additions and 34 deletions
--- a/content/posts/2022-08.md
+++ b/content/posts/2022-08.md
@ -248,4 +248,53 @@ $ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --

 - I created a [GitHub issue for OpenRXV compatibility issues with DSpace 7](https://github.com/ilri/OpenRXV/issues/133)

+## 2022-08-24
+
+- Start working on the MARLO OICRs
+  - First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
+
+```console
+$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
+$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
+```
+
+- After that I imported it into OpenRefine for data cleaning
+  - To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
+  - First, create a new column with this GREL:
+
+```console
+cells["dc.title"].value + " " + cells["dcterms.abstract"].value
+```
+
+- Then use this Jython:
+
+```python
+import re
+
+with open(r"/tmp/agrovoc-subjects.txt",'r') as f : 
+    terms = [name.rstrip().lower() for name in f]
+
+return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
+```
+
+- After that I de-duplicated the terms using this Jython:
+
+```python
+res = []
+
+[res.append(x) for x in value.split("||") if x not in res]
+
+return "||".join(res)
+```
+
+- Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
+  - Then I did the same for countries
+- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
+
+```console
+$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
+```
+
+- Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that
+
 <!-- vim: set sw=2 ts=2: -->