Add notes for 2022-08-24

This commit is contained in:
2022-08-24 21:24:07 -07:00
parent 64d5b998f9
commit 5084b5ca5e
29 changed files with 131 additions and 34 deletions

View File

@ -248,4 +248,53 @@ $ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --
- I created a [GitHub issue for OpenRXV compatibility issues with DSpace 7](https://github.com/ilri/OpenRXV/issues/133)
## 2022-08-24
- Start working on the MARLO OICRs
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
```console
$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
```
- After that I imported it into OpenRefine for data cleaning
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
- First, create a new column with this GREL:
```console
cells["dc.title"].value + " " + cells["dcterms.abstract"].value
```
- Then use this Jython:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- After that I de-duplicated the terms using this Jython:
```python
res = []
[res.append(x) for x in value.split("||") if x not in res]
return "||".join(res)
```
- Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
- Then I did the same for countries
- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
```
- Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that
<!-- vim: set sw=2 ts=2: -->