mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-08-24
This commit is contained in:
@ -248,4 +248,53 @@ $ dspace import --add --eperson=fuu@fuu.com --source /tmp/SimpleArchiveFormat --
|
||||
|
||||
- I created a [GitHub issue for OpenRXV compatibility issues with DSpace 7](https://github.com/ilri/OpenRXV/issues/133)
|
||||
|
||||
## 2022-08-24
|
||||
|
||||
- Start working on the MARLO OICRs
|
||||
- First I extracted the filenames and IDs from the v2 metadata file, then joined it with the UTF-8 version:
|
||||
|
||||
```console
|
||||
$ xsv select 'cg.number (series/report No.),File' OICRS\ Metadata\ v2.csv > /tmp/OICR-files.csv
|
||||
$ xsv join --left 'cg.number (series/report No.)' OICRS\ metadata\ utf8\ 20220816_JM.csv 'cg.number (series/report No.)' /tmp/OICR-files.csv > OICRs-UTF-8-with-files.csv
|
||||
```
|
||||
|
||||
- After that I imported it into OpenRefine for data cleaning
|
||||
- To enrich the metadata I combined the title and abstract into a new field and then checked my list of 11,000 AGROVOC terms against it
|
||||
- First, create a new column with this GREL:
|
||||
|
||||
```console
|
||||
cells["dc.title"].value + " " + cells["dcterms.abstract"].value
|
||||
```
|
||||
|
||||
- Then use this Jython:
|
||||
|
||||
```python
|
||||
import re
|
||||
|
||||
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
|
||||
terms = [name.rstrip().lower() for name in f]
|
||||
|
||||
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
|
||||
```
|
||||
|
||||
- After that I de-duplicated the terms using this Jython:
|
||||
|
||||
```python
|
||||
res = []
|
||||
|
||||
[res.append(x) for x in value.split("||") if x not in res]
|
||||
|
||||
return "||".join(res)
|
||||
```
|
||||
|
||||
- Then I split the multi-values on "||" and used a text facet to remove some countries and other nonsense terms that matched, like "gates" and "al" and "s"
|
||||
- Then I did the same for countries
|
||||
- Then I exported the CSV and started searching for duplicates so that I can add them as relations:
|
||||
|
||||
```console
|
||||
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-24-OICRs.csv -u dspace -db dspace -p 'omg' -o /tmp/oicrs-matches.csv
|
||||
```
|
||||
|
||||
- Oh wow, I actually found one OICR already uploaded to CGSpace... I have to ask Jose about that
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user