Add notes for 2022-08-19

This commit is contained in:
2022-08-19 21:55:36 -07:00
parent fc0a9ad944
commit daf4a646ed
29 changed files with 105 additions and 34 deletions

View File

@ -122,4 +122,38 @@ $ csvjoin -c 'cg.number (series/report No.)' MELIAs\ metadata\ utf8\ 20220816_JM
- Dedupe value pairs and controlled vocabularies before writing them
- Sort the controlled vocabularies before writing them (we don't do this for value pairs because some are added in specific order, like CRPs)
## 2022-08-19
- Peter Ballantyne sent me metadata for 311 Gender items that need to be duplicate checked on CGSpace before uploading
- I spent a half an hour in OpenRefine to fix the dates because they only had YYYY, but most abstracts and titles had more specific information about the date
- Then I checked for duplicates:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/gender-ppts-xlsx.csv -u dspace -db dspace -p 'fuuu' -o /tmp/gender-duplicates.csv
```
- I sent the list of ~130 possible duplicates to Peter to check
- Jose sent new versions of the MARLO Innovation/MELIA/OICR/Policy PDFs
- The idea was to replace tinyurl links pointing to MARLO, but I still see many tinyurl links, some of which point to CGIAR Sharepoint and require a login
- I asked them why they don't just use the original links in the first place in case tinyurl.com disappears
- I continued working on the MARLO MELIA v2 UTF-8 metadata
- I did the same metadata enrichment exercise to extract countries and AGROVOC subjects from the abstract field that I did earlier this month, using a Jython expression to match terms in copies of the abstract field
- It helps to replace some characters with spaces first with this GREL: `value.replace(/[.\/;(),]/, " ")`
- This caught some extra AGROVOC terms, but unfortunately we only check for single-word terms
- Then I checked for existing items on CGSpace matching these MELIA using my duplicate checker:
```console
$ ./ilri/check-duplicates.py -i ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv -u dspace -db dspace -p 'fuuu' -o /tmp/melia-matches.csv
```
- Then I did some minor processing and checking of the duplicates file (for example, some titles appear more than once in both files), and joined with the other file (left join):
```console
$ xsv join --left id ~/Downloads/2022-08-18-MELIAs-UTF-8-With-Files.csv id ~/Downloads/melia-matches-csv.csv > /tmp/melias-with-relations.csv
```
- I had to use `xsv` because `csvcut` was throwing an error detecting the dialect of the input CSVs (?)
- I created a SAF bundle and imported the 749 MELIAs to DSpace Test
- I found thirteen items on CGSpace with dates in format "DD/MM/YYYY" so I fixed those
<!-- vim: set sw=2 ts=2: -->