mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-01-30
This commit is contained in:
@ -188,5 +188,37 @@ $ grep -E '^2022-01*' /var/log/postgresql/postgresql-10-main.log | grep -c 'stil
|
||||
- I included the id because I will need a unique field to join the resulting list of non-duplicates with the original CSV where the rest of the metadata and filenames are
|
||||
- Since these items are not in DSpace yet, I generated simple numeric IDs in OpenRefine using this GREL transform: `row.index + 1`
|
||||
- Then I ran `check-duplicates.py` on items 1–200 and sent the resulting CSV to Gaia
|
||||
- Delete one duplicate item I saw in IITA's Journal Articles that was uploaded earlier in WLE
|
||||
- Also do some general cleanup on IITA's Journal Articles collection in OpenRefine
|
||||
- Delete one duplicate item I saw in ILRI's Journal Articles collection
|
||||
- Also do some general cleanup on ILRI's Journal Articles collection in OpenRefine and csv-metadata-quality
|
||||
|
||||
## 2022-01-29
|
||||
|
||||
- I did some more cleanup on the ILRI Journal Articles
|
||||
- I added missing journal titles for items that had ISSNs
|
||||
- Then I added pages for items that had them in the citation
|
||||
- First, I faceted the citation field based on whether or not the item had something like ": 232-234" present:
|
||||
|
||||
```console
|
||||
value.contains(/:\s?\d+(-|–)\d+/)
|
||||
```
|
||||
|
||||
- Then I faceted by blank on `dcterms.extent` and did a transform to extract the page information for over 1,000 items!
|
||||
|
||||
```console
|
||||
'p. ' +
|
||||
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|–)(\d+).*/)[0] +
|
||||
'-' +
|
||||
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*:\s?(\d+)(-|–)(\d+).*/)[2]
|
||||
```
|
||||
|
||||
- Then I did similar for `cg.volume` and `cg.issue`, also based on the citation, for example to extract the "16" from "Journal of Blah 16(1)", where "16" is the second capture group in a zero-based match:
|
||||
|
||||
```console
|
||||
cells['dcterms.bibliographicCitation[en_US]'].value.match(/.*( |;)(\d+)\((\d+)\).*/)[1]
|
||||
```
|
||||
|
||||
- This was 3,000 items so I imported the changes on CGSpace 1,000 at a time...
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user