Add notes for 2021-10-06

This commit is contained in:
2021-10-07 08:27:39 +03:00
parent 63500a8837
commit b55e6c9efe
26 changed files with 144 additions and 32 deletions

View File

@ -120,4 +120,52 @@ $ ./ilri/check-spider-ip-hits.sh -f /tmp/robot-ips.txt -p
Total number of bot hits purged: 465119
```
## 2021-10-06
- Thinking about how we could check for duplicates before importing
- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):
```console
localhost/dspace63= > CREATE EXTENSION pg_trgm;
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
metadata_value_id │ text_value │ dspace_object_id
───────────────────┼────────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────
3652624 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ b7f0bf12-b183-4b2f-bbd2-7a5697b0c467
3677663 │ Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines │ fb62f551-f4a5-4407-8cdc-6bff6dac399e
(2 rows)
```
- I was able to find an exact duplicate for an IITA item by searching for its title (I already knew that these existed)
- I started working on a basic Python script to do this and managed to find an actual duplicate in the recent IWMI items
- I think I will check for similar titles, and if I find them I will print out the handles for verification
- I could also proceed to check other metadata like type because those shouldn't vary too much
- I ran my new `check-duplicates.py` script on the 292 non-IWMI publications from Udana and found twelve potential duplicates
- Upon checking them manually, I found that 7/12 were indeed already present on CGSpace!
- This is with the similarity threshold at 0.5. I wonder if tweaking that higher will make the script run faster and eliminate some false positives
- I re-ran it with higher thresholds this eliminated all false positives, but it still took 24 minutes to run for 292 items!
- 0.6: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 24:40.42 total
- 0.7: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.12s user 0.03s system 0% cpu 24:29.15 total
- 0.8: ./ilri/check-duplicates.py -i ~/Downloads/2021-10-03-non-IWMI-publications.cs 0.09s user 0.03s system 0% cpu 25:44.13 total
- Some minor updates to csv-metadata-quality
- Fix two issues with regular expressions in the duplicate items and experimental language checks
- Add a check for items that have a DOI listed in their citation, but are missing a standalone DOI field
- Then I ran this new version of csv-metadata-quality on an export of IWMI's community, minus some fields I don't want to check:
```console
$ csvcut -C 'dc.date.accessioned,dc.date.accessioned[],dc.date.accessioned[en_US],dc.date.available,dc.date.available[],dc.date.available[en_US],dcterms.issued[en_US],dcterms.issued[],dcterms.issued,dc.description.provenance[en],dc.description.provenance[en_US],dc.identifier.uri,dc.identifier.uri[],dc.identifier.uri[en_US],dcterms.abstract[en_US],dcterms.bibliographicCitation[en_US],collection' ~/Downloads/iwmi.csv > /tmp/iwmi-to-check.csv
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
```
- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:
> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
- Shit, so that's worth looking into...
<!-- vim: set sw=2 ts=2: -->