Add notes for 2024-05-20

This commit is contained in:
2024-05-20 17:34:14 +03:00
parent 28a0c82e96
commit 39d8d0876c
150 changed files with 239 additions and 193 deletions

View File

@ -83,4 +83,27 @@ $ csvgrep -c 'dc.description.provenance[en_US]' -m 'CONTENTdm' cgspace.csv \
https://dspace7test.ilri.org/server/api/pid/find?id=10568/118424
```
## 2024-05-15
- I got journal titles for 2,900 journal articles that were missing them from Crossref
## 2024-05-16
Helping IFPRI with some DSpace 7 API support, these are two queries for items issued in 2024:
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued:2024
- https://dspace7test.ilri.org/server/api/discover/search/objects?query=dcterms.issued_dt%3A%5B2024-01-01T00%3A00%3A00Z%20TO%20%2A%5D — note the Lucene search syntax is URL encoded version of `:[2024-01-01T00:00:00Z TO *]`
Both of them return the same number of results and seem identitical as far as I can see, but the second one uses Solr date indexes and requires the full Lucene datetime and range syntax
I wrote a new version of the `check_duplicates.py` script to help identify duplicates with different types
- Initially I called it `check_duplicates_fast.py` but it's actually not faster
- I need to find a way to deal with duplicates from IFPRI's repository because there are some mismatched types...
## 2024-05-20
Continue working through alternative duplicate matching for IFPRI
- Their item types are sometimes different than ours...
- One thing I think I can say for sure is that the default similarity factor in my script is 0.6, and I rarely see legitimate duplicates with such similarity so I might increase this to 0.7 to reduce the number of items I have to check
- Also, the difference in issue dates is currently 365, but I should reduce that a bit, perhaps to 270 days (9 months)
<!-- vim: set sw=2 ts=2: -->