mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-06-17
This commit is contained in:
@ -85,4 +85,35 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
|
||||
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
|
||||
```
|
||||
|
||||
- The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it's much faster
|
||||
- I harvested 90,000+ items from DSpace Test in ~3 hours
|
||||
- There seem to be some issues with the health check step though
|
||||
|
||||
## 2021-06-17
|
||||
|
||||
- I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
|
||||
- The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
|
||||
- Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
|
||||
- More work with Moayad on OpenRXV harvesting issues
|
||||
- Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:
|
||||
|
||||
```console
|
||||
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
|
||||
90459
|
||||
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
|
||||
90380
|
||||
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
|
||||
...
|
||||
2 "10568/99409"
|
||||
2 "10568/99410"
|
||||
2 "10568/99411"
|
||||
2 "10568/99516"
|
||||
3 "10568/102093"
|
||||
3 "10568/103524"
|
||||
3 "10568/106664"
|
||||
3 "10568/106940"
|
||||
3 "10568/107195"
|
||||
3 "10568/96546"
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user