Add notes for 2021-06-17

2025-01-27 05:49:12 +01:00 · 2021-06-17 18:16:18 +03:00
parent fb2bd040a7
commit a6d606ca0e
25 changed files with 99 additions and 30 deletions
--- a/content/posts/2021-06.md
+++ b/content/posts/2021-06.md
@ -85,4 +85,35 @@ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhos
 $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
 ```

+- The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it's much faster
+  - I harvested 90,000+ items from DSpace Test in ~3 hours
+  - There seem to be some issues with the health check step though
+
+## 2021-06-17
+
+- I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
+  - The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
+- Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
+- More work with Moayad on OpenRXV harvesting issues
+  - Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:
+
+```console
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
+90459
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
+90380
+$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
+...
+      2 "10568/99409"
+      2 "10568/99410"
+      2 "10568/99411"
+      2 "10568/99516"
+      3 "10568/102093"
+      3 "10568/103524"
+      3 "10568/106664"
+      3 "10568/106940"
+      3 "10568/107195"
+      3 "10568/96546"
+```
+
 <!-- vim: set sw=2 ts=2: -->