June, 2021

Tue Jun 01, 2021 in Notes

2021-06-01

IWMI notified me that AReS was down with an HTTP 502 error
- Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
- I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
- I simply started it and AReS was running again:

$ docker-compose -f docker/docker-compose.yml start angular_nginx

Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately
- I guess this is related to the SMTP issues last week
- I had fixed the config, but didn’t restart Tomcat so DSpace didn’t load the new variables
- I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers

2021-06-03

Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub
- At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
- That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…
- Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:

https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1

That will sort by date issued (see: webui.itemlist.sort-option.2 in dspace.cfg), give 100 results per page, and start on item 1
Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
Fill out the CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC survey on behalf of CGSpace

2021-06-06

The Elasticsearch indexes are messed up so I dumped and re-created them correctly:

curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000

Then I started a harvesting on AReS

2021-06-07

The harvesting on AReS completed successfully
Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”

2021-06-10

Skype with Moayad to discuss AReS harvesting improvements
- He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not

2021-06-14

Dump and re-create indexes on AReS (as above) so I can do a harvest

2021-06-16

Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS
- For example, user agent Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0) with DNS msnbot-131-253-25-91.search.msn.com.
- I queried Solr for all hits using the MSN bot DNS (dns:*msnbot* AND dns:*.msn.com.) and found 457,706
- I extracted their IPs using Solr’s CSV format and ran them through my resolve-addresses.py script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
- Note that Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com” so it is safe to purge these as non-human traffic
- I purged the hits with ilri/check-spider-ip-hits.sh (though I had to do it in 3 batches because I forgot to increase the facet.limit so I was only getting them 100 at a time)
Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
- It will hopefully also fix the duplicate and missing items issues
- I had a Skype with him to discuss
- I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:

$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data

The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster
- I harvested 90,000+ items from DSpace Test in ~3 hours
- There seem to be some issues with the health check step though

2021-06-17

I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
- The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
More work with Moayad on OpenRXV harvesting issues
- Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:

$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
...
      2 "10568/99409"
      2 "10568/99410"
      2 "10568/99411"
      2 "10568/99516"
      3 "10568/102093"
      3 "10568/103524"
      3 "10568/106664"
      3 "10568/106940"
      3 "10568/107195"
      3 "10568/96546"