CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

June, 2021

2021-06-01

  • IWMI notified me that AReS was down with an HTTP 502 error
    • Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
    • I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
    • I simply started it and AReS was running again:
$ docker-compose -f docker/docker-compose.yml start angular_nginx
  • Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately
    • I guess this is related to the SMTP issues last week
    • I had fixed the config, but didn’t restart Tomcat so DSpace didn’t load the new variables
    • I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers

2021-06-03

  • Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub
    • At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
    • That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…
    • Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:
https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
  • That will sort by date issued (see: webui.itemlist.sort-option.2 in dspace.cfg), give 100 results per page, and start on item 1
  • Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
  • Fill out the CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC survey on behalf of CGSpace

2021-06-06

  • The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
  • Then I started a harvesting on AReS

2021-06-07

  • The harvesting on AReS completed successfully
  • Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”

2021-06-10

  • Skype with Moayad to discuss AReS harvesting improvements
    • He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not

2021-06-14

  • Dump and re-create indexes on AReS (as above) so I can do a harvest

2021-06-16

  • Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS
    • For example, user agent Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0) with DNS msnbot-131-253-25-91.search.msn.com.
    • I queried Solr for all hits using the MSN bot DNS (dns:*msnbot* AND dns:*.msn.com.) and found 457,706
    • I extracted their IPs using Solr’s CSV format and ran them through my resolve-addresses.py script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
    • Note that Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com” so it is safe to purge these as non-human traffic
    • I purged the hits with ilri/check-spider-ip-hits.sh (though I had to do it in 3 batches because I forgot to increase the facet.limit so I was only getting them 100 at a time)
  • Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
    • It will hopefully also fix the duplicate and missing items issues
    • I had a Skype with him to discuss
    • I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
  • The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster
    • I harvested 90,000+ items from DSpace Test in ~3 hours
    • There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times…

2021-06-17

  • I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
    • The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
  • Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
  • More work with Moayad on OpenRXV harvesting issues
    • Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
...
      2 "10568/99409"
      2 "10568/99410"
      2 "10568/99411"
      2 "10568/99516"
      3 "10568/102093"
      3 "10568/103524"
      3 "10568/106664"
      3 "10568/106940"
      3 "10568/107195"
      3 "10568/96546"

2021-06-20

  • Udana asked me to update their IWMI subjects from farmer managed irrigation systems to farmer-led irrigation
    • First I extracted the IWMI community from CGSpace:
$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
  • Then I used csvcut to extract just the columns I needed and do the replacement into a new CSV:
$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
  • Then I uploaded the resulting CSV to CGSpace, updating 161 items
  • Start a harvest on AReS
  • I found a bug and a patch for the private items showing up in the DSpace sitemap bug
    • The fix is super simple, I should try to apply it

2021-06-21

  • The AReS harvesting finished, but the indexes got messed up again
  • I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
    • We have 90,000+ items, but only 85,000 unique:
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
90937
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
85709
  • So those could be duplicates from the way we harvest pages, but they could also be from mappings…
    • Manually inspecting the duplicates where handles appear more than once:
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
  • Unfortunately I found no pattern:
    • Some appear twice in the Elasticsearch index, but appear in only one collection
    • Some appear twice in the Elasticsearch index, and appear in two collections
    • Some appear twice in the Elasticsearch index, but appear in three collections (!)
  • So really we need to just check whether a handle exists before we insert it
  • I tested the pull request for DS-1977 that adjusts the sitemap generation code to exclude private items
    • It applies cleanly and seems to work, but we don’t actually have any private items
    • The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
  • Testing the pull request for DS-4065 where the REST API’s /rest/items endpoint is not aware of private items and returns an incorrect number of items
    • This is most easily seen by setting a low limit in /rest/items, making one of the items private, and requesting items again with the same limit
    • I confirmed the issue on the current DSpace 6 Demo:
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
5
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
"10673/4"
"10673/3"
"10673/6"
"10673/5"
"10673/7"
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length       
4
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' 
"10673/4"
"10673/3"
"10673/5"
"10673/7"
  • I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
  • Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles
    • I emailed Fabio and Marianne to ask them to please use the Handle links
  • I tested the pull request for DS-4271 where Discovery filters of type “contains” don’t work as expected when the user’s search term has spaces
    • I tested with filter “farmer managed irrigation systems” on DSpace Test
    • Before the patch I got 293 results, and the few I checked didn’t have the expected metadata value
    • After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting
  • I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results:
    • 90459 in final from last harvesting
    • 90307 in temp after new harvest
    • 90327 in temp after start plugins
  • The 90327 number seems closer to the “real” number of items on CGSpace…
    • Seems close, but not entirely correct yet:
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317

2021-06-22

  • Make a pull request to the COUNTER-Robots project to add two new user agents: crusty and newspaper
    • These two bots have made ~3,000 requests on CGSpace
    • Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics

Total number of bot hits purged: 5522
  • Surprised to see RI/1.0 in there because it’s been in the override file for a while
  • Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
    • PostmanRuntime/7.26.8
    • node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
    • Photon/1.0
    • StatusCake_Pagespeed_indev
    • node-superagent/3.8.3
    • cortex/1.0
  • These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one