CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

June, 2021

2021-06-01

  • IWMI notified me that AReS was down with an HTTP 502 error
    • Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
    • I don’t see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the angular_nginx container isn’t running
    • I simply started it and AReS was running again:
$ docker-compose -f docker/docker-compose.yml start angular_nginx
  • Margarita from CCAFS emailed me to say that workflow alerts haven’t been working lately
    • I guess this is related to the SMTP issues last week
    • I had fixed the config, but didn’t restart Tomcat so DSpace didn’t load the new variables
    • I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers

2021-06-03

  • Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI’s content into the new AMCOW Knowledge Hub
    • At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
    • That’s when I thought they could perhaps use the OpenSearch, but I can’t remember if it’s possible to limit by community, or only collection…
    • Looking now, I see there is a “scope” parameter that can be used for community or collection, for example:
https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
  • That will sort by date issued (see: webui.itemlist.sort-option.2 in dspace.cfg), give 100 results per page, and start on item 1
  • Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
  • Fill out the CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC survey on behalf of CGSpace

2021-06-06

  • The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
  • Then I started a harvesting on AReS

2021-06-07

  • The harvesting on AReS completed successfully
  • Provide feedback to FAO on how we use AGROVOC for their “AGROVOC call for use cases”

2021-06-10

  • Skype with Moayad to discuss AReS harvesting improvements
    • He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not

2021-06-14

  • Dump and re-create indexes on AReS (as above) so I can do a harvest

2021-06-16

  • Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot’s DNS
    • For example, user agent Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0) with DNS msnbot-131-253-25-91.search.msn.com.
    • I queried Solr for all hits using the MSN bot DNS (dns:*msnbot* AND dns:*.msn.com.) and found 457,706
    • I extracted their IPs using Solr’s CSV format and ran them through my resolve-addresses.py script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
    • Note that Microsoft’s docs say that reverse lookups on Bingbot IPs will always have “search.msn.com” so it is safe to purge these as non-human traffic
    • I purged the hits with ilri/check-spider-ip-hits.sh (though I had to do it in 3 batches because I forgot to increase the facet.limit so I was only getting them 100 at a time)
  • Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
    • It will hopefully also fix the duplicate and missing items issues
    • I had a Skype with him to discuss
    • I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
  • The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it’s much faster
    • I harvested 90,000+ items from DSpace Test in ~3 hours
    • There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times…

2021-06-17

  • I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
    • The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
  • Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
  • More work with Moayad on OpenRXV harvesting issues
    • Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
90459
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
90380
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
...
      2 "10568/99409"
      2 "10568/99410"
      2 "10568/99411"
      2 "10568/99516"
      3 "10568/102093"
      3 "10568/103524"
      3 "10568/106664"
      3 "10568/106940"
      3 "10568/107195"
      3 "10568/96546"

2021-06-20

  • Udana asked me to update their IWMI subjects from farmer managed irrigation systems to farmer-led irrigation
    • First I extracted the IWMI community from CGSpace:
$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
  • Then I used csvcut to extract just the columns I needed and do the replacement into a new CSV:
$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
  • Then I uploaded the resulting CSV to CGSpace, updating 161 items
  • Start a harvest on AReS
  • I found a bug and a patch for the private items showing up in the DSpace sitemap bug
    • The fix is super simple, I should try to apply it

2021-06-21

  • The AReS harvesting finished, but the indexes got messed up again
  • I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
    • We have 90,000+ items, but only 85,000 unique:
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
90937
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
85709
  • So those could be duplicates from the way we harvest pages, but they could also be from mappings…
    • Manually inspecting the duplicates where handles appear more than once:
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
  • Unfortunately I found no pattern:
    • Some appear twice in the Elasticsearch index, but appear in only one collection
    • Some appear twice in the Elasticsearch index, and appear in two collections
    • Some appear twice in the Elasticsearch index, but appear in three collections (!)
  • So really we need to just check whether a handle exists before we insert it
  • I tested the pull request for DS-1977 that adjusts the sitemap generation code to exclude private items
    • It applies cleanly and seems to work, but we don’t actually have any private items
    • The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
  • Testing the pull request for DS-4065 where the REST API’s /rest/items endpoint is not aware of private items and returns an incorrect number of items
    • This is most easily seen by setting a low limit in /rest/items, making one of the items private, and requesting items again with the same limit
    • I confirmed the issue on the current DSpace 6 Demo:
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
5
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
"10673/4"
"10673/3"
"10673/6"
"10673/5"
"10673/7"
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length       
4
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' 
"10673/4"
"10673/3"
"10673/5"
"10673/7"
  • I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
  • Last week I noticed that the Gender Platform website is using “cgspace.cgiar.org” links for CGSpace, instead of handles
    • I emailed Fabio and Marianne to ask them to please use the Handle links
  • I tested the pull request for DS-4271 where Discovery filters of type “contains” don’t work as expected when the user’s search term has spaces
    • I tested with filter “farmer managed irrigation systems” on DSpace Test
    • Before the patch I got 293 results, and the few I checked didn’t have the expected metadata value
    • After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting
  • I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results:
    • 90459 in final from last harvesting
    • 90307 in temp after new harvest
    • 90327 in temp after start plugins
  • The 90327 number seems closer to the “real” number of items on CGSpace…
    • Seems close, but not entirely correct yet:
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
90327
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
90317

2021-06-22

  • Make a pull request to the COUNTER-Robots project to add two new user agents: crusty and newspaper
    • These two bots have made ~3,000 requests on CGSpace
    • Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p   
Purging 1339 hits from RI\/1\.0 in statistics
Purging 447 hits from crusty in statistics
Purging 3736 hits from newspaper in statistics

Total number of bot hits purged: 5522
  • Surprised to see RI/1.0 in there because it’s been in the override file for a while
  • Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
    • PostmanRuntime/7.26.8
    • node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
    • Photon/1.0
    • StatusCake_Pagespeed_indev
    • node-superagent/3.8.3
    • cortex/1.0
  • These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one
  • I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)

2021-06-23

  • I woke up this morning to find CGSpace down
    • The logs show a high number of abandoned PostgreSQL connections and locks:
# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10100
  • I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
  • In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
  • I also applied the following patches from the 6.4 milestone to our 6_x-prod branch:
    • DS-4065: resource policy aware REST API hibernate queries
    • DS-4271: Replaced brackets with double quotes in SolrServiceImpl
  • After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
63
  • Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:
2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
  • Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)

2021-06-24

  • I deployed the new OpenRXV code on CGSpace but I’m having problems with the indexing, something about missing the mappings on the openrxv-items-temp index
    • I extracted the mappings from my local instance using elasticdump and after putting them on CGSpace I was able to harvest…
    • But still, there are way too many duplicates and I’m not sure what the actual number of items should be
    • According to the OAI ListRecords for each of our repositories, we should have about:
      • MELSpace: 9537
      • WorldFish: 4483
      • CGSpace: 91305
      • Total: 105325
    • Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
  • This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…
  • The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
90990
  • So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?!
    • I notice that we are missing the type in the metadata structure config for each repository on the production site, and we are using type for item type in the actual schema… so maybe there is a conflict there
    • I will rename type to item_type and add it back to the metadata structure
    • The add missing items definitely checks this field…
    • I modified my local backup to add type: item and uploaded it to the temp index on production
    • Oh! nginx is blocking OpenRXV’s attempt to read the sitemap:
172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
  • I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items…
    • I see it fetched the sitemap three times, we need to make sure it’s only doing it once for each repository
  • According to the api logs we will be adding 5,697 items:
$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
  • Spent a few hours with Moayad troubleshooting and improving OpenRXV
    • We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs