--- title: "June, 2021" date: 2021-06-01T10:51:07+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2021-06-01 - IWMI notified me that AReS was down with an HTTP 502 error - Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification - I don't see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the `angular_nginx` container isn't running - I simply started it and AReS was running again: ```console $ docker-compose -f docker/docker-compose.yml start angular_nginx ``` - Margarita from CCAFS emailed me to say that workflow alerts haven't been working lately - I guess this is related to the SMTP issues last week - I had fixed the config, but didn't restart Tomcat so DSpace didn't load the new variables - I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers ## 2021-06-03 - Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI's content into the new AMCOW Knowledge Hub - At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time - That's when I thought they could perhaps use the OpenSearch, but I can't remember if it's possible to limit by community, or only collection... - Looking now, I see there is a "scope" parameter that can be used for community or collection, for example: ``` https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1 ``` - That will sort by date issued (see: `webui.itemlist.sort-option.2` in dspace.cfg), give 100 results per page, and start on item 1 - Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week - Fill out the *CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC* survey on behalf of CGSpace ## 2021-06-06 - The Elasticsearch indexes are messed up so I dumped and re-created them correctly: ```console curl -XDELETE 'http://localhost:9200/openrxv-items-final' curl -XDELETE 'http://localhost:9200/openrxv-items-temp' curl -XPUT 'http://localhost:9200/openrxv-items-final' curl -XPUT 'http://localhost:9200/openrxv-items-temp' curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}' elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000 ``` - Then I started a harvesting on AReS ## 2021-06-07 - The harvesting on AReS completed successfully - Provide feedback to FAO on how we use AGROVOC for their "AGROVOC call for use cases" ## 2021-06-10 - Skype with Moayad to discuss AReS harvesting improvements - He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not ## 2021-06-14 - Dump and re-create indexes on AReS (as above) so I can do a harvest ## 2021-06-16 - Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS - For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.` - I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706 - I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075) - Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic - I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time) - Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV - It will hopefully also fix the duplicate and missing items issues - I had a Skype with him to discuss - I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run: ```console $ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data ``` - The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it's much faster - I harvested 90,000+ items from DSpace Test in ~3 hours - There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times... ## 2021-06-17 - I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases - The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits - Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward - More work with Moayad on OpenRXV harvesting issues - Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates: ```console $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l 90459 $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l 90380 $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h ... 2 "10568/99409" 2 "10568/99410" 2 "10568/99411" 2 "10568/99516" 3 "10568/102093" 3 "10568/103524" 3 "10568/106664" 3 "10568/106940" 3 "10568/107195" 3 "10568/96546" ``` ## 2021-06-20 - Udana asked me to update their IWMI subjects from `farmer managed irrigation systems` to `farmer-led irrigation` - First I extracted the IWMI community from CGSpace: ```console $ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv ``` - Then I used `csvcut` to extract just the columns I needed and do the replacement into a new CSV: ```console $ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv ``` - Then I uploaded the resulting CSV to CGSpace, updating 161 items - Start a harvest on AReS - I found [a bug](https://jira.lyrasis.org/browse/DS-1977) and [a patch](https://github.com/DSpace/DSpace/pull/2584) for the private items showing up in the DSpace sitemap bug - The fix is super simple, I should try to apply it ## 2021-06-21 - The AReS harvesting finished, but the indexes got messed up again - I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates - We have 90,000+ items, but only 85,000 unique: ```console $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l 90937 $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l 85709 ``` - So those could be duplicates from the way we harvest pages, but they could also be from mappings... - Manually inspecting the duplicates where handles appear more than once: ```console $ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h ``` - Unfortunately I found no pattern: - Some appear twice in the Elasticsearch index, but appear in only one collection - Some appear twice in the Elasticsearch index, and appear in *two* collections - Some appear twice in the Elasticsearch index, but appear in three collections (!) - So really we need to just check whether a handle exists before we insert it - I tested the [pull request for DS-1977](https://github.com/DSpace/DSpace/pull/2584) that adjusts the sitemap generation code to exclude private items - It applies cleanly and seems to work, but we don't actually have any private items - The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private - Testing the [pull request for DS-4065](https://github.com/DSpace/DSpace/pull/2275) where the REST API's `/rest/items` endpoint is not aware of private items and returns an incorrect number of items - This is most easily seen by setting a low limit in `/rest/items`, making one of the items private, and requesting items again with the same limit - I confirmed the issue on the current DSpace 6 Demo: ```console $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length 5 $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' "10673/4" "10673/3" "10673/6" "10673/5" "10673/7" # log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6) $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length 4 $ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle' "10673/4" "10673/3" "10673/5" "10673/7" ``` - I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira - Last week I noticed that the Gender Platform website is using "cgspace.cgiar.org" links for CGSpace, instead of handles - I emailed Fabio and Marianne to ask them to please use the Handle links - I tested the [pull request for DS-4271](https://github.com/DSpace/DSpace/pull/2543) where Discovery filters of type "contains" don't work as expected when the user's search term has spaces - I tested with filter "farmer managed irrigation systems" on DSpace Test - Before the patch I got 293 results, and the few I checked didn't have the expected metadata value - After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting - I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results: - 90459 in final from last harvesting - 90307 in temp after new harvest - 90327 in temp after start plugins - The 90327 number seems closer to the "real" number of items on CGSpace... - Seems close, but not entirely correct yet: ```console $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l 90327 $ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l 90317 ``` ## 2021-06-22 - Make a [pull request](https://github.com/atmire/COUNTER-Robots/pull/43) to the COUNTER-Robots project to add two new user agents: crusty and newspaper - These two bots have made ~3,000 requests on CGSpace - Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script: ```console $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p Purging 1339 hits from RI\/1\.0 in statistics Purging 447 hits from crusty in statistics Purging 3736 hits from newspaper in statistics Total number of bot hits purged: 5522 ``` - Surprised to see RI/1.0 in there because it's been in the override file for a while - Looking at the 2021 statistics in Solr I see a few more suspicious user agents: - `PostmanRuntime/7.26.8` - `node-fetch/1.0 (+https://github.com/bitinn/node-fetch)` - `Photon/1.0` - `StatusCake_Pagespeed_indev` - `node-superagent/3.8.3` - `cortex/1.0` - These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one - I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18) ## 2021-06-23 - I woke up this morning to find CGSpace down - The logs show a high number of abandoned PostgreSQL connections and locks: ```console # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned' 978 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 10100 ``` - I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now - In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test) - I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch: - DS-4065: resource policy aware REST API hibernate queries - DS-4271: Replaced brackets with double quotes in SolrServiceImpl - After upgrading and restarting Tomcat the database connections and locks were back down to normal levels: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 63 ``` - Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM: ```console 2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000]. ``` - Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent: ```console Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy) ``` - We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/ - I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots - I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon ## 2021-06-24 - I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index - I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest... - But still, there are way too many duplicates and I'm not sure what the actual number of items should be - According to the OAI ListRecords for each of our repositories, we should have about: - MELSpace: 9537 - WorldFish: 4483 - CGSpace: 91305 - Total: 105325 - Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique: ```console $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l 104797 $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l 99186 ``` - This number is probably unique for that particular harvest, but I don't think it represents the true number of items... - The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items: ```console $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l 90990 ``` - So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?! - I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there - I will rename type to `item_type` and add it back to the metadata structure - The add missing items definitely checks this field... - I modified my local backup to add `type: item` and uploaded it to the temp index on production - Oh! nginx is blocking OpenRXV's attempt to read the sitemap: ```console 172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV" ``` - I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items... - I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository - According to the api logs we will be adding 5,697 items: ```console $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l 5697 ``` - Spent a few hours with Moayad troubleshooting and improving OpenRXV - We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs ## 2021-06-25 - The new OpenRXV code creates almost 200,000 jobs when the plugins start - I figured out how to use [bee-queue/arena](https://github.com/bee-queue/arena/tree/master/example) to view our Bull job queue - Also, we can see the jobs directly using redis-cli: ```console $ redis-cli 127.0.0.1:6379> SCAN 0 COUNT 5 1) "49152" 2) 1) "bull:plugins:476595" 2) "bull:plugins:367382" 3) "bull:plugins:369228" 4) "bull:plugins:438986" 5) "bull:plugins:366215" ``` - We can apparently get the names of the jobs in each hash using `hget`: ```console 127.0.0.1:6379> TYPE bull:plugins:401827 hash 127.0.0.1:6379> HGET bull:plugins:401827 name "dspace_add_missing_items" ``` - I whipped up a one liner to get the keys for all plugin jobs, convert to redis `HGET` commands to extract the value of the name field, and then sort them by their counts: ```console $ redis-cli KEYS "bull:plugins:*" \ | sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \ | ncat -w 3 localhost 6379 \ | grep -v -E '^\$' | sort | uniq -c | sort -h 3 dspace_health_check 4 -ERR wrong number of arguments for 'hget' command 12 mel_downloads_and_views 129 dspace_altmetrics 932 dspace_downloads_and_views 186428 dspace_add_missing_items ``` - Note that this uses `ncat` to send commands directly to redis all at once instead of one at a time (`netcat` didn't work here, as it doesn't know when our input is finished and never quits) - I thought of using `redis-cli --pipe` but then you have to construct the commands in the redis protocol format with the number of args and length of each command - There is clearly something wrong with the new DSpace health check plugin, as it creates WAY too many jobs every time we run the plugins ## 2021-06-27 - Looking into the spike in PostgreSQL connections last week - I see the same things that I always see (large number of connections waiting for lock, large number of threads, high CPU usage, etc), but I also see almost 10,000 DSpace sessions on 2021-06-25 ![DSpace sessions](/cgspace-notes/2021/06/dspace-sessions-week.png) - Looking at the DSpace log I see there was definitely a higher number of sessions that day, perhaps twice the normal: ```console $ for file in dspace.log.2021-06-[12]*; do echo "$file"; grep -oE 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done dspace.log.2021-06-10 19072 dspace.log.2021-06-11 19224 dspace.log.2021-06-12 19215 dspace.log.2021-06-13 16721 dspace.log.2021-06-14 17880 dspace.log.2021-06-15 12103 dspace.log.2021-06-16 4651 dspace.log.2021-06-17 22785 dspace.log.2021-06-18 21406 dspace.log.2021-06-19 25967 dspace.log.2021-06-20 20850 dspace.log.2021-06-21 6388 dspace.log.2021-06-22 5945 dspace.log.2021-06-23 46371 dspace.log.2021-06-24 9024 dspace.log.2021-06-25 12521 dspace.log.2021-06-26 16163 dspace.log.2021-06-27 5886 ``` - I see 15,000 unique IPs in the XMLUI logs alone on that day: ```console # zcat /var/log/nginx/access.log.5.gz /var/log/nginx/access.log.4.gz | grep '23/Jun/2021' | awk '{print $1}' | sort | uniq | wc -l 15835 ``` - Annoyingly I found 37,000 more hits from Bing using `dns:*msnbot* AND dns:*.msn.com.` as a Solr filter - WTF, they are using a normal user agent: `Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko` - I will purge the IPs and add this user agent to the nginx config so that we can rate limit it - I signed up for Bing Webmaster Tools and verified cgspace.cgiar.org with the BingSiteAuth.xml file - Also I adjusted the nginx config to explicitly allow access to `robots.txt` even when bots are rate limited - Also I found that Bing was auto discovering all our RSS and Atom feeds as "sitemaps" so I deleted 750 of them and submitted the real sitemap - I need to see if I can adjust the nginx config further to map the `bot` user agent to DNS like msnbot... - Review Abdullah's filter on click pull request - I rebased his code on the latest master branch and tested adding filter on click to the map and list components, and it works fine - There seems to be a bug that breaks scrolling on the page though... - Abdullah fixed the bug in the filter on click branch ## 2021-06-28 - Some work on OpenRXV - Integrate `prettier` into the frontend and backend and format everything on the `master` branch - Re-work the GitHub Actions workflow for frontend and add one for backend - The workflows run `npm install` to test dependencies, and `npm ci` with `prettier` to check formatting - Also I merged Abdallah's filter on click pull request ## 2021-06-30 - CGSpace is showing a blank white page... - The status is HTTP 200, but it's blank white... so UptimeRobot didn't send a notification! - The DSpace log shows: ```console 2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object ``` - The first one of these I see is from last night at 2021-06-29 at 10:47 PM - I restarted Tomcat 7 and CGSpace came back up... - I didn't see that Atmire had responded last week (on 2021-06-23) about the issues we had - He said they had to do the same thing that they did last time: switch to the postgres user and kill all activity - He said they found tons of connections to the REST API, like 3-4 per second, and asked if that was normal - I pointed him to our Tomcat server.xml configuration, saying that we purposefully isolated the Tomcat connection pools between the API and XMLUI for this purpose... - Export a list of all CGSpace's AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC: ```console localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER; COPY 20780 ``` - Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes): ```console localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER; COPY 1710 ``` - Fix an issue in the Ansible infrastructure playbooks for the DSpace role - It was causing the template module to fail when setting up the npm environment - We needed to install `acl` so that Ansible can use `setfacl` on the target file before becoming an unprivileged user - I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26): ```console Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds. ``` - What's even crazier is that it is twice that on CGSpace (linode18)! - Apparently OpenJDK defaults to using `/dev/random` (see `/etc/java-8-openjdk/security/java.security`): ```console securerandom.source=file:/dev/urandom ``` - `/dev/random` blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator - Now Tomcat starts much faster and no warning is printed so I'm going to add this to our Ansible infrastructure playbooks - Interesting resource about the lore behind the `/dev/./urandom` workaround that is posted all over the Internet, apparently due to a bug in early JVMs: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721 - I'm experimenting with using PgBouncer for pooling instead of Tomcat's JDBC