- IWMI notified me that AReS was down with an HTTP 502 error
- Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
- I don't see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the `angular_nginx` container isn't running
- Margarita from CCAFS emailed me to say that workflow alerts haven't been working lately
- I guess this is related to the SMTP issues last week
- I had fixed the config, but didn't restart Tomcat so DSpace didn't load the new variables
- I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers
## 2021-06-03
- Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI's content into the new AMCOW Knowledge Hub
- At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
- That's when I thought they could perhaps use the OpenSearch, but I can't remember if it's possible to limit by community, or only collection...
- Looking now, I see there is a "scope" parameter that can be used for community or collection, for example:
- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS
- For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.`
- I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706
- I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
- Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic
- I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time)
- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
- It will hopefully also fix the duplicate and missing items issues
- I had a Skype with him to discuss
- I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
- Then I uploaded the resulting CSV to CGSpace, updating 161 items
- Start a harvest on AReS
- I found [a bug](https://jira.lyrasis.org/browse/DS-1977) and [a patch](https://github.com/DSpace/DSpace/pull/2584) for the private items showing up in the DSpace sitemap bug
- The fix is super simple, I should try to apply it
## 2021-06-21
- The AReS harvesting finished, but the indexes got messed up again
- I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
- Some appear twice in the Elasticsearch index, but appear in only one collection
- Some appear twice in the Elasticsearch index, and appear in *two* collections
- Some appear twice in the Elasticsearch index, but appear in three collections (!)
- So really we need to just check whether a handle exists before we insert it
- I tested the [pull request for DS-1977](https://github.com/DSpace/DSpace/pull/2584) that adjusts the sitemap generation code to exclude private items
- It applies cleanly and seems to work, but we don't actually have any private items
- The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
- Testing the [pull request for DS-4065](https://github.com/DSpace/DSpace/pull/2275) where the REST API's `/rest/items` endpoint is not aware of private items and returns an incorrect number of items
- This is most easily seen by setting a low limit in `/rest/items`, making one of the items private, and requesting items again with the same limit
- I confirmed the issue on the current DSpace 6 Demo:
- I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
- Last week I noticed that the Gender Platform website is using "cgspace.cgiar.org" links for CGSpace, instead of handles
- I emailed Fabio and Marianne to ask them to please use the Handle links
- I tested the [pull request for DS-4271](https://github.com/DSpace/DSpace/pull/2543) where Discovery filters of type "contains" don't work as expected when the user's search term has spaces
- I tested with filter "farmer managed irrigation systems" on DSpace Test
- Before the patch I got 293 results, and the few I checked didn't have the expected metadata value
- After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting
- Make a [pull request](https://github.com/atmire/COUNTER-Robots/pull/43) to the COUNTER-Robots project to add two new user agents: crusty and newspaper
- These two bots have made ~3,000 requests on CGSpace
- Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:
- These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one
- I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
## 2021-06-23
- I woke up this morning to find CGSpace down
- The logs show a high number of abandoned PostgreSQL connections and locks:
```console
# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10100
```
- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
- I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch:
- DS-4065: resource policy aware REST API hibernate queries
- DS-4271: Replaced brackets with double quotes in SolrServiceImpl
- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
63
```
- Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM:
```console
2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
```
- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
```console
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
```
- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
- I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
- I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon
## 2021-06-24
- I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index
- I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest...
- But still, there are way too many duplicates and I'm not sure what the actual number of items should be
- According to the OAI ListRecords for each of our repositories, we should have about:
- MELSpace: 9537
- WorldFish: 4483
- CGSpace: 91305
- Total: 105325
- Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
- So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?!
- I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there
- I will rename type to `item_type` and add it back to the metadata structure
- The add missing items definitely checks this field...
- I modified my local backup to add `type: item` and uploaded it to the temp index on production
- Oh! nginx is blocking OpenRXV's attempt to read the sitemap:
- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items...
- I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository
- According to the api logs we will be adding 5,697 items:
- Spent a few hours with Moayad troubleshooting and improving OpenRXV
- We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
- The new OpenRXV code creates almost 200,000 jobs when the plugins start
- I figured out how to use [bee-queue/arena](https://github.com/bee-queue/arena/tree/master/example) to view our Bull job queue
- Also, we can see the jobs directly using redis-cli:
```console
$ redis-cli
127.0.0.1:6379> SCAN 0 COUNT 5
1) "49152"
2) 1) "bull:plugins:476595"
2) "bull:plugins:367382"
3) "bull:plugins:369228"
4) "bull:plugins:438986"
5) "bull:plugins:366215"
```
- We can apparently get the names of the jobs in each hash using `hget`:
```console
127.0.0.1:6379> TYPE bull:plugins:401827
hash
127.0.0.1:6379> HGET bull:plugins:401827 name
"dspace_add_missing_items"
```
- I whipped up a one liner to get the keys for all plugin jobs, convert to redis `HGET` commands to extract the value of the name field, and then sort them by their counts:
```console
$ redis-cli KEYS "bull:plugins:*" \
| sed -e 's/^bull/HGET bull/' -e 's/\([[:digit:]]\)$/\1 name/' \
| ncat -w 3 localhost 6379 \
| grep -v -E '^\$' | sort | uniq -c | sort -h
3 dspace_health_check
4 -ERR wrong number of arguments for 'hget' command
12 mel_downloads_and_views
129 dspace_altmetrics
932 dspace_downloads_and_views
186428 dspace_add_missing_items
```
- Note that this uses `ncat` to send commands directly to redis all at once instead of one at a time (`netcat` didn't work here, as it doesn't know when our input is finished and never quits)
- I thought of using `redis-cli --pipe` but then you have to construct the commands in the redis protocol format with the number of args and length of each command
- There is clearly something wrong with the new DSpace health check plugin, as it creates WAY too many jobs every time we run the plugins
- Looking into the spike in PostgreSQL connections last week
- I see the same things that I always see (large number of connections waiting for lock, large number of threads, high CPU usage, etc), but I also see almost 10,000 DSpace sessions on 2021-06-25
- Integrate `prettier` into the frontend and backend and format everything on the `master` branch
- Re-work the GitHub Actions workflow for frontend and add one for backend
- The workflows run `npm install` to test dependencies, and `npm ci` with `prettier` to check formatting
- Also I merged Abdallah's filter on click pull request
## 2021-06-30
- CGSpace is showing a blank white page...
- The status is HTTP 200, but it's blank white... so UptimeRobot didn't send a notification!
- The DSpace log shows:
```console
2021-06-30 08:19:15,874 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ Cannot get a connection, pool error Timeout waiting for idle object
```
- The first one of these I see is from last night at 2021-06-29 at 10:47 PM
- I restarted Tomcat 7 and CGSpace came back up...
- I didn't see that Atmire had responded last week (on 2021-06-23) about the issues we had
- He said they had to do the same thing that they did last time: switch to the postgres user and kill all activity
- He said they found tons of connections to the REST API, like 3-4 per second, and asked if that was normal
- I pointed him to our Tomcat server.xml configuration, saying that we purposefully isolated the Tomcat connection pools between the API and XMLUI for this purpose...
- Export a list of all CGSpace's AGROVOC keywords with counts for Enrico and Elizabeth Arnaud to discuss with AGROVOC:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value AS "dcterms.subject", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY "dcterms.subject" ORDER BY count DESC) to /tmp/2021-06-30-agrovoc.csv WITH CSV HEADER;
COPY 20780
```
- Actually Enrico wanted NON AGROVOC, so I extracted all the center and CRP subjects (ignoring system office and themes):
```console
localhost/dspace63= > \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242) GROUP BY subject ORDER BY count DESC) to /tmp/2021-06-30-non-agrovoc.csv WITH CSV HEADER;
COPY 1710
```
- Fix an issue in the Ansible infrastructure playbooks for the DSpace role
- It was causing the template module to fail when setting up the npm environment
- We needed to install `acl` so that Ansible can use `setfacl` on the target file before becoming an unprivileged user
- I saw a strange message in the Tomcat 7 journal on DSpace Test (linode26):
```console
Jun 30 16:00:09 linode26 tomcat7[30294]: WARNING: Creation of SecureRandom instance for session ID generation using [SHA1PRNG] took [111,733] milliseconds.
```
- What's even crazier is that it is twice that on CGSpace (linode18)!
- Apparently OpenJDK defaults to using `/dev/random` (see `/etc/java-8-openjdk/security/java.security`):
```console
securerandom.source=file:/dev/urandom
```
-`/dev/random` blocks and can take a long time to get entropy, and urandom on modern Linux is a cryptographically secure pseudorandom number generator
- Now Tomcat starts much faster and no warning is printed so I'm going to add this to our Ansible infrastructure playbooks
- Interesting resource about the lore behind the `/dev/./urandom` workaround that is posted all over the Internet, apparently due to a bug in early JVMs: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6202721
- I'm experimenting with using PgBouncer for pooling instead of Tomcat's JDBC