Add notes for 2021-06-24

2025-01-27 05:49:12 +01:00 · 2021-06-25 09:34:29 +03:00
parent b3577743e0
commit b36808718c
25 changed files with 220 additions and 30 deletions
--- a/content/posts/2021-06.md
+++ b/content/posts/2021-06.md
@ -232,5 +232,96 @@ Total number of bot hits purged: 5522
  - `node-superagent/3.8.3`
  - `cortex/1.0`
 - These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one
+- I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
+
+## 2021-06-23
+
+- I woke up this morning to find CGSpace down
+  - The logs show a high number of abandoned PostgreSQL connections and locks:
+
+```console
+# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
+978
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+10100
+```
+
+- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
+- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
+- I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch:
+  - DS-4065: resource policy aware REST API hibernate queries
+  - DS-4271: Replaced brackets with double quotes in SolrServiceImpl
+- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
+
+```console
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+63
+```
+
+- Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM:
+
+```console
+2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
+```
+
+- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
+
+```console
+Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
+```
+
+- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
+  - I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
+- I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon
+
+## 2021-06-24
+
+- I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index
+  - I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest...
+  - But still, there are way too many duplicates and I'm not sure what the actual number of items should be
+  - According to the OAI ListRecords for each of our repositories, we should have about:
+    - MELSpace: 9537
+    - WorldFish: 4483
+    - CGSpace: 91305
+    - Total: 105325
+  - Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
+
+```console
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
+104797
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
+99186
+```
+
+- This number is probably unique for that particular harvest, but I don't think it represents the true number of items...
+- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
+
+```console
+$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
+90990
+```
+
+- So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?!
+  - I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there
+  - I will rename type to `item_type` and add it back to the metadata structure
+  - The add missing items definitely checks this field...
+  - I modified my local backup to add `type: item` and uploaded it to the temp index on production
+  - Oh! nginx is blocking OpenRXV's attempt to read the sitemap:
+
+```console
+172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
+```
+
+- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items...
+  - I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository
+- According to the api logs we will be adding 5,697 items:
+
+```console
+$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
+5697
+```
+
+- Spent a few hours with Moayad troubleshooting and improving OpenRXV
+  - We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs

 <!-- vim: set sw=2 ts=2: -->