diff --git a/content/posts/2021-06.md b/content/posts/2021-06.md
index 895b3226a..5c84b3e8d 100644
--- a/content/posts/2021-06.md
+++ b/content/posts/2021-06.md
@@ -232,5 +232,96 @@ Total number of bot hits purged: 5522
- `node-superagent/3.8.3`
- `cortex/1.0`
- These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one
+- I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
+
+## 2021-06-23
+
+- I woke up this morning to find CGSpace down
+ - The logs show a high number of abandoned PostgreSQL connections and locks:
+
+```console
+# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
+978
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+10100
+```
+
+- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
+- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
+- I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch:
+ - DS-4065: resource policy aware REST API hibernate queries
+ - DS-4271: Replaced brackets with double quotes in SolrServiceImpl
+- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
+
+```console
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+63
+```
+
+- Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM:
+
+```console
+2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
+```
+
+- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
+
+```console
+Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
+```
+
+- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
+ - I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
+- I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon
+
+## 2021-06-24
+
+- I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index
+ - I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest...
+ - But still, there are way too many duplicates and I'm not sure what the actual number of items should be
+ - According to the OAI ListRecords for each of our repositories, we should have about:
+ - MELSpace: 9537
+ - WorldFish: 4483
+ - CGSpace: 91305
+ - Total: 105325
+ - Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
+
+```console
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
+104797
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
+99186
+```
+
+- This number is probably unique for that particular harvest, but I don't think it represents the true number of items...
+- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
+
+```console
+$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
+90990
+```
+
+- So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?!
+ - I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there
+ - I will rename type to `item_type` and add it back to the metadata structure
+ - The add missing items definitely checks this field...
+ - I modified my local backup to add `type: item` and uploaded it to the temp index on production
+ - Oh! nginx is blocking OpenRXV's attempt to read the sitemap:
+
+```console
+172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
+```
+
+- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items...
+ - I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository
+- According to the api logs we will be adding 5,697 items:
+
+```console
+$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
+5697
+```
+
+- Spent a few hours with Moayad troubleshooting and improving OpenRXV
+ - We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html
index 8983a62e1..50d47d782 100644
--- a/docs/2021-06/index.html
+++ b/docs/2021-06/index.html
@@ -20,7 +20,7 @@ I simply started it and AReS was running again:
-
+
@@ -46,9 +46,9 @@ I simply started it and AReS was running again:
"@type": "BlogPosting",
"headline": "June, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-06/",
- "wordCount": "1665",
+ "wordCount": "2396",
"datePublished": "2021-06-01T10:51:07+03:00",
- "dateModified": "2021-06-21T16:24:40+03:00",
+ "dateModified": "2021-06-22T15:22:15+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@@ -387,6 +387,105 @@ Total number of bot hits purged: 5522
These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one
+I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
+
+2021-06-23
+
+- I woke up this morning to find CGSpace down
+
+- The logs show a high number of abandoned PostgreSQL connections and locks:
+
+
+
+# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
+978
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+10100
+
+- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
+- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
+- I also applied the following patches from the 6.4 milestone to our
6_x-prod
branch:
+
+- DS-4065: resource policy aware REST API hibernate queries
+- DS-4271: Replaced brackets with double quotes in SolrServiceImpl
+
+
+- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
+
+$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
+63
+
+- Looking in the DSpace log, the first “pool empty” message I saw this morning was at 4AM:
+
+2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
+
+- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
+
+Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
+
+2021-06-24
+
+- I deployed the new OpenRXV code on CGSpace but I’m having problems with the indexing, something about missing the mappings on the
openrxv-items-temp
index
+
+- I extracted the mappings from my local instance using
elasticdump
and after putting them on CGSpace I was able to harvest…
+- But still, there are way too many duplicates and I’m not sure what the actual number of items should be
+- According to the OAI ListRecords for each of our repositories, we should have about:
+
+- MELSpace: 9537
+- WorldFish: 4483
+- CGSpace: 91305
+- Total: 105325
+
+
+- Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
+
+
+
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
+104797
+$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
+99186
+
+- This number is probably unique for that particular harvest, but I don’t think it represents the true number of items…
+- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
+
+$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
+90990
+
+- So the harvest on the live site is missing items, then why didn’t the add missing items plugin find them?!
+
+- I notice that we are missing the
type
in the metadata structure config for each repository on the production site, and we are using type
for item type in the actual schema… so maybe there is a conflict there
+- I will rename type to
item_type
and add it back to the metadata structure
+- The add missing items definitely checks this field…
+- I modified my local backup to add
type: item
and uploaded it to the temp index on production
+- Oh! nginx is blocking OpenRXV’s attempt to read the sitemap:
+
+
+
+172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
+
+- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins… now it’s checking 180,000+ handles to see if they are collections or items…
+
+- I see it fetched the sitemap three times, we need to make sure it’s only doing it once for each repository
+
+
+- According to the api logs we will be adding 5,697 items:
+
+$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
+5697
+
+- Spent a few hours with Moayad troubleshooting and improving OpenRXV
+
+- We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
+
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index b818cd054..855e30a6e 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index 1de5dc72f..5672a7dfd 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index ca14076a0..8caa3602c 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index c03168b20..e606bdbee 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 6b6264987..66269fbfc 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html
index e27af1461..34bdc5d34 100644
--- a/docs/categories/notes/page/5/index.html
+++ b/docs/categories/notes/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index fb314d7dd..80eac58f3 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index da7e9ad3c..e92bf2f4a 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 16001d89c..335dc0cc1 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index 9c3082ad2..ffd1edf7c 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 95ffaee34..1170fa787 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 129773daf..8cca8be5c 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index af4d1cebf..c7116b328 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/page/8/index.html b/docs/page/8/index.html
index 9257775ac..bcf6386d6 100644
--- a/docs/page/8/index.html
+++ b/docs/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index ad9164f99..414ae7b22 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index 9762f6c10..c67142db9 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index b3f22c942..b381a0145 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 1f6b8eba3..fdc89eb83 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index d695ff236..0f9c83448 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index 795f7695a..c29899106 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index c737001a0..b7a9fac64 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html
index ff0f63835..5e6fc6722 100644
--- a/docs/posts/page/8/index.html
+++ b/docs/posts/page/8/index.html
@@ -10,7 +10,7 @@
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index 9f6d2fd01..8f5721606 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
https://alanorth.github.io/cgspace-notes/categories/
- 2021-06-21T16:24:40+03:00
+ 2021-06-22T15:22:15+03:00
https://alanorth.github.io/cgspace-notes/
- 2021-06-21T16:24:40+03:00
+ 2021-06-22T15:22:15+03:00
https://alanorth.github.io/cgspace-notes/2021-06/
- 2021-06-21T16:24:40+03:00
+ 2021-06-22T15:22:15+03:00
https://alanorth.github.io/cgspace-notes/categories/notes/
- 2021-06-21T16:24:40+03:00
+ 2021-06-22T15:22:15+03:00
https://alanorth.github.io/cgspace-notes/posts/
- 2021-06-21T16:24:40+03:00
+ 2021-06-22T15:22:15+03:00
https://alanorth.github.io/cgspace-notes/2021-05/
2021-05-30T22:09:06+03:00