From b36808718c8ae58e742bc199eb62fe317b0463b2 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Fri, 25 Jun 2021 09:34:29 +0300 Subject: [PATCH] Add notes for 2021-06-24 --- content/posts/2021-06.md | 91 ++++++++++++++++++++ docs/2021-06/index.html | 105 +++++++++++++++++++++++- docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/categories/notes/page/5/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/page/7/index.html | 2 +- docs/page/8/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/posts/page/7/index.html | 2 +- docs/posts/page/8/index.html | 2 +- docs/sitemap.xml | 10 +-- 25 files changed, 220 insertions(+), 30 deletions(-) diff --git a/content/posts/2021-06.md b/content/posts/2021-06.md index 895b3226a..5c84b3e8d 100644 --- a/content/posts/2021-06.md +++ b/content/posts/2021-06.md @@ -232,5 +232,96 @@ Total number of bot hits purged: 5522 - `node-superagent/3.8.3` - `cortex/1.0` - These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one +- I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18) + +## 2021-06-23 + +- I woke up this morning to find CGSpace down + - The logs show a high number of abandoned PostgreSQL connections and locks: + +```console +# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned' +978 +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l +10100 +``` + +- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now +- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test) +- I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch: + - DS-4065: resource policy aware REST API hibernate queries + - DS-4271: Replaced brackets with double quotes in SolrServiceImpl +- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels: + +```console +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l +63 +``` + +- Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM: + +```console +2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000]. +``` + +- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent: + +```console +Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy) +``` + +- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/ + - I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots +- I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon + +## 2021-06-24 + +- I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index + - I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest... + - But still, there are way too many duplicates and I'm not sure what the actual number of items should be + - According to the OAI ListRecords for each of our repositories, we should have about: + - MELSpace: 9537 + - WorldFish: 4483 + - CGSpace: 91305 + - Total: 105325 + - Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique: + +```console +$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l +104797 +$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l +99186 +``` + +- This number is probably unique for that particular harvest, but I don't think it represents the true number of items... +- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items: + +```console +$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l +90990 +``` + +- So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?! + - I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there + - I will rename type to `item_type` and add it back to the metadata structure + - The add missing items definitely checks this field... + - I modified my local backup to add `type: item` and uploaded it to the temp index on production + - Oh! nginx is blocking OpenRXV's attempt to read the sitemap: + +```console +172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV" +``` + +- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items... + - I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository +- According to the api logs we will be adding 5,697 items: + +```console +$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l +5697 +``` + +- Spent a few hours with Moayad troubleshooting and improving OpenRXV + - We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs diff --git a/docs/2021-06/index.html b/docs/2021-06/index.html index 8983a62e1..50d47d782 100644 --- a/docs/2021-06/index.html +++ b/docs/2021-06/index.html @@ -20,7 +20,7 @@ I simply started it and AReS was running again: - + @@ -46,9 +46,9 @@ I simply started it and AReS was running again: "@type": "BlogPosting", "headline": "June, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-06/", - "wordCount": "1665", + "wordCount": "2396", "datePublished": "2021-06-01T10:51:07+03:00", - "dateModified": "2021-06-21T16:24:40+03:00", + "dateModified": "2021-06-22T15:22:15+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -387,6 +387,105 @@ Total number of bot hits purged: 5522
  • These bots account for ~42,000 hits in our statistics… I will just purge them and add them to our local override, but I can’t be bothered to submit them to COUNTER-Robots since I’d have to look up the information for each one
  • +
  • I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
  • + +

    2021-06-23

    + +
    # journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
    +978
    +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +10100
    +
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
    +63
    +
    +
    2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
    +
    +
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
    +
    +

    2021-06-24

    + +
    $ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
    +104797
    +$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
    +99186
    +
    +
    $ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
    +90990
    +
    +
    172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
    +
    +
    $ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
    +5697
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index b818cd054..855e30a6e 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 1de5dc72f..5672a7dfd 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index ca14076a0..8caa3602c 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index c03168b20..e606bdbee 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 6b6264987..66269fbfc 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index e27af1461..34bdc5d34 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index fb314d7dd..80eac58f3 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index da7e9ad3c..e92bf2f4a 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 16001d89c..335dc0cc1 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 9c3082ad2..ffd1edf7c 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 95ffaee34..1170fa787 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 129773daf..8cca8be5c 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index af4d1cebf..c7116b328 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 9257775ac..bcf6386d6 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index ad9164f99..414ae7b22 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 9762f6c10..c67142db9 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index b3f22c942..b381a0145 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 1f6b8eba3..fdc89eb83 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index d695ff236..0f9c83448 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index 795f7695a..c29899106 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index c737001a0..b7a9fac64 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index ff0f63835..5e6fc6722 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 9f6d2fd01..8f5721606 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2021-06-21T16:24:40+03:00 + 2021-06-22T15:22:15+03:00 https://alanorth.github.io/cgspace-notes/ - 2021-06-21T16:24:40+03:00 + 2021-06-22T15:22:15+03:00 https://alanorth.github.io/cgspace-notes/2021-06/ - 2021-06-21T16:24:40+03:00 + 2021-06-22T15:22:15+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2021-06-21T16:24:40+03:00 + 2021-06-22T15:22:15+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2021-06-21T16:24:40+03:00 + 2021-06-22T15:22:15+03:00 https://alanorth.github.io/cgspace-notes/2021-05/ 2021-05-30T22:09:06+03:00