Add notes for 2021-06-24

This commit is contained in:
Alan Orth 2021-06-25 09:34:29 +03:00
parent b3577743e0
commit b36808718c
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
25 changed files with 220 additions and 30 deletions

View File

@ -232,5 +232,96 @@ Total number of bot hits purged: 5522
- `node-superagent/3.8.3`
- `cortex/1.0`
- These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one
- I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
## 2021-06-23
- I woke up this morning to find CGSpace down
- The logs show a high number of abandoned PostgreSQL connections and locks:
```console
# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10100
```
- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
- I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch:
- DS-4065: resource policy aware REST API hibernate queries
- DS-4271: Replaced brackets with double quotes in SolrServiceImpl
- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
63
```
- Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM:
```console
2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
```
- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
```console
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
```
- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
- I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
- I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon
## 2021-06-24
- I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index
- I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest...
- But still, there are way too many duplicates and I'm not sure what the actual number of items should be
- According to the OAI ListRecords for each of our repositories, we should have about:
- MELSpace: 9537
- WorldFish: 4483
- CGSpace: 91305
- Total: 105325
- Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
```console
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
```
- This number is probably unique for that particular harvest, but I don't think it represents the true number of items...
- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
```console
$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
90990
```
- So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?!
- I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there
- I will rename type to `item_type` and add it back to the metadata structure
- The add missing items definitely checks this field...
- I modified my local backup to add `type: item` and uploaded it to the temp index on production
- Oh! nginx is blocking OpenRXV's attempt to read the sitemap:
```console
172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
```
- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items...
- I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository
- According to the api logs we will be adding 5,697 items:
```console
$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
```
- Spent a few hours with Moayad troubleshooting and improving OpenRXV
- We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
<!-- vim: set sw=2 ts=2: -->

View File

@ -20,7 +20,7 @@ I simply started it and AReS was running again:
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2021-06/" />
<meta property="article:published_time" content="2021-06-01T10:51:07+03:00" />
<meta property="article:modified_time" content="2021-06-21T16:24:40+03:00" />
<meta property="article:modified_time" content="2021-06-22T15:22:15+03:00" />
@ -46,9 +46,9 @@ I simply started it and AReS was running again:
"@type": "BlogPosting",
"headline": "June, 2021",
"url": "https://alanorth.github.io/cgspace-notes/2021-06/",
"wordCount": "1665",
"wordCount": "2396",
"datePublished": "2021-06-01T10:51:07+03:00",
"dateModified": "2021-06-21T16:24:40+03:00",
"dateModified": "2021-06-22T15:22:15+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -387,6 +387,105 @@ Total number of bot hits purged: 5522
</ul>
</li>
<li>These bots account for ~42,000 hits in our statistics&hellip; I will just purge them and add them to our local override, but I can&rsquo;t be bothered to submit them to COUNTER-Robots since I&rsquo;d have to look up the information for each one</li>
<li>I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)</li>
</ul>
<h2 id="2021-06-23">2021-06-23</h2>
<ul>
<li>I woke up this morning to find CGSpace down
<ul>
<li>The logs show a high number of abandoned PostgreSQL connections and locks:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console"># journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
978
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10100
</code></pre><ul>
<li>I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now</li>
<li>In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)</li>
<li>I also applied the following patches from the 6.4 milestone to our <code>6_x-prod</code> branch:
<ul>
<li>DS-4065: resource policy aware REST API hibernate queries</li>
<li>DS-4271: Replaced brackets with double quotes in SolrServiceImpl</li>
</ul>
</li>
<li>After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
63
</code></pre><ul>
<li>Looking in the DSpace log, the first &ldquo;pool empty&rdquo; message I saw this morning was at 4AM:</li>
</ul>
<pre><code class="language-console" data-lang="console">2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
</code></pre><ul>
<li>Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:</li>
</ul>
<pre><code class="language-console" data-lang="console">Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
</code></pre><ul>
<li>We can purge them, as this is not user traffic: <a href="https://about.flipboard.com/browserproxy/">https://about.flipboard.com/browserproxy/</a>
<ul>
<li>I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots</li>
</ul>
</li>
<li>I merged <a href="https://github.com/ilri/OpenRXV/pull/96">Moayad&rsquo;s health check pull request in AReS</a> and I will deploy it on the production server soon</li>
</ul>
<h2 id="2021-06-24">2021-06-24</h2>
<ul>
<li>I deployed the new OpenRXV code on CGSpace but I&rsquo;m having problems with the indexing, something about missing the mappings on the <code>openrxv-items-temp</code> index
<ul>
<li>I extracted the mappings from my local instance using <code>elasticdump</code> and after putting them on CGSpace I was able to harvest&hellip;</li>
<li>But still, there are way too many duplicates and I&rsquo;m not sure what the actual number of items should be</li>
<li>According to the OAI ListRecords for each of our repositories, we should have about:
<ul>
<li>MELSpace: 9537</li>
<li>WorldFish: 4483</li>
<li>CGSpace: 91305</li>
<li>Total: 105325</li>
</ul>
</li>
<li>Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | wc -l
104797
$ grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
99186
</code></pre><ul>
<li>This number is probably unique for that particular harvest, but I don&rsquo;t think it represents the true number of items&hellip;</li>
<li>The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ grep -E '&quot;repo&quot;:&quot;DSpace Test&quot;' 2021-06-23-openrxv-items-final-local.json | grep -oE '&quot;handle&quot;:&quot;([[:digit:]]|\.)+/[[:digit:]]+&quot;' | sort | uniq | wc -l
90990
</code></pre><ul>
<li>So the harvest on the live site is missing items, then why didn&rsquo;t the add missing items plugin find them?!
<ul>
<li>I notice that we are missing the <code>type</code> in the metadata structure config for each repository on the production site, and we are using <code>type</code> for item type in the actual schema&hellip; so maybe there is a conflict there</li>
<li>I will rename type to <code>item_type</code> and add it back to the metadata structure</li>
<li>The add missing items definitely checks this field&hellip;</li>
<li>I modified my local backup to add <code>type: item</code> and uploaded it to the temp index on production</li>
<li>Oh! nginx is blocking OpenRXV&rsquo;s attempt to read the sitemap:</li>
</ul>
</li>
</ul>
<pre><code class="language-console" data-lang="console">172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] &quot;GET /sitemap HTTP/1.1&quot; 503 190 &quot;-&quot; &quot;OpenRXV harvesting bot; https://github.com/ilri/OpenRXV&quot;
</code></pre><ul>
<li>I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins&hellip; now it&rsquo;s checking 180,000+ handles to see if they are collections or items&hellip;
<ul>
<li>I see it fetched the sitemap three times, we need to make sure it&rsquo;s only doing it once for each repository</li>
</ul>
</li>
<li>According to the api logs we will be adding 5,697 items:</li>
</ul>
<pre><code class="language-console" data-lang="console">$ docker logs api 2&gt;/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
5697
</code></pre><ul>
<li>Spent a few hours with Moayad troubleshooting and improving OpenRXV
<ul>
<li>We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2021-06-21T16:24:40+03:00" />
<meta property="og:updated_time" content="2021-06-22T15:22:15+03:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2021-06-21T16:24:40+03:00</lastmod>
<lastmod>2021-06-22T15:22:15+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2021-06-21T16:24:40+03:00</lastmod>
<lastmod>2021-06-22T15:22:15+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-06/</loc>
<lastmod>2021-06-21T16:24:40+03:00</lastmod>
<lastmod>2021-06-22T15:22:15+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2021-06-21T16:24:40+03:00</lastmod>
<lastmod>2021-06-22T15:22:15+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2021-06-21T16:24:40+03:00</lastmod>
<lastmod>2021-06-22T15:22:15+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2021-05/</loc>
<lastmod>2021-05-30T22:09:06+03:00</lastmod>