mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 06:35:03 +01:00
Add notes for 2018-08-19
This commit is contained in:
parent
3c5174eb35
commit
545e5ecd78
@ -1050,3 +1050,4 @@ dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.p
|
||||
- I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process
|
||||
- Afterwards I looked a few times and saw only 150 or 200 locks
|
||||
- On the test server, with the [PostgreSQL indexes from DS-3636](https://jira.duraspace.org/browse/DS-3636) applied, it finished instantly
|
||||
- Run system updates on DSpace Test and reboot the server
|
||||
|
@ -85,4 +85,124 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
|
||||
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
|
||||
```
|
||||
|
||||
## 2018-08-19
|
||||
|
||||
- Keep working on the CIAT ORCID identifiers from Elizabeth
|
||||
- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
|
||||
- This is less obvious and more error prone with names like "Peters" where there are many more authors
|
||||
- I see some errors in the variations of names as well, for example:
|
||||
|
||||
```
|
||||
Verchot, Louis
|
||||
Verchot, L
|
||||
Verchot, L. V.
|
||||
Verchot, L.V
|
||||
Verchot, L.V.
|
||||
Verchot, LV
|
||||
Verchot, Louis V.
|
||||
```
|
||||
|
||||
- I'll just tag them all with Louis Verchot's ORCID identifier...
|
||||
- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:
|
||||
|
||||
```
|
||||
dc.contributor.author,cg.creator.id
|
||||
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
|
||||
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
|
||||
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
|
||||
"Peters, Michael",Michael Peters: 0000-0003-4237-3916
|
||||
"Peters, M.",Michael Peters: 0000-0003-4237-3916
|
||||
"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
|
||||
"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
|
||||
"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
|
||||
"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
|
||||
"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
|
||||
"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
|
||||
"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
|
||||
"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
|
||||
"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
|
||||
"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
|
||||
"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
|
||||
"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
|
||||
"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
|
||||
"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||||
"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||||
"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||||
```
|
||||
|
||||
- The invocation would be:
|
||||
|
||||
```
|
||||
$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
|
||||
```
|
||||
|
||||
- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
|
||||
- Looking at the list of author affialitions from Peter one last time
|
||||
- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
|
||||
|
||||
```
|
||||
or(
|
||||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||||
isNotNull(value.match(/.*\u00A0.*/)),
|
||||
isNotNull(value.match(/.*\u200A.*/)),
|
||||
isNotNull(value.match(/.*\u2019.*/)),
|
||||
isNotNull(value.match(/.*\u00b4.*/))
|
||||
)
|
||||
```
|
||||
|
||||
- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
|
||||
- I will run the following on DSpace Test and CGSpace:
|
||||
|
||||
```
|
||||
$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
|
||||
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
|
||||
```
|
||||
|
||||
- Then force an update of the Discovery index on DSpace Test:
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 72m12.570s
|
||||
user 6m45.305s
|
||||
sys 2m2.461s
|
||||
```
|
||||
|
||||
- And then on CGSpace:
|
||||
|
||||
```
|
||||
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 79m44.392s
|
||||
user 8m50.730s
|
||||
sys 2m20.248s
|
||||
```
|
||||
|
||||
- Run system updates on DSpace Test and reboot the server
|
||||
- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
|
||||
|
||||
```
|
||||
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
|
||||
1553
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
|
||||
1724
|
||||
```
|
||||
|
||||
- I don't even know how its possible for the bot to use MORE sessions than total requests...
|
||||
- The user agent is:
|
||||
|
||||
```
|
||||
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
|
||||
```
|
||||
|
||||
- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc.
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -57,7 +57,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl
|
||||
"@type": "BlogPosting",
|
||||
"headline": "February, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-02/",
|
||||
"wordCount": "6400",
|
||||
"wordCount": "6410",
|
||||
"datePublished": "2018-02-01T16:28:54+02:00",
|
||||
"dateModified": "2018-03-09T22:10:33+02:00",
|
||||
"author": {
|
||||
@ -1296,6 +1296,7 @@ UPDATE 3
|
||||
<li>I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process</li>
|
||||
<li>Afterwards I looked a few times and saw only 150 or 200 locks</li>
|
||||
<li>On the test server, with the <a href="https://jira.duraspace.org/browse/DS-3636">PostgreSQL indexes from DS-3636</a> applied, it finished instantly</li>
|
||||
<li>Run system updates on DSpace Test and reboot the server</li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
@ -34,7 +34,7 @@ I ran all system updates on DSpace Test and rebooted it
|
||||
|
||||
<meta property="article:published_time" content="2018-08-01T11:52:54+03:00"/>
|
||||
|
||||
<meta property="article:modified_time" content="2018-08-16T15:40:38+03:00"/>
|
||||
<meta property="article:modified_time" content="2018-08-16T18:59:45+03:00"/>
|
||||
|
||||
|
||||
|
||||
@ -79,9 +79,9 @@ I ran all system updates on DSpace Test and rebooted it
|
||||
"@type": "BlogPosting",
|
||||
"headline": "August, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-08/",
|
||||
"wordCount": "834",
|
||||
"wordCount": "1376",
|
||||
"datePublished": "2018-08-01T11:52:54+03:00",
|
||||
"dateModified": "2018-08-16T15:40:38+03:00",
|
||||
"dateModified": "2018-08-16T18:59:45+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -241,6 +241,137 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
|
||||
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
|
||||
</code></pre>
|
||||
|
||||
<h2 id="2018-08-19">2018-08-19</h2>
|
||||
|
||||
<ul>
|
||||
<li>Keep working on the CIAT ORCID identifiers from Elizabeth</li>
|
||||
<li>In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie “Schultze-Kraft, Rainer” and “Schultze-Kraft, R.”) I will just tag them with ORCID identifiers too</li>
|
||||
<li>This is less obvious and more error prone with names like “Peters” where there are many more authors</li>
|
||||
<li>I see some errors in the variations of names as well, for example:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>Verchot, Louis
|
||||
Verchot, L
|
||||
Verchot, L. V.
|
||||
Verchot, L.V
|
||||
Verchot, L.V.
|
||||
Verchot, LV
|
||||
Verchot, Louis V.
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I’ll just tag them all with Louis Verchot’s ORCID identifier…</li>
|
||||
<li>In the end, I’ll run the following CSV with my <a href="https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050">add-orcid-identifiers-csv.py</a> script:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>dc.contributor.author,cg.creator.id
|
||||
"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
|
||||
"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
|
||||
"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
|
||||
"Peters, Michael",Michael Peters: 0000-0003-4237-3916
|
||||
"Peters, M.",Michael Peters: 0000-0003-4237-3916
|
||||
"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
|
||||
"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
|
||||
"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
|
||||
"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
|
||||
"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
|
||||
"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
|
||||
"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
|
||||
"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
|
||||
"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
|
||||
"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
|
||||
"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
|
||||
"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
|
||||
"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
|
||||
"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
|
||||
"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||||
"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||||
"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>The invocation would be:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers</li>
|
||||
<li>Looking at the list of author affialitions from Peter one last time</li>
|
||||
<li>I notice that I should add the Unicode character 0x00b4 (`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>or(
|
||||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||||
isNotNull(value.match(/.*\u00A0.*/)),
|
||||
isNotNull(value.match(/.*\u200A.*/)),
|
||||
isNotNull(value.match(/.*\u2019.*/)),
|
||||
isNotNull(value.match(/.*\u00b4.*/))
|
||||
)
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n</li>
|
||||
<li>I will run the following on DSpace Test and CGSpace:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
|
||||
$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Then force an update of the Discovery index on DSpace Test:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 72m12.570s
|
||||
user 6m45.305s
|
||||
sys 2m2.461s
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>And then on CGSpace:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
|
||||
|
||||
real 79m44.392s
|
||||
user 8m50.730s
|
||||
sys 2m20.248s
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Run system updates on DSpace Test and reboot the server</li>
|
||||
<li>In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
|
||||
1553
|
||||
# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19
|
||||
1724
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>I don’t even know how its possible for the bot to use MORE sessions than total requests…</li>
|
||||
<li>The user agent is:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>So I’m thinking we should add “crawl” to the Tomcat Crawler Session Manager valve, as we already have “bot” that catches Googlebot, Bingbot, etc.</li>
|
||||
</ul>
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
||||
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-08/</loc>
|
||||
<lastmod>2018-08-16T15:40:38+03:00</lastmod>
|
||||
<lastmod>2018-08-16T18:59:45+03:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -179,7 +179,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2018-08-16T15:40:38+03:00</lastmod>
|
||||
<lastmod>2018-08-16T18:59:45+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -190,7 +190,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-08-16T15:40:38+03:00</lastmod>
|
||||
<lastmod>2018-08-16T18:59:45+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -202,13 +202,13 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2018-08-16T15:40:38+03:00</lastmod>
|
||||
<lastmod>2018-08-16T18:59:45+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2018-08-16T15:40:38+03:00</lastmod>
|
||||
<lastmod>2018-08-16T18:59:45+03:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user