From 545e5ecd7883dd820b3541e7d1105975f7f6ae74 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sun, 19 Aug 2018 18:42:55 +0300 Subject: [PATCH] Add notes for 2018-08-19 --- content/posts/2018-02.md | 1 + content/posts/2018-08.md | 120 ++++++++++++++++++++++++++++++++++ docs/2018-02/index.html | 3 +- docs/2018-08/index.html | 137 ++++++++++++++++++++++++++++++++++++++- docs/sitemap.xml | 10 +-- 5 files changed, 262 insertions(+), 9 deletions(-) diff --git a/content/posts/2018-02.md b/content/posts/2018-02.md index d64dd1990..74c10b2a2 100644 --- a/content/posts/2018-02.md +++ b/content/posts/2018-02.md @@ -1050,3 +1050,4 @@ dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.p - I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process - Afterwards I looked a few times and saw only 150 or 200 locks - On the test server, with the [PostgreSQL indexes from DS-3636](https://jira.duraspace.org/browse/DS-3636) applied, it finished instantly +- Run system updates on DSpace Test and reboot the server diff --git a/content/posts/2018-08.md b/content/posts/2018-08.md index f81dc587a..0db4a7741 100644 --- a/content/posts/2018-08.md +++ b/content/posts/2018-08.md @@ -85,4 +85,124 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest ``` +## 2018-08-19 + +- Keep working on the CIAT ORCID identifiers from Elizabeth +- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too +- This is less obvious and more error prone with names like "Peters" where there are many more authors +- I see some errors in the variations of names as well, for example: + +``` +Verchot, Louis +Verchot, L +Verchot, L. V. +Verchot, L.V +Verchot, L.V. +Verchot, LV +Verchot, Louis V. +``` + +- I'll just tag them all with Louis Verchot's ORCID identifier... +- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script: + +``` +dc.contributor.author,cg.creator.id +"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859 +"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859 +"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859 +"Peters, Michael",Michael Peters: 0000-0003-4237-3916 +"Peters, M.",Michael Peters: 0000-0003-4237-3916 +"Peters, M.K.",Michael Peters: 0000-0003-4237-3916 +"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890 +"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890 +"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318 +"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777 +"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044 +"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044 +"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754 +"Verchot, L",Louis Verchot: 0000-0001-8309-6754 +"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754 +"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754 +"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754 +"Verchot, LV",Louis Verchot: 0000-0001-8309-6754 +"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754 +"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545 +"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545 +"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X +"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X +"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X +"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294 +"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294 +"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294 +``` + +- The invocation would be: + +``` +$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu' +``` + +- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers +- Looking at the list of author affialitions from Peter one last time +- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being: + +``` +or( + isNotNull(value.match(/.*\uFFFD.*/)), + isNotNull(value.match(/.*\u00A0.*/)), + isNotNull(value.match(/.*\u200A.*/)), + isNotNull(value.match(/.*\u2019.*/)), + isNotNull(value.match(/.*\u00b4.*/)) +) +``` + +- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n +- I will run the following on DSpace Test and CGSpace: + +``` +$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211 +$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 +``` + +- Then force an update of the Discovery index on DSpace Test: + +``` +$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m" +$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b + +real 72m12.570s +user 6m45.305s +sys 2m2.461s +``` + +- And then on CGSpace: + +``` +$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m" +$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b + +real 79m44.392s +user 8m50.730s +sys 2m20.248s +``` + +- Run system updates on DSpace Test and reboot the server +- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session: + +``` +# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51 +1553 +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19 +1724 +``` + +- I don't even know how its possible for the bot to use MORE sessions than total requests... +- The user agent is: + +``` +Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler) +``` + +- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc. + diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index da303fa36..68241ec8c 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -57,7 +57,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl "@type": "BlogPosting", "headline": "February, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-02/", - "wordCount": "6400", + "wordCount": "6410", "datePublished": "2018-02-01T16:28:54+02:00", "dateModified": "2018-03-09T22:10:33+02:00", "author": { @@ -1296,6 +1296,7 @@ UPDATE 3
  • I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process
  • Afterwards I looked a few times and saw only 150 or 200 locks
  • On the test server, with the PostgreSQL indexes from DS-3636 applied, it finished instantly
  • +
  • Run system updates on DSpace Test and reboot the server
  • diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html index 4c32d7c75..0ff561f8e 100644 --- a/docs/2018-08/index.html +++ b/docs/2018-08/index.html @@ -34,7 +34,7 @@ I ran all system updates on DSpace Test and rebooted it - + @@ -79,9 +79,9 @@ I ran all system updates on DSpace Test and rebooted it "@type": "BlogPosting", "headline": "August, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-08/", - "wordCount": "834", + "wordCount": "1376", "datePublished": "2018-08-01T11:52:54+03:00", - "dateModified": "2018-08-16T15:40:38+03:00", + "dateModified": "2018-08-16T18:59:45+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -241,6 +241,137 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest +

    2018-08-19

    + + + +
    Verchot, Louis
    +Verchot, L
    +Verchot, L. V.
    +Verchot, L.V
    +Verchot, L.V.
    +Verchot, LV
    +Verchot, Louis V.
    +
    + + + +
    dc.contributor.author,cg.creator.id
    +"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
    +"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
    +"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
    +"Peters, Michael",Michael Peters: 0000-0003-4237-3916
    +"Peters, M.",Michael Peters: 0000-0003-4237-3916
    +"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
    +"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
    +"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
    +"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
    +"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
    +"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
    +"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
    +"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
    +"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
    +"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
    +"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
    +"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
    +"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
    +"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
    +"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
    +"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
    +"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
    +
    + + + +
    $ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
    +
    + + + +
    or(
    +  isNotNull(value.match(/.*\uFFFD.*/)),
    +  isNotNull(value.match(/.*\u00A0.*/)),
    +  isNotNull(value.match(/.*\u200A.*/)),
    +  isNotNull(value.match(/.*\u2019.*/)),
    +  isNotNull(value.match(/.*\u00b4.*/))
    +)
    +
    + + + +
    $ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
    +$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
    +
    + + + +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
    +$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    +real    72m12.570s
    +user    6m45.305s
    +sys     2m2.461s
    +
    + + + +
    $ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
    +$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    +real    79m44.392s
    +user    8m50.730s
    +sys     2m20.248s
    +
    + + + +
    # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
    +1553
    +# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19 
    +1724
    +
    + + + +
    Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
    +
    + + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 52fb1da43..5ff7ee31d 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-08/ - 2018-08-16T15:40:38+03:00 + 2018-08-16T18:59:45+03:00 @@ -179,7 +179,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-08-16T15:40:38+03:00 + 2018-08-16T18:59:45+03:00 0 @@ -190,7 +190,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-08-16T15:40:38+03:00 + 2018-08-16T18:59:45+03:00 0 @@ -202,13 +202,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-08-16T15:40:38+03:00 + 2018-08-16T18:59:45+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-08-16T15:40:38+03:00 + 2018-08-16T18:59:45+03:00 0