Add notes for 2018-08-19

2025-01-27 05:49:12 +01:00 · 2018-08-19 18:42:55 +03:00
parent 3c5174eb35
commit 545e5ecd78
5 changed files with 262 additions and 9 deletions
--- a/content/posts/2018-02.md
+++ b/content/posts/2018-02.md
@ -1050,3 +1050,4 @@ dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.p
 - I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process
 - Afterwards I looked a few times and saw only 150 or 200 locks
 - On the test server, with the [PostgreSQL indexes from DS-3636](https://jira.duraspace.org/browse/DS-3636) applied, it finished instantly
+- Run system updates on DSpace Test and reboot the server
--- a/content/posts/2018-08.md
+++ b/content/posts/2018-08.md
@ -85,4 +85,124 @@ $ psql -h localhost -U postgres dspacetest -c 'alter user dspacetest nosuperuser
 $ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
 ```

+## 2018-08-19
+
+- Keep working on the CIAT ORCID identifiers from Elizabeth
+- In the spreadsheet she sent me there are some names with other versions in the database, so when it is obviously the same one (ie "Schultze-Kraft, Rainer" and "Schultze-Kraft, R.") I will just tag them with ORCID identifiers too
+- This is less obvious and more error prone with names like "Peters" where there are many more authors
+- I see some errors in the variations of names as well, for example:
+
+```
+Verchot, Louis
+Verchot, L
+Verchot, L. V.
+Verchot, L.V
+Verchot, L.V.
+Verchot, LV
+Verchot, Louis V.
+```
+
+- I'll just tag them all with Louis Verchot's ORCID identifier...
+- In the end, I'll run the following CSV with my [add-orcid-identifiers-csv.py](https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050) script:
+
+```
+dc.contributor.author,cg.creator.id
+"Campbell, Bruce",Bruce M Campbell: 0000-0002-0123-4859
+"Campbell, Bruce M.",Bruce M Campbell: 0000-0002-0123-4859
+"Campbell, B.M",Bruce M Campbell: 0000-0002-0123-4859
+"Peters, Michael",Michael Peters: 0000-0003-4237-3916
+"Peters, M.",Michael Peters: 0000-0003-4237-3916
+"Peters, M.K.",Michael Peters: 0000-0003-4237-3916
+"Tamene, Lulseged",Lulseged Tamene: 0000-0002-3806-8890
+"Desta, Lulseged Tamene",Lulseged Tamene: 0000-0002-3806-8890
+"Läderach, Peter",Peter Läderach: 0000-0001-8708-6318
+"Lundy, Mark",Mark Lundy: 0000-0002-5241-3777
+"Schultze-Kraft, Rainer",Rainer Schultze-Kraft: 0000-0002-4563-0044
+"Schultze-Kraft, R.",Rainer Schultze-Kraft: 0000-0002-4563-0044
+"Verchot, Louis",Louis Verchot: 0000-0001-8309-6754
+"Verchot, L",Louis Verchot: 0000-0001-8309-6754
+"Verchot, L. V.",Louis Verchot: 0000-0001-8309-6754
+"Verchot, L.V",Louis Verchot: 0000-0001-8309-6754
+"Verchot, L.V.",Louis Verchot: 0000-0001-8309-6754
+"Verchot, LV",Louis Verchot: 0000-0001-8309-6754
+"Verchot, Louis V.",Louis Verchot: 0000-0001-8309-6754
+"Mukankusi, Clare",Clare Mukankusi: 0000-0001-7837-4545
+"Mukankusi, Clare M.",Clare Mukankusi: 0000-0001-7837-4545
+"Wyckhuys, Kris",Kris Wyckhuys: 0000-0003-0922-488X
+"Wyckhuys, Kris A. G.",Kris Wyckhuys: 0000-0003-0922-488X
+"Wyckhuys, Kris A.G.",Kris Wyckhuys: 0000-0003-0922-488X
+"Chirinda, Ngonidzashe",Ngonidzashe Chirinda: 0000-0002-4213-6294
+"Chirinda, Ngoni",Ngonidzashe Chirinda: 0000-0002-4213-6294
+"Ngonidzashe, Chirinda",Ngonidzashe Chirinda: 0000-0002-4213-6294
+```
+
+- The invocation would be:
+
+```
+$ ./add-orcid-identifiers-csv.py -i 2018-08-16-ciat-orcid.csv -db dspace -u dspace -p 'fuuu'
+```
+
+- I ran the script on DSpace Test and CGSpace and tagged a total of 986 ORCID identifiers
+- Looking at the list of author affialitions from Peter one last time
+- I notice that I should add the Unicode character 0x00b4 (\`) to my list of invalid characters to look for in Open Refine, making the latest version of the GREL expression being:
+
+```
+or(
+  isNotNull(value.match(/.*\uFFFD.*/)),
+  isNotNull(value.match(/.*\u00A0.*/)),
+  isNotNull(value.match(/.*\u200A.*/)),
+  isNotNull(value.match(/.*\u2019.*/)),
+  isNotNull(value.match(/.*\u00b4.*/))
+)
+```
+
+- This character all by itself is indicative of encoding issues in French, Italian, and Spanish names, for example: De´veloppement and Investigacio´n
+- I will run the following on DSpace Test and CGSpace:
+
+```
+$ ./fix-metadata-values.py -i /tmp/2018-08-15-Correct-1083-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t correct -m 211
+$ ./delete-metadata-values.py -i /tmp/2018-08-15-Remove-11-Affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
+```
+
+- Then force an update of the Discovery index on DSpace Test:
+
+```
+$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
+$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real    72m12.570s
+user    6m45.305s
+sys     2m2.461s
+```
+
+- And then on CGSpace:
+
+```
+$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
+$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real    79m44.392s
+user    8m50.730s
+sys     2m20.248s
+```
+
+- Run system updates on DSpace Test and reboot the server
+- In unrelated news, I see some newish Russian bot making a few thousand requests per day and not re-using its XMLUI session:
+
+```
+# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep '19/Aug/2018' | grep -c 5.9.6.51
+1553
+# grep -c -E 'session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2018-08-19 
+1724
+```
+
+- I don't even know how its possible for the bot to use MORE sessions than total requests...
+- The user agent is:
+
+```
+Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)
+```
+
+- So I'm thinking we should add "crawl" to the Tomcat Crawler Session Manager valve, as we already have "bot" that catches Googlebot, Bingbot, etc.
+
 <!-- vim: set sw=2 ts=2: -->