February, 2018

Thu Feb 01, 2018 by Alan Orth in Notes

2018-02-01

Peter gave feedback on the dc.rights proof of concept that I had sent him last week
We don’t need to distinguish between internal and external works, so that makes it just a simple list
Yesterday I figured out how to monitor DSpace sessions using JMX
I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01

DSpace Sessions

Run all system updates and reboot DSpace Test
Wow, I packaged up the jmx_dspace_sessions stuff in the Ansible infrastructure scripts and deployed it on CGSpace and it totally works:
```
# munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0
```

2018-02-03

Bram from Atmire responded about the high load caused by the Solr updater script and said it will be fixed with the updates to DSpace 5.8 compatibility: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=566
We will close that ticket for now and wait for the 5.8 stuff: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
I finally took a look at the second round of cleanups Peter had sent me for author affiliations in mid January

After trimming whitespace and quickly scanning for encoding errors I applied them on CGSpace:

$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'

Then I started a full Discovery reindex:

$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b

real    96m39.823s
user    14m10.975s
sys     2m29.088s

Generate a new list of affiliations for Peter to sort through:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723

Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in December:

# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
3126109

real    0m23.839s
user    0m27.225s
sys     0m1.905s

2018-02-05

Toying with correcting authors with trailing spaces via PostgreSQL:

dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20

I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away
This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.

Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:

dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630

2018-02-06

UptimeRobot says CGSpace is down this morning around 9:15
I see 308 PostgreSQL connections in pg_stat_activity

The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:

# date
Tue Feb  6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
  2 223.185.41.40
  2 66.249.64.14
  2 77.246.52.40
  4 157.55.39.82
  4 193.205.105.8
  5 207.46.13.63
  5 207.46.13.64
  6 154.68.16.34
  7 207.46.13.66
1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
 77 213.55.99.121
 86 66.249.64.14
101 104.196.152.243
103 207.46.13.64
118 157.55.39.82
133 207.46.13.66
136 207.46.13.63
156 68.180.228.157
295 197.210.168.174
752 144.76.64.79

I did notice in /var/log/tomcat7/catalina.out that Atmire’s update thing was running though
So I restarted Tomcat and now everything is fine
Next time I see that many database connections I need to save the output so I can analyze it later
I’m going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later
Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on
I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server
Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called ‘dspaceCli’ so that I can distinguish it in pg_stat_activity
I implemented some changes to the pooling in the Ansible infrastructure scripts so that each DSpace web application can use its own pool (web, api, and solr)
Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down
Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine
I’m not actually sure if the Solr web application uses the database though, so I’ll have to check later and remove it if necessary
I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week

2018-02-07

Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace
I tried on DSpace Test and indeed the lookup just doesn’t work!
The ORCiD code in DSpace appears to be using http://pub.orcid.org/, but when I go there in the browser it redirects me to https://pub.orcid.org/v2.0/
According to the announcement the v1 API was moved from http://pub.orcid.org/ to https://pub.orcid.org/v1.2 until March 1st when it will be discontinued for good
But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018

I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:

$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
/tmp/pg_stat_activity4.txt:5
/tmp/pg_stat_activity5.txt:6

Interestingly, all of those 751 connections were idle!

$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
751

Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps
Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):

DSpace Sessions

Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:

$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1828

CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)

What’s interesting is that the DSpace log says the connections are all busy:

org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].

… but in PostgreSQL I see them idle or idle in transaction:

$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
187

What the fuck, does DSpace think all connections are busy?
I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the removeAbandoned='true' parameter which is apparently off by default
I will try testOnReturn='true' too, just to add more validation, because I’m fucking grasping at straws

Also, WTF, there was a heap space error randomly in catalina.out:

Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space

I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!

Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:

$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
 34 ip_addr=46.229.168.67
 34 ip_addr=46.229.168.73
 37 ip_addr=46.229.168.76
 40 ip_addr=34.232.65.41
 41 ip_addr=46.229.168.71
 44 ip_addr=197.210.168.174
 55 ip_addr=181.137.2.214
 55 ip_addr=213.55.99.121
 58 ip_addr=46.229.168.65
 64 ip_addr=66.249.66.91
 67 ip_addr=66.249.66.90
 71 ip_addr=207.46.13.54
 78 ip_addr=130.82.1.40
104 ip_addr=40.77.167.36
151 ip_addr=68.180.228.157
174 ip_addr=207.46.13.135
194 ip_addr=54.83.138.123
198 ip_addr=40.77.167.62
210 ip_addr=207.46.13.71
214 ip_addr=104.196.152.243

These IPs made thousands of sessions today:

$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
859
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
610
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
8
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
826
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
727
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
181
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
24
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
166
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
992

Let’s investigate who these IPs belong to:
- 104.196.152.243 is CIAT, which is already marked as a bot via nginx!
- 207.46.13.71 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
- 40.77.167.62 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
- 207.46.13.135 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
- 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
- 40.77.167.36 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
- 207.46.13.54 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
- 46.229.168.x is Semrush, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
Nice, so these are all known bots that are already crammed into one session by Tomcat’s Crawler Session Manager Valve.
What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle
One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:
```
BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
```

This one makes two thousand requests per day or so recently:

# grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029

And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww
Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace
Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker

This is how the connections looked when it crashed this afternoon:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
  5 dspaceApi
290 dspaceWeb

This is how it is right now:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
  5 dspaceApi
  5 dspaceWeb

So is this just some fucked up XMLUI database leaking?
I notice there is an issue (that I’ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
I seriously doubt this leaking shit is fixed for sure, but I’m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I’m fed up with this shit
I cherry-picked all the commits for DS-3551 but it won’t build on our current DSpace 5.5!
I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle

2018-02-10

I tried to disable ORCID lookups but keep the existing authorities
This item has an ORCID for Ralf Kiese: http://localhost:8080/handle/10568/89897
Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
```
Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
```

If I change choices.presentation to suggest it give this error:

xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError

So I don’t think we can disable the ORCID lookup function and keep the ORCID badges

2018-02-11

Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: ¹⁰⁵⁶⁸⁄₉₀₇₃₅

Weird thumbnail

I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:

$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg

Manual thumbnail

Peter sent me corrected author names last week but the file encoding is messed up:

$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.

The isutf8 program comes from moreutils
Line 100 contains: Galiè, Alessandra
In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use pip install psycopg2-binary
See: http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/
I updated my fix-metadata-values.py and delete-metadata-values.py scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts

I ran the 342 author corrections (after trimming whitespace and excluding those with || and other syntax errors) on CGSpace:

$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'

Then I ran a full Discovery re-indexing:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b

That reminds me that Bizu had asked me to fix some of Alan Duncan’s names in December
I see he actually has some variations with “Duncan, Alan J.”: https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=

I will just update those for her too and then restart the indexing:

dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value    |              authority               | confidence 
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 |         -1
Duncan, Alan J. |                                      |         -1
Duncan, Alan    | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 |         -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |         -1
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |         -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
(8 rows)

dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value  |              authority               | confidence 
--------------+--------------------------------------+------------
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
(1 row)
dspace=# commit;

Run all system updates on DSpace Test (linode02) and reboot it
I wrote a Python script (resolve-orcids-from-solr.py) using SolrClient to parse the Solr authority cache for ORCID IDs
We currently have 1562 authority records with ORCID IDs, and 624 unique IDs
We can use this to build a controlled vocabulary of ORCID IDs for new item submissions
I don’t know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?
I added the script to the ILRI DSpace wiki on GitHub

2018-02-12

Follow up with Atmire on the DSpace 5.8 Compatibility ticket to ask again if they want me to send them a DSpace 5.8 branch to work on
Abenet asked if there was a way to get the number of submissions she and Bizuwork did
I said that the Atmire Workflow Statistics module was supposed to be able to do that
We had tried it in June, 2017 and found that it didn’t work
Atmire sent us some fixes but they didn’t work either
I just tried the branch with the fixes again and it indeed does not work:

Atmire Workflow Statistics No Data Available

I see that in April, 2017 I just used a SQL query to get a user’s submissions by checking the dc.description.provenance field

So for Abenet, I can check her submissions in December, 2017 with:

dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*yabowork.*2017-12.*';

I emailed Peter to ask whether we can move DSpace Test to a new Linode server and attach 300 GB of disk space to it
This would be using Linode’s new block storage volumes
I think our current $40/month Linode has enough CPU and memory capacity, but we need more disk space
I think I’d probably just attach the block storage volume and mount it on /home/dspace
Ask Peter about dc.rights on DSpace Test again, if he likes it then we should move it to CGSpace soon

2018-02-13

Peter said he was getting a “socket closed” error on CGSpace

I looked in the dspace.log.2018-02-13 and saw one recent one:

2018-02-13 12:50:13,656 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error - 
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
...
Caused by: java.net.SocketException: Socket closed

Could be because of the removeAbandoned="true" that I enabled in the JDBC connection pool last week?

$ grep -c "java.net.SocketException: Socket closed" dspace.log.2018-02-*
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
dspace.log.2018-02-04:0
dspace.log.2018-02-05:0
dspace.log.2018-02-06:0
dspace.log.2018-02-07:0
dspace.log.2018-02-08:1
dspace.log.2018-02-09:6
dspace.log.2018-02-10:0
dspace.log.2018-02-11:3
dspace.log.2018-02-12:0
dspace.log.2018-02-13:4

I apparently added that on 2018-02-07 so it could be, as I don’t see any of those socket closed errors in 2018-01’s logs!
I will increase the removeAbandonedTimeout from its default of 60 to 90 and enable logAbandoned

Peter hit this issue one more time, and this is apparently what Tomcat’s catalina.out log says when an abandoned connection is removed:

Feb 13, 2018 2:05:42 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
WARNING: Connection has been abandoned PooledConnection[org.postgresql.jdbc.PgConnection@22e107be]:java.lang.Exception

2018-02-14

Skype with Peter and the Addis team to discuss what we need to do for the ORCIDs in the immediate future
We said we’d start with a controlled vocabulary for cg.creator.id on the DSpace Test submission form, where we store the author name and the ORCID in some format like: Alan S. Orth (0000-0002-1735-7458)
Eventually we need to find a way to print the author names with links to their ORCID profiles
Abenet will send an email to the partners to give us ORCID IDs for their authors and to stress that they update their name format on ORCID.org if they want it in a special way
I sent the Codeobia guys a question to ask how they prefer that we store the IDs, ie one of:
- Alan Orth - 0000-0002-1735-7458
- Alan Orth: 0000-0002-1735-7458
- Alan S. Orth (0000-0002-1735-7458)
Atmire responded on the DSpace 5.8 compatability ticket and said they will let me know if they they want me to give them a clean 5.8 branch

I formatted my list of ORCID IDs as a controlled vocabulary, sorted alphabetically, then ran through XML tidy:

$ sort cgspace-orcids.txt > dspace/config/controlled-vocabularies/cg-creator-id.xml
$ add XML formatting...
$ tidy -xml -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml

It seems the tidy fucks up accents, for example it turns Adriana Tofiño (0000-0001-7115-7169) into Adriana TofiÃ±o (0000-0001-7115-7169)

We need to force UTF-8:

$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml

This preserves special accent characters
I tested the display and store of these in the XMLUI and PostgreSQL and it looks good
Sisay exported all ILRI, CIAT, etc authors from ORCID and sent a list of 600+

Peter combined it with mine and we have 1204 unique ORCIDs!

$ grep -coE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv
1204
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' CGcenter_ORCID_ID_combined.csv | sort | uniq | wc -l
1204

Also, save that regex for the future because it will be very useful!

CIAT sent a list of their authors’ ORCIDs and combined with ours there are now 1227:

$ cat CGcenter_ORCID_ID_combined.csv ciat-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1227

There are some formatting issues with names in Peter’s list, so I should remember to re-generate the list of names from ORCID’s API once we’re done

The dspace cleanup -v currently fails on CGSpace with the following:

- Deleting bitstream record from database (ID: 149473)
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (bitstream_id)=(149473) is still referenced from table "bundle".

The solution is to update the bitstream table, as I’ve discovered several other times in 2016 and 2017:

$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (149473);'
UPDATE 1

Then the cleanup process will continue for awhile and hit another foreign key conflict, and eventually it will complete after you manually resolve them all

2018-02-15

Altmetric seems to be indexing DSpace Test for some reason:
- See this item on DSpace Test: https://dspacetest.cgiar.org/handle/10568/78450
- See the corresponding page on Altmetric: https://www.altmetric.com/details/handle/10568/78450
And this item doesn’t even exist on CGSpace!
Start working on XMLUI item display code for ORCIDs
Send emails to Macaroni Bros and Usman at CIFOR about ORCID metadata
CGSpace crashed while I was driving to Tel Aviv, and was down for four hours!
I only looked quickly in the logs but saw a bunch of database errors

PostgreSQL connections are currently:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | uniq -c
  2 dspaceApi
  1 dspaceWeb
  3 dspaceApi

I see shitloads of memory errors in Tomcat’s logs:

# grep -c "Java heap space" /var/log/tomcat7/catalina.out
56

And shit tons of database connections abandoned:

# grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
612

I have no fucking idea why it crashed

The XMLUI activity looks like:

# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "15/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
715 63.143.42.244
746 213.55.99.121
886 68.180.228.157
967 66.249.66.90
1013 216.244.66.245
1177 197.210.168.174
1419 207.46.13.159
1512 207.46.13.59
1554 207.46.13.157
2018 104.196.152.243

2018-02-17

Peter pointed out that we had an incorrect sponsor in the controlled vocabulary: U.S. Agency for International Development → United States Agency for International Development
I made a pull request to fix it ((#354)[https://github.com/ilri/DSpace/pull/354])

I should remember to update existing values in PostgreSQL too:

dspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
UPDATE 2

2018-02-18

ICARDA’s Mohamed Salem pointed out that it would be easiest to format the cg.creator.id field like “Alan Orth: 0000-0002-1735-7458” because no name will have a “:” so it’s easier to split on
I finally figured out a few ways to extract ORCID iDs from metadata using XSLT and display them in the XMLUI:

Displaying ORCID iDs in XMLUI

The one on the bottom left uses a similar format to our author display, and the one in the middle uses the format recommended by ORCID’s branding guidelines
Also, I realized that the Academicons font icon set we’re using includes an ORCID badge so we don’t need to use the PNG image anymore
Run system updates on DSpace Test (linode02) and reboot the server

Looking back at the system errors on 2018-02-15, I wonder what the fuck caused this:

$ wc -l dspace.log.2018-02-1{0..8}
383483 dspace.log.2018-02-10
275022 dspace.log.2018-02-11
249557 dspace.log.2018-02-12
280142 dspace.log.2018-02-13
615119 dspace.log.2018-02-14
4388259 dspace.log.2018-02-15
243496 dspace.log.2018-02-16
209186 dspace.log.2018-02-17
167432 dspace.log.2018-02-18

From an average of a few hundred thousand to over four million lines in DSpace log?

Using grep’s -B1 I can see the line before the heap space error, which has the time, ie:

2018-02-15 16:02:12,748 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space

So these errors happened at hours 16, 18, 19, and 20

Let’s see what was going on in nginx then:

# zcat --force /var/log/nginx/*.log.{3,4}.gz | wc -l
168571
# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | wc -l
8188

Only 8,000 requests during those four hours, out of 170,000 the whole day!

And the usage of XMLUI, REST, and OAI looks SUPER boring:

# zcat --force /var/log/nginx/*.log.{3,4}.gz | grep -E "15/Feb/2018:(16|18|19|20)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
111 95.108.181.88
158 45.5.184.221
201 104.196.152.243
205 68.180.228.157
236 40.77.167.131 
253 207.46.13.159
293 207.46.13.59
296 63.143.42.242
303 207.46.13.157
416 63.143.42.244

63.143.42.244 is Uptime Robot, and 207.46.x.x is Bing!
The DSpace sessions, PostgreSQL connections, and JVM memory all look normal
I see a lot of AccessShareLock on February 15th…?

PostgreSQL locks

I have no idea what caused this crash
In other news, I adjusted the ORCID badge size on the XMLUI item display and sent it back to Peter for feedback

2018-02-19

Combined list of CGIAR author ORCID iDs is up to 1,500:

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI-csv.csv CGcenter_ORCID_ID_combined.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l  
1571

I updated my resolve-orcids-from-solr.py script to be able to resolve ORCID identifiers from a text file so I renamed it to resolve-orcids.py

Also, I updated it so it uses several new options:

$ ./resolve-orcids.py -i input.txt -o output.txt
$ cat output.txt 
Ali Ramadhan: 0000-0001-5019-1368
Ahmad Maryudi: 0000-0001-5051-7217

I was running this on the new list of 1571 and found an error:

Looking up the name associated with ORCID iD: 0000-0001-9634-1958
Traceback (most recent call last):
File "./resolve-orcids.py", line 111, in <module>
read_identifiers_from_file()
File "./resolve-orcids.py", line 37, in read_identifiers_from_file
resolve_orcid_identifiers(orcids)
File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
family_name = data['name']['family-name']['value']
TypeError: 'NoneType' object is not subscriptable

According to ORCID that identifier’s family-name is null so that sucks
I fixed the script so that it checks if the family name is null

Now another:

Looking up the name associated with ORCID iD: 0000-0002-1300-3636
Traceback (most recent call last):
File "./resolve-orcids.py", line 117, in <module>
read_identifiers_from_file()
File "./resolve-orcids.py", line 37, in read_identifiers_from_file
resolve_orcid_identifiers(orcids)
File "./resolve-orcids.py", line 65, in resolve_orcid_identifiers
if data['name']['given-names']:
TypeError: 'NoneType' object is not subscriptable

According to ORCID that identifier’s entire name block is null!

2018-02-20

Send Abenet an email about getting a purchase requisition for a new DSpace Test server on Linode
Discuss some of the issues with null values and poor-quality names in some ORCID identifiers with Abenet and I think we’ll now only use ORCID iDs that have been sent to use from partners, not those extracted via keyword searches on orcid.org

This should be the version we use (the existing controlled vocabulary generated from CGSpace’s Solr authority core plus the IDs sent to us so far by partners):

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml ORCID_ID_CIAT_IITA_IWMI.csv | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > 2018-02-20-combined.txt

I updated the resolve-orcids.py to use the “credit-name” if it exists in a profile, falling back to “given-names” + “family-name”
Also, I added color coded output to the debug messages and added a “quiet” mode that supresses the normal behavior of printing results to the screen

I’m using this as the test input for resolve-orcids.py:

$ cat orcid-test-values.txt 
# valid identifier with 'given-names' and 'family-name'
0000-0001-5019-1368

# duplicate identifier
0000-0001-5019-1368 

# invalid identifier
0000-0001-9634-19580

# has a 'credit-name' value we should prefer
0000-0002-1735-7458

# has a blank 'credit-name' value
0000-0001-5199-5528

# has a null 'name' object
0000-0002-1300-3636

# has a null 'family-name' value
0000-0001-9634-1958

# missing ORCID identifier
0000-0003-4221-3214

Help debug issues with Altmetric badges again, it looks like Altmetric is all kinds of fucked up
Last week I pointed out that they were tracking Handles from our test server
Now, their API is responding with content that is marked as content-type JSON but is not valid JSON
For example, this item: https://cgspace.cgiar.org/handle/10568/83320
The Altmetric JavaScript builds the following API call: https://api.altmetric.com/v1/handle/10568/83320?callback=_altmetric.embed_callback&domain=cgspace.cgiar.org&key=3c130976ca2b8f2e88f8377633751ba1&cache_until=13-20
The response body is not JSON
To contrast, the following bare API call without query parameters is valid JSON: https://api.altmetric.com/v1/handle/10568/83320
I told them that it’s their JavaScript that is fucked up
Remove CPWF project number and Humidtropics subject from submission form (#3)
I accidentally merged it into my own repository, oops

2018-02-22

CGSpace was apparently down today around 13:00 server time and I didn’t get any emails on my phone, but saw them later on the computer
It looks like Sisay restarted Tomcat because I was offline

There was absolutely nothing interesting going on at 13:00 on the server, WTF?

# cat /var/log/nginx/*.log | grep -E "22/Feb/2018:13" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
 55 192.99.39.235
 60 207.46.13.26
 62 40.77.167.38
 65 207.46.13.23
103 41.57.108.208
120 104.196.152.243
133 104.154.216.0
145 68.180.228.117
159 54.92.197.82
231 5.9.6.51

Otherwise there was pretty normal traffic the rest of the day:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "22/Feb/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
839 216.244.66.245
1074 68.180.228.117
1114 157.55.39.100
1162 207.46.13.26
1178 207.46.13.23
2749 104.196.152.243
3109 50.116.102.77
4199 70.32.83.92
5208 5.9.6.51
8686 45.5.184.196

So I don’t see any definite cause for this crash, I see a shit ton of abandoned PostgreSQL connections today around 1PM!

# grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' /var/log/tomcat7/catalina.out
729
# grep 'Feb 22, 2018 1' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon' 
519

I think the removeAbandonedTimeout might still be too low (I increased it from 60 to 90 seconds last week)
Abandoned connections is not a cause but a symptom, though perhaps something more like a few minutes is better?

Also, while looking at the logs I see some new bot:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.4.2661.102 Safari/537.36; 360Spider

It seems to re-use its user agent but makes tons of useless requests and I wonder if I should add “.spider.” to the Tomcat Crawler Session Manager valve?

2018-02-23

Atmire got back to us with a quote about their DSpace 5.8 upgrade

2018-02-25

A few days ago Abenet sent me the list of ORCID iDs from CCAFS

We currently have 988 unique identifiers:

$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l          
988

After adding the ones from CCAFS we now have 1004:

$ cat dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ccafs | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq | wc -l
1004

I will add them to DSpace Test but Abenet says she’s still waiting to set us ILRI’s list
I will tell her that we should proceed on sharing our work on DSpace Test with the partners this week anyways and we can update the list later

While regenerating the names for these ORCID identifiers I saw one that has a weird value for its names:

Looking up the names associated with ORCID iD: 0000-0002-2614-426X
Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X

I don’t know if the user accidentally entered this as their name or if that’s how ORCID behaves when the name is private?
I will remove that one from our list for now
Remove Dryland Systems subject from submission form because that CRP closed two years ago (#355)
Run all system updates on DSpace Test
Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode
Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace

We have over 60,000 unique author + authority combinations on CGSpace:

dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
count 
-------
62464
(1 row)

I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it’s way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way
The query in Solr would simply be orcid_id:*

Assuming I know that authority record with id:d7ef744b-bbd4-4171-b449-00e37e1b776f, then I could query PostgreSQL for all metadata records using that authority:

dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
       2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
(1 row)

Then I suppose I can use the resource_id to identify the item?
Actually, resource_id is the same id we use in CSV, so I could simply build something like this for a metadata import!
```
id,cg.creator.id
93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
```
I just discovered that requests-cache can transparently cache HTTP requests

Running resolve-orcids.py with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!

$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
Ali Ramadhan: 0000-0001-5019-1368
Alan S. Orth: 0000-0002-1735-7458
Ibrahim Mohammed: 0000-0001-5199-5528
Nor Azwadi: 0000-0001-9634-1958
./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.32s user 0.07s system 3% cpu 10.530 total
$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
Ali Ramadhan: 0000-0001-5019-1368
Alan S. Orth: 0000-0002-1735-7458
Ibrahim Mohammed: 0000-0001-5199-5528
Nor Azwadi: 0000-0001-9634-1958
./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.23s user 0.05s system 8% cpu 3.046 total

2018-02-26

Peter is having problems with “Socket closed” on his submissions page again
He says his personal account loads much faster than his CGIAR account, which could be because the CGIAR account has potentially thousands of submissions over the last few years
I don’t know why it would take so long, but this logic kinda makes sense
I think I should increase the removeAbandonedTimeout from 90 to something like 180 and continue observing
I also reduced the timeout for the API pool back to 60 because those interfaces are only used by bots

2018-02-27

Peter is still having problems with “Socket closed” on his submissions page
I have disabled removeAbandoned for now because that’s the only thing I changed in the last few weeks since he started having issues
I think the real line of logic to follow here is why the submissions page is so slow for him (presumably because of loading all his submissions?)
I need to see which SQL queries are run during that time

And only a few hours after I disabled the removeAbandoned thing CGSpace went down and lo and behold, there were 264 connections, most of which were idle:

$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
  5 dspaceApi
279 dspaceWeb
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
218

So I’m re-enabling the removeAbandoned setting

I grabbed a snapshot of the active connections in pg_stat_activity for all queries running longer than 2 minutes:

dspace=# \copy (SELECT now() - query_start as "runtime", application_name, usename, datname, waiting, state, query
FROM  pg_stat_activity
WHERE now() - query_start > '2 minutes'::interval
ORDER BY runtime DESC) to /tmp/2018-02-27-postgresql.txt
COPY 263

100 of these idle in transaction connections are the following query:

SELECT * FROM resourcepolicy WHERE resource_type_id= $1 AND resource_id= $2 AND action_id= $3

… but according to the pg_locks documentation I should have done this to correlate the locks with the activity:
```
SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;
```
Tom Desair from Atmire shared some extra JDBC pool parameters that might be useful on my thread on the dspace-tech mailing list:
- abandonWhenPercentageFull: Only start cleaning up abandoned connections if the pool is used for more than X %.
- jdbcInterceptors=‘ResetAbandonedTimer’: Make sure the “abondoned” timer is reset every time there is activity on a connection
I will try with abandonWhenPercentageFull='50'
Also there are some indexes proposed in DS-3636 that he urged me to try
Finally finished the orcid-authority-to-item.py script!
It successfully mapped 2600 ORCID identifiers to items in my tests
I will run it on DSpace Test

2018-02-28

CGSpace crashed today, the first HTTP 499 in nginx’s access.log was around 09:12

There’s nothing interesting going on in nginx’s logs around that time:

# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "28/Feb/2018:09:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
 65 197.210.168.174
 74 213.55.99.121
 74 66.249.66.90
 86 41.204.190.40
102 130.225.98.207
108 192.0.89.192
112 157.55.39.218
129 207.46.13.21
131 207.46.13.115
135 207.46.13.101

Looking in dspace.log-2018-02-28 I see this, though:

2018-02-28 09:19:29,692 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space

Memory issues seem to be common this month:

$ grep -c 'nested exception is java.lang.OutOfMemoryError: Java heap space' dspace.log.2018-02-* 
dspace.log.2018-02-01:0
dspace.log.2018-02-02:0
dspace.log.2018-02-03:0
dspace.log.2018-02-04:0
dspace.log.2018-02-05:0
dspace.log.2018-02-06:0
dspace.log.2018-02-07:0
dspace.log.2018-02-08:0
dspace.log.2018-02-09:0
dspace.log.2018-02-10:0
dspace.log.2018-02-11:0
dspace.log.2018-02-12:0
dspace.log.2018-02-13:0
dspace.log.2018-02-14:0
dspace.log.2018-02-15:10
dspace.log.2018-02-16:0
dspace.log.2018-02-17:0
dspace.log.2018-02-18:0
dspace.log.2018-02-19:0
dspace.log.2018-02-20:0
dspace.log.2018-02-21:0
dspace.log.2018-02-22:0
dspace.log.2018-02-23:0
dspace.log.2018-02-24:0
dspace.log.2018-02-25:0
dspace.log.2018-02-26:0
dspace.log.2018-02-27:6
dspace.log.2018-02-28:1

Top ten users by session during the first twenty minutes of 9AM:

$ grep -E '2018-02-28 09:(0|1)' dspace.log.2018-02-28 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq -c | sort -n | tail -n 10
 18 session_id=F2DFF64D3D707CD66AE3A873CEC80C49
 19 session_id=92E61C64A79F0812BE62A3882DA8F4BA
 21 session_id=57417F5CB2F9E3871E609CEEBF4E001F
 25 session_id=C3CD265AB7AA51A49606C57C069A902A
 26 session_id=E395549F081BA3D7A80F174AE6528750
 26 session_id=FEE38CF9760E787754E4480069F11CEC
 33 session_id=C45C2359AE5CD115FABE997179E35257
 38 session_id=1E9834E918A550C5CD480076BC1B73A4
 40 session_id=8100883DAD00666A655AE8EC571C95AE
 66 session_id=01D9932D6E85E90C2BA9FF5563A76D03

According to the log 01D9932D6E85E90C2BA9FF5563A76D03 is an ILRI editor, doing lots of updating and editing of items
8100883DAD00666A655AE8EC571C95AE is some Indian IP address
1E9834E918A550C5CD480076BC1B73A4 looks to be a session shared by the bots
So maybe it was due to the editor’s uploading of files, perhaps something that was too big or?
I think I’ll increase the JVM heap size on CGSpace from 6144m to 8192m because I’m sick of this random crashing shit and the server has memory and I’d rather eliminate this so I can get back to solving PostgreSQL issues and doing other real work

Run the few corrections from earlier this month for sponsor on CGSpace:

cgspace=# update metadatavalue set text_value='United States Agency for International Development' where resource_type_id=2 and metadata_field_id=29 and text_value like '%U.S. Agency for International Development%';
UPDATE 3

I finally got a CGIAR account so I logged into CGSpace with it and tried to delete my old unfinished submissions (22 of them)

Eventually it succeeded, but it took about five minutes and I noticed LOTS of locks happening with this query:

dspace=# \copy (SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid) to /tmp/locks-aorth.txt;

I took a few snapshots during the process and noticed 500, 800, and even 2000 locks at certain times during the process
Afterwards I looked a few times and saw only 150 or 200 locks
On the test server, with the PostgreSQL indexes from DS-3636 applied, it finished instantly
Run system updates on DSpace Test and reboot the server