CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2018

2018-02-01

  • Peter gave feedback on the dc.rights proof of concept that I had sent him last week
  • We don’t need to distinguish between internal and external works, so that makes it just a simple list
  • Yesterday I figured out how to monitor DSpace sessions using JMX
  • I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01

DSpace Sessions

  • Run all system updates and reboot DSpace Test
  • Wow, I packaged up the jmx_dspace_sessions stuff in the Ansible infrastructure scripts and deployed it on CGSpace and it totally works:
# munin-run jmx_dspace_sessions
v_.value 223
v_jspui.value 1
v_oai.value 0

2018-02-03

$ ./delete-metadata-values.py -i /tmp/2018-02-03-Affiliations-12-deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
$ ./fix-metadata-values.py -i /tmp/2018-02-03-Affiliations-1116-corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
  • Then I started a full Discovery reindex:
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b

real    96m39.823s
user    14m10.975s
sys     2m29.088s
  • Generate a new list of affiliations for Peter to sort through:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 3723
  • Oh, and it looks like we processed over 3.1 million requests in January, up from 2.9 million in December:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2018"
3126109

real    0m23.839s
user    0m27.225s
sys     0m1.905s

2018-02-05

  • Toying with correcting authors with trailing spaces via PostgreSQL:
dspace=# update metadatavalue set text_value=REGEXP_REPLACE(text_value, '\s+$' , '') where resource_type_id=2 and metadata_field_id=3 and text_value ~ '^.*?\s+$';
UPDATE 20
  • I tried the TRIM(TRAILING from text_value) function and it said it changed 20 items but the spaces didn’t go away
  • This is on a fresh import of the CGSpace database, but when I tried to apply it on CGSpace there were no changes detected. Weird.
  • Anyways, Peter wants a new list of authors to clean up, so I exported another CSV:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors-2018-02-05.csv with csv;
COPY 55630

2018-02-06

  • UptimeRobot says CGSpace is down this morning around 9:15
  • I see 308 PostgreSQL connections in pg_stat_activity
  • The usage otherwise seemed low for REST/OAI as well as XMLUI in the last hour:
# date
Tue Feb  6 09:30:32 UTC 2018
# cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
      2 223.185.41.40
      2 66.249.64.14
      2 77.246.52.40
      4 157.55.39.82
      4 193.205.105.8
      5 207.46.13.63
      5 207.46.13.64
      6 154.68.16.34
      7 207.46.13.66
   1548 50.116.102.77
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 /var/log/nginx/error.log /var/log/nginx/error.log.1 | grep -E "6/Feb/2018:(08|09)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     77 213.55.99.121
     86 66.249.64.14
    101 104.196.152.243
    103 207.46.13.64
    118 157.55.39.82
    133 207.46.13.66
    136 207.46.13.63
    156 68.180.228.157
    295 197.210.168.174
    752 144.76.64.79
  • I did notice in /var/log/tomcat7/catalina.out that Atmire’s update thing was running though
  • So I restarted Tomcat and now everything is fine
  • Next time I see that many database connections I need to save the output so I can analyze it later
  • I’m going to re-schedule the taskUpdateSolrStatsMetadata task as Bram detailed in ticket 566 to see if it makes CGSpace stop crashing every morning
  • If I move the task from 3AM to 3PM, deally CGSpace will stop crashing in the morning, or start crashing ~12 hours later
  • Eventually Atmire has said that there will be a fix for this high load caused by their script, but it will come with the 5.8 compatability they are already working on
  • I re-deployed CGSpace with the new task time of 3PM, ran all system updates, and restarted the server
  • Also, I changed the name of the DSpace fallback pool on DSpace Test and CGSpace to be called ‘dspaceCli’ so that I can distinguish it in pg_stat_activity
  • I implemented some changes to the pooling in the Ansible infrastructure scripts so that each DSpace web application can use its own pool (web, api, and solr)
  • Each pool uses its own name and hopefully this should help me figure out which one is using too many connections next time CGSpace goes down
  • Also, this will mean that when a search bot comes along and hammers the XMLUI, the REST and OAI applications will be fine
  • I’m not actually sure if the Solr web application uses the database though, so I’ll have to check later and remove it if necessary
  • I deployed the changes on DSpace Test only for now, so I will monitor and make them on CGSpace later this week

2018-02-07

  • Abenet wrote to ask a question about the ORCiD lookup not working for one CIAT user on CGSpace
  • I tried on DSpace Test and indeed the lookup just doesn’t work!
  • The ORCiD code in DSpace appears to be using http://pub.orcid.org/, but when I go there in the browser it redirects me to https://pub.orcid.org/v2.0/
  • According to the announcement the v1 API was moved from http://pub.orcid.org/ to https://pub.orcid.org/v1.2 until March 1st when it will be discontinued for good
  • But the old URL is hard coded in DSpace and it doesn’t work anyways, because it currently redirects you to https://pub.orcid.org/v2.0/v1.2
  • So I guess we have to disable that shit once and for all and switch to a controlled vocabulary
  • CGSpace crashed again, this time around Wed Feb 7 11:20:28 UTC 2018
  • I took a few snapshots of the PostgreSQL activity at the time and as the minutes went on and the connections were very high at first but reduced on their own:
$ psql -c 'select * from pg_stat_activity' > /tmp/pg_stat_activity.txt
$ grep -c 'PostgreSQL JDBC' /tmp/pg_stat_activity*
/tmp/pg_stat_activity1.txt:300
/tmp/pg_stat_activity2.txt:272
/tmp/pg_stat_activity3.txt:168
/tmp/pg_stat_activity4.txt:5
/tmp/pg_stat_activity5.txt:6
  • Interestingly, all of those 751 connections were idle!
$ grep "PostgreSQL JDBC" /tmp/pg_stat_activity* | grep -c idle
751
  • Since I was restarting Tomcat anyways, I decided to deploy the changes to create two different pools for web and API apps
  • Looking the Munin graphs, I can see that there were almost double the normal number of DSpace sessions at the time of the crash (and also yesterday!):

DSpace Sessions

  • Indeed it seems like there were over 1800 sessions today around the hours of 10 and 11 AM:
$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
1828
  • CGSpace went down again a few hours later, and now the connections to the dspaceWeb pool are maxed at 250 (the new limit I imposed with the new separate pool scheme)
  • What’s interesting is that the DSpace log says the connections are all busy:
org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-bio-127.0.0.1-8443-exec-328] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
  • … but in PostgreSQL I see them idle or idle in transaction:
$ psql -c 'select * from pg_stat_activity' | grep -c dspaceWeb
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c idle
250
$ psql -c 'select * from pg_stat_activity' | grep dspaceWeb | grep -c "idle in transaction"
187
  • What the fuck, does DSpace think all connections are busy?
  • I suspect these are issues with abandoned connections or maybe a leak, so I’m going to try adding the removeAbandoned='true' parameter which is apparently off by default
  • I will try testOnReturn='true' too, just to add more validation, because I’m fucking grasping at straws
  • Also, WTF, there was a heap space error randomly in catalina.out:
Wed Feb 07 15:01:54 UTC 2018 | Query:containerItem:91917 AND type:2
Exception in thread "http-bio-127.0.0.1-8081-exec-58" java.lang.OutOfMemoryError: Java heap space
  • I’m trying to find a way to determine what was using all those Tomcat sessions, but parsing the DSpace log is hard because some IPs are IPv6, which contain colons!
  • Looking at the first crash this morning around 11, I see these IPv4 addresses making requests around 10 and 11AM:
$ grep -E '^2018-02-07 (10|11)' dspace.log.2018-02-07 | grep -o -E 'ip_addr=[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | sort -n | uniq -c | sort -n | tail -n 20
     34 ip_addr=46.229.168.67
     34 ip_addr=46.229.168.73
     37 ip_addr=46.229.168.76
     40 ip_addr=34.232.65.41
     41 ip_addr=46.229.168.71
     44 ip_addr=197.210.168.174
     55 ip_addr=181.137.2.214
     55 ip_addr=213.55.99.121
     58 ip_addr=46.229.168.65
     64 ip_addr=66.249.66.91
     67 ip_addr=66.249.66.90
     71 ip_addr=207.46.13.54
     78 ip_addr=130.82.1.40
    104 ip_addr=40.77.167.36
    151 ip_addr=68.180.228.157
    174 ip_addr=207.46.13.135
    194 ip_addr=54.83.138.123
    198 ip_addr=40.77.167.62
    210 ip_addr=207.46.13.71
    214 ip_addr=104.196.152.243
  • These IPs made thousands of sessions today:
$ grep 104.196.152.243 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
530
$ grep 207.46.13.71 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
859
$ grep 40.77.167.62 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
610
$ grep 54.83.138.123 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
8
$ grep 207.46.13.135 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
826
$ grep 68.180.228.157 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
727
$ grep 40.77.167.36 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
181
$ grep 130.82.1.40 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
24
$ grep 207.46.13.54 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
166
$ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
992

  • Let’s investigate who these IPs belong to:
    • 104.196.152.243 is CIAT, which is already marked as a bot via nginx!
    • 207.46.13.71 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
    • 40.77.167.62 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
    • 207.46.13.135 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
    • 68.180.228.157 is Yahoo, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
    • 40.77.167.36 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
    • 207.46.13.54 is Bing, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
    • 46.229.168.x is Semrush, which is already marked as a bot in Tomcat’s Crawler Session Manager Valve!
  • Nice, so these are all known bots that are already crammed into one session by Tomcat’s Crawler Session Manager Valve.
  • What in the actual fuck, why is our load doing this? It’s gotta be something fucked up with the database pool being “busy” but everything is fucking idle
  • One that I should probably add in nginx is 54.83.138.123, which is apparently the following user agent:
BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
  • This one makes two thousand requests per day or so recently:
# grep -c BUbiNG /var/log/nginx/access.log /var/log/nginx/access.log.1
/var/log/nginx/access.log:1925
/var/log/nginx/access.log.1:2029
  • And they have 30 IPs, so fuck that shit I’m going to add them to the Tomcat Crawler Session Manager Valve nowwww
  • Lots of discussions on the dspace-tech mailing list over the last few years about leaky transactions being a known problem with DSpace
  • Helix84 recommends restarting PostgreSQL instead of Tomcat because it restarts quicker
  • This is how the connections looked when it crashed this afternoon:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
    290 dspaceWeb
  • This is how it is right now:
$ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
      5 dspaceApi
      5 dspaceWeb
  • So is this just some fucked up XMLUI database leaking?
  • I notice there is an issue (that I’ve probably noticed before) on the Jira tracker about this that was fixed in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551
  • I seriously doubt this leaking shit is fixed for sure, but I’m gonna cherry-pick all those commits and try them on DSpace Test and probably even CGSpace because I’m fed up with this shit
  • I cherry-picked all the commits for DS-3551 but it won’t build on our current DSpace 5.5!
  • I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle

2018-02-10

  • I tried to disable ORCID lookups but keep the existing authorities
  • This item has an ORCID for Ralf Kiese: http://localhost:8080/handle/10568/89897
  • Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn’t show up on the item
  • Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
  • If I change choices.presentation to suggest it give this error:
xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
  • So I don’t think we can disable the ORCID lookup function and keep the ORCID badges

2018-02-11

  • Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: 1056890735

Weird thumbnail

  • I downloaded the PDF and manually generated a thumbnail with ImageMagick and it looked better:
$ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_cmyk.icc -thumbnail 600x600 -flatten -profile /usr/local/share/ghostscript/9.22/iccprofiles/default_rgb.icc CCAFS_WP_223.jpg

Manual thumbnail

  • Peter sent me corrected author names last week but the file encoding is messed up:
$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
  • The isutf8 program comes from moreutils
  • Line 100 contains: Galiè, Alessandra
  • In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use pip install psycopg2-binary
  • See: http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/
  • I updated my fix-metadata-values.py and delete-metadata-values.py scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts
  • I ran the 342 author corrections (after trimming whitespace and excluding those with || and other syntax errors) on CGSpace:
$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
  • Then I ran a full Discovery re-indexing:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
   text_value    |              authority               | confidence 
-----------------+--------------------------------------+------------
 Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |        600
 Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 |         -1
 Duncan, Alan J. |                                      |         -1
 Duncan, Alan    | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
 Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 |         -1
 Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |         -1
 Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 |         -1
 Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
(8 rows)

dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
  text_value  |              authority               | confidence 
--------------+--------------------------------------+------------
 Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d |        600
(1 row)
dspace=# commit;
  • Run all system updates on DSpace Test (linode02) and reboot it
  • I wrote a Python script (resolve-orcids-from-solr.py) using SolrClient to parse the Solr authority cache for ORCID IDs
  • We currently have 1562 authority records with ORCID IDs, and 624 unique IDs
  • We can use this to build a controlled vocabulary of ORCID IDs for new item submissions
  • I don’t know how to add ORCID IDs to existing items yet… some more querying of PostgreSQL for authority values perhaps?
  • I added the script to the ILRI DSpace wiki on GitHub