--- title: "December, 2017" date: 2017-12-01T13:53:54+03:00 author: "Alan Orth" tags: ["Notes"] --- ## 2017-12-01 - Uptime Robot noticed that CGSpace went down - The logs say "Timeout waiting for idle object" - PostgreSQL activity says there are 115 connections currently - The list of connections to XMLUI and REST API for today: ``` # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 763 2.86.122.76 907 207.46.13.94 1018 157.55.39.206 1021 157.55.39.235 1407 66.249.66.70 1411 104.196.152.243 1503 50.116.102.77 1805 66.249.66.90 4007 70.32.83.92 6061 45.5.184.196 ``` - The number of DSpace sessions isn't even that high: ``` $ cat /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 5815 ``` - Connections in the last two hours: ``` # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017:(09|10)" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 78 93.160.60.22 101 40.77.167.122 113 66.249.66.70 129 157.55.39.206 130 157.55.39.235 135 40.77.167.58 164 68.180.229.254 177 87.100.118.220 188 66.249.66.90 314 2.86.122.76 ``` - What the fuck is going on? - I've never seen this 2.86.122.76 before, it has made quite a few unique Tomcat sessions today: ``` $ grep 2.86.122.76 /home/cgspace.cgiar.org/log/dspace.log.2017-12-01 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 822 ``` - Appears to be some new bot: ``` 2.86.122.76 - - [01/Dec/2017:09:02:53 +0000] "GET /handle/10568/78444?show=full HTTP/1.1" 200 29307 "-" "Mozilla/3.0 (compatible; Indy Library)" ``` - I restarted Tomcat and everything came back up - I can add Indy Library to the Tomcat crawler session manager valve but it would be nice if I could simply remap the useragent in nginx - I will also add 'Drupal' to the Tomcat crawler session manager valve because there are Drupals out there harvesting and they should be considered as bots ``` # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "1/Dec/2017" | grep Drupal | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 3 54.75.205.145 6 70.32.83.92 14 2a01:7e00::f03c:91ff:fe18:7396 46 2001:4b99:1:1:216:3eff:fe2c:dc6c 319 2001:4b99:1:1:216:3eff:fe76:205b ``` ## 2017-12-03 - Linode alerted that CGSpace's load was 327.5% from 6 to 8 AM again ## 2017-12-04 - Linode alerted that CGSpace's load was 255.5% from 8 to 10 AM again - I looked at the Munin stats on DSpace Test (linode02) again to see how the PostgreSQL tweaks from a few weeks ago were holding up: ![DSpace Test PostgreSQL connections month](/cgspace-notes/2017/12/postgres-connections-month.png) - The results look fantastic! So the `random_page_cost` tweak is massively important for informing the PostgreSQL scheduler that there is no "cost" to accessing random pages, as we're on an SSD! - I guess we could probably even reduce the PostgreSQL connections in DSpace / PostgreSQL after using this - Run system updates on DSpace Test (linode02) and reboot it - I'm going to enable the PostgreSQL `random_page_cost` tweak on CGSpace - For reference, here is the past month's connections: ![CGSpace PostgreSQL connections month](/cgspace-notes/2017/12/postgres-connections-month-cgspace.png) ## 2017-12-05 - Linode alerted again that the CPU usage on CGSpace was high this morning from 8 to 10 AM - CORE updated the entry for CGSpace on their index: https://core.ac.uk/search?q=repositories.id:(1016)&fullTextOnly=false - Linode alerted again that the CPU usage on CGSpace was high this evening from 8 to 10 PM ## 2017-12-06 - Linode alerted again that the CPU usage on CGSpace was high this morning from 6 to 8 AM - Uptime Robot alerted that the server went down and up around 8:53 this morning - Uptime Robot alerted that CGSpace was down and up again a few minutes later - I don't see any errors in the DSpace logs but I see in nginx's access.log that UptimeRobot was returned with HTTP 499 status (Client Closed Request) - Looking at the REST API logs I see some new client IP I haven't noticed before: ``` # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E "6/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 18 95.108.181.88 19 68.180.229.254 30 207.46.13.151 33 207.46.13.110 38 40.77.167.20 41 157.55.39.223 82 104.196.152.243 1529 50.116.102.77 4005 70.32.83.92 6045 45.5.184.196 ``` - 50.116.102.77 is apparently in the US on websitewelcome.com ## 2017-12-07 - Uptime Robot reported a few times today that CGSpace was down and then up - At one point Tsega restarted Tomcat - I never got any alerts about high load from Linode though... - I looked just now and see that there are 121 PostgreSQL connections! - The top users right now are: ``` # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "7/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 838 40.77.167.11 939 66.249.66.223 1149 66.249.66.206 1316 207.46.13.110 1322 207.46.13.151 1323 2001:da8:203:2224:c912:1106:d94f:9189 1414 157.55.39.223 2378 104.196.152.243 2662 66.249.66.219 5110 124.17.34.60 ``` - We've never seen 124.17.34.60 yet, but it's really hammering us! - Apparently it is from China, and here is one of its user agents: ``` Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE) ``` - It is responsible for 4,500 Tomcat sessions today alone: ``` $ grep 124.17.34.60 /home/cgspace.cgiar.org/log/dspace.log.2017-12-07 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 4574 ``` - I've adjusted the nginx IP mapping that I set up last month to account for 124.17.34.60 and 124.17.34.59 using a regex, as it's the same bot on the same subnet - I was running the DSpace cleanup task manually and it hit an error: ``` $ /home/cgspace.cgiar.org/bin/dspace cleanup -v ... Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle" Detail: Key (bitstream_id)=(144666) is still referenced from table "bundle". ``` - The solution is like I discovered in [2017-04](/cgspace-notes/2017-04), to set the `primary_bitstream_id` to null: ``` dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (144666); UPDATE 1 ``` ## 2017-12-13 - Linode alerted that CGSpace was using high CPU from 10:13 to 12:13 this morning ## 2017-12-16 - Re-work the XMLUI base theme to allow child themes to override the header logo's image and link destination: [#349](https://github.com/ilri/DSpace/pull/349) - This required a little bit of work to restructure the XSL templates - Optimize PNG and SVG image assets in the CGIAR base theme using pngquant and svgo: [#350](https://github.com/ilri/DSpace/pull/350) ## 2017-12-17 - Reboot DSpace Test to get new Linode Linux kernel - Looking at CCAFS bulk import for Magdalena Haman (she originally sent them in November but some of the thumbnails were missing and dates were messed up so she resent them now) - A few issues with the data and thumbnails: - Her thumbnail files all use capital JPG so I had to rename them to lowercase: `rename -fc *.JPG` - thumbnail20.jpg is 1.7MB so I have to resize it - I also had to add the .jpg to the thumbnail string in the CSV - The thumbnail11.jpg is missing - The dates are in super long ISO8601 format (from Excel?) like `2016-02-07T00:00:00Z` so I converted them to simpler forms in GREL: `value.toString("yyyy-MM-dd")` - I trimmed the whitespaces in a few fields but it wasn't many - Rename her thumbnail column to filename, and format it so SAFBuilder adds the files to the thumbnail bundle with this GREL in OpenRefine: `value + "__bundle:THUMBNAIL"` - Rename dc.identifier.status and dc.identifier.url columns to cg.identifier.status and cg.identifier.url - Item 4 has weird characters in citation, ie: Nagoya et de Trait - Some author names need normalization, ie: `Aggarwal, Pramod` and `Aggarwal, Pramod K.` - Something weird going on with duplicate authors that have the same text value, like `Berto, Jayson C.` and `Balmeo, Katherine P.` - I will send her feedback on some author names like UNEP and ICRISAT and ask her for the missing thumbnail11.jpg - I did a test import of the data locally after building with SAFBuilder but for some reason I had to specify the collection (even though the collections were specified in the `collection` field) ``` $ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/89338 --source /Users/aorth/Downloads/2016\ bulk\ upload\ thumbnails/SimpleArchiveFormat --mapfile=/tmp/ccafs.map &> /tmp/ccafs.log ``` - It's the same on DSpace Test, I can't import the SAF bundle without specifying the collection: ``` $ dspace import --add --eperson=aorth@mjanja.ch --mapfile=/tmp/ccafs.map --source=/tmp/ccafs-2016/SimpleArchiveFormat No collections given. Assuming 'collections' file inside item directory Adding items from directory: /tmp/ccafs-2016/SimpleArchiveFormat Generating mapfile: /tmp/ccafs.map Processing collections file: collections Adding item from directory item_1 java.lang.NullPointerException at org.dspace.app.itemimport.ItemImport.addItem(ItemImport.java:865) at org.dspace.app.itemimport.ItemImport.addItems(ItemImport.java:736) at org.dspace.app.itemimport.ItemImport.main(ItemImport.java:498) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78) java.lang.NullPointerException Started: 1513521856014 Ended: 1513521858573 Elapsed time: 2 secs (2559 msecs) ``` - I even tried to debug it by adding verbose logging to the `JAVA_OPTS`: ``` -Dlog4j.configuration=file:/Users/aorth/dspace/config/log4j-console.properties -Ddspace.log.init.disable=true ``` - ... but the error message was the same, just with more INFO noise around it - For now I'll import into a collection in DSpace Test but I'm really not sure what's up with this! - Linode alerted that CGSpace was using high CPU from 4 to 6 PM - The logs for today show the CORE bot (137.108.70.7) being active in XMLUI: ``` # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 671 66.249.66.70 885 95.108.181.88 904 157.55.39.96 923 157.55.39.179 1159 207.46.13.107 1184 104.196.152.243 1230 66.249.66.91 1414 68.180.229.254 4137 66.249.66.90 46401 137.108.70.7 ``` - And then some CIAT bot (45.5.184.196) is actively hitting API endpoints: ``` # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "17/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 33 68.180.229.254 48 157.55.39.96 51 157.55.39.179 56 207.46.13.107 102 104.196.152.243 102 66.249.66.90 691 137.108.70.7 1531 50.116.102.77 4014 70.32.83.92 11030 45.5.184.196 ``` - That's probably ok, as I don't think the REST API connections use up a Tomcat session... - CIP emailed a few days ago to ask about unique IDs for authors and organizations, and if we can provide them via an API - Regarding the import issue above it seems to be a known issue that has a patch in DSpace 5.7: - https://jira.duraspace.org/browse/DS-2633 - https://jira.duraspace.org/browse/DS-3583 - We're on DSpace 5.5 but there is a one-word fix to the addItem() function here: https://github.com/DSpace/DSpace/pull/1731 - I will apply it on our branch but I need to make a note to NOT cherry-pick it when I rebase on to the latest 5.x upstream later - Pull request: [#351](https://github.com/ilri/DSpace/pull/351) ## 2017-12-18 - Linode alerted this morning that there was high outbound traffic from 6 to 8 AM - The XMLUI logs show that the CORE bot from last night (137.108.70.7) is very active still: ``` # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 190 207.46.13.146 191 197.210.168.174 202 86.101.203.216 268 157.55.39.134 297 66.249.66.91 314 213.55.99.121 402 66.249.66.90 532 68.180.229.254 644 104.196.152.243 32220 137.108.70.7 ``` - On the API side (REST and OAI) there is still the same CIAT bot (45.5.184.196) from last night making quite a number of requests this morning: ``` # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 7 104.198.9.108 8 185.29.8.111 8 40.77.167.176 9 66.249.66.91 9 68.180.229.254 10 157.55.39.134 15 66.249.66.90 59 104.196.152.243 4014 70.32.83.92 8619 45.5.184.196 ``` - I need to keep an eye on this issue because it has nice fixes for reducing the number of database connections in DSpace 5.7: https://jira.duraspace.org/browse/DS-3551 - Update text on CGSpace about page to give some tips to developers about using the resources more wisely ([#352](https://github.com/ilri/DSpace/pull/352)) - Linode alerted that CGSpace was using 396.3% CPU from 12 to 2 PM - The REST and OAI API logs look pretty much the same as earlier this morning, but there's a new IP harvesting XMLUI: ``` # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E "18/Dec/2017" | awk '{print $1}' | sort -n | uniq -c | sort -h | tail 360 95.108.181.88 477 66.249.66.90 526 86.101.203.216 691 207.46.13.13 698 197.210.168.174 819 207.46.13.146 878 68.180.229.254 1965 104.196.152.243 17701 2.86.72.181 52532 137.108.70.7 ``` - 2.86.72.181 appears to be from Greece, and has the following user agent: ``` Mozilla/3.0 (compatible; Indy Library) ``` - Surprisingly it seems they are re-using their Tomcat session for all those 17,000 requests: ``` $ grep 2.86.72.181 dspace.log.2017-12-18 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l 1 ``` - I guess there's nothing I can do to them for now - In other news, I am curious how many PostgreSQL connection pool errors we've had in the last month: ``` $ grep -c "Cannot get a connection, pool error Timeout waiting for idle object" dspace.log.2017-1* | grep -v :0 dspace.log.2017-11-07:15695 dspace.log.2017-11-08:135 dspace.log.2017-11-17:1298 dspace.log.2017-11-26:4160 dspace.log.2017-11-28:107 dspace.log.2017-11-29:3972 dspace.log.2017-12-01:1601 dspace.log.2017-12-02:1274 dspace.log.2017-12-07:2769 ``` - I made a small fix to my `move-collections.sh` script so that it handles the case when a "to" or "from" community doesn't exist - The script lives here: https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515 - Major reorganization of four of CTA's French collections - Basically moving their items into the English ones, then moving the English ones to the top-level of the CTA community, and deleting the old sub-communities - Move collection 10568/51821 from 10568/42212 to 10568/42211 - Move collection 10568/51400 from 10568/42214 to 10568/42211 - Move collection 10568/56992 from 10568/42216 to 10568/42211 - Move collection 10568/42218 from 10568/42217 to 10568/42211 - Export CSV of collection 10568/63484 and move items to collection 10568/51400 - Export CSV of collection 10568/64403 and move items to collection 10568/56992 - Export CSV of collection 10568/56994 and move items to collection 10568/42218 - There are blank lines in this metadata, which causes DSpace to not detect changes in the CSVs - I had to use OpenRefine to remove all columns from the CSV except `id` and `collection`, and then update the `collection` field for the new mappings - Remove empty sub-communities: 10568/42212, 10568/42214, 10568/42216, 10568/42217 - I was in the middle of applying the metadata imports on CGSpace and the system ran out of PostgreSQL connections... - There were 128 PostgreSQL connections at the time... grrrr. - So I restarted Tomcat 7 and restarted the imports - I assume the PostgreSQL transactions were fine but I will remove the Discovery index for their community and re-run the light-weight indexing to hopefully re-construct everything: ``` $ dspace index-discovery -r 10568/42211 $ schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery ``` - The PostgreSQL issues are getting out of control, I need to figure out how to enable connection pools in Tomcat! ## 2017-12-19 - Briefly had PostgreSQL connection issues on CGSpace for the millionth time - I'm fucking sick of this! - The connection graph on CGSpace shows shit tons of connections idle ![Idle PostgreSQL connections on CGSpace](/cgspace-notes/2017/12/postgres-connections-month-cgspace-2.png) - And I only now just realized that DSpace's `db.maxidle` parameter is not seconds, but number of idle connections to allow. - So theoretically, because each webapp has its own pool, this could be 20 per app—so no wonder we have 50 idle connections! - I notice that this number will be set to 10 by default in DSpace 6.1 and 7.0: https://jira.duraspace.org/browse/DS-3564 - So I'm going to reduce ours from 20 to 10 and start trying to figure out how the hell to supply a database pool using Tomcat JNDI - I re-deployed the `5_x-prod` branch on CGSpace, applied all system updates, and restarted the server