--- title: "April, 2021" date: 2021-04-01T09:50:54+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2021-04-01 - I wrote a script to query Sherpa's API for our ISSNs: `sherpa-issn-lookup.py` - I'm curious to see how the results compare with the results from Crossref yesterday - AReS Explorer was down since this morning, I didn't see anything in the systemd journal - I simply took everything down with docker-compose and then back up, and then it was OK - Perhaps one of the containers crashed, I should have looked closer but I was in a hurry ## 2021-04-03 - Biruk from ICT contacted me to say that some CGSpace users still can't log in - I guess the CGSpace LDAP bind account is really still locked after last week's reset - He fixed the account and then I was finally able to bind and query: ```console $ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)" ``` ## 2021-04-04 - Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest: ```console $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less ``` - Then set the `openrxv-items-final` index to read-only so we can make a backup: ```console $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' {"acknowledged":true}% $ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup {"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}% $ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}' ``` - Then start a harvesting on AReS Explorer - Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace - He was hitting [a bug on AReS](https://github.com/ilri/OpenRXV/issues/66) and also he only needed stats for 2020, and AReS currently only gives all-time stats - I cleaned up about 230 ISSNs on CGSpace in OpenRefine - I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: `isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))` - Then I applied them on CGSpace with the `fix-metadata-values.py` script: ```console $ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253 ``` - For now I only fixed obvious errors like "1234-5678." and "e-ISSN: 1234-5678" etc, but there are still lots of invalid ones which need more manual work: - Too few characters - Too many characters - ISBNs - Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps ## 2021-04-05 - The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again: ```console $ curl -s 'http://localhost:9200/_alias/' | python -m json.tool { "openrxv-items-final": { "aliases": {} }, "openrxv-items-temp": { "aliases": { "openrxv-items": {} } }, ... } ``` - `openrxv-items` should be an alias of `openrxv-items-final`, not `openrxv-temp`... I will have to fix that manually - Enrico asked for more information on the RTB stats I gave him yesterday - I remembered (again) that we can't filter Atmire's CUA stats by date issued - To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range - I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020: ```console $ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv $ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \ sed '1d' | \ csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \ csvgrep -c issued -m 2020 | \ csvcut -c id | \ sed '1d' | \ sort | \ uniq ``` - So I remember in the future, this basically does the following: - Use csvcut to extract the id and all date issued columns from the CSV - Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters) - Use csvsql to concatenate the various date issued columns (coalescing where null) - Use csvgrep to filter items by date issued in 2020 - Use csvcut to extract the id column - Use sed to delete the header row - Use sort and uniq to filter out any duplicate IDs (there were three) - Then I have a list of 296 IDs for RTB items issued in 2020 - I constructed a JSON file to post to the DSpace Statistics API: ```json { "limit": 100, "page": 0, "dateFrom": "2020-01-01T00:00:00Z", "dateTo": "2020-12-31T00:00:00Z", "items": [ "00358715-b70c-4fdd-aa55-730e05ba739e", "004b54bb-f16f-4cec-9fbc-ab6c6345c43d", "02fb7630-d71a-449e-b65d-32b4ea7d6904", ... ] } ``` - Then I submitted the file three times (changing the page parameter): ```console $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json $ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json ``` - Then I extracted the views and downloads in the most ridiculous way: ```console $ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc 30364 $ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc 9100 ``` - For curiousity I did the same exercise for items issued in 2019 and got the following: - Views: 30721 - Downloads: 10205 ## 2021-04-06 - Margarita from CCAFS was having problems deleting an item from CGSpace again - The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..." - This is the same issue as last month - Create a new collection on CGSpace for a new CIP project at Mishel Portilla's request - I got a notice that CGSpace was down - I didn't see anything strange at first, but there are an insane amount of database connections: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 12413 ``` - The system journal shows thousands of these messages in the system journal, this is the first one: ```console Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon ``` - Around that time in the dspace log I see nothing unusual, but maybe these? ```console 2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics ``` - (BTW what is the deal with the "200/127"? I should send a comment to Atmire) - I file a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets - I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 3640 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 2968 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 13 ``` - After ten minutes or so it went back down... - And now it's back up in the thousands... I am seeing a lot of stuff in dspace log like this: ```console 2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970 2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971 ``` - I sent some notes and a log to Atmire on our existing issue about the database stuff - Also I asked them about the possibility of doing a formal review of Hibernate - Falcon 3.0.0 was released so I updated the 3.0.0 branch for dspace-statistics-api and merged it to `v6_x` - I also fixed one minor (unrelated) bug in the tests - Then I deployed the new version on DSpace Test - I had a meeting with Peter and Abenet about CGSpace TODOs - CGSpace went down again and the PostgreSQL locks are through the roof: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 12154 ``` - I don't see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions: ```console # grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l 3547 ``` - I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions: ```console # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done ... 3572 4085 3476 3128 2949 2016 1839 4513 3463 4425 3328 2783 3898 3848 7799 255 534 2755 599 4463 3547 ``` - What about total number of sessions per day? ```console # for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done ... /home/cgspace.cgiar.org/log/dspace.log.2021-03-28: 11784 /home/cgspace.cgiar.org/log/dspace.log.2021-03-29: 15104 /home/cgspace.cgiar.org/log/dspace.log.2021-03-30: 19396 /home/cgspace.cgiar.org/log/dspace.log.2021-03-31: 32612 /home/cgspace.cgiar.org/log/dspace.log.2021-04-01: 26037 /home/cgspace.cgiar.org/log/dspace.log.2021-04-02: 14315 /home/cgspace.cgiar.org/log/dspace.log.2021-04-03: 12530 /home/cgspace.cgiar.org/log/dspace.log.2021-04-04: 13138 /home/cgspace.cgiar.org/log/dspace.log.2021-04-05: 16756 /home/cgspace.cgiar.org/log/dspace.log.2021-04-06: 12343 ``` - So it's not the number of sessions... it's something with the workload... - I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks - I restarted postgresql and tomcat7... - The locks in PostgreSQL shot up again... ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 3447 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 3527 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 4582 ``` - I don't know what the hell is going on, but the PostgreSQL connections and locks are way higher than ever before: ![PostgreSQL connections week](/cgspace-notes/2021/04/postgres_connections_cgspace-week.png) ![PostgreSQL locks week](/cgspace-notes/2021/04/postgres_locks_cgspace-week.png) ![Tomcat database pool](/cgspace-notes/2021/04/jmx_tomcat_dbpools-week.png) - Otherwise, the number of DSpace sessions is completely normal: ![DSpace sessions](/cgspace-notes/2021/04/jmx_dspace_sessions-week.png) - While looking at the nginx logs I see that MEL is trying to log into CGSpace's REST API and delete items: ```console 34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL" 34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-" ``` - I see a few of these per day going back several months - I sent a message to Salem and Enrico to ask if they know - Also annoying, I see tons of what look like penetration testing requests from Qualys: ```console 2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'> 2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'> 2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'>, realm=null, result=2 2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com 2021-04-04 06:35:18,519 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com 2021-04-04 06:35:18,520 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com ``` - I deleted the ilri/AReS repository on GitHub since we haven't updated it in two years - All development is happening in https://github.com/ilri/openRXV now - 10PM and the server is down again, with locks through the roof: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 12198 ``` - I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks: ```console $ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon' 1838 $ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon' 3 ``` - I even restarted the server and connections were low for a few minutes until they shot back up: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 13 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 8651 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 8940 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 10504 ``` - I had to go to bed and I bet it will crash and be down for hours until I wake up... - What the hell is this user agent? ``` 54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)" ``` ## 2021-04-07 - CGSpace was still down from last night of course, with tons of database locks: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 12168 ``` - I restarted the server again and the locks came back - Atmire responded to the message from yesterday - The noticed something in the logs about emails failing to be sent - There appears to be an issue sending mails on workflow tasks when a user in that group has an invalid email address: ```console 2021-04-01 12:45:11,414 WARN org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon; Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address ``` - The issue is not the named user above, but a member of the group... - And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP): ![DSpace group](/cgspace-notes/2021/04/group-invalid-email.png) - I extracted all the group IDs from recent logs that had users with invalid email addresses: ```console $ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq 0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6 1769137c-36d4-42b2-8fec-60585e110db7 203c8614-8a97-4ac8-9686-d9d62cb52acc 294603de-3d09-464e-a5b0-09e452c6b5ab 35878555-9623-4679-beb8-bb3395fdf26e 3d8a5efa-5509-4bf9-9374-2bc714aceb99 4238208a-f848-47cb-9dd2-43f9f954a4af 44939b84-1894-41e7-b3e6-8c8d1781057b 49ba087e-75a3-45ce-805c-69eeda0f786b 4a6606ce-0284-421d-bf80-4dafddba2d42 527de6aa-9cd0-4988-bf5f-c9c92ba2ac10 54cd1b16-65bf-4041-9d84-fb2ea3301d6d 58982847-5f7c-4b8b-a7b0-4d4de702136e 5f0b85be-bd23-47de-927d-bca368fa1fbc 646ada17-e4ef-49f6-9378-af7e58596ce1 7e2f4bf8-fbc9-4b2f-97a4-75e5427bef90 8029fd53-f9f5-4107-bfc3-8815507265cf 81faa934-c602-4608-bf45-de91845dfea7 8611a462-210c-4be1-a5bb-f87a065e6113 8855c903-ef86-433c-b0be-c12300eb0f84 8c7ece98-3598-4de7-a885-d61fd033bea8 8c9a0d01-2d12-4a99-84f9-cdc25ac072f9 8f9f888a-b501-41f3-a462-4da16150eebf 94168f0e-9f45-4112-ac8d-3ba9be917842 96998038-f381-47dc-8488-ff7252703627 9768f4a8-3018-44e9-bf58-beba4296327c 9a99e8d2-558e-4fc1-8011-e4411f658414 a34e6400-78ed-45c0-a751-abc039eed2e6 a9da5af3-4ec7-4a9b-becb-6e3d028d594d abf5201c-8be5-4dee-b461-132203dd51cb adb5658c-cef3-402f-87b6-b498f580351c aecf811b-b7e9-4b6f-8776-3d372e6a048b ba5aae61-ea34-4ac1-9490-4645acf2382f bf7f3638-c7c6-4a8f-893d-891a6d3dafff c617ada0-09d1-40ed-b479-1c4860a4f724 cff91d44-a855-458c-89e5-bd48c17d1a54 e65171ae-a2bf-4043-8f54-f8457bc9174b e7098b40-4701-4ca2-b9a9-3a1282f67044 e904f122-71dc-439b-b877-313ef62486d7 ede59734-adac-4c01-8691-b45f19088d37 f88bd6bb-f93f-41cb-872f-ff26f6237068 f985f5fb-be5c-430b-a8f1-cf86ae4fc49a fe800006-aaec-4f9e-9ab4-f9475b4cbdc3 ``` ## 2021-04-08 - I can't believe it but the server has been down for twelve hours or so - The locks have not changed since I went to bed last night: ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 12070 ``` - I restarted PostgreSQL and Tomcat and the locks go straight back up! ```console $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 13 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 986 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 1194 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 1212 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 1489 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 2124 $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l 5934 ``` ## 2021-04-09 - Atmire managed to get CGSpace back up by killing all the PostgreSQL connections yesterday - I don't know how they did it... - They also think it's weird that restarting PostgreSQL didn't kill the connections - They asked some more questions, like for example if there were also issues on DSpace Test - Strangely enough, I checked DSpace Test and notice a clear spike in PostgreSQL locks on the morning of April 6th as well! ![PostgreSQL locks week CGSpace](/cgspace-notes/2021/04/postgres_locks_ALL-week-PROD.png) ![PostgreSQL locks week DSpace Test](/cgspace-notes/2021/04/postgres_locks_ALL-week-TEST.png) - I definitely need to look into that! ## 2021-04-11 - I am trying to resolve the AReS Elasticsearch index issues that happened last week - I decided to back up the `openrxv-items` index to `openrxv-items-backup` and then delete all the others: ```console $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' $ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup $ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}' $ curl -XDELETE 'http://localhost:9200/openrxv-items-temp' $ curl -XDELETE 'http://localhost:9200/openrxv-items-final' $ curl -XDELETE 'http://localhost:9200/openrxv-items-final' ``` - Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again: ```console $ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull ``` - Then I realized I have to clone the backup index directly to `openrxv-items-final`, and re-create the `openrxv-items` alias: ```console $ curl -XDELETE 'http://localhost:9200/openrxv-items-final' $ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' $ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final $ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}' ``` - Now I see both `openrxv-items-final` and `openrxv-items` have the current number of items: ```console $ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty' { "count" : 103373, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } $ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty' { "count" : 103373, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 } } ``` - Then I started a fresh harvesting in the AReS Explorer admin dashboard ## 2021-04-12 - The harvesting on AReS finished last night, but the indexes got messed up again - I will have to fix them manually next time... ## 2021-04-13 - Looking into the logs on 2021-04-06 on CGSpace and DSpace Test to see if there is anything specific that stands out about the activty on those days that would cause the PostgreSQL issues - Digging into the Munin graphs for the last week I found a few other things happening on that morning: ![/dev/sda disk latency week](/cgspace-notes/2021/04/sda-week.png) ![JVM classes unloaded week](/cgspace-notes/2021/04/classes_unloaded-week.png) ![Nginx status week](/cgspace-notes/2021/04/nginx_status-week.png) - 13,000 requests in the last two months from a user with user agent `SomeRandomText`, for example: ```console 84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText" ``` - I purged them: ```console $ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p Purging 13159 hits from SomeRandomText in statistics Total number of bot hits purged: 13159 ``` - I noticed there were 78 items submitted in the hour before CGSpace crashed: ```console # grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 78 ``` - Of those 78, 77 of them were from Udana - Compared to other mornings (0 to 9 AM) this month that seems to be pretty high: ```console # for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a add_item; done 32 0 0 2 8 108 4 0 29 0 1 1 2 ```