cgspace-notes/2021-04.md at e96067374e5781ac68cf6f8dba30483bc8f4c9b7

28 KiB

Raw Blame History

title

date

author

2021-04-01

I wrote a script to query Sherpa's API for our ISSNs: sherpa-issn-lookup.py
- I'm curious to see how the results compare with the results from Crossref yesterday
AReS Explorer was down since this morning, I didn't see anything in the systemd journal
- I simply took everything down with docker-compose and then back up, and then it was OK
- Perhaps one of the containers crashed, I should have looked closer but I was in a hurry

2021-04-03

Biruk from ICT contacted me to say that some CGSpace users still can't log in
- I guess the CGSpace LDAP bind account is really still locked after last week's reset
- He fixed the account and then I was finally able to bind and query:

$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"

2021-04-04

Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less

Then set the openrxv-items-final index to read-only so we can make a backup:

$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
{"acknowledged":true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

Then start a harvesting on AReS Explorer
Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
- He was hitting a bug on AReS and also he only needed stats for 2020, and AReS currently only gives all-time stats
I cleaned up about 230 ISSNs on CGSpace in OpenRefine
- I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))
- Then I applied them on CGSpace with the fix-metadata-values.py script:

$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253

For now I only fixed obvious errors like "1234-5678." and "e-ISSN: 1234-5678" etc, but there are still lots of invalid ones which need more manual work:
- Too few characters
- Too many characters
- ISBNs
Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps

2021-04-05

The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
{
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
...
}

openrxv-items should be an alias of openrxv-items-final, not openrxv-temp... I will have to fix that manually
Enrico asked for more information on the RTB stats I gave him yesterday
- I remembered (again) that we can't filter Atmire's CUA stats by date issued
- To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range
- I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:

$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
  sed '1d' | \
  csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
  csvgrep -c issued -m 2020 | \
  csvcut -c id | \
  sed '1d' | \
  sort | \
  uniq

So I remember in the future, this basically does the following:
- Use csvcut to extract the id and all date issued columns from the CSV
- Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)
- Use csvsql to concatenate the various date issued columns (coalescing where null)
- Use csvgrep to filter items by date issued in 2020
- Use csvcut to extract the id column
- Use sed to delete the header row
- Use sort and uniq to filter out any duplicate IDs (there were three)
Then I have a list of 296 IDs for RTB items issued in 2020
I constructed a JSON file to post to the DSpace Statistics API:

{
  "limit": 100,
  "page": 0,
  "dateFrom": "2020-01-01T00:00:00Z",
  "dateTo": "2020-12-31T00:00:00Z",
  "items": [
"00358715-b70c-4fdd-aa55-730e05ba739e",
"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",
"02fb7630-d71a-449e-b65d-32b4ea7d6904",
...
  ]
}

Then I submitted the file three times (changing the page parameter):

$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json

Then I extracted the views and downloads in the most ridiculous way:

$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
30364
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
9100

For curiousity I did the same exercise for items issued in 2019 and got the following:
- Views: 30721
- Downloads: 10205

2021-04-06

Margarita from CCAFS was having problems deleting an item from CGSpace again
- The error was "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e ..."
- This is the same issue as last month
Create a new collection on CGSpace for a new CIP project at Mishel Portilla's request
I got a notice that CGSpace was down
- I didn't see anything strange at first, but there are an insane amount of database connections:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12413

The system journal shows thousands of these messages in the system journal, this is the first one:

Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon

Around that time in the dspace log I see nothing unusual, but maybe these?

2021-04-06 07:52:29,409 INFO  com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics

(BTW what is the deal with the "200/127"? I should send a comment to Atmire)
- I file a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets
I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13

After ten minutes or so it went back down...
And now it's back up in the thousands... I am seeing a lot of stuff in dspace log like this:

2021-04-06 11:59:34,364 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971

I sent some notes and a log to Atmire on our existing issue about the database stuff
- Also I asked them about the possibility of doing a formal review of Hibernate
Falcon 3.0.0 was released so I updated the 3.0.0 branch for dspace-statistics-api and merged it to v6_x
- I also fixed one minor (unrelated) bug in the tests
- Then I deployed the new version on DSpace Test
I had a meeting with Peter and Abenet about CGSpace TODOs
CGSpace went down again and the PostgreSQL locks are through the roof:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154

I don't see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:

# grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3547

I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:

# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
...
3572
4085
3476
3128
2949
2016
1839
4513
3463
4425
3328
2783
3898
3848
7799
255
534
2755
599
4463
3547

What about total number of sessions per day?

# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
...
/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
11784
/home/cgspace.cgiar.org/log/dspace.log.2021-03-29:
15104
/home/cgspace.cgiar.org/log/dspace.log.2021-03-30:
19396
/home/cgspace.cgiar.org/log/dspace.log.2021-03-31:
32612
/home/cgspace.cgiar.org/log/dspace.log.2021-04-01:
26037
/home/cgspace.cgiar.org/log/dspace.log.2021-04-02:
14315
/home/cgspace.cgiar.org/log/dspace.log.2021-04-03:
12530
/home/cgspace.cgiar.org/log/dspace.log.2021-04-04:
13138
/home/cgspace.cgiar.org/log/dspace.log.2021-04-05:
16756
/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
12343

So it's not the number of sessions... it's something with the workload...
I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks
- I restarted postgresql and tomcat7...
The locks in PostgreSQL shot up again...

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3447
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3527
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
4582

I don't know what the hell is going on, but the PostgreSQL connections and locks are way higher than ever before:

Otherwise, the number of DSpace sessions is completely normal:

While looking at the nginx logs I see that MEL is trying to log into CGSpace's REST API and delete items:

34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"

I see a few of these per day going back several months
- I sent a message to Salem and Enrico to ask if they know
Also annoying, I see tons of what look like penetration testing requests from Qualys:

2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,890 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
2021-04-04 06:35:18,145 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com

I deleted the ilri/AReS repository on GitHub since we haven't updated it in two years
- All development is happening in https://github.com/ilri/openRXV now
10PM and the server is down again, with locks through the roof:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12198

I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:

$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
1838
$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
3

I even restarted the server and connections were low for a few minutes until they shot back up:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8651
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8940
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10504

I had to go to bed and I bet it will crash and be down for hours until I wake up...
What the hell is this user agent?

54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)"

2021-04-07

CGSpace was still down from last night of course, with tons of database locks:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12168

I restarted the server again and the locks came back
Atmire responded to the message from yesterday
- The noticed something in the logs about emails failing to be sent
- There appears to be an issue sending mails on workflow tasks when a user in that group has an invalid email address:

2021-04-01 12:45:11,414 WARN  org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon;  Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address

The issue is not the named user above, but a member of the group...
And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):

I extracted all the group IDs from recent logs that had users with invalid email addresses:

$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
1769137c-36d4-42b2-8fec-60585e110db7
203c8614-8a97-4ac8-9686-d9d62cb52acc
294603de-3d09-464e-a5b0-09e452c6b5ab
35878555-9623-4679-beb8-bb3395fdf26e
3d8a5efa-5509-4bf9-9374-2bc714aceb99
4238208a-f848-47cb-9dd2-43f9f954a4af
44939b84-1894-41e7-b3e6-8c8d1781057b
49ba087e-75a3-45ce-805c-69eeda0f786b
4a6606ce-0284-421d-bf80-4dafddba2d42
527de6aa-9cd0-4988-bf5f-c9c92ba2ac10
54cd1b16-65bf-4041-9d84-fb2ea3301d6d
58982847-5f7c-4b8b-a7b0-4d4de702136e
5f0b85be-bd23-47de-927d-bca368fa1fbc
646ada17-e4ef-49f6-9378-af7e58596ce1
7e2f4bf8-fbc9-4b2f-97a4-75e5427bef90
8029fd53-f9f5-4107-bfc3-8815507265cf
81faa934-c602-4608-bf45-de91845dfea7
8611a462-210c-4be1-a5bb-f87a065e6113
8855c903-ef86-433c-b0be-c12300eb0f84
8c7ece98-3598-4de7-a885-d61fd033bea8
8c9a0d01-2d12-4a99-84f9-cdc25ac072f9
8f9f888a-b501-41f3-a462-4da16150eebf
94168f0e-9f45-4112-ac8d-3ba9be917842
96998038-f381-47dc-8488-ff7252703627
9768f4a8-3018-44e9-bf58-beba4296327c
9a99e8d2-558e-4fc1-8011-e4411f658414
a34e6400-78ed-45c0-a751-abc039eed2e6
a9da5af3-4ec7-4a9b-becb-6e3d028d594d
abf5201c-8be5-4dee-b461-132203dd51cb
adb5658c-cef3-402f-87b6-b498f580351c
aecf811b-b7e9-4b6f-8776-3d372e6a048b
ba5aae61-ea34-4ac1-9490-4645acf2382f
bf7f3638-c7c6-4a8f-893d-891a6d3dafff
c617ada0-09d1-40ed-b479-1c4860a4f724
cff91d44-a855-458c-89e5-bd48c17d1a54
e65171ae-a2bf-4043-8f54-f8457bc9174b
e7098b40-4701-4ca2-b9a9-3a1282f67044
e904f122-71dc-439b-b877-313ef62486d7
ede59734-adac-4c01-8691-b45f19088d37
f88bd6bb-f93f-41cb-872f-ff26f6237068
f985f5fb-be5c-430b-a8f1-cf86ae4fc49a
fe800006-aaec-4f9e-9ab4-f9475b4cbdc3

2021-04-08

I can't believe it but the server has been down for twelve hours or so
- The locks have not changed since I went to bed last night:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12070

I restarted PostgreSQL and Tomcat and the locks go straight back up!

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
986
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1194
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1212
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1489
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2124
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
5934

2021-04-09

Atmire managed to get CGSpace back up by killing all the PostgreSQL connections yesterday
- I don't know how they did it...
- They also think it's weird that restarting PostgreSQL didn't kill the connections
- They asked some more questions, like for example if there were also issues on DSpace Test
- Strangely enough, I checked DSpace Test and notice a clear spike in PostgreSQL locks on the morning of April 6th as well!

I definitely need to look into that!

2021-04-11

I am trying to resolve the AReS Elasticsearch index issues that happened last week
- I decided to back up the openrxv-items index to openrxv-items-backup and then delete all the others:

$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'

Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:

$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull

Then I realized I have to clone the backup index directly to openrxv-items-final, and re-create the openrxv-items alias:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'

Now I see both openrxv-items-final and openrxv-items have the current number of items:

$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
{
  "count" : 103373,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 103373,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Then I started a fresh harvesting in the AReS Explorer admin dashboard

2021-04-12

The harvesting on AReS finished last night, but the indexes got messed up again
- I will have to fix them manually next time...

2021-04-13

Looking into the logs on 2021-04-06 on CGSpace and DSpace Test to see if there is anything specific that stands out about the activty on those days that would cause the PostgreSQL issues
- Digging into the Munin graphs for the last week I found a few other things happening on that morning:

13,000 requests in the last two months from a user with user agent SomeRandomText, for example:

84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"

I purged them:

$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 13159 hits from SomeRandomText in statistics

Total number of bot hits purged: 13159

I noticed there were 78 items submitted in the hour before CGSpace crashed:

# grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 
78

Of those 78, 77 of them were from Udana
Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:

# for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
 add_item; done
32
0
0
2
8
108
4
0
29
0
1
1
2

2021-04-15

Release v1.4.2 of the DSpace Statistics API on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2
- This has been running on DSpace Test for the last week or so, and mostly contains the Falcon 3.0.0 changes
Re-sync DSpace Test with data from CGSpace
- Run system updates on DSpace Test (linode26) and reboot the server
Update the PostgreSQL JDBC driver on DSpace Test (linode26) to 42.2.19
- It has been a few months since we updated this, and there have been a few releases since 42.2.14 that we are currently using
Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:

$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'

I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from [2020-10]({{< relref "2020-10.md" >}}) the account must be in the admin group in order to submit via the REST API

28 KiB Raw Blame History

2021-04-01

2021-04-03

2021-04-04

2021-04-05

2021-04-06

2021-04-07

2021-04-08

2021-04-09

2021-04-11

2021-04-12

2021-04-13

2021-04-15

28 KiB

Raw Blame History