April, 2021

Thu Apr 01, 2021 in Notes

2021-04-01

I wrote a script to query Sherpa’s API for our ISSNs: sherpa-issn-lookup.py
- I’m curious to see how the results compare with the results from Crossref yesterday
AReS Explorer was down since this morning, I didn’t see anything in the systemd journal
- I simply took everything down with docker-compose and then back up, and then it was OK
- Perhaps one of the containers crashed, I should have looked closer but I was in a hurry

2021-04-03

Biruk from ICT contacted me to say that some CGSpace users still can’t log in
- I guess the CGSpace LDAP bind account is really still locked after last week’s reset
- He fixed the account and then I was finally able to bind and query:

$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"

2021-04-04

Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less

Then set the openrxv-items-final index to read-only so we can make a backup:

$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}' 
{"acknowledged":true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

Then start a harvesting on AReS Explorer
Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
- He was hitting a bug on AReS and also he only needed stats for 2020, and AReS currently only gives all-time stats
I cleaned up about 230 ISSNs on CGSpace in OpenRefine
- I had exported them last week, then filtered for anything not looking like an ISSN with this GREL: isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))
- Then I applied them on CGSpace with the fix-metadata-values.py script:

$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253

For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
- Too few characters
- Too many characters
- ISBNs
Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps

2021-04-05

The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
{
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
...
}

openrxv-items should be an alias of openrxv-items-final, not openrxv-temp… I will have to fix that manually
Enrico asked for more information on the RTB stats I gave him yesterday
- I remembered (again) that we can’t filter Atmire’s CUA stats by date issued
- To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range
- I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:

$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
  sed '1d' | \
  csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
  csvgrep -c issued -m 2020 | \
  csvcut -c id | \
  sed '1d' | \
  sort | \
  uniq

So I remember in the future, this basically does the following:
- Use csvcut to extract the id and all date issued columns from the CSV
- Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)
- Use csvsql to concatenate the various date issued columns (coalescing where null)
- Use csvgrep to filter items by date issued in 2020
- Use csvcut to extract the id column
- Use sed to delete the header row
- Use sort and uniq to filter out any duplicate IDs (there were three)
Then I have a list of 296 IDs for RTB items issued in 2020
I constructed a JSON file to post to the DSpace Statistics API:

{
  "limit": 100,
  "page": 0,
  "dateFrom": "2020-01-01T00:00:00Z",
  "dateTo": "2020-12-31T00:00:00Z",
  "items": [
"00358715-b70c-4fdd-aa55-730e05ba739e",
"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",
"02fb7630-d71a-449e-b65d-32b4ea7d6904",
...
  ]
}

Then I submitted the file three times (changing the page parameter):

$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json

Then I extracted the views and downloads in the most ridiculous way:

$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
30364
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
9100

For curiousity I did the same exercise for items issued in 2019 and got the following:
- Views: 30721
- Downloads: 10205

2021-04-06

Margarita from CCAFS was having problems deleting an item from CGSpace again
- The error was “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e …”
- This is the same issue as last month
Create a new collection on CGSpace for a new CIP project at Mishel Portilla’s request
I got a notice that CGSpace was down
- I didn’t see anything strange at first, but there are an insane amount of database connections:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12413

The system journal shows thousands of these messages in the system journal, this is the first one:

Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon

Around that time in the dspace log I see nothing unusual, but maybe these?

2021-04-06 07:52:29,409 INFO  com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics

(BTW what is the deal with the “200/127”? I should send a comment to Atmire)
- I file a ticket with Atmire: https://tracker.atmire.com/tickets-cgiar-ilri/view-tickets
I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13

After ten minutes or so it went back down…
And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:

2021-04-06 11:59:34,364 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO  org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971

I sent some notes and a log to Atmire on our existing issue about the database stuff
- Also I asked them about the possibility of doing a formal review of Hibernate
Falcon 3.0.0 was released so I updated the 3.0.0 branch for dspace-statistics-api and merged it to v6_x
- I also fixed one minor (unrelated) bug in the tests
- Then I deployed the new version on DSpace Test
I had a meeting with Peter and Abenet about CGSpace TODOs
CGSpace went down again and the PostgreSQL locks are through the roof:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154

I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:

# grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3547

I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:

# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
...
3572
4085
3476
3128
2949
2016
1839
4513
3463
4425
3328
2783
3898
3848
7799
255
534
2755
599
4463
3547

What about total number of sessions per day?

# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
...
/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
11784
/home/cgspace.cgiar.org/log/dspace.log.2021-03-29:
15104
/home/cgspace.cgiar.org/log/dspace.log.2021-03-30:
19396
/home/cgspace.cgiar.org/log/dspace.log.2021-03-31:
32612
/home/cgspace.cgiar.org/log/dspace.log.2021-04-01:
26037
/home/cgspace.cgiar.org/log/dspace.log.2021-04-02:
14315
/home/cgspace.cgiar.org/log/dspace.log.2021-04-03:
12530
/home/cgspace.cgiar.org/log/dspace.log.2021-04-04:
13138
/home/cgspace.cgiar.org/log/dspace.log.2021-04-05:
16756
/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
12343

So it’s not the number of sessions… it’s something with the workload…
I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks
- I restarted postgresql and tomcat7…
The locks in PostgreSQL shot up again…

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3447
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3527
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
4582

I don’t know what the hell is going on, but the PostgreSQL connections and locks are way higher than ever before:

PostgreSQL connections week PostgreSQL locks week Tomcat database pool

Otherwise, the number of DSpace sessions is completely normal:

DSpace sessions

While looking at the nginx logs I see that MEL is trying to log into CGSpace’s REST API and delete items:

34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"

I see a few of these per day going back several months
- I sent a message to Salem and Enrico to ask if they know
Also annoying, I see tons of what look like penetration testing requests from Qualys:

2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,889 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,890 INFO  org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
2021-04-04 06:35:18,145 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO  org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO  org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com

I deleted the ilri/AReS repository on GitHub since we haven’t updated it in two years
- All development is happening in https://github.com/ilri/openRXV now
10PM and the server is down again, with locks through the roof:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12198

I see that there are tons of PostgreSQL connections getting abandoned today, compared to very few in the past few weeks:

$ journalctl -u tomcat7 --since=today | grep -c 'ConnectionPool abandon'
1838
$ journalctl -u tomcat7 --since=2021-03-20 --until=2021-04-05 | grep -c 'ConnectionPool abandon'
3

I even restarted the server and connections were low for a few minutes until they shot back up:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8651
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
8940
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
10504

I had to go to bed and I bet it will crash and be down for hours until I wake up…
What the hell is this user agent?

54.197.119.143 - - [06/Apr/2021:19:18:11 +0200] "GET /handle/10568/16499 HTTP/1.1" 499 0 "-" "GetUrl/1.0 wdestiny@umich.edu (Linux)"

2021-04-07

CGSpace was still down from last night of course, with tons of database locks:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12168

I restarted the server again and the locks came back
Atmire responded to the message from yesterday
- The noticed something in the logs about emails failing to be sent
- There appears to be an issue sending mails on workflow tasks when a user in that group has an invalid email address:

2021-04-01 12:45:11,414 WARN  org.dspace.workflowbasic.BasicWorkflowServiceImpl @ a.akwarandu@cgiar.org:session_id=2F20F20D4A8C36DB53D42DE45DFA3CCE:notifyGroupofTask:cannot email user group_id=aecf811b-b7e9-4b6f-8776-3d372e6a048b workflow_item_id=33085\colon;  Invalid Addresses (com.sun.mail.smtp.SMTPAddressFailedException\colon; 501 5.1.3 Invalid address

The issue is not the named user above, but a member of the group…
And the group does have users with invalid email addresses (probably accounts created automatically after authenticating with LDAP):

DSpace group

I extracted all the group IDs from recent logs that had users with invalid email addresses:

$ grep -a -E 'email user group_id=\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' /home/cgspace.cgiar.org/log/dspace.log.* | grep -o -E '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' | sort | uniq
0a30d6ae-74a6-4eee-a8f5-ee5d15192ee6
1769137c-36d4-42b2-8fec-60585e110db7
203c8614-8a97-4ac8-9686-d9d62cb52acc
294603de-3d09-464e-a5b0-09e452c6b5ab
35878555-9623-4679-beb8-bb3395fdf26e
3d8a5efa-5509-4bf9-9374-2bc714aceb99
4238208a-f848-47cb-9dd2-43f9f954a4af
44939b84-1894-41e7-b3e6-8c8d1781057b
49ba087e-75a3-45ce-805c-69eeda0f786b
4a6606ce-0284-421d-bf80-4dafddba2d42
527de6aa-9cd0-4988-bf5f-c9c92ba2ac10
54cd1b16-65bf-4041-9d84-fb2ea3301d6d
58982847-5f7c-4b8b-a7b0-4d4de702136e
5f0b85be-bd23-47de-927d-bca368fa1fbc
646ada17-e4ef-49f6-9378-af7e58596ce1
7e2f4bf8-fbc9-4b2f-97a4-75e5427bef90
8029fd53-f9f5-4107-bfc3-8815507265cf
81faa934-c602-4608-bf45-de91845dfea7
8611a462-210c-4be1-a5bb-f87a065e6113
8855c903-ef86-433c-b0be-c12300eb0f84
8c7ece98-3598-4de7-a885-d61fd033bea8
8c9a0d01-2d12-4a99-84f9-cdc25ac072f9
8f9f888a-b501-41f3-a462-4da16150eebf
94168f0e-9f45-4112-ac8d-3ba9be917842
96998038-f381-47dc-8488-ff7252703627
9768f4a8-3018-44e9-bf58-beba4296327c
9a99e8d2-558e-4fc1-8011-e4411f658414
a34e6400-78ed-45c0-a751-abc039eed2e6
a9da5af3-4ec7-4a9b-becb-6e3d028d594d
abf5201c-8be5-4dee-b461-132203dd51cb
adb5658c-cef3-402f-87b6-b498f580351c
aecf811b-b7e9-4b6f-8776-3d372e6a048b
ba5aae61-ea34-4ac1-9490-4645acf2382f
bf7f3638-c7c6-4a8f-893d-891a6d3dafff
c617ada0-09d1-40ed-b479-1c4860a4f724
cff91d44-a855-458c-89e5-bd48c17d1a54
e65171ae-a2bf-4043-8f54-f8457bc9174b
e7098b40-4701-4ca2-b9a9-3a1282f67044
e904f122-71dc-439b-b877-313ef62486d7
ede59734-adac-4c01-8691-b45f19088d37
f88bd6bb-f93f-41cb-872f-ff26f6237068
f985f5fb-be5c-430b-a8f1-cf86ae4fc49a
fe800006-aaec-4f9e-9ab4-f9475b4cbdc3

2021-04-08

I can’t believe it but the server has been down for twelve hours or so
- The locks have not changed since I went to bed last night:

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12070

I restarted PostgreSQL and Tomcat and the locks go straight back up!

$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
986
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1194
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1212
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1489
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2124
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
5934

2021-04-09

Atmire managed to get CGSpace back up by killing all the PostgreSQL connections yesterday
- I don’t know how they did it…
- They also think it’s weird that restarting PostgreSQL didn’t kill the connections
- They asked some more questions, like for example if there were also issues on DSpace Test
- Strangely enough, I checked DSpace Test and notice a clear spike in PostgreSQL locks on the morning of April 6th as well!

PostgreSQL locks week CGSpace PostgreSQL locks week DSpace Test

I definitely need to look into that!

2021-04-11

I am trying to resolve the AReS Elasticsearch index issues that happened last week
- I decided to back up the openrxv-items index to openrxv-items-backup and then delete all the others:

$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-backup
$ curl -X PUT "localhost:9200/openrxv-items/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'

Then I updated all Docker containers and rebooted the server (linode20) so that the correct indexes would be created again:

$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull

Then I realized I have to clone the backup index directly to openrxv-items-final, and re-create the openrxv-items alias:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT "localhost:9200/openrxv-items-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'

Now I see both openrxv-items-final and openrxv-items have the current number of items:

$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
{
  "count" : 103373,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
$ curl -s 'http://localhost:9200/openrxv-items-final/_count?q=*&pretty'
{
  "count" : 103373,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Then I started a fresh harvesting in the AReS Explorer admin dashboard

2021-04-12

The harvesting on AReS finished last night, but the indexes got messed up again
- I will have to fix them manually next time…

2021-04-13

Looking into the logs on 2021-04-06 on CGSpace and DSpace Test to see if there is anything specific that stands out about the activty on those days that would cause the PostgreSQL issues
- Digging into the Munin graphs for the last week I found a few other things happening on that morning:

/dev/sda disk latency week JVM classes unloaded week Nginx status week

13,000 requests in the last two months from a user with user agent SomeRandomText, for example:

84.33.2.97 - - [06/Apr/2021:06:25:13 +0200] "GET /bitstream/handle/10568/77776/CROP%20SCIENCE.jpg.jpg HTTP/1.1" 404 10890 "-" "SomeRandomText"

I purged them:

$ ./ilri/check-spider-hits.sh -f /tmp/agents.txt -p
Purging 13159 hits from SomeRandomText in statistics

Total number of bot hits purged: 13159

I noticed there were 78 items submitted in the hour before CGSpace crashed:

# grep -a -E '2021-04-06 0(6|7):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -c -a add_item 
78

Of those 78, 77 of them were from Udana
Compared to other mornings (0 to 9 AM) this month that seems to be pretty high:

# for num in {01..13}; do grep -a -E "2021-04-$num 0" /home/cgspace.cgiar.org/log/dspace.log.2021-04-$num | grep -c -a
 add_item; done
32
0
0
2
8
108
4
0
29
0
1
1
2

2021-04-15

Release v1.4.2 of the DSpace Statistics API on GitHub: https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.2
- This has been running on DSpace Test for the last week or so, and mostly contains the Falcon 3.0.0 changes
Re-sync DSpace Test with data from CGSpace
- Run system updates on DSpace Test (linode26) and reboot the server
Update the PostgreSQL JDBC driver on DSpace Test (linode26) to 42.2.19
- It has been a few months since we updated this, and there have been a few releases since 42.2.14 that we are currently using
Create a test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:

$ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p 'fuuuuuuuu'

I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
- According to my notes from 2020-10 the account must be in the admin group in order to submit via the REST API

2021-04-18

Update all containers on AReS (linode20):

$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull

Then run all system updates and reboot the server
I learned a new command for Elasticsearch:

$ curl http://localhost:9200/_cat/indices
yellow open openrxv-values           ChyhGwMDQpevJtlNWO1vcw 1 1   1579      0 537.6kb 537.6kb
yellow open openrxv-items-temp       PhV5ieuxQsyftByvCxzSIw 1 1 103585 104372 482.7mb 482.7mb
yellow open openrxv-shared           J_8cxIz6QL6XTRZct7UBBQ 1 1    127      0 115.7kb 115.7kb
yellow open openrxv-values-00001     jAoXTLR0R9mzivlDVbQaqA 1 1   3903      0 696.2kb 696.2kb
green  open .kibana_task_manager_1   O1zgJ0YlQhKCFAwJZaNSIA 1 0      2      2  20.6kb  20.6kb
yellow open openrxv-users            1hWGXh9kS_S6YPxAaBN8ew 1 1      5      0  28.6kb  28.6kb
green  open .apm-agent-configuration f3RAkSEBRGaxJZs3ePVxsA 1 0      0      0    283b    283b
yellow open openrxv-items-final      sgk-s8O-RZKdcLRoWt3G8A 1 1    970      0   2.3mb   2.3mb
green  open .kibana_1                HHPN7RD_T7qe0zDj4rauQw 1 0     25      7  36.8kb  36.8kb
yellow open users                    M0t2LaZhSm2NrF5xb64dnw 1 1      2      0  11.6kb  11.6kb

Somehow the openrxv-items-final index only has a few items and the majority are in openrxv-items-temp, via the openrxv-items alias (which is in the temp index):

$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty' 
{
  "count" : 103585,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

I found a cool tool to help with exporting and restoring Elasticsearch indexes:

$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
...
Sun, 18 Apr 2021 06:27:07 GMT | Total Writes: 103585
Sun, 18 Apr 2021 06:27:07 GMT | dump complete

It took only two or three minutes to export everything…
I did a test to restore the index:

$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-test --type=mapping
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-test --limit 1000 --type=data

So that’s pretty cool!
I deleted the openrxv-items-final index and openrxv-items-temp indexes and then restored the mappings to openrxv-items-final, added the openrxv-items alias, and started restoring the data to openrxv-items with elasticdump:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data

AReS seems to be working fine аfter that, so I created the openrxv-items-temp index and then started a fresh harvest on AReS Explorer:

$ curl -X PUT "localhost:9200/openrxv-items-temp"

Run system updates on CGSpace (linode18) and run the latest Ansible infrastructure playbook to update the DSpace Statistics API, PostgreSQL JDBC driver, etc, and then reboot the system
I wasted a bit of time trying to get TSLint and then ESLint running for OpenRXV on GitHub Actions

2021-04-19

The AReS harvesting last night seems to have completed successfully, but the number of results is strange:

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       kNUlupUyS_i7vlBGiuVxwg 1 1 103741 105553 483.6mb 483.6mb
yellow open openrxv-items-final      HFc3uytTRq2GPpn13vkbmg 1 1    970      0   2.3mb   2.3mb

The indices endpoint doesn’t include the openrxv-items alias, but it is currently in the openrxv-items-temp index so the number of items is the same:

$ curl -s 'http://localhost:9200/openrxv-items/_count?q=*&pretty'     
{
  "count" : 103741,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

A user was having problems resetting their password on CGSpace, with some message about SMTP etc
- I checked and we are indeed locked out of our mailbox:

$ dspace test-email
...
Error sending email:
 - Error: javax.mail.SendFailedException: Send failure (javax.mail.AuthenticationFailedException: 550 5.2.1 Mailbox cannot be accessed [PR0P264CA0280.FRAP264.PROD.OUTLOOK.COM]
)

I have to write to ICT…
I decided to switch back to the G1GC garbage collector on DSpace Test
- Reading Shawn Heisy’s discussion again: https://cwiki.apache.org/confluence/display/SOLR/ShawnHeisey
- I am curious to check the JVM stats in a few days to see if there is a marked change
Work on minor changes to get DSpace working on Ubuntu 20.04 for our Ansible infrastructure scripts

2021-04-21

Send Abdullah feedback on the filter on click pull request for OpenRXV
- I see it adds a new “allow filter on click” checkbox in the layout settings, but it doesn’t modify the filters
- Also, it seems to have broken the existing clicking of the countries on the map
Atmire recently sent feedback about the CUA duplicates processor
- Last month when I ran it it got stuck on the storage reports, apparently, so I will try again (with a fresh Solr statistics core from production) and skip the storage reports (-g):

$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ cp atmire-cua-update.xml-20210124-132112.old /home/dspacetest.cgiar.org/config/spring/api/atmire-cua-update.xml
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -r 100 -c statistics -t 12 -g

The first run processed 1,439 docs, the second run processed 0 docs
- I’m not sure if that means that it worked? I sent feedback to Atmire
Meeting with Moayad to discuss OpenRXV development progress

2021-04-25

The indexes on AReS are messed up again
- I made a backup of the indexes, then deleted the openrxv-items-final and openrxv-items-temp indexes, re-created the openrxv-items alias, and restored the data into openrxv-items:

$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --limit=1000 --type=data
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items --limit 1000 --type=data

Then I started a fresh AReS harvest

2021-04-26

The AReS harvest last night seems to have finished successfully and the number of items looks good:

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb

And the aliases seem correct for once:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },
    "openrxv-items-temp": {
        "aliases": {}
    },
...

That’s 250 new items in the index since the last harvest!
Re-create my local Artifactory container because I’m getting errors starting it and it has been a few months since it was updated:

$ podman rm artifactory
$ podman pull docker.bintray.io/jfrog/artifactory-oss:latest
$ podman create --ulimit nofile=32000:32000 --name artifactory -v artifactory_data:/var/opt/jfrog/artifactory -p 8081-8082:8081-8082 docker.bintray.io/jfrog/artifactory-oss
$ podman start artifactory

Start testing DSpace 7.0 Beta 5 so I can evaluate if it solves some of the problems we are having on DSpace 6, and if it’s missing things like multiple handle resolvers, etc
- I see it needs Java JDK 11, Tomcat 9, Solr 8, and PostgreSQL 11
- Also, according to the installation notes I see you can install the old DSpace 6 REST API, so that’s potentially useful for us
- I see that all web applications on the backend are now rolled into just one “server” application
- The build process took 11 minutes the first time (due to downloading the world with Maven) and ~2 minutes the second time
- The local.cfg content and syntax is very similar DSpace 6
I got the basic fresh_install up and running
- Then I tried to import a DSpace 6 database from production
I tried to delete all the Atmire SQL migrations:

localhost/dspace7b5= > DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';

But I got an error when running dspace database migrate:

$ ~/dspace7b5/bin/dspace database migrate

Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:738)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:632)
        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:228)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:273)
        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:129)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:94)
Caused by: org.flywaydb.core.api.FlywayException: Validate failed: 
Detected applied migration not resolved locally: 5.0.2017.09.25
Detected applied migration not resolved locally: 6.0.2017.01.30
Detected applied migration not resolved locally: 6.0.2017.09.25

        at org.flywaydb.core.Flyway.doValidate(Flyway.java:292)
        at org.flywaydb.core.Flyway.access$100(Flyway.java:73)
        at org.flywaydb.core.Flyway$1.execute(Flyway.java:166)
        at org.flywaydb.core.Flyway$1.execute(Flyway.java:158)
        at org.flywaydb.core.Flyway.execute(Flyway.java:527)
        at org.flywaydb.core.Flyway.migrate(Flyway.java:158)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:729)
        ... 9 more

I deleted those migrations:

localhost/dspace7b5= > DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');

Then when I ran the migration again it failed for a new reason, related to the configurable workflow:

Database URL: jdbc:postgresql://localhost:5432/dspace7b5
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:738)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:632)
        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:228)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:273)
        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:129)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:94)
Caused by: org.flywaydb.core.internal.command.DbMigrate$FlywayMigrateException:
Migration V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql failed
--------------------------------------------------------------------
SQL State  : 42P01
Error Code : 0
Message    : ERROR: relation "cwf_pooltask" does not exist
  Position: 8
Location   : org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql (/home/aorth/src/apache-tomcat-9.0.45/file:/home/aorth/dspace7b5/lib/dspace-api-7.0-beta5.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V7.0_2019.05.02__DS-4239-workflow-xml-migration.sql)
Line       : 16
Statement  : UPDATE cwf_pooltask SET workflow_id='defaultWorkflow' WHERE workflow_id='default'
...

The DSpace 7 upgrade docs say I need to apply these previously optional migrations:

$ ~/dspace7b5/bin/dspace database migrate ignored

Now I see all migrations have completed and DSpace actually starts up fine!
I will try to do a full re-index to see how long it takes:

$ time ~/dspace7b5/bin/dspace index-discovery -b