2021-04-01
- I wrote a script to query Sherpa’s API for our ISSNs:
sherpa-issn-lookup.py
- I’m curious to see how the results compare with the results from Crossref yesterday
- AReS Explorer was down since this morning, I didn’t see anything in the systemd journal
- I simply took everything down with docker-compose and then back up, and then it was OK
- Perhaps one of the containers crashed, I should have looked closer but I was in a hurry
2021-04-03
- Biruk from ICT contacted me to say that some CGSpace users still can’t log in
- I guess the CGSpace LDAP bind account is really still locked after last week’s reset
- He fixed the account and then I was finally able to bind and query:
$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-account" -W "(sAMAccountName=otheraccounttoquery)"
2021-04-04
- Check the index aliases on AReS Explorer to make sure they are sane before starting a new harvest:
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool | less
- Then set the
openrxv-items-final
index to read-only so we can make a backup:
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
{"acknowledged":true}%
$ curl -s -X POST http://localhost:9200/openrxv-items-final/_clone/openrxv-items-final-backup
{"acknowledged":true,"shards_acknowledged":true,"index":"openrxv-items-final-backup"}%
$ curl -X PUT "localhost:9200/openrxv-items-final/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'
- Then start a harvesting on AReS Explorer
- Help Enrico get some 2020 statistics for the Roots, Tubers and Bananas (RTB) community on CGSpace
- He was hitting a bug on AReS and also he only needed stats for 2020, and AReS currently only gives all-time stats
- I cleaned up about 230 ISSNs on CGSpace in OpenRefine
- I had exported them last week, then filtered for anything not looking like an ISSN with this GREL:
isNotNull(value.match(/^\p{Alnum}{4}-\p{Alnum}{4}$/))
- Then I applied them on CGSpace with the
fix-metadata-values.py
script:
$ ./ilri/fix-metadata-values.py -i /tmp/2021-04-01-ISSNs.csv -db dspace -u dspace -p 'fuuu' -f cg.issn -t 'correct' -m 253
- For now I only fixed obvious errors like “1234-5678.” and “e-ISSN: 1234-5678” etc, but there are still lots of invalid ones which need more manual work:
- Too few characters
- Too many characters
- ISBNs
- Create the CGSpace community and collection structure for the new Accelerating Impacts of CGIAR Climate Research for Africa (AICCRA) and assign all workflow steps
2021-04-04
- The AReS Explorer harvesting from yesterday finished, and the results look OK, but actually the Elasticsearch indexes are messed up again:
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
{
"openrxv-items-final": {
"aliases": {}
},
"openrxv-items-temp": {
"aliases": {
"openrxv-items": {}
}
},
...
}
openrxv-items
should be an alias of openrxv-items-final
, not openrxv-temp
… I will have to fix that manually
- Enrico asked for more information on the RTB stats I gave him yesterday
- I remembered (again) that we can’t filter Atmire’s CUA stats by date issued
- To show, for example, views/downloads in the year 2020 for RTB issued in 2020, we would need to use the DSpace statistics API and post a list of IDs and a custom date range
- I tried to do that here by exporting the RTB community and extracting the IDs for items issued in 2020:
$ ~/dspace63/bin/dspace metadata-export -i 10568/80100 -f /tmp/rtb.csv
$ csvcut -c 'id,dcterms.issued,dcterms.issued[],dcterms.issued[en_US]' /tmp/rtb.csv | \
sed '1d' | \
csvsql --no-header --no-inference --query 'SELECT a AS id,COALESCE(b, "")||COALESCE(c, "")||COALESCE(d, "") AS issued FROM stdin' | \
csvgrep -c issued -m 2020 | \
csvcut -c id | \
sed '1d' | \
sort | \
uniq
- So I remember in the future, this basically does the following:
- Use csvcut to extract the id and all date issued columns from the CSV
- Use sed to remove the header so we can refer to the columns using default a, b, c instead of their real names (which are tricky to match due to special characters)
- Use csvsql to concatenate the various date issued columns (coalescing where null)
- Use csvgrep to filter items by date issued in 2020
- Use csvcut to extract the id column
- Use sed to delete the header row
- Use sort and uniq to filter out any duplicate IDs (there were three)
- Then I have a list of 296 IDs for RTB items issued in 2020
- I constructed a JSON file to post to the DSpace Statistics API:
{
"limit": 100,
"page": 0,
"dateFrom": "2020-01-01T00:00:00Z",
"dateTo": "2020-12-31T00:00:00Z",
"items": [
"00358715-b70c-4fdd-aa55-730e05ba739e",
"004b54bb-f16f-4cec-9fbc-ab6c6345c43d",
"02fb7630-d71a-449e-b65d-32b4ea7d6904",
...
]
}
- Then I submitted the file three times (changing the page parameter):
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page1.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page2.json
$ curl -s -d @/tmp/2020-items.txt https://cgspace.cgiar.org/rest/statistics/items | json_pp > /tmp/page3.json
- Then I extracted the views and downloads in the most ridiculous way:
$ grep views /tmp/page*.json | grep -o -E '[0-9]+$' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
30364
$ grep downloads /tmp/page*.json | grep -o -E '[0-9]+,' | sed 's/,//' | xargs | sed -e 's/ /+/g' | bc
9100
- For curiousity I did the same exercise for items issued in 2019 and got the following:
- Views: 30721
- Downloads: 10205
2021-04-06
- Margarita from CCAFS was having problems deleting an item from CGSpace again
- The error was “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:bd157345-448e …”
- This is the same issue as last month
- Create a new collection on CGSpace for a new CIP project at Mishel Portilla’s request
- I got a notice that CGSpace was down
- I didn’t see anything strange at first, but there are an insane amount of database connections:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12413
- The system journal shows thousands of these messages in the system journal, this is the first one:
Apr 06 07:52:13 linode18 tomcat7[556]: Apr 06, 2021 7:52:13 AM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
- Around that time in the dspace log I see nothing unusual, but maybe these?
2021-04-06 07:52:29,409 INFO com.atmire.dspace.cua.CUASolrLoggerServiceImpl @ Updating : 200/127 docs in http://localhost:8081/solr/statistics
- (BTW what is the deal with the “200/127”? I should send a comment to Atmire)
- I restarted the PostgreSQL and Tomcat services and now I see less connections, but still WAY high:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3640
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2968
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
13
- After ten minutes or so it went back down…
- And now it’s back up in the thousands… I am seeing a lot of stuff in dspace log like this:
2021-04-06 11:59:34,364 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717951
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717952
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717953
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717954
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717955
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717956
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717957
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717958
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717959
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717960
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717961
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717962
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717963
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717964
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717965
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717966
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717967
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717968
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717969
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717970
2021-04-06 11:59:34,365 INFO org.dspace.content.MetadataValueServiceImpl @ user.hidden@cgiar.org:session_id=65F32E67CE8E347F64EFB5EB4E349B9B:delete_metadata_value: metadata_value_id=5717971
- I sent some notes and a log to Atmire on our existing issue about the database stuff
- Also I asked them about the possibility of doing a formal review of Hibernate
- Falcon 3.0.0 was released so I updated the 3.0.0 branch for dspace-statistics-api and merged it to
v6_x
- I also fixed one minor (unrelated) bug in the tests
- Then I deployed the new version on DSpace Test
- I had a meeting with Peter and Abenet about CGSpace TODOs
- CGSpace went down again and the PostgreSQL locks are through the roof:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
12154
- I don’t see any activity on REST API, but in the last four hours there have been 3,500 DSpace sessions:
# grep -a -E '2021-04-06 (13|14|15|16|17):' /home/cgspace.cgiar.org/log/dspace.log.2021-04-06 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
3547
- I looked at the same time of day for the past few weeks and it seems to be a normal number of sessions:
# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do grep -a -E "2021-0(3|4)-[0-9]{2} (13|14|15|16|17):" "$file" | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l; done
...
3572
4085
3476
3128
2949
2016
1839
4513
3463
4425
3328
2783
3898
3848
7799
255
534
2755
599
4463
3547
- What about total number of sessions per day?
# for file in /home/cgspace.cgiar.org/log/dspace.log.2021-0{3,4}-*; do echo "$file:"; grep -a -o -E 'session_id=[A-Z0-9]{32}' "$file" | sort | uniq | wc -l; done
...
/home/cgspace.cgiar.org/log/dspace.log.2021-03-28:
11784
/home/cgspace.cgiar.org/log/dspace.log.2021-03-29:
15104
/home/cgspace.cgiar.org/log/dspace.log.2021-03-30:
19396
/home/cgspace.cgiar.org/log/dspace.log.2021-03-31:
32612
/home/cgspace.cgiar.org/log/dspace.log.2021-04-01:
26037
/home/cgspace.cgiar.org/log/dspace.log.2021-04-02:
14315
/home/cgspace.cgiar.org/log/dspace.log.2021-04-03:
12530
/home/cgspace.cgiar.org/log/dspace.log.2021-04-04:
13138
/home/cgspace.cgiar.org/log/dspace.log.2021-04-05:
16756
/home/cgspace.cgiar.org/log/dspace.log.2021-04-06:
12343
- So it’s not the number of sessions… it’s something with the workload…
- I had to step away for an hour or so and when I came back the site was still down and there were still 12,000 locks
- I restarted postgresql and tomcat7…
- The locks in PostgreSQL shot up again…
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3447
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
3527
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
4582
- I don’t know what the hell is going on
- While looking at the logs I see that MEL is trying to log into CGSpace’s REST API and delete items:
34.209.213.122 - - [06/Apr/2021:03:50:46 +0200] "POST /rest/login HTTP/1.1" 401 727 "-" "MEL"
34.209.213.122 - - [06/Apr/2021:03:50:48 +0200] "DELETE /rest/items/95f52bf1-f082-4e10-ad57-268a76ca18ec/metadata HTTP/1.1" 401 704 "-" "-"
- I see a few of these per day going back several months
- Also annoying, I see tons of what look like penetration testing requests from Qualys:
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user "'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,889 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user="'><qss a=X158062356Y1_2Z>
2021-04-04 06:35:17,890 INFO org.dspace.app.xmlui.utils.AuthenticationUtil @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:email="'><qss a=X158062356Y1_2Z>, realm=null, result=2
2021-04-04 06:35:18,145 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:auth:attempting trivial auth of user=was@qualys.com
2021-04-04 06:35:18,519 INFO org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:failed_login:no DN found for user was@qualys.com
2021-04-04 06:35:18,520 INFO org.dspace.authenticate.PasswordAuthentication @ anonymous:session_id=FF1E051BCA7D81CC5A807D85380D81E5:ip_addr=64.39.108.48:authenticate:attempting password auth of user=was@qualys.com
- I deleted the ilri/AReS repository on GitHub since we haven’t updated it in two years