cgspace-notes/2021-05.md at 8383cd466bd9e475c36e9fec05f6ddba99427b65

alanorth/cgspace-notes

Fork 0

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-11-09 16:45:45 +01:00

Alan Orth 1a9194e460

content/posts/2021-05.md: Fix csvgrep command

2021-07-06 17:03:55 +03:00

31 KiB

Raw Blame History

title

date

author

2021-05-01

I looked at the top user agents and IPs in the Solr statistics for last month and I see these user agents:
- "RI/1.0", 1337
- "Microsoft Office Word 2014", 941
I will add the RI/1.0 pattern to our DSpace agents overload and purge them from Solr (we had previously seen this agent with 9,000 hits or so in 2020-09), but I think I will leave the Microsoft Word one... as that's an actual user...

I should probably add the RI/1.0 pattern to COUNTER-Robots project
As well as these IPs:
- 193.169.254.178, 21648
- 181.62.166.177, 20323
- 45.146.166.180, 19376
The first IP seems to be in Estonia and their requests to the REST API change user agents from curl to Mac OS X to Windows and more
- Also, they seem to be trying to exploit something:

193.169.254.178 - - [21/Apr/2021:01:59:01 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata\x22%20and%20\x2221\x22=\x2221 HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata-21%2B21*01 HTTP/1.1" 200 458201 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
193.169.254.178 - - [21/Apr/2021:02:00:36 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'||lower('')||' HTTP/1.1" 400 5 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
193.169.254.178 - - [21/Apr/2021:02:02:10 +0200] "GET /rest/collections/1179/items?limit=812&expand=metadata'%2Brtrim('')%2B' HTTP/1.1" 200 458209 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

I will report the IP on abuseipdb.com and purge their hits from Solr
The second IP is in Colombia and is making thousands of requests for what looks like some test site:

181.62.166.177 - - [20/Apr/2021:22:48:42 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"
181.62.166.177 - - [20/Apr/2021:22:55:39 +0200] "GET /rest/collections/d1e11546-c62a-4aee-af91-fd482b3e7653/items?expand=metadata HTTP/2.0" 200 123613 "http://cassavalighthousetest.org/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36"

But this site does not exist (yet?)
- I will purge them from Solr
The third IP is in Russia apparently, and the user agent has the pl-PL locale with thousands of requests like this:

45.146.166.180 - - [18/Apr/2021:16:28:44 +0200] "GET /bitstream/handle/10947/4153/.AAS%202014%20Annual%20Report.pdf?sequence=1%22%29%29%20AND%201691%3DUTL_INADDR.GET_HOST_ADDRESS%28CHR%28113%29%7C%7CCHR%28118%29%7C%7CCHR%28113%29%7C%7CCHR%28106%29%7C%7CCHR%28113%29%7C%7C%28SELECT%20%28CASE%20WHEN%20%281691%3D1691%29%20THEN%201%20ELSE%200%20END%29%20FROM%20DUAL%29%7C%7CCHR%28113%29%7C%7CCHR%2898%29%7C%7CCHR%28122%29%7C%7CCHR%28120%29%7C%7CCHR%28113%29%29%20AND%20%28%28%22RKbp%22%3D%22RKbp&isAllowed=y HTTP/1.1" 200 918998 "http://cgspace.cgiar.org:80/bitstream/handle/10947/4153/.AAS 2014 Annual Report.pdf" "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl-PL) AppleWebKit/523.15 (KHTML, like Gecko) Version/3.0 Safari/523.15"

I will purge these all with my check-spider-ip-hits.sh script:

$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 21648 hits from 193.169.254.178 in statistics
Purging 20323 hits from 181.62.166.177 in statistics
Purging 19376 hits from 45.146.166.180 in statistics

Total number of bot hits purged: 61347

2021-05-02

Check the AReS Harvester indexes:

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1      0 0    283b    283b
yellow open openrxv-items-final      ul3SKsa7Q9Cd_K7qokBY_w 1 1 103951 0   254mb   254mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
    "openrxv-items-temp": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },

I think they look OK (openrxv-items is an alias of openrxv-items-final), but I took a backup just in case:

$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000

Then I started an indexing in the AReS Explorer admin dashboard
The indexing finished, but it looks like the aliases are messed up again:

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb

2021-05-05

Peter noticed that we no longer display cg.link.reference on the item view
- It seems that this got dropped accidentally when we migrated to dcterms.relation in CG Core v2
- I fixed it in the 6_x-prod branch and told him it will be live soon

2021-05-09

I set up a clean DSpace 6.4 instance locally to test some things against, for example to be able to rule out whether some issues are due to Atmire modules or are fixed in the as-of-yet-unreleased DSpace 6.4
- I had to delete all the Atmire schemas, then it worked fine on Tomcat 8.5 with Mirage (I didn't want to bother with npm and ruby for Mirage 2)
- Then I tried to see if I could reproduce the mapping issue that Marianne raised last month
  - I tried unmapping and remapping to the CGIAR Gender grants collection and the collection appears in the item view's list of mapped collections, but not on the collection browse itself
  - Then I tried mapping to a new collection and it was the same as above
  - So this issue is really just a DSpace bug, and nothing to do with Atmire and not fixed in the unreleased DSpace 6.4
  - I will try one more time after updating the Discovery index (I'm also curious how fast it is on vanilla DSpace 6.4, though I think I tried that when I did the flame graphs in 2019 and it was miserable)

$ time ~/dspace64/bin/dspace index-discovery -b
~/dspace64/bin/dspace index-discovery -b  4053.24s user 53.17s system 38% cpu 2:58:53.83 total

Nope! Still slow, and still no mapped item...
- I even tried unmapping it from all collections, and adding it to a single new owning collection...
Ah hah! Actually, I was inspecting the item's authorization policies when I noticed that someone had made the item private!
- After making it public again I was able to see it in the target collection
The indexes on AReS Explorer are messed up after last week's harvesting:

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       H-CGsyyLTaqAj6-nKXZ-7w 1 1 104165 105024 487.7mb 487.7mb
yellow open openrxv-items-final      d0tbMM_SRWimirxr_gm9YA 1 1    937      0   2.2mb   2.2mb

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
...
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    }

openrxv-items should be an alias of openrxv-items-final...
I made a backup of the temp index and then started indexing on the AReS Explorer admin dashboard:

$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp/_clone/openrxv-items-temp-backup
$ curl -X PUT "localhost:9200/openrxv-items-temp/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": false}}'

2021-05-10

Amazing, the harvesting on AReS finished but it messed up all the indexes and now there are no items in any index!

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp        8thRX0WVRUeAzmd2hkG6TA 1 1      0     0    283b    283b
yellow open openrxv-items-temp-backup _0tyvctBTg2pjOlcoVP1LA 1 1 104165 20134 305.5mb 305.5mb
yellow open openrxv-items-final       BtvV9kwVQ3yBYCZvJS1QyQ 1 1      0     0    283b    283b

I fixed the indexes manually by re-creating them and cloning from the backup:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -X PUT "localhost:9200/openrxv-items-temp-backup/_settings" -H 'Content-Type: application/json' -d'{"settings": {"index.blocks.write": true}}'
$ curl -s -X POST http://localhost:9200/openrxv-items-temp-backup/_clone/openrxv-items-final
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp-backup'

Also I ran all updated on the server and updated all Docker images, then rebooted the server (linode20):

$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull

I backed up the AReS Elasticsearch data using elasticdump, then started a new harvest:

$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000

Discuss CGSpace statistics with the CIP team
- They were wondering why their numbers for 2020 were so low
- I checked their community using the DSpace Statistics API and found very accurate numbers for 2020 and 2019 for them
- I think they had been using AReS, which actually doesn't even give stats for a time period...

2021-05-11

The AReS harvesting from yesterday finished, but the indexes are messed up again so I will have to fix them again before I harvest next time
I also spent some time looking at IWMI's reports again
- On AReS we don't have a way to group by peer reviewed or item type other than doing "if type is Journal Article"
- Also, we don't have a way to check the IWMI Strategic Priorities because those are communities, not metadata...
- We can get the collections an item is in from the parentCollectionList metadata, but it is saved in Elasticsearch as a string instead of a list...
- I told them it won't be possible to replicate their reports exactly
I decided to look at the CLARISA controlled vocabularies again
- They now have 6,200 institutions (was around 3,400 when I last looked in 2020-07)
- They have updated their Swagger interface but it still requires an API key if you want to use it from curl
- They have ISO 3166 countries and UN M.49 regions, but I notice they have some weird names like "Russian Federation (the)", which is not in ISO 3166 as far as I can see
- I exported a list of the institutions to look closer
  - I found twelve items with whitespace issues
  - There are some weird entries like Research Institute for Aquaculture No1 and Research Institute for Aquaculture No2
  - A few items have weird Unicode characters like U+00AD, U+200B, and U+00A0
  - I found 100+ items with multiple languages in there name like Ministère de l’Agriculture, de la pêche et des ressources hydrauliques / Ministry of Agriculture, Hydraulic Resources and Fisheries
  - Over 600 institutions have the country in their name like Ministry of Coordination of Environmental Affairs (Mozambique)
  - For URLs they have null in some places... which is weird... why not just leave it blank?
I checked the CLARISA list against ROR's April, 2020 release ("Version 9", on figshare, though it is version 8 in the dump):

$ ./ilri/ror-lookup.py -i /tmp/clarisa-institutions.txt -r ror-data-2021-04-06.json -o /tmp/clarisa-ror-matches.csv
$ csvgrep -c matched -m 'true' /tmp/clarisa-ror-matches.csv | sed '1d' | wc -l
1770

With 1770 out of 6230 matched, that's 28.5%...
I sent an email to Hector Tobon to point out the issues in CLARISA again and ask him to chat
Meeting with GARDIAN developers about CG Core and how GARDIAN works

2021-05-13

Fix a few thousand IWMI URLs that are using HTTP instead of HTTPS on CGSpace:

localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://www.iwmi.cgiar.org','https://www.iwmi.cgiar.org', 'g') WHERE text_value LIKE 'http://www.iwmi.cgiar.org%' AND metadata_field_id=219;
UPDATE 1132
localhost/dspace63= > UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, 'http://publications.iwmi.org','https://publications.iwmi.org', 'g') WHERE text_value LIKE 'http://publications.iwmi.org%' AND metadata_field_id=219;
UPDATE 1803

In the case of the latter, the HTTP links don't even work! The web server returns HTTP 404 unless the request is HTTPS
IWMI also says that their subjects are a subset of AGROVOC so they no longer want to use cg.subject.iwmi for their subjects
- They asked if I can move them to dcterms.subject
Delete two items for Udana because he was getting the "Authorization denied for action OBSOLETE (DELETE) ..." error when trying to delete them (DSpace 6 bug I found a few months ago)
- https://cgspace.cgiar.org/handle/10568/34536
- https://cgspace.cgiar.org/handle/10568/34570

2021-05-14

I updated the PostgreSQL JDBC driver in the Ansible playbooks to version 42.2.20 and deployed it on DSpace Test (linode26)

2021-05-15

I have to fix the Elasticsearch indexes on AReS after last week's harvesting because, as always, the openrxv-items index should be an alias of openrxv-items-final instead of openrxv-items-temp:

$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
    "openrxv-items-final": {
        "aliases": {}
    },
    "openrxv-items-temp": {
        "aliases": {
            "openrxv-items": {}
        }
    },
...

I took a backup of the openrxv-items index with elasticdump so I can re-create them manually before starting a new harvest tomorrow:

$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000

2021-05-16

I deleted and re-created the Elasticsearch indexes on AReS:

$ curl -XDELETE 'http://localhost:9200/openrxv-items-final'
$ curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
$ curl -XPUT 'http://localhost:9200/openrxv-items-final'
$ curl -XPUT 'http://localhost:9200/openrxv-items-temp'
$ curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'

Then I re-imported the backup that I created with elasticdump yesterday:

$ elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
$ elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000

Then I started a new harvest on AReS

2021-05-17

The AReS harvest finished and the Elasticsearch indexes seem OK so I shouldn't have to fix them next time...

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items
yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1      0 0    283b    283b
yellow open openrxv-items-final      TrJ1Ict3QZ-vFkj-4VcAzw 1 1 104317 0 259.4mb 259.4mb
$ curl -s 'http://localhost:9200/_alias/' | python -m json.tool
    "openrxv-items-temp": {
        "aliases": {}
    },
    "openrxv-items-final": {
        "aliases": {
            "openrxv-items": {}
        }
    },
...

Abenet said she and some others can't log into CGSpace
- I tried to check the CGSpace LDAP account and it does seem to be not working:

$ ldapsearch -x -H ldaps://AZCGNEROOT2.CGIARAD.ORG:636/ -b "dc=cgiarad,dc=org" -D "cgspace-ldap@cgiarad.org" -W "(sAMAccountName=aorth)"
Enter LDAP Password: 
ldap_bind: Invalid credentials (49)
        additional info: 80090308: LdapErr: DSID-0C090453, comment: AcceptSecurityContext error, data 532, v3839

I sent a message to Biruk so he can check the LDAP account
IWMI confirmed that they do indeed want to move all their subjects to AGROVOC, so I made the changes in the XMLUI and config (#467)
- Then I used the migrate-fields.sh script to move them (46,000 metadata entries!)
I tested Abdullah's latest pull request to add clickable filters to AReS and I think it works
- I had some issues with my local AReS installation so I was only able to do limited testing...
- I will have to wait until I have more time to test before I can merge that
CCAFS asked me to add eight more phase II project tags to the CGSpace input form
- I extracted the existing ones using xmllint, added the new ones, sorted them, and then replaced them in input-forms.xml:

$ xmllint --xpath '//value-pairs[@value-pairs-name="ccafsprojectpii"]/pair/stored-value/node()' dspace/config/input-forms.xml

I formatted the input file with tidy, especially because one of the new project tags has an ampersand character... grrr:

$ tidy -xml -utf8 -m -iq -w 0 dspace/config/input-forms.xml      
line 3658 column 26 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"
line 3659 column 23 - Warning: unescaped & or unknown entity "&WA_EU-IFAD"

After testing whether this escaped value worked during submission, I created and merged a pull request to 6_x-prod (#468)

2021-05-18

Paola from the Alliance emailed me some new ORCID identifiers to add to CGSpace
I saved the new ones to a text file, combined them with the others, extracted the ORCID iDs themselves, and updated the names using resolve-orcids.py:

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/new | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2021-05-18-combined.txt
$ ./ilri/resolve-orcids.py -i /tmp/2021-05-18-combined.txt -o /tmp/2021-05-18-combined-names.txt

I sorted the names and added the XML formatting in vim, then ran it through tidy:

$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/cg-creator-identifier.xml

Tag fifty-five items from the Alliance's new authors with ORCID iDs using add-orcid-identifiers-csv.py:

$ cat 2021-05-18-add-orcids.csv 
dc.contributor.author,cg.creator.identifier
"Urioste Daza, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
"Urioste, Sergio",Sergio Alejandro Urioste Daza: 0000-0002-3208-032X
"Villegas, Daniel",Daniel M. Villegas: 0000-0001-6801-3332
"Villegas, Daniel M.",Daniel M. Villegas: 0000-0001-6801-3332
"Giles, James",James Giles: 0000-0003-1899-9206
"Simbare,  Alice",Alice Simbare: 0000-0003-2389-0969
"Simbare, Alice",Alice Simbare: 0000-0003-2389-0969
"Simbare, A.",Alice Simbare: 0000-0003-2389-0969
"Dita Rodriguez, Miguel",Miguel Angel Dita Rodriguez: 0000-0002-0496-4267
"Templer, Noel",Noel Templer: 0000-0002-3201-9043
"Jalonen, R.",Riina Jalonen: 0000-0003-1669-9138
"Jalonen, Riina",Riina Jalonen: 0000-0003-1669-9138
"Izquierdo, Paulo",Paulo Izquierdo: 0000-0002-2153-0655
"Reyes, Byron",Byron Reyes: 0000-0003-2672-9636
"Reyes, Byron A.",Byron Reyes: 0000-0003-2672-9636
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2021-05-18-add-orcids.csv -db dspace -u dspace -p 'fuuu' -d

I deployed the latest 6_x-prod branch on CGSpace, ran all system updates, and rebooted the server
- This included the IWMI changes, so I also migrated the cg.subject.iwmi metadata to dcterms.subject and deleted the subject term
- Then I started a full Discovery reindex

2021-05-19

I realized that I need to lower case the IWMI subjects that I just moved to AGROVOC because they were probably mostly uppercase
- To my surprise I checked dcterms.subject has 47,000 metadata fields that are upper or mixed case!

dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value ~ '[[:upper:]]';
UPDATE 47405

That's interesting because we lowercased them all a few months ago, so these must all be new... wow
- We have 405,000 total AGROVOC terms, with 20,600 of them being unique
- I will have to start another Discovery re-indexing to pick up these new changes

2021-05-20

Export the top 5,000 AGROVOC terms to validate them:

localhost/dspace63= > \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 187 GROUP BY text_value ORDER BY count DESC LIMIT 5000) to /tmp/2021-05-20-agrovoc.csv WITH CSV HEADER;
COPY 5000
$ csvcut -c 1 /tmp/2021-05-20-agrovoc.csv| sed 1d > /tmp/2021-05-20-agrovoc.txt
$ ./ilri/agrovoc-lookup.py -i /tmp/2021-05-20-agrovoc.txt -o /tmp/2021-05-20-agrovoc-results.csv
$ csvgrep -c "number of matches" -r '^0$' /tmp/2021-05-20-agrovoc-results.csv > /tmp/2021-05-20-agrovoc-rejected.csv

Meeting with Medha and Pythagoras about the FAIR Workflow tool
- Discussed the need for such a tool, other tools being developed, etc
- I stressed the important of controlled vocabularies
- No real outcome, except to keep us posted and let us know if they need help testing on DSpace
Meeting with Hector Tobon to discuss issues with CLARISA
- They pushed back a bit, saying they were more focused on the needs of the CG
- They are not against the idea of aligning closer to ROR, but lack the man power
- They pointed out that their countries come directly from the ISO 3166 online browsing platform on the ISO website
- Indeed the text value for Russia is "Russian Federation (the)" there... I find that strange
- I filed an issue on the iso-codes GitLab repository

2021-05-24

Add ORCID identifiers for missing ILRI authors and tag 550 others based on a few authors I noticed that were missing them:

$ cat 2021-05-24-add-orcids.csv 
dc.contributor.author,cg.creator.identifier
"Patel, Ekta","Ekta Patel: 0000-0001-9400-6988"
"Dessie, Tadelle","Tadelle Dessie: 0000-0002-1630-0417"
"Tadelle, D.","Tadelle Dessie: 0000-0002-1630-0417"
"Dione, Michel M.","Michel Dione: 0000-0001-7812-5776"
"Kiara, Henry K.","Henry Kiara: 0000-0001-9578-1636"
"Naessens, Jan","Jan Naessens: 0000-0002-7075-9915"
"Steinaa, Lucilla","Lucilla Steinaa: 0000-0003-3691-3971"
"Wieland, Barbara","Barbara Wieland: 0000-0003-4020-9186"
"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
"Rao, Idupulapati M.","Idupulapati M. Rao: 0000-0002-8381-9358"
"Cardoso Arango, Juan Andrés","Juan Andrés Cardoso Arango: 0000-0002-0252-4655"
$ ./ilri/add-orcid-identifiers-csv.py -i 2021-05-24-add-orcids.csv -db dspace -u dspace -p 'fuuu'

A few days ago I took a backup of the Elasticsearch indexes on AReS using elasticdump:

$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_data.json --type=data --limit=1000
$ elasticdump --input=http://localhost:9200/openrxv-items --output=/home/aorth/openrxv-items_mapping.json --type=mapping

The indexes look OK so I started a harvesting on AReS

2021-05-25

The AReS harvest got messed up somehow, as I see the number of items in the indexes are the same as before the harvesting:

$ curl -s http://localhost:9200/_cat/indices | grep openrxv-items                                                
yellow open openrxv-items-temp       o3ijJLcyTtGMOPeWpAJiVA 1 1 104373 106455 491.5mb 491.5mb
yellow open openrxv-items-final      soEzAnp3TDClIGZbmVyEIw 1 1    953      0   2.3mb   2.3mb

Update all docker images on the AReS server (linode20):

$ docker images | grep -v ^REPO | sed 's/ \+/:/g' | cut -d: -f1,2 | xargs -L1 docker pull
$ docker-compose -f docker/docker-compose.yml down
$ docker-compose -f docker/docker-compose.yml build

Then run all system updates on the server and reboot it
Oh crap, I deleted everything on AReS and restored the backup and the total items are now 104317... so it was actually correct before!
For reference, this is how I re-created everything:

curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000

I will just start a new harvest... sigh

2021-05-26

The AReS harvest last night got stuck at 3:20AM (UTC+3) at 752 pages for some reason...
- Something seems to have happened on CGSpace this morning around then as I have an alert from UptimeRobot as well
- I re-created everything as above (again) and restarted the harvest
Looking in the DSpace log for this morning I see a big hole in the logs at that time (UTC+2 server time):

2021-05-26 02:17:52,808 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/70659 with status: 2. Result: '10568/70659: item has country codes, skipping'
2021-05-26 02:17:52,853 INFO  org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/66761 with status: 2. Result: '10568/66761: item has country codes, skipping'
2021-05-26 03:00:05,772 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.spidersfile:null
2021-05-26 03:00:05,773 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ solr-statistics.server:http://localhost:8081/solr/statistics

There are no logs between 02:17 and 03:00... hmmm.
I see a similar gap in the Solr log, though it starts at 02:15:

2021-05-26 02:15:07,968 INFO  org.apache.solr.core.SolrCore @ [search] webapp=/solr path=/select params={f.location.coll.facet.sort=count&facet.field=location.comm&facet.field=location.coll&fl=handle,search.resourcetype,search.resourceid,search.uniqueid&start=0&fq=NOT(withdrawn:true)&fq=NOT(discoverable:false)&fq=search.resourcetype:2&fq=NOT(discoverable:false)&rows=0&version=2&q=*:*&f.location.coll.facet.limit=-1&facet.mincount=1&facet=true&f.location.comm.facet.sort=count&wt=javabin&facet.offset=0&f.location.comm.facet.limit=-1} hits=90792 status=0 QTime=6 
2021-05-26 02:15:09,446 INFO  org.apache.solr.core.SolrCore @ [statistics] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=1 
2021-05-26 02:28:03,602 INFO  org.apache.solr.update.UpdateHandler @ start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2021-05-26 02:28:03,630 INFO  org.apache.solr.core.SolrCore @ SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=NRTCachingDirectory(MMapDirectory@/home/cgspace.cgiar.org/solr/statistics/data/index lockFactory=NativeFSLockFactory@/home/cgspace.cgiar.org/solr/statistics/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_n6ns,generation=1081720}
        commit{dir=NRTCachingDirectory(MMapDirectory@/home/cgspace.cgiar.org/solr/statistics/data/index lockFactory=NativeFSLockFactory@/home/cgspace.cgiar.org/solr/statistics/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_n6nt,generation=1081721}
2021-05-26 02:28:03,630 INFO  org.apache.solr.core.SolrCore @ newest commit generation = 1081721
2021-05-26 02:28:03,632 INFO  org.apache.solr.search.SolrIndexSearcher @ Opening Searcher@34f2c871[statistics] main
2021-05-26 02:28:03,633 INFO  org.apache.solr.core.SolrCore @ QuerySenderListener sending requests to Searcher@34f2c871[statistics] main{StandardDirectoryReader(segments_n5xy:4540675:nrt _1befl(4.10.4):C42054400/925069:delGen=8891 _1bksq(4.10.4):C685090/92211:delGen=1227 _1bx0c(4.10.4):C1069897/49988:delGen=966 _1bxr2(4.10.4):C197387/5860:delGen=485 _1cc1x(4.10.4):C353338/40887:delGen=626 _1ck6k(4.10.4):C1009357/39041:delGen=166 _1celj(4.10.4):C268907/18097:delGen=340 _1clq9(4.10.4):C147453/25003:delGen=68 _1cn3t(4.10.4):C260311/1802:delGen=82 _1cl3c(4.10.4):C47408/2610:delGen=39 _1cnmh(4.10.4):C32851/237:delGen=41 _1cod4(4.10.4):C85915/281:delGen=35 _1coy4(4.10.4):C178367/483:delGen=27 _1cpgs(4.10.4):C25465/81:delGen=13 _1cppf(4.10.4):C101411/154:delGen=15 _1cqc4(4.10.4):C26003/39:delGen=8 _1cpvl(4.10.4):C24160/91:delGen=8 _1cq3n(4.10.4):C18167/39:delGen=4 _1cq15(4.10.4):C9983/13:delGen=2 _1cq79(4.10.4):C13077/19:delGen=4 _1cqhz(4.10.4):C21251/2:delGen=1 _1cqka(4.10.4):C3531 _1cqku(4.10.4):C2597 _1cqkk(4.10.4):C2951 _1cqjq(4.10.4):C2675 _1cql5(4.10.4):C993 _1cql6(4.10.4):C161 _1cql7(4.10.4):C106 _1cql8(4.10.4):C19 _1cql9(4.10.4):C147 _1cqla(4.10.4):C2 _1cqlb(4.10.4):C15)}
2021-05-26 02:28:03,633 INFO  org.apache.solr.core.SolrCore @ QuerySenderListener done.

Ah, it seems to have been a Linode network issue in the Frankfurt region:

May 26, 2021
Connectivity Issue - Frankfurt
Resolved - We haven’t observed any additional connectivity issues in our Frankfurt data center, and will now consider this incident resolved. If you continue to experience problems, please open a Support ticket for assistance.
May 26, 02:57 UTC

While looking in the logs I noticed an error about SMTP:

2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ Failed to send subscription to eperson_id=934cb92f-2e77-4881-89e2-6f13ad4b1378
2021-05-26 02:00:18,015 ERROR org.dspace.eperson.SubscribeCLITool @ javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))

And indeed the email seems to be broken:

$ dspace test-email

About to send test email:
 - To: fuuuuuu
 - Subject: DSpace test email
 - Server: smtp.office365.com

Error sending email:
 - Error: javax.mail.SendFailedException: Send failure (javax.mail.MessagingException: Could not convert socket to TLS (javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)))

Please see the DSpace documentation for assistance.

I saw a recent thread on the dspace-tech mailing list about this that makes me wonder if Microsoft changed something on Office 365
- I added mail.smtp.ssl.protocols=TLSv1.2 to the mail.extraproperties in dspace.cfg and the test email sent successfully

2021-05-30

Reset the Elasticsearch indexes on AReS as above and start a fresh harvest
- The indexing finished and the total number of items is now 104504, but I'm sure the indexes are messed up so I will just start by taking a backup and re-creating them manually before every indexing

31 KiB Raw Blame History Unescape Escape

2021-05-01

2021-05-02

2021-05-05

2021-05-09

2021-05-10

2021-05-11

2021-05-13

2021-05-14

2021-05-15

2021-05-16

2021-05-17

2021-05-18

2021-05-19

2021-05-20

2021-05-24

2021-05-25

2021-05-26

2021-05-30

31 KiB

Raw Blame History