mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-30 06:04:16 +02:00
328 lines
17 KiB
Markdown
328 lines
17 KiB
Markdown
---
|
|
title: "June, 2021"
|
|
date: 2021-06-01T10:51:07+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2021-06-01
|
|
|
|
- IWMI notified me that AReS was down with an HTTP 502 error
|
|
- Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
|
|
- I don't see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the `angular_nginx` container isn't running
|
|
- I simply started it and AReS was running again:
|
|
|
|
<!--more-->
|
|
|
|
```console
|
|
$ docker-compose -f docker/docker-compose.yml start angular_nginx
|
|
```
|
|
|
|
- Margarita from CCAFS emailed me to say that workflow alerts haven't been working lately
|
|
- I guess this is related to the SMTP issues last week
|
|
- I had fixed the config, but didn't restart Tomcat so DSpace didn't load the new variables
|
|
- I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers
|
|
|
|
## 2021-06-03
|
|
|
|
- Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI's content into the new AMCOW Knowledge Hub
|
|
- At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
|
|
- That's when I thought they could perhaps use the OpenSearch, but I can't remember if it's possible to limit by community, or only collection...
|
|
- Looking now, I see there is a "scope" parameter that can be used for community or collection, for example:
|
|
|
|
```
|
|
https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
|
|
```
|
|
|
|
- That will sort by date issued (see: `webui.itemlist.sort-option.2` in dspace.cfg), give 100 results per page, and start on item 1
|
|
- Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
|
|
- Fill out the *CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC* survey on behalf of CGSpace
|
|
|
|
## 2021-06-06
|
|
|
|
- The Elasticsearch indexes are messed up so I dumped and re-created them correctly:
|
|
|
|
```console
|
|
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
|
|
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
|
|
curl -XPUT 'http://localhost:9200/openrxv-items-final'
|
|
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
|
|
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
|
|
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
|
|
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
|
|
```
|
|
|
|
- Then I started a harvesting on AReS
|
|
|
|
## 2021-06-07
|
|
|
|
- The harvesting on AReS completed successfully
|
|
- Provide feedback to FAO on how we use AGROVOC for their "AGROVOC call for use cases"
|
|
|
|
## 2021-06-10
|
|
|
|
- Skype with Moayad to discuss AReS harvesting improvements
|
|
- He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not
|
|
|
|
## 2021-06-14
|
|
|
|
- Dump and re-create indexes on AReS (as above) so I can do a harvest
|
|
|
|
## 2021-06-16
|
|
|
|
- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS
|
|
- For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.`
|
|
- I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706
|
|
- I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
|
|
- Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic
|
|
- I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time)
|
|
- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
|
|
- It will hopefully also fix the duplicate and missing items issues
|
|
- I had a Skype with him to discuss
|
|
- I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
|
|
|
|
```console
|
|
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
|
|
```
|
|
|
|
- The new OpenRXV harvesting method by Moayad uses pages of 10 items instead of 100 and it's much faster
|
|
- I harvested 90,000+ items from DSpace Test in ~3 hours
|
|
- There seem to be some issues with the health check step though, as I see it is requesting one restricted item 600,000+ times...
|
|
|
|
## 2021-06-17
|
|
|
|
- I ported my ilri/resolve-addresses.py script that uses IPAPI.co to use the local GeoIP2 databases
|
|
- The new script is ilri/resolve-addresses-geoip2.py and it is much faster and works offline with no API rate limits
|
|
- Teams meeting with the CGIAR Metadata Working group to discuss CGSpace and open repositories and the way forward
|
|
- More work with Moayad on OpenRXV harvesting issues
|
|
- Using a JSON export from elasticdump we debugged the duplicate checker plugin and found that there are indeed duplicates:
|
|
|
|
```console
|
|
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | wc -l
|
|
90459
|
|
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq | wc -l
|
|
90380
|
|
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data.json | awk -F: '{print $2}' | sort | uniq -c | sort -h
|
|
...
|
|
2 "10568/99409"
|
|
2 "10568/99410"
|
|
2 "10568/99411"
|
|
2 "10568/99516"
|
|
3 "10568/102093"
|
|
3 "10568/103524"
|
|
3 "10568/106664"
|
|
3 "10568/106940"
|
|
3 "10568/107195"
|
|
3 "10568/96546"
|
|
```
|
|
|
|
## 2021-06-20
|
|
|
|
- Udana asked me to update their IWMI subjects from `farmer managed irrigation systems` to `farmer-led irrigation`
|
|
- First I extracted the IWMI community from CGSpace:
|
|
|
|
```console
|
|
$ dspace metadata-export -i 10568/16814 -f /tmp/2021-06-20-IWMI.csv
|
|
```
|
|
|
|
- Then I used `csvcut` to extract just the columns I needed and do the replacement into a new CSV:
|
|
|
|
```console
|
|
$ csvcut -c 'id,dcterms.subject[],dcterms.subject[en_US]' /tmp/2021-06-20-IWMI.csv | sed 's/farmer managed irrigation systems/farmer-led irrigation/' > /tmp/2021-06-20-IWMI-new-subjects.csv
|
|
```
|
|
|
|
- Then I uploaded the resulting CSV to CGSpace, updating 161 items
|
|
- Start a harvest on AReS
|
|
- I found [a bug](https://jira.lyrasis.org/browse/DS-1977) and [a patch](https://github.com/DSpace/DSpace/pull/2584) for the private items showing up in the DSpace sitemap bug
|
|
- The fix is super simple, I should try to apply it
|
|
|
|
## 2021-06-21
|
|
|
|
- The AReS harvesting finished, but the indexes got messed up again
|
|
- I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
|
|
- We have 90,000+ items, but only 85,000 unique:
|
|
|
|
```console
|
|
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | wc -l
|
|
90937
|
|
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort -u | wc -l
|
|
85709
|
|
```
|
|
|
|
- So those could be duplicates from the way we harvest pages, but they could also be from mappings...
|
|
- Manually inspecting the duplicates where handles appear more than once:
|
|
|
|
```console
|
|
$ grep -E '"repo":"CGSpace"' openrxv-items_data.json | grep -oE '"handle":"[[:digit:]]+/[[:alnum:]]+"' | sort | uniq -c | sort -h
|
|
```
|
|
|
|
- Unfortunately I found no pattern:
|
|
- Some appear twice in the Elasticsearch index, but appear in only one collection
|
|
- Some appear twice in the Elasticsearch index, and appear in *two* collections
|
|
- Some appear twice in the Elasticsearch index, but appear in three collections (!)
|
|
- So really we need to just check whether a handle exists before we insert it
|
|
- I tested the [pull request for DS-1977](https://github.com/DSpace/DSpace/pull/2584) that adjusts the sitemap generation code to exclude private items
|
|
- It applies cleanly and seems to work, but we don't actually have any private items
|
|
- The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
|
|
- Testing the [pull request for DS-4065](https://github.com/DSpace/DSpace/pull/2275) where the REST API's `/rest/items` endpoint is not aware of private items and returns an incorrect number of items
|
|
- This is most easily seen by setting a low limit in `/rest/items`, making one of the items private, and requesting items again with the same limit
|
|
- I confirmed the issue on the current DSpace 6 Demo:
|
|
|
|
```console
|
|
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
|
|
5
|
|
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
|
|
"10673/4"
|
|
"10673/3"
|
|
"10673/6"
|
|
"10673/5"
|
|
"10673/7"
|
|
# log into DSpace Demo XMLUI as admin and make one item private (for example 10673/6)
|
|
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq length
|
|
4
|
|
$ curl -s -H "Accept: application/json" "https://demo.dspace.org/rest/items?offset=0&limit=5" | jq '.[].handle'
|
|
"10673/4"
|
|
"10673/3"
|
|
"10673/5"
|
|
"10673/7"
|
|
```
|
|
|
|
- I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
|
|
- Last week I noticed that the Gender Platform website is using "cgspace.cgiar.org" links for CGSpace, instead of handles
|
|
- I emailed Fabio and Marianne to ask them to please use the Handle links
|
|
- I tested the [pull request for DS-4271](https://github.com/DSpace/DSpace/pull/2543) where Discovery filters of type "contains" don't work as expected when the user's search term has spaces
|
|
- I tested with filter "farmer managed irrigation systems" on DSpace Test
|
|
- Before the patch I got 293 results, and the few I checked didn't have the expected metadata value
|
|
- After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting
|
|
- I tested a fresh harvest from my local AReS on DSpace Test with the DS-4065 REST API patch and here are my results:
|
|
- 90459 in final from last harvesting
|
|
- 90307 in temp after new harvest
|
|
- 90327 in temp after start plugins
|
|
- The 90327 number seems closer to the "real" number of items on CGSpace...
|
|
- Seems close, but not entirely correct yet:
|
|
|
|
```console
|
|
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | wc -l
|
|
90327
|
|
$ grep -oE '"handle":"[[:digit:]]+/[[:digit:]]+"' openrxv-items_data-local-ds-4065.json | sort -u | wc -l
|
|
90317
|
|
```
|
|
|
|
## 2021-06-22
|
|
|
|
- Make a [pull request](https://github.com/atmire/COUNTER-Robots/pull/43) to the COUNTER-Robots project to add two new user agents: crusty and newspaper
|
|
- These two bots have made ~3,000 requests on CGSpace
|
|
- Then I added them to our local bot override in CGSpace (until the above pull request is merged) and ran my bot checking script:
|
|
|
|
```console
|
|
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
|
|
Purging 1339 hits from RI\/1\.0 in statistics
|
|
Purging 447 hits from crusty in statistics
|
|
Purging 3736 hits from newspaper in statistics
|
|
|
|
Total number of bot hits purged: 5522
|
|
```
|
|
|
|
- Surprised to see RI/1.0 in there because it's been in the override file for a while
|
|
- Looking at the 2021 statistics in Solr I see a few more suspicious user agents:
|
|
- `PostmanRuntime/7.26.8`
|
|
- `node-fetch/1.0 (+https://github.com/bitinn/node-fetch)`
|
|
- `Photon/1.0`
|
|
- `StatusCake_Pagespeed_indev`
|
|
- `node-superagent/3.8.3`
|
|
- `cortex/1.0`
|
|
- These bots account for ~42,000 hits in our statistics... I will just purge them and add them to our local override, but I can't be bothered to submit them to COUNTER-Robots since I'd have to look up the information for each one
|
|
- I re-synced DSpace Test (linode26) with the assetstore, Solr statistics, and database from CGSpace (linode18)
|
|
|
|
## 2021-06-23
|
|
|
|
- I woke up this morning to find CGSpace down
|
|
- The logs show a high number of abandoned PostgreSQL connections and locks:
|
|
|
|
```console
|
|
# journalctl --since=today -u tomcat7 | grep -c 'Connection has been abandoned'
|
|
978
|
|
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
|
|
10100
|
|
```
|
|
|
|
- I sent a message to Atmire, hoping that the database logging stuff they put in place last time this happened will be of help now
|
|
- In the mean time, I decided to upgrade Tomcat from 7.0.107 to 7.0.109, and the PostgreSQL JDBC driver from 42.2.20 to 42.2.22 (first on DSpace Test)
|
|
- I also applied the following patches from the 6.4 milestone to our `6_x-prod` branch:
|
|
- DS-4065: resource policy aware REST API hibernate queries
|
|
- DS-4271: Replaced brackets with double quotes in SolrServiceImpl
|
|
- After upgrading and restarting Tomcat the database connections and locks were back down to normal levels:
|
|
|
|
```console
|
|
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
|
|
63
|
|
```
|
|
|
|
- Looking in the DSpace log, the first "pool empty" message I saw this morning was at 4AM:
|
|
|
|
```console
|
|
2021-06-23 04:01:14,596 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper @ [http-bio-127.0.0.1-8443-exec-4323] Timeout: Pool empty. Unable to fetch a connection in 5 seconds, none available[size:250; busy:250; idle:0; lastwait:5000].
|
|
```
|
|
|
|
- Oh, and I notice 8,000 hits from a Flipboard bot using this user-agent:
|
|
|
|
```console
|
|
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
|
|
```
|
|
|
|
- We can purge them, as this is not user traffic: https://about.flipboard.com/browserproxy/
|
|
- I will add it to our local user agent pattern file and eventually submit a pull request to COUNTER-Robots
|
|
- I merged [Moayad's health check pull request in AReS](https://github.com/ilri/OpenRXV/pull/96) and I will deploy it on the production server soon
|
|
|
|
## 2021-06-24
|
|
|
|
- I deployed the new OpenRXV code on CGSpace but I'm having problems with the indexing, something about missing the mappings on the `openrxv-items-temp` index
|
|
- I extracted the mappings from my local instance using `elasticdump` and after putting them on CGSpace I was able to harvest...
|
|
- But still, there are way too many duplicates and I'm not sure what the actual number of items should be
|
|
- According to the OAI ListRecords for each of our repositories, we should have about:
|
|
- MELSpace: 9537
|
|
- WorldFish: 4483
|
|
- CGSpace: 91305
|
|
- Total: 105325
|
|
- Looking at the last backup I have from harvesting before these changes we have 104,000 total handles, but only 99186 unique:
|
|
|
|
```console
|
|
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | wc -l
|
|
104797
|
|
$ grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' cgspace-openrxv-items-temp-backup.json | sort | uniq | wc -l
|
|
99186
|
|
```
|
|
|
|
- This number is probably unique for that particular harvest, but I don't think it represents the true number of items...
|
|
- The harvest of DSpace Test I did on my local test instance yesterday has about 91,000 items:
|
|
|
|
```console
|
|
$ grep -E '"repo":"DSpace Test"' 2021-06-23-openrxv-items-final-local.json | grep -oE '"handle":"([[:digit:]]|\.)+/[[:digit:]]+"' | sort | uniq | wc -l
|
|
90990
|
|
```
|
|
|
|
- So the harvest on the live site is missing items, then why didn't the add missing items plugin find them?!
|
|
- I notice that we are missing the `type` in the metadata structure config for each repository on the production site, and we are using `type` for item type in the actual schema... so maybe there is a conflict there
|
|
- I will rename type to `item_type` and add it back to the metadata structure
|
|
- The add missing items definitely checks this field...
|
|
- I modified my local backup to add `type: item` and uploaded it to the temp index on production
|
|
- Oh! nginx is blocking OpenRXV's attempt to read the sitemap:
|
|
|
|
```console
|
|
172.104.229.92 - - [24/Jun/2021:07:52:58 +0200] "GET /sitemap HTTP/1.1" 503 190 "-" "OpenRXV harvesting bot; https://github.com/ilri/OpenRXV"
|
|
```
|
|
|
|
- I fixed nginx so it always allows people to get the sitemap and then re-ran the plugins... now it's checking 180,000+ handles to see if they are collections or items...
|
|
- I see it fetched the sitemap three times, we need to make sure it's only doing it once for each repository
|
|
- According to the api logs we will be adding 5,697 items:
|
|
|
|
```console
|
|
$ docker logs api 2>/dev/null | grep dspace_add_missing_items | sort | uniq | wc -l
|
|
5697
|
|
```
|
|
|
|
- Spent a few hours with Moayad troubleshooting and improving OpenRXV
|
|
- We found a bug in the harvesting code that can occur when you are harvesting DSpace 5 and DSpace 6 instances, as DSpace 5 uses numeric (long) IDs, and DSpace 6 uses UUIDs
|
|
|
|
<!-- vim: set sw=2 ts=2: -->
|