- IWMI notified me that AReS was down with an HTTP 502 error
- Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
- I don't see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the `angular_nginx` container isn't running
- Margarita from CCAFS emailed me to say that workflow alerts haven't been working lately
- I guess this is related to the SMTP issues last week
- I had fixed the config, but didn't restart Tomcat so DSpace didn't load the new variables
- I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers
## 2021-06-03
- Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI's content into the new AMCOW Knowledge Hub
- At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
- That's when I thought they could perhaps use the OpenSearch, but I can't remember if it's possible to limit by community, or only collection...
- Looking now, I see there is a "scope" parameter that can be used for community or collection, for example:
- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS
- For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.`
- I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706
- I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
- Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic
- I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time)
- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
- It will hopefully also fix the duplicate and missing items issues
- I had a Skype with him to discuss
- I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:
- Then I uploaded the resulting CSV to CGSpace, updating 161 items
- Start a harvest on AReS
- I found [a bug](https://jira.lyrasis.org/browse/DS-1977) and [a patch](https://github.com/DSpace/DSpace/pull/2584) for the private items showing up in the DSpace sitemap bug
- The fix is super simple, I should try to apply it
## 2021-06-21
- The AReS harvesting finished, but the indexes got messed up again
- I was looking at the JSON export I made yesterday and trying to understand the situation with duplicates
- Some appear twice in the Elasticsearch index, but appear in only one collection
- Some appear twice in the Elasticsearch index, and appear in *two* collections
- Some appear twice in the Elasticsearch index, but appear in three collections (!)
- So really we need to just check whether a handle exists before we insert it
- I tested the [pull request for DS-1977](https://github.com/DSpace/DSpace/pull/2584) that adjusts the sitemap generation code to exclude private items
- It applies cleanly and seems to work, but we don't actually have any private items
- The issue we are having with AReS hitting restricted items in the sitemap is that the items have restricted metadata, not that they are private
- Testing the [pull request for DS-4065](https://github.com/DSpace/DSpace/pull/2275) where the REST API's `/rest/items` endpoint is not aware of private items and returns an incorrect number of items
- This is most easily seen by setting a low limit in `/rest/items`, making one of the items private, and requesting items again with the same limit
- I confirmed the issue on the current DSpace 6 Demo:
- I tested the pull request on DSpace Test and it works, so I left a note on GitHub and Jira
- Last week I noticed that the Gender Platform website is using "cgspace.cgiar.org" links for CGSpace, instead of handles
- I emailed Fabio and Marianne to ask them to please use the Handle links
- I tested the [pull request for DS-4271](https://github.com/DSpace/DSpace/pull/2543) where Discovery filters of type "contains" don't work as expected when the user's search term has spaces
- I tested with filter "farmer managed irrigation systems" on DSpace Test
- Before the patch I got 293 results, and the few I checked didn't have the expected metadata value
- After the patch I got 162 results, and all the items I checked had the exact metadata value I was expecting