---
title: "June, 2021"
date: 2021-06-01T10:51:07+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2021-06-01

- IWMI notified me that AReS was down with an HTTP 502 error
  - Looking at UptimeRobot I see it has been down for 33 hours, but I never got a notification
  - I don't see anything in the Elasticsearch container logs, or the systemd journal on the host, but I notice that the `angular_nginx` container isn't running
  - I simply started it and AReS was running again:

<!--more-->

```console
$ docker-compose -f docker/docker-compose.yml start angular_nginx
```

- Margarita from CCAFS emailed me to say that workflow alerts haven't been working lately
  - I guess this is related to the SMTP issues last week
  - I had fixed the config, but didn't restart Tomcat so DSpace didn't load the new variables
  - I ran all system updates on CGSpace (linode18) and DSpace Test (linode26) and rebooted the servers

## 2021-06-03

- Meeting with AMCOW and IWMI to discuss AMCOW getting IWMI's content into the new AMCOW Knowledge Hub
  - At first we spent some time talking about DSpace communities/collections and the REST API, but then they said they actually prefer to send queries to sites on the fly and cache them in Redis for some time
  - That's when I thought they could perhaps use the OpenSearch, but I can't remember if it's possible to limit by community, or only collection...
  - Looking now, I see there is a "scope" parameter that can be used for community or collection, for example:

```
https://cgspace.cgiar.org/open-search/discover?query=subject:water%20scarcity&scope=10568/16814&order=DESC&rpp=100&sort_by=2&start=1
```

- That will sort by date issued (see: `webui.itemlist.sort-option.2` in dspace.cfg), give 100 results per page, and start on item 1
- Otherwise, another alternative would be to use the IWMI CSV that we are already exporting every week
- Fill out the *CGIAR-AGROVOC Task Group: Survey on the current CGIAR use of AGROVOC* survey on behalf of CGSpace

## 2021-06-06

- The Elasticsearch indexes are messed up so I dumped and re-created them correctly:

```console
curl -XDELETE 'http://localhost:9200/openrxv-items-final'
curl -XDELETE 'http://localhost:9200/openrxv-items-temp'
curl -XPUT 'http://localhost:9200/openrxv-items-final'
curl -XPUT 'http://localhost:9200/openrxv-items-temp'
curl -s -X POST 'http://localhost:9200/_aliases' -H 'Content-Type: application/json' -d'{"actions" : [{"add" : { "index" : "openrxv-items-final", "alias" : "openrxv-items"}}]}'
elasticdump --input=/home/aorth/openrxv-items_mapping.json --output=http://localhost:9200/openrxv-items-final --type=mapping
elasticdump --input=/home/aorth/openrxv-items_data.json --output=http://localhost:9200/openrxv-items-final --type=data --limit=1000
```

- Then I started a harvesting on AReS

## 2021-06-07

- The harvesting on AReS completed successfully
- Provide feedback to FAO on how we use AGROVOC for their "AGROVOC call for use cases"

## 2021-06-10

- Skype with Moayad to discuss AReS harvesting improvements
  - He will work on a plugin that reads the XML sitemap to get all item IDs and checks whether we have them or not

## 2021-06-14

- Dump and re-create indexes on AReS (as above) so I can do a harvest

## 2021-06-16

- Looking at the Solr statistics on CGSpace for last month I see many requests from hosts using seemingly normal Windows browser user agents, but using the MSN bot's DNS
  - For example, user agent `Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0;  Trident/5.0)` with DNS `msnbot-131-253-25-91.search.msn.com.`
  - I queried Solr for all hits using the MSN bot DNS (`dns:*msnbot* AND dns:*.msn.com.`) and found 457,706
  - I extracted their IPs using Solr's CSV format and ran them through my `resolve-addresses.py` script and found that they all belong to MICROSOFT-CORP-MSN-AS-BLOCK (AS8075)
  - Note that [Microsoft's docs say that reverse lookups on Bingbot IPs will always have "search.msn.com"](https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26) so it is safe to purge these as non-human traffic
  - I purged the hits with `ilri/check-spider-ip-hits.sh` (though I had to do it in 3 batches because I forgot to increase the `facet.limit` so I was only getting them 100 at a time)
- Moayad sent a pull request a few days ago to re-work the harvesting on OpenRXV
  - It will hopefully also fix the duplicate and missing items issues
  - I had a Skype with him to discuss
  - I got it running on podman-compose, but I had to fix the storage permissions on the Elasticsearch volume after the first time it tries (and fails) to run:

```console
$ podman unshare chown 1000:1000 /home/aorth/.local/share/containers/storage/volumes/docker_esData_7/_data
```

<!-- vim: set sw=2 ts=2: -->