cgspace-notes/content/posts/2022-07.md

---
title: "July, 2022"
date: 2022-07-02T14:07:36+03:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2022-07-02

- I learned how to use the Levenshtein functions in PostgreSQL
  - The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
  - Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first

<!--more-->

- A working query checking for duplicates in the recent AfricaRice items is:

```console
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE  dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
                                       text_value                                       
────────────────────────────────────────────────────────────────────────────────────────
 International trade and exotic pests: the risks for biodiversity and African economies
(1 row)

Time: 399.751 ms
```

- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster
- I want to do some proper checks of accuracy and speed against my trigram method

## 2022-07-03

- Start a harvest on AReS

## 2022-07-04

- Linode told me that CGSpace had high load yesterday
  - I also got some up and down notices from UptimeRobot
  - Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count

![CPU load day](/cgspace-notes/2022/07/cpu-day.png)
![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png)

- Seems we have some old database transactions since 2022-06-27:

![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png)
![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png)

- Looking at the top connections to nginx yesterday:

```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail
   1132 64.124.8.34
   1146 2a01:4f8:1c17:5550::1
   1380 137.184.159.211
   1533 64.124.8.59
   4013 80.248.237.167
   4776 54.195.118.125
  10482 45.5.186.2
  11177 172.104.229.92
  15855 2a01:7e00::f03c:91ff:fe9a:3a37
  22179 64.39.98.251
```

- And the total number of unique IPs:

```console
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l
6952
```

- This seems low, so it must have been from the request patterns by certain visitors
  - 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test)
  - The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
  - 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
- I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
- I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks
  - This is clever and relies on the fact that we can use defaults in both cases
  - First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly
  - Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys)
- I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka"
  - Perhaps we need to update our list of languages to include all instead of the most common ones
- I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`

<!-- vim: set sw=2 ts=2: -->
Add notes for 2022-07-03 2022-07-04 08:25:14 +02:00			`---`
			`title: "July, 2022"`
			`date: 2022-07-02T14:07:36+03:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

			`## 2022-07-02`

			`- I learned how to use the Levenshtein functions in PostgreSQL`
			`- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing`
			`- Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first`

			`<!--more-->`

			`- A working query checking for duplicates in the recent AfricaRice items is:`

			```console
			`localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;`
			`text_value`
			`────────────────────────────────────────────────────────────────────────────────────────`
			`International trade and exotic pests: the risks for biodiversity and African economies`
			`(1 row)`

			`Time: 399.751 ms`
			```

			`- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster`
			`- I want to do some proper checks of accuracy and speed against my trigram method`

			`## 2022-07-03`

			`- Start a harvest on AReS`

Add notes for 2022-07-04 2022-07-04 16:20:01 +02:00			`## 2022-07-04`

			`- Linode told me that CGSpace had high load yesterday`
			`- I also got some up and down notices from UptimeRobot`
			`- Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count`

			`![CPU load day](/cgspace-notes/2022/07/cpu-day.png)`
			`![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png)`

			`- Seems we have some old database transactions since 2022-06-27:`

			`![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png)`
			`![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png)`

			`- Looking at the top connections to nginx yesterday:`

			```console
			`# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 \| sort \| uniq -c \| sort -h \| tail`
			`1132 64.124.8.34`
			`1146 2a01:4f8:1c17:5550::1`
			`1380 137.184.159.211`
			`1533 64.124.8.59`
			`4013 80.248.237.167`
			`4776 54.195.118.125`
			`10482 45.5.186.2`
			`11177 172.104.229.92`
			`15855 2a01:7e00::f03c:91ff:fe9a:3a37`
			`22179 64.39.98.251`
			```

			`- And the total number of unique IPs:`

			```console
			`# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 \| sort -u \| wc -l`
			`6952`
			```

			`- This seems low, so it must have been from the request patterns by certain visitors`
			`- 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test)`
			`- The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover`
			`- 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo`
			`- I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)`
			- I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks
			`- This is clever and relies on the fact that we can use defaults in both cases`
			`- First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly`
			- Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys)
			- I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka"
			`- Perhaps we need to update our list of languages to include all instead of the most common ones`
Add notes for 2022-07-04 2022-07-04 21:10:02 +02:00			- I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`
Add notes for 2022-07-04 2022-07-04 16:20:01 +02:00
Add notes for 2022-07-03 2022-07-04 08:25:14 +02:00			`<!-- vim: set sw=2 ts=2: -->`