--- title: "July, 2022" date: 2022-07-02T14:07:36+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2022-07-02 - I learned how to use the Levenshtein functions in PostgreSQL - The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing - Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first - A working query checking for duplicates in the recent AfricaRice items is: ```console localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3; text_value ──────────────────────────────────────────────────────────────────────────────────────── International trade and exotic pests: the risks for biodiversity and African economies (1 row) Time: 399.751 ms ``` - There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster - I want to do some proper checks of accuracy and speed against my trigram method ## 2022-07-03 - Start a harvest on AReS ## 2022-07-04 - Linode told me that CGSpace had high load yesterday - I also got some up and down notices from UptimeRobot - Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count ![CPU load day](/cgspace-notes/2022/07/cpu-day.png) ![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png) - Seems we have some old database transactions since 2022-06-27: ![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png) ![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png) - Looking at the top connections to nginx yesterday: ```console # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail 1132 64.124.8.34 1146 2a01:4f8:1c17:5550::1 1380 137.184.159.211 1533 64.124.8.59 4013 80.248.237.167 4776 54.195.118.125 10482 45.5.186.2 11177 172.104.229.92 15855 2a01:7e00::f03c:91ff:fe9a:3a37 22179 64.39.98.251 ``` - And the total number of unique IPs: ```console # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l 6952 ``` - This seems low, so it must have been from the request patterns by certain visitors - 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test) - The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover - 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo - I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything) - I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks - This is clever and relies on the fact that we can use defaults in both cases - First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly - Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys) - I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka" - Perhaps we need to update our list of languages to include all instead of the most common ones - I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`