--- title: "July, 2022" date: 2022-07-02T14:07:36+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2022-07-02 - I learned how to use the Levenshtein functions in PostgreSQL - The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing - Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first - A working query checking for duplicates in the recent AfricaRice items is: ```console localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3; text_value ──────────────────────────────────────────────────────────────────────────────────────── International trade and exotic pests: the risks for biodiversity and African economies (1 row) Time: 399.751 ms ``` - There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster - I want to do some proper checks of accuracy and speed against my trigram method ## 2022-07-03 - Start a harvest on AReS ## 2022-07-04 - Linode told me that CGSpace had high load yesterday - I also got some up and down notices from UptimeRobot - Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count ![CPU load day](/cgspace-notes/2022/07/cpu-day.png) ![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png) - Seems we have some old database transactions since 2022-06-27: ![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png) ![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png) - Looking at the top connections to nginx yesterday: ```console # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail 1132 64.124.8.34 1146 2a01:4f8:1c17:5550::1 1380 137.184.159.211 1533 64.124.8.59 4013 80.248.237.167 4776 54.195.118.125 10482 45.5.186.2 11177 172.104.229.92 15855 2a01:7e00::f03c:91ff:fe9a:3a37 22179 64.39.98.251 ``` - And the total number of unique IPs: ```console # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l 6952 ``` - This seems low, so it must have been from the request patterns by certain visitors - 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test) - The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover - 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo - I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything) - I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks - This is clever and relies on the fact that we can use defaults in both cases - First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly - Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys) - I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka" - Perhaps we need to update our list of languages to include all instead of the most common ones - I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml` ## 2022-07-06 - CGSpace went down and up a few times due to high load - I found one host in Romania making very high speed requests with a normal user agent (`Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C`): ```console # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort | uniq -c | sort -h | tail -n 10 516 142.132.248.90 525 157.55.39.234 587 66.249.66.21 593 95.108.213.59 1372 137.184.159.211 4776 54.195.118.125 5441 205.186.128.185 6267 45.5.186.2 15839 2a01:7e00::f03c:91ff:fe9a:3a37 36114 146.19.75.141 ``` - I added 146.19.75.141 to the list of bot networks in nginx - While looking at the logs I started thinking about Bing again - They apparently [publish a list of all their networks](https://www.bing.com/toolbox/bingbot.json) - I wrote a script to use `prips` to [print the IPs for each network](https://stackoverflow.com/a/52501093/1996540) - The script is `bing-networks-to-ips.sh` - From Bing's IPs alone I purged 145,403 hits... sheesh - Delete two items on CGSpace for Margarita because she was getting the "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-..." error - This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05 - Update some `cg.audience` metadata to use "Academics" instead of "Academicians": ```console dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians'; UPDATE 104 ``` - I will also have to remove "Academicians" from input-forms.xml ## 2022-07-07 - Finalize lists of non-AGROVOC subjects in CGSpace that I started last week - I used the [SQL helper functions](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the collections where each term was used: ```console localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5; collection │ count ─────────────┼─────── 10568/36178 │ 56 10568/36185 │ 46 10568/36181 │ 35 10568/36188 │ 28 10568/36179 │ 21 (5 rows) ``` - For now I only did terms from my list that had 100 or more occurrences in CGSpace - This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC - Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace - We want to remove them from the submission form to create space for new fields - Update one term I noticed people using that was close to AGROVOC: ```console dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy'; UPDATE 108 ``` - After contacting some editors I removed some old metadata fields from the submission form and browse indexes: - Bioversity subject (`cg.subject.bioversity`) - CCAFS phase 1 project tag (`cg.identifier.ccafsproject`) - CIAT project tag (`cg.identifier.ciatproject`) - CIAT subject (`cg.subject.ciat`) - Work on cleaning and proofing forty-six AfricaRice items for CGSpace - Last week we identified some duplicates so I removed those - The data is of mediocre quality - I've been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects - I even found titles that have typos, looking something like OCR errors... ## 2022-07-08 - Finalize the cleaning and proofing of AfricaRice records - I found two suspicious items that claim to have been published but I can't find in the respective journals, so I removed those - I uploaded the forty-four items to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/119135) - Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag - I removed these from the input-form.xml and Discovery facets: - cg.identifier.ccafsprojectpii - cg.subject.cifor - For now we will keep them in the search filters - I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents) - I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace - I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually - Deploy latest changes to submission form, Discovery, and browse on CGSpace - Also run all system updates and reboot the host - Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles: ```console UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$'; ``` ## 2022-07-10 - UptimeRobot says that CGSpace is down - I see high load around 22, high CPU around 800% - Doesn't seem to be a lot of unique IPs: ```console # awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l 2243 ``` Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions: ```console $ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l 1613 $ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l 1696 $ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l 1708 $ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l 1830 $ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l 1811 ``` ![DSpace sessions week](/cgspace-notes/2022/07/jmx_dspace_sessions-week.png) - These IPs are using normal-looking user agents: - `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1` - `Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"` - `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1` - `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36` - I will add networks I'm seeing now to nginx's bot-networks.conf for now (not all of Hetzner) and purge the hits later: - 65.108.0.0/16 - 65.21.0.0/16 - 95.216.0.0/16 - 135.181.0.0/16 - 138.201.0.0/16 - I think I'm going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need - Sheesh, there are a bunch more IPv6 addresses also on Hetzner: ```console # awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h 1 2a01:4f9:6a:1c2b::2 2 2a01:4f9:2b:5a8::2 2 2a01:4f9:4b:4495::2 96 2a01:4f9:c010:518c::1 137 2a01:4f9:c010:a9bc::1 142 2a01:4f9:c010:58c9::1 142 2a01:4f9:c010:58ea::1 144 2a01:4f9:c010:58eb::1 145 2a01:4f9:c010:6ff8::1 148 2a01:4f9:c010:5190::1 149 2a01:4f9:c010:7d6d::1 153 2a01:4f9:c010:5226::1 156 2a01:4f9:c010:7f74::1 160 2a01:4f9:c010:5188::1 161 2a01:4f9:c010:58e5::1 168 2a01:4f9:c010:58ed::1 170 2a01:4f9:c010:548e::1 170 2a01:4f9:c010:8c97::1 175 2a01:4f9:c010:58c8::1 175 2a01:4f9:c010:aada::1 182 2a01:4f9:c010:58ec::1 182 2a01:4f9:c010:ae8c::1 502 2a01:4f9:c010:ee57::1 530 2a01:4f9:c011:567a::1 535 2a01:4f9:c010:d04e::1 539 2a01:4f9:c010:3d9a::1 586 2a01:4f9:c010:93db::1 593 2a01:4f9:c010:a04a::1 601 2a01:4f9:c011:4166::1 607 2a01:4f9:c010:9881::1 640 2a01:4f9:c010:87fb::1 648 2a01:4f9:c010:e680::1 1141 2a01:4f9:3a:2696::2 1146 2a01:4f9:3a:2555::2 3207 2a01:4f9:3a:2c19::2 ``` - Maybe it's time I ban all of Hetzner... sheesh. - I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back ![CPU day](/cgspace-notes/2022/07/cpu-day.png) - I am not sure what's going on - I extracted all the IPs and used `resolve-addresses-geoip2.py` to analyze them and extract all the Hetzner networks and block them - It's 181 IPs on Hetzner... - I rebooted the server to see if it was just some stuck locks in PostgreSQL... - The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block - Start a harvest on AReS ## 2022-07-12 - Update an incorrect ORCID identifier for Alliance