mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-09-30 06:04:16 +02:00
171 lines
8.6 KiB
Markdown
171 lines
8.6 KiB
Markdown
---
|
|
title: "July, 2022"
|
|
date: 2022-07-02T14:07:36+03:00
|
|
author: "Alan Orth"
|
|
categories: ["Notes"]
|
|
---
|
|
|
|
## 2022-07-02
|
|
|
|
- I learned how to use the Levenshtein functions in PostgreSQL
|
|
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
|
|
- Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
|
|
|
|
<!--more-->
|
|
|
|
- A working query checking for duplicates in the recent AfricaRice items is:
|
|
|
|
```console
|
|
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
|
|
text_value
|
|
────────────────────────────────────────────────────────────────────────────────────────
|
|
International trade and exotic pests: the risks for biodiversity and African economies
|
|
(1 row)
|
|
|
|
Time: 399.751 ms
|
|
```
|
|
|
|
- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster
|
|
- I want to do some proper checks of accuracy and speed against my trigram method
|
|
|
|
## 2022-07-03
|
|
|
|
- Start a harvest on AReS
|
|
|
|
## 2022-07-04
|
|
|
|
- Linode told me that CGSpace had high load yesterday
|
|
- I also got some up and down notices from UptimeRobot
|
|
- Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count
|
|
|
|
![CPU load day](/cgspace-notes/2022/07/cpu-day.png)
|
|
![JDBC pool day](/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png)
|
|
|
|
- Seems we have some old database transactions since 2022-06-27:
|
|
|
|
![PostgreSQL locks week](/cgspace-notes/2022/07/postgres_locks_ALL-week.png)
|
|
![PostgreSQL query length week](/cgspace-notes/2022/07/postgres_querylength_ALL-week.png)
|
|
|
|
- Looking at the top connections to nginx yesterday:
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail
|
|
1132 64.124.8.34
|
|
1146 2a01:4f8:1c17:5550::1
|
|
1380 137.184.159.211
|
|
1533 64.124.8.59
|
|
4013 80.248.237.167
|
|
4776 54.195.118.125
|
|
10482 45.5.186.2
|
|
11177 172.104.229.92
|
|
15855 2a01:7e00::f03c:91ff:fe9a:3a37
|
|
22179 64.39.98.251
|
|
```
|
|
|
|
- And the total number of unique IPs:
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l
|
|
6952
|
|
```
|
|
|
|
- This seems low, so it must have been from the request patterns by certain visitors
|
|
- 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test)
|
|
- The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
|
|
- 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
|
|
- I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
|
|
- I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks
|
|
- This is clever and relies on the fact that we can use defaults in both cases
|
|
- First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly
|
|
- Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys)
|
|
- I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka"
|
|
- Perhaps we need to update our list of languages to include all instead of the most common ones
|
|
- I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`
|
|
|
|
## 2022-07-06
|
|
|
|
- CGSpace went down and up a few times due to high load
|
|
- I found one host in Romania making very high speed requests with a normal user agent (`Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C`):
|
|
|
|
```console
|
|
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort | uniq -c | sort -h | tail -n 10
|
|
516 142.132.248.90
|
|
525 157.55.39.234
|
|
587 66.249.66.21
|
|
593 95.108.213.59
|
|
1372 137.184.159.211
|
|
4776 54.195.118.125
|
|
5441 205.186.128.185
|
|
6267 45.5.186.2
|
|
15839 2a01:7e00::f03c:91ff:fe9a:3a37
|
|
36114 146.19.75.141
|
|
```
|
|
|
|
- I added 146.19.75.141 to the list of bot networks in nginx
|
|
- While looking at the logs I started thinking about Bing again
|
|
- They apparently [publish a list of all their networks](https://www.bing.com/toolbox/bingbot.json)
|
|
- I wrote a script to use `prips` to [print the IPs for each network](https://stackoverflow.com/a/52501093/1996540)
|
|
- The script is `bing-networks-to-ips.sh`
|
|
- From Bing's IPs alone I purged 145,403 hits... sheesh
|
|
- Delete two items on CGSpace for Margarita because she was getting the "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-..." error
|
|
- This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05
|
|
- Update some `cg.audience` metadata to use "Academics" instead of "Academicians":
|
|
|
|
```console
|
|
dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
|
|
UPDATE 104
|
|
```
|
|
|
|
- I will also have to remove "Academicians" from input-forms.xml
|
|
|
|
|
|
## 2022-07-07
|
|
|
|
- Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
|
|
- I used the [SQL helper functions](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the collections where each term was used:
|
|
|
|
```console
|
|
localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
|
|
collection │ count
|
|
─────────────┼───────
|
|
10568/36178 │ 56
|
|
10568/36185 │ 46
|
|
10568/36181 │ 35
|
|
10568/36188 │ 28
|
|
10568/36179 │ 21
|
|
(5 rows)
|
|
```
|
|
|
|
- For now I only did terms from my list that had 100 or more occurrences in CGSpace
|
|
- This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC
|
|
- Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
|
|
- We want to remove them from the submission form to create space for new fields
|
|
- Update one term I noticed people using that was close to AGROVOC:
|
|
|
|
```console
|
|
dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
|
|
UPDATE 108
|
|
```
|
|
|
|
- After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
|
|
- Bioversity subject (`cg.subject.bioversity`)
|
|
- CCAFS phase 1 project tag (`cg.identifier.ccafsproject`)
|
|
- CIAT project tag (`cg.identifier.ciatproject`)
|
|
- CIAT subject (`cg.subject.ciat`)
|
|
- Work on cleaning and proofing forty-six AfricaRice items for CGSpace
|
|
- Last week we identified some duplicates so I removed those
|
|
- The data is of mediocre quality
|
|
- I've been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects
|
|
- I even found titles that have typos, looking something like OCR errors...
|
|
|
|
## 2022-07-08
|
|
|
|
- Finalize the cleaning and proofing of AfricaRice records
|
|
- I found two suspicious items that claim to have been published but I can't find in the respective journals, so I removed those
|
|
- I uploaded the forty-four items to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/119135)
|
|
- Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
|
|
- I removed these from the input-form.xml and Discovery facets:
|
|
- cg.identifier.ccafsprojectpii
|
|
- cg.subject.cifor
|
|
- For now we will keep them in the search filters
|