- I learned how to use the Levenshtein functions in PostgreSQL
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
- Also, the trgm functions I've used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
<!--more-->
- A working query checking for duplicates in the recent AfricaRice items is:
```console
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
International trade and exotic pests: the risks for biodiversity and African economies
(1 row)
Time: 399.751 ms
```
- There is a great [blog post discussing Soundex with Levenshtein](https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql) and creating indexes to make them faster
- I want to do some proper checks of accuracy and speed against my trigram method
- This seems low, so it must have been from the request patterns by certain visitors
- 64.39.98.251 is Qualys, and I'm debating blocking [all their IPs](https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm) using a geo block in nginx (need to test)
- The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
- 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
- I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
- I implemented a geo mapping for the user agent mapping AND the nginx `limit_req_zone` by extracting the networks into an external file and including it in two different geo mapping blocks
- This is clever and relies on the fact that we can use defaults in both cases
- First, we map the user agent of requests from these networks to "bot" so that Tomcat and Solr handle them accordingly
- Second, we use this as a key in a `limit_req_zone`, which relies on a default mapping of '' (and nginx doesn't evaluate empty cache keys)
- I noticed that CIP uploaded a number of Georgian presentations with `dcterms.language` set to English and Other so I changed them to "ka"
- Perhaps we need to update our list of languages to include all instead of the most common ones
- I wrote a script `ilri/iso-639-value-pairs.py` to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to `input-forms.xml`
- CGSpace went down and up a few times due to high load
- I found one host in Romania making very high speed requests with a normal user agent (`Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C`):
- I added 146.19.75.141 to the list of bot networks in nginx
- While looking at the logs I started thinking about Bing again
- They apparently [publish a list of all their networks](https://www.bing.com/toolbox/bingbot.json)
- I wrote a script to use `prips` to [print the IPs for each network](https://stackoverflow.com/a/52501093/1996540)
- The script is `bing-networks-to-ips.sh`
- From Bing's IPs alone I purged 145,403 hits... sheesh
- Delete two items on CGSpace for Margarita because she was getting the "Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-..." error
- This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05
- Update some `cg.audience` metadata to use "Academics" instead of "Academicians":
```console
dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
UPDATE 104
```
- I will also have to remove "Academicians" from input-forms.xml
- Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
- I used the [SQL helper functions](https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6) to find the collections where each term was used:
```console
localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
collection │ count
─────────────┼───────
10568/36178 │ 56
10568/36185 │ 46
10568/36181 │ 35
10568/36188 │ 28
10568/36179 │ 21
(5 rows)
```
- For now I only did terms from my list that had 100 or more occurrences in CGSpace
- This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC
- Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
- We want to remove them from the submission form to create space for new fields
- Update one term I noticed people using that was close to AGROVOC:
```console
dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
UPDATE 108
```
- After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
- Bioversity subject (`cg.subject.bioversity`)
- CCAFS phase 1 project tag (`cg.identifier.ccafsproject`)
- CIAT project tag (`cg.identifier.ciatproject`)
- CIAT subject (`cg.subject.ciat`)
- Work on cleaning and proofing forty-six AfricaRice items for CGSpace
- Last week we identified some duplicates so I removed those
- The data is of mediocre quality
- I've been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects
- I even found titles that have typos, looking something like OCR errors...
## 2022-07-08
- Finalize the cleaning and proofing of AfricaRice records
- I found two suspicious items that claim to have been published but I can't find in the respective journals, so I removed those
- I uploaded the forty-four items to [DSpace Test](https://dspacetest.cgiar.org/handle/10568/119135)
- Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
- I removed these from the input-form.xml and Discovery facets:
- I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents)
- I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace
- I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually
- Deploy latest changes to submission form, Discovery, and browse on CGSpace
- Also run all system updates and reboot the host
- Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles:
```console
UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
- Adjust collection permissions on CIFOR publications collection so Vika can submit without approval
## 2022-07-14
- Someone on the DSpace Slack mentioned having issues with the database configuration in DSpace 7.3
- The reason is apparently that the default `db.dialect` changed from "org.dspace.storage.rdbms.hibernate.postgres.DSpacePostgreSQL82Dialect" to "org.hibernate.dialect.PostgreSQL94Dialect" as a result of a Hibernate update
- Then I was getting more errors starting the backend server in Tomcat, but the issue was that the backend server needs Solr to be up first!
- I see some Hetzner IPs in the top users today, but most of the requests are getting HTTP 503 because of the changes I made last week
- I see 137.184.159.211, which is on Digital Ocean, and the DNS is apparently iitawpsite.iita.org
- I've seen their user agent before, but I don't think I knew it was IITA: "GuzzleHttp/6.3.3 curl/7.84.0 PHP/7.4.30"
- I already have something in nginx to mark Guzzle as a bot, but interestingly it shows up in Solr as `$http_user_agent` so there is a logic error in my nginx config
- Ouch, the logic error seems to be this:
```console
geo $ua {
default $http_user_agent;
include /etc/nginx/bot-networks.conf;
}
```
- After some testing on DSpace Test I see that this is actually setting the default user agent to a literal `$http_user_agent`
- The [nginx map docs](http://nginx.org/en/docs/http/ngx_http_map_module.html) say:
> The resulting value can contain text, variable (0.9.0), and their combination (1.11.0).
- But I can't get it to work, neither for the default value or for matching my IP...
- Reading more about nginx's geo/map and doing some tests on DSpace Test, it appears that the [geo module cannot do dynamic values](https://stackoverflow.com/questions/47011497/nginx-geo-module-wont-use-variables)
- So this issue with the literal `$http_user_agent` is due to the geo block I put in place earlier this month
- I reworked the logic so that the geo block sets "bot" or and empty string when a network matches or not, and then re-use that value in a mapping that passes through the host's user agent in case geo has set it to an empty string
- This allows me to accomplish the original goal while still only using one bot-networks.conf file for the `limit_req_zone` and the user agent mapping that we pass to Tomcat
- Unfortunately this means I will have hundreds of thousands of requests in Solr with a literal `$http_user_agent`
- I might try to purge some by enumerating all the networks in my block file and running them through `check-spider-ip-hits.sh`
- Review the AfricaRice records from earlier this month again
- I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two
- I took all the ~560 IPs that had hits so far in `check-spider-ip-hits.sh` above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
- This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from
- Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script
## 2022-07-20
- Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
- Once it worked well I did an import to CGSpace:
```console
$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
```
- Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up
- Extract another ~1,600 IPs that had hits since I started the second round of `check-spider-ip-hits.sh` yesterday and purge another 303,594 hits
- This is about 999846 into the original list of 1946968 from yesterday
- A metric fuck ton of the IPs in this batch were from Hetzner
## 2022-07-21
- Extract another ~2,100 IPs that had hits since I started the third round of `check-spider-ip-hits.sh` last night and purge another 763,843 hits
- This is about 1441221 into the original list of 1946968 from two days ago
- Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)