July, 2022

Sat Jul 02, 2022 in Notes

2022-07-02

I learned how to use the Levenshtein functions in PostgreSQL
- The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
- Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first

A working query checking for duplicates in the recent AfricaRice items is:

localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE  dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
                                       text_value                                       
────────────────────────────────────────────────────────────────────────────────────────
 International trade and exotic pests: the risks for biodiversity and African economies
(1 row)

Time: 399.751 ms

There is a great blog post discussing Soundex with Levenshtein and creating indexes to make them faster
I want to do some proper checks of accuracy and speed against my trigram method

2022-07-03

Start a harvest on AReS

2022-07-04

Linode told me that CGSpace had high load yesterday
- I also got some up and down notices from UptimeRobot
- Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count

CPU load day JDBC pool day

Seems we have some old database transactions since 2022-06-27:

PostgreSQL locks week PostgreSQL query length week

Looking at the top connections to nginx yesterday:

# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail
   1132 64.124.8.34
   1146 2a01:4f8:1c17:5550::1
   1380 137.184.159.211
   1533 64.124.8.59
   4013 80.248.237.167
   4776 54.195.118.125
  10482 45.5.186.2
  11177 172.104.229.92
  15855 2a01:7e00::f03c:91ff:fe9a:3a37
  22179 64.39.98.251

And the total number of unique IPs:

# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l
6952

This seems low, so it must have been from the request patterns by certain visitors
- 64.39.98.251 is Qualys, and I’m debating blocking all their IPs using a geo block in nginx (need to test)
- The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
- 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
I implemented a geo mapping for the user agent mapping AND the nginx limit_req_zone by extracting the networks into an external file and including it in two different geo mapping blocks
- This is clever and relies on the fact that we can use defaults in both cases
- First, we map the user agent of requests from these networks to “bot” so that Tomcat and Solr handle them accordingly
- Second, we use this as a key in a limit_req_zone, which relies on a default mapping of ’’ (and nginx doesn’t evaluate empty cache keys)
I noticed that CIP uploaded a number of Georgian presentations with dcterms.language set to English and Other so I changed them to “ka”
- Perhaps we need to update our list of languages to include all instead of the most common ones
I wrote a script ilri/iso-639-value-pairs.py to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to input-forms.xml

2022-07-06

CGSpace went down and up a few times due to high load
- I found one host in Romania making very high speed requests with a normal user agent (Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C):

# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort | uniq -c | sort -h | tail -n 10
    516 142.132.248.90
    525 157.55.39.234
    587 66.249.66.21
    593 95.108.213.59
   1372 137.184.159.211
   4776 54.195.118.125
   5441 205.186.128.185
   6267 45.5.186.2
  15839 2a01:7e00::f03c:91ff:fe9a:3a37
  36114 146.19.75.141

I added 146.19.75.141 to the list of bot networks in nginx
While looking at the logs I started thinking about Bing again
- They apparently publish a list of all their networks
- I wrote a script to use prips to print the IPs for each network
- The script is bing-networks-to-ips.sh
- From Bing’s IPs alone I purged 145,403 hits… sheesh
Delete two items on CGSpace for Margarita because she was getting the “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-…” error
- This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05
Update some cg.audience metadata to use “Academics” instead of “Academicians”:

dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
UPDATE 104

I will also have to remove “Academicians” from input-forms.xml

2022-07-07

Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
- I used the SQL helper functions to find the collections where each term was used:

localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
 collection  │ count 
─────────────┼───────
 10568/36178 │    56
 10568/36185 │    46
 10568/36181 │    35
 10568/36188 │    28
 10568/36179 │    21
(5 rows)

For now I only did terms from my list that had 100 or more occurrences in CGSpace
- This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC
Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
- We want to remove them from the submission form to create space for new fields
Update one term I noticed people using that was close to AGROVOC:

dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
UPDATE 108

After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
- Bioversity subject (cg.subject.bioversity)
- CCAFS phase 1 project tag (cg.identifier.ccafsproject)
- CIAT project tag (cg.identifier.ciatproject)
- CIAT subject (cg.subject.ciat)
Work on cleaning and proofing forty-six AfricaRice items for CGSpace
- Last week we identified some duplicates so I removed those
- The data is of mediocre quality
- I’ve been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects
- I even found titles that have typos, looking something like OCR errors…

2022-07-08

Finalize the cleaning and proofing of AfricaRice records
- I found two suspicious items that claim to have been published but I can’t find in the respective journals, so I removed those
- I uploaded the forty-four items to DSpace Test
Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
- I removed these from the input-form.xml and Discovery facets:
  - cg.identifier.ccafsprojectpii
  - cg.subject.cifor
- For now we will keep them in the search filters
I modified my check-duplicates.py script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents)
- I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace
- I am curious to see how the similarity scores compare to those from trgm… perhaps we don’t need them actually
Deploy latest changes to submission form, Discovery, and browse on CGSpace
- Also run all system updates and reboot the host
Fix 152 dcterms.relation that are using “cgspace.cgiar.org” links instead of handles:

UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';

2022-07-10

UptimeRobot says that CGSpace is down
- I see high load around 22, high CPU around 800%
- Doesn’t seem to be a lot of unique IPs:

# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
2243

Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:

$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1613
$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1696
$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1708
$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1830
$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
1811

DSpace sessions week

These IPs are using normal-looking user agents:
- Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1
- Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"
- Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1
- Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36
I will add networks I’m seeing now to nginx’s bot-networks.conf for now (not all of Hetzner) and purge the hits later:
- 65.108.0.0/16
- 65.21.0.0/16
- 95.216.0.0/16
- 135.181.0.0/16
- 138.201.0.0/16
I think I’m going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need
Sheesh, there are a bunch more IPv6 addresses also on Hetzner:

# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h
      1 2a01:4f9:6a:1c2b::2
      2 2a01:4f9:2b:5a8::2
      2 2a01:4f9:4b:4495::2
     96 2a01:4f9:c010:518c::1
    137 2a01:4f9:c010:a9bc::1
    142 2a01:4f9:c010:58c9::1
    142 2a01:4f9:c010:58ea::1
    144 2a01:4f9:c010:58eb::1
    145 2a01:4f9:c010:6ff8::1
    148 2a01:4f9:c010:5190::1
    149 2a01:4f9:c010:7d6d::1
    153 2a01:4f9:c010:5226::1
    156 2a01:4f9:c010:7f74::1
    160 2a01:4f9:c010:5188::1
    161 2a01:4f9:c010:58e5::1
    168 2a01:4f9:c010:58ed::1
    170 2a01:4f9:c010:548e::1
    170 2a01:4f9:c010:8c97::1
    175 2a01:4f9:c010:58c8::1
    175 2a01:4f9:c010:aada::1
    182 2a01:4f9:c010:58ec::1
    182 2a01:4f9:c010:ae8c::1
    502 2a01:4f9:c010:ee57::1
    530 2a01:4f9:c011:567a::1
    535 2a01:4f9:c010:d04e::1
    539 2a01:4f9:c010:3d9a::1
    586 2a01:4f9:c010:93db::1
    593 2a01:4f9:c010:a04a::1
    601 2a01:4f9:c011:4166::1
    607 2a01:4f9:c010:9881::1
    640 2a01:4f9:c010:87fb::1
    648 2a01:4f9:c010:e680::1
   1141 2a01:4f9:3a:2696::2
   1146 2a01:4f9:3a:2555::2
   3207 2a01:4f9:3a:2c19::2

Maybe it’s time I ban all of Hetzner… sheesh.
I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back

CPU day

I am not sure what’s going on
- I extracted all the IPs and used resolve-addresses-geoip2.py to analyze them and extract all the Hetzner networks and block them
- It’s 181 IPs on Hetzner…
I rebooted the server to see if it was just some stuck locks in PostgreSQL…
The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block
Start a harvest on AReS

2022-07-12

Update an incorrect ORCID identifier for Alliance
Adjust collection permissions on CIFOR publications collection so Vika can submit without approval

2022-07-14

Someone on the DSpace Slack mentioned having issues with the database configuration in DSpace 7.3
- The reason is apparently that the default db.dialect changed from “org.dspace.storage.rdbms.hibernate.postgres.DSpacePostgreSQL82Dialect” to “org.hibernate.dialect.PostgreSQL94Dialect” as a result of a Hibernate update
Then I was getting more errors starting the backend server in Tomcat, but the issue was that the backend server needs Solr to be up first!

2022-07-17

Start a harvest on AReS around 3:30PM
Later in the evening I see CGSpace was going down and up (not as bad as last Sunday) with around 18.0 load…
I see very high CPU usage:

CPU day

But DSpace sessions are normal (not like last weekend):

DSpace sessions week

I see some Hetzner IPs in the top users today, but most of the requests are getting HTTP 503 because of the changes I made last week
I see 137.184.159.211, which is on Digital Ocean, and the DNS is apparently iitawpsite.iita.org
- I’ve seen their user agent before, but I don’t think I knew it was IITA: “GuzzleHttp/6.3.3 curl/7.84.0 PHP/7.4.30”
- I already have something in nginx to mark Guzzle as a bot, but interestingly it shows up in Solr as $http_user_agent so there is a logic error in my nginx config
Ouch, the logic error seems to be this:

geo $ua {
    default          $http_user_agent;

    include /etc/nginx/bot-networks.conf;
}

After some testing on DSpace Test I see that this is actually setting the default user agent to a literal $http_user_agent
The nginx map docs say:

The resulting value can contain text, variable (0.9.0), and their combination (1.11.0).

But I can’t get it to work, neither for the default value or for matching my IP…
- I will have to ask on the nginx mailing list
The total number of requests and unique hosts was not even very high (below here around midnight so is almost all day):

# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
2776
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | wc -l
40325

2022-07-18

Reading more about nginx’s geo/map and doing some tests on DSpace Test, it appears that the geo module cannot do dynamic values
- So this issue with the literal $http_user_agent is due to the geo block I put in place earlier this month
- I reworked the logic so that the geo block sets “bot” or and empty string when a network matches or not, and then re-use that value in a mapping that passes through the host’s user agent in case geo has set it to an empty string
- This allows me to accomplish the original goal while still only using one bot-networks.conf file for the limit_req_zone and the user agent mapping that we pass to Tomcat
- Unfortunately this means I will have hundreds of thousands of requests in Solr with a literal $http_user_agent
- I might try to purge some by enumerating all the networks in my block file and running them through check-spider-ip-hits.sh
I extracted all the IPs/subnets from bot-networks.conf and prepared them so I could enumerate their IPs
- I had to add /32 to all single IPs, which I did with this crazy vim invocation:

:g!/\/\d\+$/s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/

Explanation:
- g!: global, lines not matching (the opposite of g)
- /\/\d\+$/, pattern matching / with one or more digits at the end of the line
- s/^$\d\+\.\d\+\.\d\+\.\d\+$$/\1\/32/, for lines not matching above, capture the IPv4 address and add /32 at the end
Then I ran the list through prips to enumerate the IPs:

$ while read -r line; do prips "$line" | sed -e '1d; $d'; done < /tmp/bot-networks.conf > /tmp/bot-ips.txt
$ wc -l /tmp/bot-ips.txt                                                                                        
1946968 /tmp/bot-ips.txt

I started running check-spider-ip-hits.sh with the 1946968 IPs and left it running in dry run mode

2022-07-19

Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace
- It’s one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API
Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items:

dc.contributor.author,cg.creator.identifier
"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247"
"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247"
"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247"
"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411"
"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411"
"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887"
"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887"
"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746"
"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337"
"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337"
$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p 'fuuu'

Review the AfricaRice records from earlier this month again
- I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two
I took all the ~560 IPs that had hits so far in check-spider-ip-hits.sh above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
- This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from
- Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script

2022-07-20

Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
- Once it worked well I did an import to CGSpace:

$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat

Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up
Extract another ~1,600 IPs that had hits since I started the second round of check-spider-ip-hits.sh yesterday and purge another 303,594 hits
- This is about 999846 into the original list of 1946968 from yesterday
- A metric fuck ton of the IPs in this batch were from Hetzner

2022-07-21

Extract another ~2,100 IPs that had hits since I started the third round of check-spider-ip-hits.sh last night and purge another 763,843 hits
- This is about 1441221 into the original list of 1946968 from two days ago
- Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)
I responded to my original request to Atmire about the log4j to reload4j migration in DSpace 6.4
- I had initially requested a comment from them in 2022-05
Extract another ~1,200 IPs that had hits from the fourth round of check-spider-ip-hits.sh earlier today and purge another 74,591 hits
- Now the list of IPs I enumerated from the nginx bot-networks.conf is finished

2022-07-22

I created a new Linode instance for testing DSpace 7
Jose from the CCAFS team sent me the final versions of 3,500+ Innovations, Policies, MELIAs, and OICRs from MARLO
I re-synced CGSpace with DSpace Test so I can have a newer snapshot of the production data there for testing the CCAFS MELIAs, OICRs, Policies, and Innovations
I re-created the tip-submit and tip-approve DSpace user accounts for Alliance’s new TIP submit tool and added them to the Alliance submitters and Alliance admins accounts respectively
Start working on updating the Ansible infrastructure playbooks for DSpace 7 stuff

2022-07-23

Start a harvest on AReS
More work on DSpace 7 related issues in the Ansible infrastructure playbooks

2022-07-24

More work on DSpace 7 related issues in the Ansible infrastructure playbooks

2022-07-25

More work on DSpace 7 related issues in the Ansible infrastructure playbooks
- I see that, for Solr, we will need to copy the DSpace configsets to the writable data directory rather than the default home dir
- The Taking Solr to production guide recommends keeping the unzipped code separate from the data, which we do in our Solr role already
- So that means we keep the unzipped code in /opt/solr-8.11.2, but the data directory in /var/solr/data, with the DSpace Solr cores here /var/solr/data/configsets
- I’m not sure how to integrate that into my playbooks yet
Much to my surprise, Discovery indexing on DSpace 7 was really fast when I did it just now, apparently taking 40 minutes of wall clock time?!:

$ /usr/bin/time -v /home/dspace7/bin/dspace index-discovery -b
The script has started
(Re)building index from scratch.
Done with indexing
The script has completed
        Command being timed: "/home/dspace7/bin/dspace index-discovery -b"
        User time (seconds): 588.18
        System time (seconds): 91.26
        Percent of CPU this job got: 28%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 40:05.79
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 635380
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1513
        Minor (reclaiming a frame) page faults: 216412
        Voluntary context switches: 1671092
        Involuntary context switches: 744007
        Swaps: 0
        File system inputs: 4396880
        File system outputs: 74312
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Leroy from the Alliance wrote to say that the CIAT Library is back up so I might be able to download all the PDFs
- It had been shut down for a security reason a few months ago and we were planning to download them all and attach them to their relevant items on CGSpace
- I noticed one item that had the PDF already on CGSpace so I’ll need to consider that when I eventually do the import
I had to re-create the tip-submit and tip-approve accounts for Alliance on DSpace Test again
- After I created them last week they somehow got deleted…?!… I couldn’t find them or the mel-submit account either!