CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

July, 2022

2022-07-02

  • I learned how to use the Levenshtein functions in PostgreSQL
    • The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
    • Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
  • A working query checking for duplicates in the recent AfricaRice items is:
localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE  dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
                                       text_value                                       
────────────────────────────────────────────────────────────────────────────────────────
 International trade and exotic pests: the risks for biodiversity and African economies
(1 row)

Time: 399.751 ms

2022-07-03

  • Start a harvest on AReS

2022-07-04

  • Linode told me that CGSpace had high load yesterday
    • I also got some up and down notices from UptimeRobot
    • Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count

CPU load day JDBC pool day

  • Seems we have some old database transactions since 2022-06-27:

PostgreSQL locks week PostgreSQL query length week

  • Looking at the top connections to nginx yesterday:
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort | uniq -c | sort -h | tail
   1132 64.124.8.34
   1146 2a01:4f8:1c17:5550::1
   1380 137.184.159.211
   1533 64.124.8.59
   4013 80.248.237.167
   4776 54.195.118.125
  10482 45.5.186.2
  11177 172.104.229.92
  15855 2a01:7e00::f03c:91ff:fe9a:3a37
  22179 64.39.98.251
  • And the total number of unique IPs:
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log.1 | sort -u | wc -l
6952
  • This seems low, so it must have been from the request patterns by certain visitors
    • 64.39.98.251 is Qualys, and I’m debating blocking all their IPs using a geo block in nginx (need to test)
    • The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover
    • 64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo
  • I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)
  • I implemented a geo mapping for the user agent mapping AND the nginx limit_req_zone by extracting the networks into an external file and including it in two different geo mapping blocks
    • This is clever and relies on the fact that we can use defaults in both cases
    • First, we map the user agent of requests from these networks to “bot” so that Tomcat and Solr handle them accordingly
    • Second, we use this as a key in a limit_req_zone, which relies on a default mapping of ’’ (and nginx doesn’t evaluate empty cache keys)
  • I noticed that CIP uploaded a number of Georgian presentations with dcterms.language set to English and Other so I changed them to “ka”
    • Perhaps we need to update our list of languages to include all instead of the most common ones
  • I wrote a script ilri/iso-639-value-pairs.py to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to input-forms.xml

2022-07-06

  • CGSpace went down and up a few times due to high load
    • I found one host in Romania making very high speed requests with a normal user agent (Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C):
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort | uniq -c | sort -h | tail -n 10
    516 142.132.248.90
    525 157.55.39.234
    587 66.249.66.21
    593 95.108.213.59
   1372 137.184.159.211
   4776 54.195.118.125
   5441 205.186.128.185
   6267 45.5.186.2
  15839 2a01:7e00::f03c:91ff:fe9a:3a37
  36114 146.19.75.141
  • I added 146.19.75.141 to the list of bot networks in nginx
  • While looking at the logs I started thinking about Bing again
  • Delete two items on CGSpace for Margarita because she was getting the “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-…” error
    • This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05
  • Update some cg.audience metadata to use “Academics” instead of “Academicians”:
dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
UPDATE 104
  • I will also have to remove “Academicians” from input-forms.xml

2022-07-07

  • Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
 collection  │ count 
─────────────┼───────
 10568/36178 │    56
 10568/36185 │    46
 10568/36181 │    35
 10568/36188 │    28
 10568/36179 │    21
(5 rows)
  • For now I only did terms from my list that had 100 or more occurrences in CGSpace
    • This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC
  • Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
    • We want to remove them from the submission form to create space for new fields
  • Update one term I noticed people using that was close to AGROVOC:
dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
UPDATE 108
  • After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
    • Bioversity subject (cg.subject.bioversity)
    • CCAFS phase 1 project tag (cg.identifier.ccafsproject)
    • CIAT project tag (cg.identifier.ciatproject)
    • CIAT subject (cg.subject.ciat)
  • Work on cleaning and proofing forty-six AfricaRice items for CGSpace
    • Last week we identified some duplicates so I removed those
    • The data is of mediocre quality
    • I’ve been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects
    • I even found titles that have typos, looking something like OCR errors…

2022-07-08

  • Finalize the cleaning and proofing of AfricaRice records
    • I found two suspicious items that claim to have been published but I can’t find in the respective journals, so I removed those
    • I uploaded the forty-four items to DSpace Test
  • Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
    • I removed these from the input-form.xml and Discovery facets:
      • cg.identifier.ccafsprojectpii
      • cg.subject.cifor
    • For now we will keep them in the search filters