<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="July, 2022" /> <meta property="og:description" content="2022-07-02 I learned how to use the Levenshtein functions in PostgreSQL The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-07/" /> <meta property="article:published_time" content="2022-07-02T14:07:36+03:00" /> <meta property="article:modified_time" content="2022-07-07T10:02:04+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="July, 2022"/> <meta name="twitter:description" content="2022-07-02 I learned how to use the Levenshtein functions in PostgreSQL The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first "/> <meta name="generator" content="Hugo 0.101.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "July, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-07/", "wordCount": "1095", "datePublished": "2022-07-02T14:07:36+03:00", "dateModified": "2022-07-07T10:02:04+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-07/"> <title>July, 2022 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-07/">July, 2022</a></h2> <p class="blog-post-meta"> <time datetime="2022-07-02T14:07:36+03:00">Sat Jul 02, 2022</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2022-07-02">2022-07-02</h2> <ul> <li>I learned how to use the Levenshtein functions in PostgreSQL <ul> <li>The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing</li> <li>Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first</li> </ul> </li> </ul> <ul> <li>A working query checking for duplicates in the recent AfricaRice items is:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3; </span></span><span style="display:flex;"><span> text_value </span></span><span style="display:flex;"><span>──────────────────────────────────────────────────────────────────────────────────────── </span></span><span style="display:flex;"><span> International trade and exotic pests: the risks for biodiversity and African economies </span></span><span style="display:flex;"><span>(1 row) </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 399.751 ms </span></span></code></pre></div><ul> <li>There is a great <a href="https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql">blog post discussing Soundex with Levenshtein</a> and creating indexes to make them faster</li> <li>I want to do some proper checks of accuracy and speed against my trigram method</li> </ul> <h2 id="2022-07-03">2022-07-03</h2> <ul> <li>Start a harvest on AReS</li> </ul> <h2 id="2022-07-04">2022-07-04</h2> <ul> <li>Linode told me that CGSpace had high load yesterday <ul> <li>I also got some up and down notices from UptimeRobot</li> <li>Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count</li> </ul> </li> </ul> <p><img src="/cgspace-notes/2022/07/cpu-day.png" alt="CPU load day"> <img src="/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png" alt="JDBC pool day"></p> <ul> <li>Seems we have some old database transactions since 2022-06-27:</li> </ul> <p><img src="/cgspace-notes/2022/07/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"> <img src="/cgspace-notes/2022/07/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p> <ul> <li>Looking at the top connections to nginx yesterday:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort | uniq -c | sort -h | tail </span></span><span style="display:flex;"><span> 1132 64.124.8.34 </span></span><span style="display:flex;"><span> 1146 2a01:4f8:1c17:5550::1 </span></span><span style="display:flex;"><span> 1380 137.184.159.211 </span></span><span style="display:flex;"><span> 1533 64.124.8.59 </span></span><span style="display:flex;"><span> 4013 80.248.237.167 </span></span><span style="display:flex;"><span> 4776 54.195.118.125 </span></span><span style="display:flex;"><span> 10482 45.5.186.2 </span></span><span style="display:flex;"><span> 11177 172.104.229.92 </span></span><span style="display:flex;"><span> 15855 2a01:7e00::f03c:91ff:fe9a:3a37 </span></span><span style="display:flex;"><span> 22179 64.39.98.251 </span></span></code></pre></div><ul> <li>And the total number of unique IPs:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort -u | wc -l </span></span><span style="display:flex;"><span>6952 </span></span></code></pre></div><ul> <li>This seems low, so it must have been from the request patterns by certain visitors <ul> <li>64.39.98.251 is Qualys, and I’m debating blocking <a href="https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm">all their IPs</a> using a geo block in nginx (need to test)</li> <li>The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover</li> <li>64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo</li> </ul> </li> <li>I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)</li> <li>I implemented a geo mapping for the user agent mapping AND the nginx <code>limit_req_zone</code> by extracting the networks into an external file and including it in two different geo mapping blocks <ul> <li>This is clever and relies on the fact that we can use defaults in both cases</li> <li>First, we map the user agent of requests from these networks to “bot” so that Tomcat and Solr handle them accordingly</li> <li>Second, we use this as a key in a <code>limit_req_zone</code>, which relies on a default mapping of ’’ (and nginx doesn’t evaluate empty cache keys)</li> </ul> </li> <li>I noticed that CIP uploaded a number of Georgian presentations with <code>dcterms.language</code> set to English and Other so I changed them to “ka” <ul> <li>Perhaps we need to update our list of languages to include all instead of the most common ones</li> </ul> </li> <li>I wrote a script <code>ilri/iso-639-value-pairs.py</code> to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to <code>input-forms.xml</code></li> </ul> <h2 id="2022-07-06">2022-07-06</h2> <ul> <li>CGSpace went down and up a few times due to high load <ul> <li>I found one host in Romania making very high speed requests with a normal user agent (<code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C</code>):</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span> </span></span><span style="display:flex;"><span> 516 142.132.248.90 </span></span><span style="display:flex;"><span> 525 157.55.39.234 </span></span><span style="display:flex;"><span> 587 66.249.66.21 </span></span><span style="display:flex;"><span> 593 95.108.213.59 </span></span><span style="display:flex;"><span> 1372 137.184.159.211 </span></span><span style="display:flex;"><span> 4776 54.195.118.125 </span></span><span style="display:flex;"><span> 5441 205.186.128.185 </span></span><span style="display:flex;"><span> 6267 45.5.186.2 </span></span><span style="display:flex;"><span> 15839 2a01:7e00::f03c:91ff:fe9a:3a37 </span></span><span style="display:flex;"><span> 36114 146.19.75.141 </span></span></code></pre></div><ul> <li>I added 146.19.75.141 to the list of bot networks in nginx</li> <li>While looking at the logs I started thinking about Bing again <ul> <li>They apparently <a href="https://www.bing.com/toolbox/bingbot.json">publish a list of all their networks</a></li> <li>I wrote a script to use <code>prips</code> to <a href="https://stackoverflow.com/a/52501093/1996540">print the IPs for each network</a></li> <li>The script is <code>bing-networks-to-ips.sh</code></li> <li>From Bing’s IPs alone I purged 145,403 hits… sheesh</li> </ul> </li> <li>Delete two items on CGSpace for Margarita because she was getting the “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-…” error <ul> <li>This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05</li> </ul> </li> <li>Update some <code>cg.audience</code> metadata to use “Academics” instead of “Academicians”:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians'; </span></span><span style="display:flex;"><span>UPDATE 104 </span></span></code></pre></div><ul> <li>I will also have to remove “Academicians” from input-forms.xml</li> </ul> <h2 id="2022-07-07">2022-07-07</h2> <ul> <li>Finalize lists of non-AGROVOC subjects in CGSpace that I started last week <ul> <li>I used the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper functions</a> to find the collections where each term was used:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5; </span></span><span style="display:flex;"><span> collection │ count </span></span><span style="display:flex;"><span>─────────────┼─────── </span></span><span style="display:flex;"><span> 10568/36178 │ 56 </span></span><span style="display:flex;"><span> 10568/36185 │ 46 </span></span><span style="display:flex;"><span> 10568/36181 │ 35 </span></span><span style="display:flex;"><span> 10568/36188 │ 28 </span></span><span style="display:flex;"><span> 10568/36179 │ 21 </span></span><span style="display:flex;"><span>(5 rows) </span></span></code></pre></div><ul> <li>For now I only did terms from my list that had 100 or more occurrences in CGSpace <ul> <li>This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC</li> </ul> </li> <li>Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace <ul> <li>We want to remove them from the submission form to create space for new fields</li> </ul> </li> <li>Update one term I noticed people using that was close to AGROVOC:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy'; </span></span><span style="display:flex;"><span>UPDATE 108 </span></span></code></pre></div><ul> <li>After contacting some editors I removed some old metadata fields from the submission form and browse indexes: <ul> <li>Bioversity subject (<code>cg.subject.bioversity</code>)</li> <li>CCAFS phase 1 project tag (<code>cg.identifier.ccafsproject</code>)</li> <li>CIAT project tag (<code>cg.identifier.ciatproject</code>)</li> <li>CIAT subject (<code>cg.subject.ciat</code>)</li> </ul> </li> <li>Work on cleaning and proofing forty-six AfricaRice items for CGSpace <ul> <li>Last week we identified some duplicates so I removed those</li> <li>The data is of mediocre quality</li> <li>I’ve been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects</li> <li>I even found titles that have typos, looking something like OCR errors…</li> </ul> </li> </ul> <h2 id="2022-07-08">2022-07-08</h2> <ul> <li>Finalize the cleaning and proofing of AfricaRice records <ul> <li>I found two suspicious items that claim to have been published but I can’t find in the respective journals, so I removed those</li> <li>I uploaded the forty-four items to <a href="https://dspacetest.cgiar.org/handle/10568/119135">DSpace Test</a></li> </ul> </li> <li>Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag <ul> <li>I removed these from the input-form.xml and Discovery facets: <ul> <li>cg.identifier.ccafsprojectpii</li> <li>cg.subject.cifor</li> </ul> </li> <li>For now we will keep them in the search filters</li> </ul> </li> </ul> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2022-07/">July, 2022</a></li> <li><a href="/cgspace-notes/2022-06/">June, 2022</a></li> <li><a href="/cgspace-notes/2022-05/">May, 2022</a></li> <li><a href="/cgspace-notes/2022-04/">April, 2022</a></li> <li><a href="/cgspace-notes/2022-03/">March, 2022</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>