cgspace-notes/docs/2022-07/index.html

391 lines
19 KiB
HTML
Raw Normal View History

2022-07-04 08:25:14 +02:00
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="July, 2022" />
<meta property="og:description" content="2022-07-02
I learned how to use the Levenshtein functions in PostgreSQL
The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
Also, the trgm functions I&rsquo;ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-07/" />
<meta property="article:published_time" content="2022-07-02T14:07:36+03:00" />
2022-07-08 14:49:45 +02:00
<meta property="article:modified_time" content="2022-07-07T10:02:04+03:00" />
2022-07-04 08:25:14 +02:00
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="July, 2022"/>
<meta name="twitter:description" content="2022-07-02
I learned how to use the Levenshtein functions in PostgreSQL
The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
Also, the trgm functions I&rsquo;ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
"/>
<meta name="generator" content="Hugo 0.101.0" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "July, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-07/",
2022-07-08 14:49:45 +02:00
"wordCount": "1095",
2022-07-04 08:25:14 +02:00
"datePublished": "2022-07-02T14:07:36+03:00",
2022-07-08 14:49:45 +02:00
"dateModified": "2022-07-07T10:02:04+03:00",
2022-07-04 08:25:14 +02:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-07/">
<title>July, 2022 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-07/">July, 2022</a></h2>
<p class="blog-post-meta">
<time datetime="2022-07-02T14:07:36+03:00">Sat Jul 02, 2022</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2022-07-02">2022-07-02</h2>
<ul>
<li>I learned how to use the Levenshtein functions in PostgreSQL
<ul>
<li>The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing</li>
<li>Also, the trgm functions I&rsquo;ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first</li>
</ul>
</li>
</ul>
<ul>
<li>A working query checking for duplicates in the recent AfricaRice items is:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER(&#39;International Trade and Exotic Pests: The Risks for Biodiversity and African Economies&#39;), LEFT(LOWER(text_value), 255), 3) &lt;= 3;
</span></span><span style="display:flex;"><span> text_value
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────
</span></span><span style="display:flex;"><span> International trade and exotic pests: the risks for biodiversity and African economies
</span></span><span style="display:flex;"><span>(1 row)
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 399.751 ms
</span></span></code></pre></div><ul>
<li>There is a great <a href="https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql">blog post discussing Soundex with Levenshtein</a> and creating indexes to make them faster</li>
<li>I want to do some proper checks of accuracy and speed against my trigram method</li>
</ul>
<h2 id="2022-07-03">2022-07-03</h2>
<ul>
<li>Start a harvest on AReS</li>
</ul>
2022-07-04 16:20:01 +02:00
<h2 id="2022-07-04">2022-07-04</h2>
<ul>
<li>Linode told me that CGSpace had high load yesterday
<ul>
<li>I also got some up and down notices from UptimeRobot</li>
<li>Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count</li>
</ul>
</li>
</ul>
<p><img src="/cgspace-notes/2022/07/cpu-day.png" alt="CPU load day">
<img src="/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png" alt="JDBC pool day"></p>
<ul>
<li>Seems we have some old database transactions since 2022-06-27:</li>
</ul>
<p><img src="/cgspace-notes/2022/07/postgres_locks_ALL-week.png" alt="PostgreSQL locks week">
<img src="/cgspace-notes/2022/07/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p>
<ul>
<li>Looking at the top connections to nginx yesterday:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort | uniq -c | sort -h | tail
</span></span><span style="display:flex;"><span> 1132 64.124.8.34
</span></span><span style="display:flex;"><span> 1146 2a01:4f8:1c17:5550::1
</span></span><span style="display:flex;"><span> 1380 137.184.159.211
</span></span><span style="display:flex;"><span> 1533 64.124.8.59
</span></span><span style="display:flex;"><span> 4013 80.248.237.167
</span></span><span style="display:flex;"><span> 4776 54.195.118.125
</span></span><span style="display:flex;"><span> 10482 45.5.186.2
</span></span><span style="display:flex;"><span> 11177 172.104.229.92
</span></span><span style="display:flex;"><span> 15855 2a01:7e00::f03c:91ff:fe9a:3a37
</span></span><span style="display:flex;"><span> 22179 64.39.98.251
</span></span></code></pre></div><ul>
<li>And the total number of unique IPs:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort -u | wc -l
</span></span><span style="display:flex;"><span>6952
</span></span></code></pre></div><ul>
<li>This seems low, so it must have been from the request patterns by certain visitors
<ul>
<li>64.39.98.251 is Qualys, and I&rsquo;m debating blocking <a href="https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm">all their IPs</a> using a geo block in nginx (need to test)</li>
<li>The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover</li>
<li>64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo</li>
</ul>
</li>
<li>I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)</li>
<li>I implemented a geo mapping for the user agent mapping AND the nginx <code>limit_req_zone</code> by extracting the networks into an external file and including it in two different geo mapping blocks
<ul>
<li>This is clever and relies on the fact that we can use defaults in both cases</li>
<li>First, we map the user agent of requests from these networks to &ldquo;bot&rdquo; so that Tomcat and Solr handle them accordingly</li>
<li>Second, we use this as a key in a <code>limit_req_zone</code>, which relies on a default mapping of &rsquo;&rsquo; (and nginx doesn&rsquo;t evaluate empty cache keys)</li>
</ul>
</li>
<li>I noticed that CIP uploaded a number of Georgian presentations with <code>dcterms.language</code> set to English and Other so I changed them to &ldquo;ka&rdquo;
<ul>
<li>Perhaps we need to update our list of languages to include all instead of the most common ones</li>
</ul>
</li>
2022-07-04 21:10:02 +02:00
<li>I wrote a script <code>ilri/iso-639-value-pairs.py</code> to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to <code>input-forms.xml</code></li>
2022-07-04 16:20:01 +02:00
</ul>
2022-07-07 09:02:04 +02:00
<h2 id="2022-07-06">2022-07-06</h2>
<ul>
<li>CGSpace went down and up a few times due to high load
<ul>
<li>I found one host in Romania making very high speed requests with a normal user agent (<code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C</code>):</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span>
</span></span><span style="display:flex;"><span> 516 142.132.248.90
</span></span><span style="display:flex;"><span> 525 157.55.39.234
</span></span><span style="display:flex;"><span> 587 66.249.66.21
</span></span><span style="display:flex;"><span> 593 95.108.213.59
</span></span><span style="display:flex;"><span> 1372 137.184.159.211
</span></span><span style="display:flex;"><span> 4776 54.195.118.125
</span></span><span style="display:flex;"><span> 5441 205.186.128.185
</span></span><span style="display:flex;"><span> 6267 45.5.186.2
</span></span><span style="display:flex;"><span> 15839 2a01:7e00::f03c:91ff:fe9a:3a37
</span></span><span style="display:flex;"><span> 36114 146.19.75.141
</span></span></code></pre></div><ul>
<li>I added 146.19.75.141 to the list of bot networks in nginx</li>
<li>While looking at the logs I started thinking about Bing again
<ul>
<li>They apparently <a href="https://www.bing.com/toolbox/bingbot.json">publish a list of all their networks</a></li>
<li>I wrote a script to use <code>prips</code> to <a href="https://stackoverflow.com/a/52501093/1996540">print the IPs for each network</a></li>
<li>The script is <code>bing-networks-to-ips.sh</code></li>
<li>From Bing&rsquo;s IPs alone I purged 145,403 hits&hellip; sheesh</li>
</ul>
</li>
<li>Delete two items on CGSpace for Margarita because she was getting the &ldquo;Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-&hellip;&rdquo; error
<ul>
<li>This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05</li>
</ul>
</li>
<li>Update some <code>cg.audience</code> metadata to use &ldquo;Academics&rdquo; instead of &ldquo;Academicians&rdquo;:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=&#39;Academics&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value=&#39;Academicians&#39;;
</span></span><span style="display:flex;"><span>UPDATE 104
</span></span></code></pre></div><ul>
<li>I will also have to remove &ldquo;Academicians&rdquo; from input-forms.xml</li>
</ul>
2022-07-08 14:49:45 +02:00
<h2 id="2022-07-07">2022-07-07</h2>
<ul>
<li>Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
<ul>
<li>I used the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper functions</a> to find the collections where each term was used:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = &#39;water demand&#39; GROUP BY collection ORDER BY count DESC LIMIT 5;
</span></span><span style="display:flex;"><span> collection │ count
</span></span><span style="display:flex;"><span>─────────────┼───────
</span></span><span style="display:flex;"><span> 10568/36178 │ 56
</span></span><span style="display:flex;"><span> 10568/36185 │ 46
</span></span><span style="display:flex;"><span> 10568/36181 │ 35
</span></span><span style="display:flex;"><span> 10568/36188 │ 28
</span></span><span style="display:flex;"><span> 10568/36179 │ 21
</span></span><span style="display:flex;"><span>(5 rows)
</span></span></code></pre></div><ul>
<li>For now I only did terms from my list that had 100 or more occurrences in CGSpace
<ul>
<li>This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC</li>
</ul>
</li>
<li>Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
<ul>
<li>We want to remove them from the submission form to create space for new fields</li>
</ul>
</li>
<li>Update one term I noticed people using that was close to AGROVOC:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value=&#39;development policies&#39; WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value=&#39;development policy&#39;;
</span></span><span style="display:flex;"><span>UPDATE 108
</span></span></code></pre></div><ul>
<li>After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
<ul>
<li>Bioversity subject (<code>cg.subject.bioversity</code>)</li>
<li>CCAFS phase 1 project tag (<code>cg.identifier.ccafsproject</code>)</li>
<li>CIAT project tag (<code>cg.identifier.ciatproject</code>)</li>
<li>CIAT subject (<code>cg.subject.ciat</code>)</li>
</ul>
</li>
<li>Work on cleaning and proofing forty-six AfricaRice items for CGSpace
<ul>
<li>Last week we identified some duplicates so I removed those</li>
<li>The data is of mediocre quality</li>
<li>I&rsquo;ve been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects</li>
<li>I even found titles that have typos, looking something like OCR errors&hellip;</li>
</ul>
</li>
</ul>
<h2 id="2022-07-08">2022-07-08</h2>
<ul>
<li>Finalize the cleaning and proofing of AfricaRice records
<ul>
<li>I found two suspicious items that claim to have been published but I can&rsquo;t find in the respective journals, so I removed those</li>
<li>I uploaded the forty-four items to <a href="https://dspacetest.cgiar.org/handle/10568/119135">DSpace Test</a></li>
</ul>
</li>
<li>Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
<ul>
<li>I removed these from the input-form.xml and Discovery facets:
<ul>
<li>cg.identifier.ccafsprojectpii</li>
<li>cg.subject.cifor</li>
</ul>
</li>
<li>For now we will keep them in the search filters</li>
</ul>
</li>
</ul>
2022-07-04 08:25:14 +02:00
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
<li><a href="/cgspace-notes/2022-03/">March, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>