mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-08 21:24:54 +01:00
661 lines
39 KiB
HTML
661 lines
39 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en" >
|
|
|
|
<head>
|
|
<meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
|
|
|
|
|
<meta property="og:title" content="July, 2022" />
|
|
<meta property="og:description" content="2022-07-02
|
|
|
|
I learned how to use the Levenshtein functions in PostgreSQL
|
|
|
|
The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
|
|
Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
|
|
|
|
|
|
" />
|
|
<meta property="og:type" content="article" />
|
|
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-07/" />
|
|
<meta property="article:published_time" content="2022-07-02T14:07:36+03:00" />
|
|
<meta property="article:modified_time" content="2022-07-18T16:45:55+03:00" />
|
|
|
|
|
|
|
|
<meta name="twitter:card" content="summary"/>
|
|
<meta name="twitter:title" content="July, 2022"/>
|
|
<meta name="twitter:description" content="2022-07-02
|
|
|
|
I learned how to use the Levenshtein functions in PostgreSQL
|
|
|
|
The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing
|
|
Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first
|
|
|
|
|
|
"/>
|
|
<meta name="generator" content="Hugo 0.101.0" />
|
|
|
|
|
|
|
|
<script type="application/ld+json">
|
|
{
|
|
"@context": "http://schema.org",
|
|
"@type": "BlogPosting",
|
|
"headline": "July, 2022",
|
|
"url": "https://alanorth.github.io/cgspace-notes/2022-07/",
|
|
"wordCount": "2679",
|
|
"datePublished": "2022-07-02T14:07:36+03:00",
|
|
"dateModified": "2022-07-18T16:45:55+03:00",
|
|
"author": {
|
|
"@type": "Person",
|
|
"name": "Alan Orth"
|
|
},
|
|
"keywords": "Notes"
|
|
}
|
|
</script>
|
|
|
|
|
|
|
|
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-07/">
|
|
|
|
<title>July, 2022 | CGSpace Notes</title>
|
|
|
|
|
|
<!-- combined, minified CSS -->
|
|
|
|
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
|
|
|
|
|
<!-- minified Font Awesome for SVG icons -->
|
|
|
|
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
|
|
|
<!-- RSS 2.0 feed -->
|
|
|
|
|
|
|
|
|
|
</head>
|
|
|
|
<body>
|
|
|
|
|
|
<div class="blog-masthead">
|
|
<div class="container">
|
|
<nav class="nav blog-nav">
|
|
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
|
</nav>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
|
|
|
|
<header class="blog-header">
|
|
<div class="container">
|
|
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
|
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
|
</div>
|
|
</header>
|
|
|
|
|
|
|
|
|
|
<div class="container">
|
|
<div class="row">
|
|
<div class="col-sm-8 blog-main">
|
|
|
|
|
|
|
|
|
|
<article class="blog-post">
|
|
<header>
|
|
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-07/">July, 2022</a></h2>
|
|
<p class="blog-post-meta">
|
|
<time datetime="2022-07-02T14:07:36+03:00">Sat Jul 02, 2022</time>
|
|
in
|
|
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
|
|
|
|
|
</p>
|
|
</header>
|
|
<h2 id="2022-07-02">2022-07-02</h2>
|
|
<ul>
|
|
<li>I learned how to use the Levenshtein functions in PostgreSQL
|
|
<ul>
|
|
<li>The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing</li>
|
|
<li>Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<ul>
|
|
<li>A working query checking for duplicates in the recent AfricaRice items is:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3;
|
|
</span></span><span style="display:flex;"><span> text_value
|
|
</span></span><span style="display:flex;"><span>────────────────────────────────────────────────────────────────────────────────────────
|
|
</span></span><span style="display:flex;"><span> International trade and exotic pests: the risks for biodiversity and African economies
|
|
</span></span><span style="display:flex;"><span>(1 row)
|
|
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 399.751 ms
|
|
</span></span></code></pre></div><ul>
|
|
<li>There is a great <a href="https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql">blog post discussing Soundex with Levenshtein</a> and creating indexes to make them faster</li>
|
|
<li>I want to do some proper checks of accuracy and speed against my trigram method</li>
|
|
</ul>
|
|
<h2 id="2022-07-03">2022-07-03</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2022-07-04">2022-07-04</h2>
|
|
<ul>
|
|
<li>Linode told me that CGSpace had high load yesterday
|
|
<ul>
|
|
<li>I also got some up and down notices from UptimeRobot</li>
|
|
<li>Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<p><img src="/cgspace-notes/2022/07/cpu-day.png" alt="CPU load day">
|
|
<img src="/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png" alt="JDBC pool day"></p>
|
|
<ul>
|
|
<li>Seems we have some old database transactions since 2022-06-27:</li>
|
|
</ul>
|
|
<p><img src="/cgspace-notes/2022/07/postgres_locks_ALL-week.png" alt="PostgreSQL locks week">
|
|
<img src="/cgspace-notes/2022/07/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p>
|
|
<ul>
|
|
<li>Looking at the top connections to nginx yesterday:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort | uniq -c | sort -h | tail
|
|
</span></span><span style="display:flex;"><span> 1132 64.124.8.34
|
|
</span></span><span style="display:flex;"><span> 1146 2a01:4f8:1c17:5550::1
|
|
</span></span><span style="display:flex;"><span> 1380 137.184.159.211
|
|
</span></span><span style="display:flex;"><span> 1533 64.124.8.59
|
|
</span></span><span style="display:flex;"><span> 4013 80.248.237.167
|
|
</span></span><span style="display:flex;"><span> 4776 54.195.118.125
|
|
</span></span><span style="display:flex;"><span> 10482 45.5.186.2
|
|
</span></span><span style="display:flex;"><span> 11177 172.104.229.92
|
|
</span></span><span style="display:flex;"><span> 15855 2a01:7e00::f03c:91ff:fe9a:3a37
|
|
</span></span><span style="display:flex;"><span> 22179 64.39.98.251
|
|
</span></span></code></pre></div><ul>
|
|
<li>And the total number of unique IPs:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort -u | wc -l
|
|
</span></span><span style="display:flex;"><span>6952
|
|
</span></span></code></pre></div><ul>
|
|
<li>This seems low, so it must have been from the request patterns by certain visitors
|
|
<ul>
|
|
<li>64.39.98.251 is Qualys, and I’m debating blocking <a href="https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm">all their IPs</a> using a geo block in nginx (need to test)</li>
|
|
<li>The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover</li>
|
|
<li>64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo</li>
|
|
</ul>
|
|
</li>
|
|
<li>I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)</li>
|
|
<li>I implemented a geo mapping for the user agent mapping AND the nginx <code>limit_req_zone</code> by extracting the networks into an external file and including it in two different geo mapping blocks
|
|
<ul>
|
|
<li>This is clever and relies on the fact that we can use defaults in both cases</li>
|
|
<li>First, we map the user agent of requests from these networks to “bot” so that Tomcat and Solr handle them accordingly</li>
|
|
<li>Second, we use this as a key in a <code>limit_req_zone</code>, which relies on a default mapping of ’’ (and nginx doesn’t evaluate empty cache keys)</li>
|
|
</ul>
|
|
</li>
|
|
<li>I noticed that CIP uploaded a number of Georgian presentations with <code>dcterms.language</code> set to English and Other so I changed them to “ka”
|
|
<ul>
|
|
<li>Perhaps we need to update our list of languages to include all instead of the most common ones</li>
|
|
</ul>
|
|
</li>
|
|
<li>I wrote a script <code>ilri/iso-639-value-pairs.py</code> to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to <code>input-forms.xml</code></li>
|
|
</ul>
|
|
<h2 id="2022-07-06">2022-07-06</h2>
|
|
<ul>
|
|
<li>CGSpace went down and up a few times due to high load
|
|
<ul>
|
|
<li>I found one host in Romania making very high speed requests with a normal user agent (<code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C</code>):</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span>
|
|
</span></span><span style="display:flex;"><span> 516 142.132.248.90
|
|
</span></span><span style="display:flex;"><span> 525 157.55.39.234
|
|
</span></span><span style="display:flex;"><span> 587 66.249.66.21
|
|
</span></span><span style="display:flex;"><span> 593 95.108.213.59
|
|
</span></span><span style="display:flex;"><span> 1372 137.184.159.211
|
|
</span></span><span style="display:flex;"><span> 4776 54.195.118.125
|
|
</span></span><span style="display:flex;"><span> 5441 205.186.128.185
|
|
</span></span><span style="display:flex;"><span> 6267 45.5.186.2
|
|
</span></span><span style="display:flex;"><span> 15839 2a01:7e00::f03c:91ff:fe9a:3a37
|
|
</span></span><span style="display:flex;"><span> 36114 146.19.75.141
|
|
</span></span></code></pre></div><ul>
|
|
<li>I added 146.19.75.141 to the list of bot networks in nginx</li>
|
|
<li>While looking at the logs I started thinking about Bing again
|
|
<ul>
|
|
<li>They apparently <a href="https://www.bing.com/toolbox/bingbot.json">publish a list of all their networks</a></li>
|
|
<li>I wrote a script to use <code>prips</code> to <a href="https://stackoverflow.com/a/52501093/1996540">print the IPs for each network</a></li>
|
|
<li>The script is <code>bing-networks-to-ips.sh</code></li>
|
|
<li>From Bing’s IPs alone I purged 145,403 hits… sheesh</li>
|
|
</ul>
|
|
</li>
|
|
<li>Delete two items on CGSpace for Margarita because she was getting the “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-…” error
|
|
<ul>
|
|
<li>This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05</li>
|
|
</ul>
|
|
</li>
|
|
<li>Update some <code>cg.audience</code> metadata to use “Academics” instead of “Academicians”:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians';
|
|
</span></span><span style="display:flex;"><span>UPDATE 104
|
|
</span></span></code></pre></div><ul>
|
|
<li>I will also have to remove “Academicians” from input-forms.xml</li>
|
|
</ul>
|
|
<h2 id="2022-07-07">2022-07-07</h2>
|
|
<ul>
|
|
<li>Finalize lists of non-AGROVOC subjects in CGSpace that I started last week
|
|
<ul>
|
|
<li>I used the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper functions</a> to find the collections where each term was used:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5;
|
|
</span></span><span style="display:flex;"><span> collection │ count
|
|
</span></span><span style="display:flex;"><span>─────────────┼───────
|
|
</span></span><span style="display:flex;"><span> 10568/36178 │ 56
|
|
</span></span><span style="display:flex;"><span> 10568/36185 │ 46
|
|
</span></span><span style="display:flex;"><span> 10568/36181 │ 35
|
|
</span></span><span style="display:flex;"><span> 10568/36188 │ 28
|
|
</span></span><span style="display:flex;"><span> 10568/36179 │ 21
|
|
</span></span><span style="display:flex;"><span>(5 rows)
|
|
</span></span></code></pre></div><ul>
|
|
<li>For now I only did terms from my list that had 100 or more occurrences in CGSpace
|
|
<ul>
|
|
<li>This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC</li>
|
|
</ul>
|
|
</li>
|
|
<li>Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace
|
|
<ul>
|
|
<li>We want to remove them from the submission form to create space for new fields</li>
|
|
</ul>
|
|
</li>
|
|
<li>Update one term I noticed people using that was close to AGROVOC:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy';
|
|
</span></span><span style="display:flex;"><span>UPDATE 108
|
|
</span></span></code></pre></div><ul>
|
|
<li>After contacting some editors I removed some old metadata fields from the submission form and browse indexes:
|
|
<ul>
|
|
<li>Bioversity subject (<code>cg.subject.bioversity</code>)</li>
|
|
<li>CCAFS phase 1 project tag (<code>cg.identifier.ccafsproject</code>)</li>
|
|
<li>CIAT project tag (<code>cg.identifier.ciatproject</code>)</li>
|
|
<li>CIAT subject (<code>cg.subject.ciat</code>)</li>
|
|
</ul>
|
|
</li>
|
|
<li>Work on cleaning and proofing forty-six AfricaRice items for CGSpace
|
|
<ul>
|
|
<li>Last week we identified some duplicates so I removed those</li>
|
|
<li>The data is of mediocre quality</li>
|
|
<li>I’ve been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects</li>
|
|
<li>I even found titles that have typos, looking something like OCR errors…</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-07-08">2022-07-08</h2>
|
|
<ul>
|
|
<li>Finalize the cleaning and proofing of AfricaRice records
|
|
<ul>
|
|
<li>I found two suspicious items that claim to have been published but I can’t find in the respective journals, so I removed those</li>
|
|
<li>I uploaded the forty-four items to <a href="https://dspacetest.cgiar.org/handle/10568/119135">DSpace Test</a></li>
|
|
</ul>
|
|
</li>
|
|
<li>Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag
|
|
<ul>
|
|
<li>I removed these from the input-form.xml and Discovery facets:
|
|
<ul>
|
|
<li>cg.identifier.ccafsprojectpii</li>
|
|
<li>cg.subject.cifor</li>
|
|
</ul>
|
|
</li>
|
|
<li>For now we will keep them in the search filters</li>
|
|
</ul>
|
|
</li>
|
|
<li>I modified my <code>check-duplicates.py</code> script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: <a href="https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents">https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents</a>)
|
|
<ul>
|
|
<li>I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace</li>
|
|
<li>I am curious to see how the similarity scores compare to those from trgm… perhaps we don’t need them actually</li>
|
|
</ul>
|
|
</li>
|
|
<li>Deploy latest changes to submission form, Discovery, and browse on CGSpace
|
|
<ul>
|
|
<li>Also run all system updates and reboot the host</li>
|
|
</ul>
|
|
</li>
|
|
<li>Fix 152 <code>dcterms.relation</code> that are using “cgspace.cgiar.org” links instead of handles:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
|
|
</span></span></code></pre></div><h2 id="2022-07-10">2022-07-10</h2>
|
|
<ul>
|
|
<li>UptimeRobot says that CGSpace is down
|
|
<ul>
|
|
<li>I see high load around 22, high CPU around 800%</li>
|
|
<li>Doesn’t seem to be a lot of unique IPs:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort -u | wc -l
|
|
</span></span><span style="display:flex;"><span>2243
|
|
</span></span></code></pre></div><p>Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:</p>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
|
</span></span><span style="display:flex;"><span>1613
|
|
</span></span><span style="display:flex;"><span>$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
|
</span></span><span style="display:flex;"><span>1696
|
|
</span></span><span style="display:flex;"><span>$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
|
</span></span><span style="display:flex;"><span>1708
|
|
</span></span><span style="display:flex;"><span>$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
|
</span></span><span style="display:flex;"><span>1830
|
|
</span></span><span style="display:flex;"><span>$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l
|
|
</span></span><span style="display:flex;"><span>1811
|
|
</span></span></code></pre></div><p><img src="/cgspace-notes/2022/07/jmx_dspace_sessions-week.png" alt="DSpace sessions week"></p>
|
|
<ul>
|
|
<li>
|
|
<p>These IPs are using normal-looking user agents:</p>
|
|
<ul>
|
|
<li><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1</code></li>
|
|
<li><code>Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"</code></li>
|
|
<li><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1</code></li>
|
|
<li><code>Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<p>I will add networks I’m seeing now to nginx’s bot-networks.conf for now (not all of Hetzner) and purge the hits later:</p>
|
|
<ul>
|
|
<li>65.108.0.0/16</li>
|
|
<li>65.21.0.0/16</li>
|
|
<li>95.216.0.0/16</li>
|
|
<li>135.181.0.0/16</li>
|
|
<li>138.201.0.0/16</li>
|
|
</ul>
|
|
</li>
|
|
<li>
|
|
<p>I think I’m going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need</p>
|
|
</li>
|
|
<li>
|
|
<p>Sheesh, there are a bunch more IPv6 addresses also on Hetzner:</p>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access<span style="color:#f92672">}</span>.log | sort | grep 2a01:4f9 | uniq -c | sort -h
|
|
</span></span><span style="display:flex;"><span> 1 2a01:4f9:6a:1c2b::2
|
|
</span></span><span style="display:flex;"><span> 2 2a01:4f9:2b:5a8::2
|
|
</span></span><span style="display:flex;"><span> 2 2a01:4f9:4b:4495::2
|
|
</span></span><span style="display:flex;"><span> 96 2a01:4f9:c010:518c::1
|
|
</span></span><span style="display:flex;"><span> 137 2a01:4f9:c010:a9bc::1
|
|
</span></span><span style="display:flex;"><span> 142 2a01:4f9:c010:58c9::1
|
|
</span></span><span style="display:flex;"><span> 142 2a01:4f9:c010:58ea::1
|
|
</span></span><span style="display:flex;"><span> 144 2a01:4f9:c010:58eb::1
|
|
</span></span><span style="display:flex;"><span> 145 2a01:4f9:c010:6ff8::1
|
|
</span></span><span style="display:flex;"><span> 148 2a01:4f9:c010:5190::1
|
|
</span></span><span style="display:flex;"><span> 149 2a01:4f9:c010:7d6d::1
|
|
</span></span><span style="display:flex;"><span> 153 2a01:4f9:c010:5226::1
|
|
</span></span><span style="display:flex;"><span> 156 2a01:4f9:c010:7f74::1
|
|
</span></span><span style="display:flex;"><span> 160 2a01:4f9:c010:5188::1
|
|
</span></span><span style="display:flex;"><span> 161 2a01:4f9:c010:58e5::1
|
|
</span></span><span style="display:flex;"><span> 168 2a01:4f9:c010:58ed::1
|
|
</span></span><span style="display:flex;"><span> 170 2a01:4f9:c010:548e::1
|
|
</span></span><span style="display:flex;"><span> 170 2a01:4f9:c010:8c97::1
|
|
</span></span><span style="display:flex;"><span> 175 2a01:4f9:c010:58c8::1
|
|
</span></span><span style="display:flex;"><span> 175 2a01:4f9:c010:aada::1
|
|
</span></span><span style="display:flex;"><span> 182 2a01:4f9:c010:58ec::1
|
|
</span></span><span style="display:flex;"><span> 182 2a01:4f9:c010:ae8c::1
|
|
</span></span><span style="display:flex;"><span> 502 2a01:4f9:c010:ee57::1
|
|
</span></span><span style="display:flex;"><span> 530 2a01:4f9:c011:567a::1
|
|
</span></span><span style="display:flex;"><span> 535 2a01:4f9:c010:d04e::1
|
|
</span></span><span style="display:flex;"><span> 539 2a01:4f9:c010:3d9a::1
|
|
</span></span><span style="display:flex;"><span> 586 2a01:4f9:c010:93db::1
|
|
</span></span><span style="display:flex;"><span> 593 2a01:4f9:c010:a04a::1
|
|
</span></span><span style="display:flex;"><span> 601 2a01:4f9:c011:4166::1
|
|
</span></span><span style="display:flex;"><span> 607 2a01:4f9:c010:9881::1
|
|
</span></span><span style="display:flex;"><span> 640 2a01:4f9:c010:87fb::1
|
|
</span></span><span style="display:flex;"><span> 648 2a01:4f9:c010:e680::1
|
|
</span></span><span style="display:flex;"><span> 1141 2a01:4f9:3a:2696::2
|
|
</span></span><span style="display:flex;"><span> 1146 2a01:4f9:3a:2555::2
|
|
</span></span><span style="display:flex;"><span> 3207 2a01:4f9:3a:2c19::2
|
|
</span></span></code></pre></div><ul>
|
|
<li>Maybe it’s time I ban all of Hetzner… sheesh.</li>
|
|
<li>I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back</li>
|
|
</ul>
|
|
<p><img src="/cgspace-notes/2022/07/cpu-day.png" alt="CPU day"></p>
|
|
<ul>
|
|
<li>I am not sure what’s going on
|
|
<ul>
|
|
<li>I extracted all the IPs and used <code>resolve-addresses-geoip2.py</code> to analyze them and extract all the Hetzner networks and block them</li>
|
|
<li>It’s 181 IPs on Hetzner…</li>
|
|
</ul>
|
|
</li>
|
|
<li>I rebooted the server to see if it was just some stuck locks in PostgreSQL…</li>
|
|
<li>The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block</li>
|
|
<li>Start a harvest on AReS</li>
|
|
</ul>
|
|
<h2 id="2022-07-12">2022-07-12</h2>
|
|
<ul>
|
|
<li>Update an incorrect ORCID identifier for Alliance</li>
|
|
<li>Adjust collection permissions on CIFOR publications collection so Vika can submit without approval</li>
|
|
</ul>
|
|
<h2 id="2022-07-14">2022-07-14</h2>
|
|
<ul>
|
|
<li>Someone on the DSpace Slack mentioned having issues with the database configuration in DSpace 7.3
|
|
<ul>
|
|
<li>The reason is apparently that the default <code>db.dialect</code> changed from “org.dspace.storage.rdbms.hibernate.postgres.DSpacePostgreSQL82Dialect” to “org.hibernate.dialect.PostgreSQL94Dialect” as a result of a Hibernate update</li>
|
|
</ul>
|
|
</li>
|
|
<li>Then I was getting more errors starting the backend server in Tomcat, but the issue was that the backend server needs Solr to be up first!</li>
|
|
</ul>
|
|
<h2 id="2022-07-17">2022-07-17</h2>
|
|
<ul>
|
|
<li>Start a harvest on AReS around 3:30PM</li>
|
|
<li>Later in the evening I see CGSpace was going down and up (not as bad as last Sunday) with around 18.0 load…</li>
|
|
<li>I see very high CPU usage:</li>
|
|
</ul>
|
|
<p><img src="/cgspace-notes/2022/07/cpu-day2.png" alt="CPU day"></p>
|
|
<ul>
|
|
<li>But DSpace sessions are normal (not like last weekend):</li>
|
|
</ul>
|
|
<p><img src="/cgspace-notes/2022/07/jmx_dspace_sessions-week2.png" alt="DSpace sessions week"></p>
|
|
<ul>
|
|
<li>I see some Hetzner IPs in the top users today, but most of the requests are getting HTTP 503 because of the changes I made last week</li>
|
|
<li>I see 137.184.159.211, which is on Digital Ocean, and the DNS is apparently iitawpsite.iita.org
|
|
<ul>
|
|
<li>I’ve seen their user agent before, but I don’t think I knew it was IITA: “GuzzleHttp/6.3.3 curl/7.84.0 PHP/7.4.30”</li>
|
|
<li>I already have something in nginx to mark Guzzle as a bot, but interestingly it shows up in Solr as <code>$http_user_agent</code> so there is a logic error in my nginx config</li>
|
|
</ul>
|
|
</li>
|
|
<li>Ouch, the logic error seems to be this:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>geo $ua {
|
|
</span></span><span style="display:flex;"><span> default $http_user_agent;
|
|
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
|
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span> include /etc/nginx/bot-networks.conf;
|
|
</span></span><span style="display:flex;"><span>}
|
|
</span></span></code></pre></div><ul>
|
|
<li>After some testing on DSpace Test I see that this is actually setting the default user agent to a literal <code>$http_user_agent</code></li>
|
|
<li>The <a href="http://nginx.org/en/docs/http/ngx_http_map_module.html">nginx map docs</a> say:</li>
|
|
</ul>
|
|
<blockquote>
|
|
<p>The resulting value can contain text, variable (0.9.0), and their combination (1.11.0).</p>
|
|
</blockquote>
|
|
<ul>
|
|
<li>But I can’t get it to work, neither for the default value or for matching my IP…
|
|
<ul>
|
|
<li>I will have to ask on the nginx mailing list</li>
|
|
</ul>
|
|
</li>
|
|
<li>The total number of requests and unique hosts was not even very high (below here around midnight so is almost all day):</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort -u | wc -l
|
|
</span></span><span style="display:flex;"><span>2776
|
|
</span></span><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | wc -l
|
|
</span></span><span style="display:flex;"><span>40325
|
|
</span></span></code></pre></div><h2 id="2022-07-18">2022-07-18</h2>
|
|
<ul>
|
|
<li>Reading more about nginx’s geo/map and doing some tests on DSpace Test, it appears that the <a href="https://stackoverflow.com/questions/47011497/nginx-geo-module-wont-use-variables">geo module cannot do dynamic values</a>
|
|
<ul>
|
|
<li>So this issue with the literal <code>$http_user_agent</code> is due to the geo block I put in place earlier this month</li>
|
|
<li>I reworked the logic so that the geo block sets “bot” or and empty string when a network matches or not, and then re-use that value in a mapping that passes through the host’s user agent in case geo has set it to an empty string</li>
|
|
<li>This allows me to accomplish the original goal while still only using one bot-networks.conf file for the <code>limit_req_zone</code> and the user agent mapping that we pass to Tomcat</li>
|
|
<li>Unfortunately this means I will have hundreds of thousands of requests in Solr with a literal <code>$http_user_agent</code></li>
|
|
<li>I might try to purge some by enumerating all the networks in my block file and running them through <code>check-spider-ip-hits.sh</code></li>
|
|
</ul>
|
|
</li>
|
|
<li>I extracted all the IPs/subnets from <code>bot-networks.conf</code> and prepared them so I could enumerate their IPs
|
|
<ul>
|
|
<li>I had to add <code>/32</code> to all single IPs, which I did with this crazy vim invocation:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>:g!/\/\d\+$/s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/
|
|
</span></span></code></pre></div><ul>
|
|
<li>Explanation:
|
|
<ul>
|
|
<li><code>g!</code>: global, lines <em>not</em> matching (the opposite of <code>g</code>)</li>
|
|
<li><code>/\/\d\+$/</code>, pattern matching <code>/</code> with one or more digits at the end of the line</li>
|
|
<li><code>s/^\(\d\+\.\d\+\.\d\+\.\d\+\)$/\1\/32/</code>, for lines not matching above, capture the IPv4 address and add <code>/32</code> at the end</li>
|
|
</ul>
|
|
</li>
|
|
<li>Then I ran the list through prips to enumerate the IPs:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ <span style="color:#66d9ef">while</span> read -r line; <span style="color:#66d9ef">do</span> prips <span style="color:#e6db74">"</span>$line<span style="color:#e6db74">"</span> | sed -e <span style="color:#e6db74">'1d; $d'</span>; <span style="color:#66d9ef">done</span> < /tmp/bot-networks.conf > /tmp/bot-ips.txt
|
|
</span></span><span style="display:flex;"><span>$ wc -l /tmp/bot-ips.txt
|
|
</span></span><span style="display:flex;"><span>1946968 /tmp/bot-ips.txt
|
|
</span></span></code></pre></div><ul>
|
|
<li>I started running <code>check-spider-ip-hits.sh</code> with the 1946968 IPs and left it running in dry run mode</li>
|
|
</ul>
|
|
<h2 id="2022-07-19">2022-07-19</h2>
|
|
<ul>
|
|
<li>Patrizio and Fabio emailed me to ask if their IP was banned from CGSpace
|
|
<ul>
|
|
<li>It’s one of the Hetzner ones so I said yes definitely, and asked more about how they are using the API</li>
|
|
</ul>
|
|
</li>
|
|
<li>Add ORCID identifer for Ram Dhulipala, Lilian Wambua, and Dan Masiga to CGSpace and tag them and some other existing items:</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dc.contributor.author,cg.creator.identifier
|
|
</span></span><span style="display:flex;"><span>"Dhulipala, Ram K","Ram Dhulipala: 0000-0002-9720-3247"
|
|
</span></span><span style="display:flex;"><span>"Dhulipala, Ram","Ram Dhulipala: 0000-0002-9720-3247"
|
|
</span></span><span style="display:flex;"><span>"Dhulipala, R.","Ram Dhulipala: 0000-0002-9720-3247"
|
|
</span></span><span style="display:flex;"><span>"Wambua, Lillian","Lillian Wambua: 0000-0003-3632-7411"
|
|
</span></span><span style="display:flex;"><span>"Wambua, Lilian","Lillian Wambua: 0000-0003-3632-7411"
|
|
</span></span><span style="display:flex;"><span>"Masiga, D.K.","Daniel Masiga: 0000-0001-7513-0887"
|
|
</span></span><span style="display:flex;"><span>"Masiga, Daniel K.","Daniel Masiga: 0000-0001-7513-0887"
|
|
</span></span><span style="display:flex;"><span>"Jores, Joerg","Joerg Jores: 0000-0003-3790-5746"
|
|
</span></span><span style="display:flex;"><span>"Schieck, Elise","Elise Schieck: 0000-0003-1756-6337"
|
|
</span></span><span style="display:flex;"><span>"Schieck, Elise G.","Elise Schieck: 0000-0003-1756-6337"
|
|
</span></span><span style="display:flex;"><span>$ ./ilri/add-orcid-identifiers-csv.py -i /tmp/2022-07-19-add-orcids.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span>
|
|
</span></span></code></pre></div><ul>
|
|
<li>Review the AfricaRice records from earlier this month again
|
|
<ul>
|
|
<li>I found one more duplicate and one more suspicious item, so the total after removing those is now forty-two</li>
|
|
</ul>
|
|
</li>
|
|
<li>I took all the ~560 IPs that had hits so far in <code>check-spider-ip-hits.sh</code> above (about 270,000 into the list of 1946968 above) and ran them directly on CGSpace
|
|
<ul>
|
|
<li>This purged 199,032 hits from Solr, very many of which were from Qualys, but also that Chinese bot on 124.17.34.0/24 that was grabbing PDFs a few years ago which I blocked in nginx, but never purged the hits from</li>
|
|
<li>Then I deleted all IPs up to the last one where I found hits in the large file of 1946968 IPs and re-started the script</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-07-20">2022-07-20</h2>
|
|
<ul>
|
|
<li>Did a few more minor edits to the forty-two AfricaRice records (including generating thumbnails for the handful that are Creative Commons licensed) then did a test import on my local instance
|
|
<ul>
|
|
<li>Once it worked well I did an import to CGSpace:</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace import -a -e fuuu@example.com -m 2022-07-20-africarice.map -s /tmp/SimpleArchiveFormat
|
|
</span></span></code></pre></div><ul>
|
|
<li>Also make edits to ~62 affiliations on CGSpace because I noticed they were messed up</li>
|
|
<li>Extract another ~1,600 IPs that had hits since I started the second round of <code>check-spider-ip-hits.sh</code> yesterday and purge another 303,594 hits
|
|
<ul>
|
|
<li>This is about 999846 into the original list of 1946968 from yesterday</li>
|
|
<li>A metric fuck ton of the IPs in this batch were from Hetzner</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2 id="2022-07-21">2022-07-21</h2>
|
|
<ul>
|
|
<li>Extract another ~2,100 IPs that had hits since I started the third round of <code>check-spider-ip-hits.sh</code> last night and purge another 763,843 hits
|
|
<ul>
|
|
<li>This is about 1441221 into the original list of 1946968 from two days ago</li>
|
|
<li>Again these are overwhelmingly Hetzner (not surprising since my bot-networks.conf file in nginx is mostly Hetzner)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<!-- raw HTML omitted -->
|
|
|
|
|
|
|
|
|
|
|
|
</article>
|
|
|
|
|
|
|
|
</div> <!-- /.blog-main -->
|
|
|
|
<aside class="col-sm-3 ml-auto blog-sidebar">
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Recent Posts</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
|
|
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
|
|
|
|
<li><a href="/cgspace-notes/2022-03/">March, 2022</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
|
|
|
|
|
|
<section class="sidebar-module">
|
|
<h4>Links</h4>
|
|
<ol class="list-unstyled">
|
|
|
|
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
|
|
|
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
|
|
|
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
|
|
|
</ol>
|
|
</section>
|
|
|
|
</aside>
|
|
|
|
|
|
</div> <!-- /.row -->
|
|
</div> <!-- /.container -->
|
|
|
|
|
|
|
|
<footer class="blog-footer">
|
|
<p dir="auto">
|
|
|
|
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
|
|
|
</p>
|
|
<p>
|
|
<a href="#">Back to top</a>
|
|
</p>
|
|
</footer>
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|