<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="July, 2022" /> <meta property="og:description" content="2022-07-02 I learned how to use the Levenshtein functions in PostgreSQL The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-07/" /> <meta property="article:published_time" content="2022-07-02T14:07:36+03:00" /> <meta property="article:modified_time" content="2022-07-08T15:49:45+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="July, 2022"/> <meta name="twitter:description" content="2022-07-02 I learned how to use the Levenshtein functions in PostgreSQL The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first "/> <meta name="generator" content="Hugo 0.101.0" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "July, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-07/", "wordCount": "1660", "datePublished": "2022-07-02T14:07:36+03:00", "dateModified": "2022-07-08T15:49:45+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-07/"> <title>July, 2022 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-07/">July, 2022</a></h2> <p class="blog-post-meta"> <time datetime="2022-07-02T14:07:36+03:00">Sat Jul 02, 2022</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2022-07-02">2022-07-02</h2> <ul> <li>I learned how to use the Levenshtein functions in PostgreSQL <ul> <li>The thing is that there is a limit of 255 characters for these functions in PostgreSQL so you need to truncate the strings before comparing</li> <li>Also, the trgm functions I’ve used before are case insensitive, but Levenshtein is not, so you need to make sure to lower case both strings first</li> </ul> </li> </ul> <ul> <li>A working query checking for duplicates in the recent AfricaRice items is:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT text_value FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND levenshtein_less_equal(LOWER('International Trade and Exotic Pests: The Risks for Biodiversity and African Economies'), LEFT(LOWER(text_value), 255), 3) <= 3; </span></span><span style="display:flex;"><span> text_value </span></span><span style="display:flex;"><span>──────────────────────────────────────────────────────────────────────────────────────── </span></span><span style="display:flex;"><span> International trade and exotic pests: the risks for biodiversity and African economies </span></span><span style="display:flex;"><span>(1 row) </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Time: 399.751 ms </span></span></code></pre></div><ul> <li>There is a great <a href="https://www.crunchydata.com/blog/fuzzy-name-matching-in-postgresql">blog post discussing Soundex with Levenshtein</a> and creating indexes to make them faster</li> <li>I want to do some proper checks of accuracy and speed against my trigram method</li> </ul> <h2 id="2022-07-03">2022-07-03</h2> <ul> <li>Start a harvest on AReS</li> </ul> <h2 id="2022-07-04">2022-07-04</h2> <ul> <li>Linode told me that CGSpace had high load yesterday <ul> <li>I also got some up and down notices from UptimeRobot</li> <li>Looking now, I see there was a very high CPU and database pool load, but a mostly normal DSpace session count</li> </ul> </li> </ul> <p><img src="/cgspace-notes/2022/07/cpu-day.png" alt="CPU load day"> <img src="/cgspace-notes/2022/07/jmx_tomcat_dbpools-day.png" alt="JDBC pool day"></p> <ul> <li>Seems we have some old database transactions since 2022-06-27:</li> </ul> <p><img src="/cgspace-notes/2022/07/postgres_locks_ALL-week.png" alt="PostgreSQL locks week"> <img src="/cgspace-notes/2022/07/postgres_querylength_ALL-week.png" alt="PostgreSQL query length week"></p> <ul> <li>Looking at the top connections to nginx yesterday:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort | uniq -c | sort -h | tail </span></span><span style="display:flex;"><span> 1132 64.124.8.34 </span></span><span style="display:flex;"><span> 1146 2a01:4f8:1c17:5550::1 </span></span><span style="display:flex;"><span> 1380 137.184.159.211 </span></span><span style="display:flex;"><span> 1533 64.124.8.59 </span></span><span style="display:flex;"><span> 4013 80.248.237.167 </span></span><span style="display:flex;"><span> 4776 54.195.118.125 </span></span><span style="display:flex;"><span> 10482 45.5.186.2 </span></span><span style="display:flex;"><span> 11177 172.104.229.92 </span></span><span style="display:flex;"><span> 15855 2a01:7e00::f03c:91ff:fe9a:3a37 </span></span><span style="display:flex;"><span> 22179 64.39.98.251 </span></span></code></pre></div><ul> <li>And the total number of unique IPs:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | sort -u | wc -l </span></span><span style="display:flex;"><span>6952 </span></span></code></pre></div><ul> <li>This seems low, so it must have been from the request patterns by certain visitors <ul> <li>64.39.98.251 is Qualys, and I’m debating blocking <a href="https://pci.qualys.com/static/help/merchant/getting_started/check_scanner_ip_addresses.htm">all their IPs</a> using a geo block in nginx (need to test)</li> <li>The top few are known ILRI and other CGIAR scrapers, but 80.248.237.167 is on InternetVikings in Sweden, using a normal user agentand scraping Discover</li> <li>64.124.8.59 is making requests with a normal user agent and belongs to Castle Global or Zayo</li> </ul> </li> <li>I ran all system updates and rebooted the server (could have just restarted PostgreSQL but I thought I might as well do everything)</li> <li>I implemented a geo mapping for the user agent mapping AND the nginx <code>limit_req_zone</code> by extracting the networks into an external file and including it in two different geo mapping blocks <ul> <li>This is clever and relies on the fact that we can use defaults in both cases</li> <li>First, we map the user agent of requests from these networks to “bot” so that Tomcat and Solr handle them accordingly</li> <li>Second, we use this as a key in a <code>limit_req_zone</code>, which relies on a default mapping of ’’ (and nginx doesn’t evaluate empty cache keys)</li> </ul> </li> <li>I noticed that CIP uploaded a number of Georgian presentations with <code>dcterms.language</code> set to English and Other so I changed them to “ka” <ul> <li>Perhaps we need to update our list of languages to include all instead of the most common ones</li> </ul> </li> <li>I wrote a script <code>ilri/iso-639-value-pairs.py</code> to extract the names and Alpha 2 codes for all ISO 639-1 languages from pycountry and added them to <code>input-forms.xml</code></li> </ul> <h2 id="2022-07-06">2022-07-06</h2> <ul> <li>CGSpace went down and up a few times due to high load <ul> <li>I found one host in Romania making very high speed requests with a normal user agent (<code>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C</code>):</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span> </span></span><span style="display:flex;"><span> 516 142.132.248.90 </span></span><span style="display:flex;"><span> 525 157.55.39.234 </span></span><span style="display:flex;"><span> 587 66.249.66.21 </span></span><span style="display:flex;"><span> 593 95.108.213.59 </span></span><span style="display:flex;"><span> 1372 137.184.159.211 </span></span><span style="display:flex;"><span> 4776 54.195.118.125 </span></span><span style="display:flex;"><span> 5441 205.186.128.185 </span></span><span style="display:flex;"><span> 6267 45.5.186.2 </span></span><span style="display:flex;"><span> 15839 2a01:7e00::f03c:91ff:fe9a:3a37 </span></span><span style="display:flex;"><span> 36114 146.19.75.141 </span></span></code></pre></div><ul> <li>I added 146.19.75.141 to the list of bot networks in nginx</li> <li>While looking at the logs I started thinking about Bing again <ul> <li>They apparently <a href="https://www.bing.com/toolbox/bingbot.json">publish a list of all their networks</a></li> <li>I wrote a script to use <code>prips</code> to <a href="https://stackoverflow.com/a/52501093/1996540">print the IPs for each network</a></li> <li>The script is <code>bing-networks-to-ips.sh</code></li> <li>From Bing’s IPs alone I purged 145,403 hits… sheesh</li> </ul> </li> <li>Delete two items on CGSpace for Margarita because she was getting the “Authorization denied for action OBSOLETE (DELETE) on BITSTREAM:0b26875a-…” error <ul> <li>This is the same DSpace 6 bug I noticed in 2021-03, 2021-04, and 2021-05</li> </ul> </li> <li>Update some <code>cg.audience</code> metadata to use “Academics” instead of “Academicians”:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='Academics' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=144 AND text_value='Academicians'; </span></span><span style="display:flex;"><span>UPDATE 104 </span></span></code></pre></div><ul> <li>I will also have to remove “Academicians” from input-forms.xml</li> </ul> <h2 id="2022-07-07">2022-07-07</h2> <ul> <li>Finalize lists of non-AGROVOC subjects in CGSpace that I started last week <ul> <li>I used the <a href="https://wiki.lyrasis.org/display/DSPACE/Helper+SQL+functions+for+DSpace+6">SQL helper functions</a> to find the collections where each term was used:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace= ☘ SELECT DISTINCT(ds6_item2collectionhandle(dspace_object_id)) AS collection, COUNT(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND LOWER(text_value) = 'water demand' GROUP BY collection ORDER BY count DESC LIMIT 5; </span></span><span style="display:flex;"><span> collection │ count </span></span><span style="display:flex;"><span>─────────────┼─────── </span></span><span style="display:flex;"><span> 10568/36178 │ 56 </span></span><span style="display:flex;"><span> 10568/36185 │ 46 </span></span><span style="display:flex;"><span> 10568/36181 │ 35 </span></span><span style="display:flex;"><span> 10568/36188 │ 28 </span></span><span style="display:flex;"><span> 10568/36179 │ 21 </span></span><span style="display:flex;"><span>(5 rows) </span></span></code></pre></div><ul> <li>For now I only did terms from my list that had 100 or more occurrences in CGSpace <ul> <li>This leaves us with thirty-six terms that I will send to Sara Jani and Elizabeth Arnaud for evaluating possible inclusion to AGROVOC</li> </ul> </li> <li>Write to some submitters from CIAT, Bioversity, and CCAFS to ask if they are still uploading new items with their legacy subject fields on CGSpace <ul> <li>We want to remove them from the submission form to create space for new fields</li> </ul> </li> <li>Update one term I noticed people using that was close to AGROVOC:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# UPDATE metadatavalue SET text_value='development policies' WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=187 AND text_value='development policy'; </span></span><span style="display:flex;"><span>UPDATE 108 </span></span></code></pre></div><ul> <li>After contacting some editors I removed some old metadata fields from the submission form and browse indexes: <ul> <li>Bioversity subject (<code>cg.subject.bioversity</code>)</li> <li>CCAFS phase 1 project tag (<code>cg.identifier.ccafsproject</code>)</li> <li>CIAT project tag (<code>cg.identifier.ciatproject</code>)</li> <li>CIAT subject (<code>cg.subject.ciat</code>)</li> </ul> </li> <li>Work on cleaning and proofing forty-six AfricaRice items for CGSpace <ul> <li>Last week we identified some duplicates so I removed those</li> <li>The data is of mediocre quality</li> <li>I’ve been fixing citations (nitpick), adding licenses, adding volume/issue/extent, fixing DOIs, and adding some AGROVOC subjects</li> <li>I even found titles that have typos, looking something like OCR errors…</li> </ul> </li> </ul> <h2 id="2022-07-08">2022-07-08</h2> <ul> <li>Finalize the cleaning and proofing of AfricaRice records <ul> <li>I found two suspicious items that claim to have been published but I can’t find in the respective journals, so I removed those</li> <li>I uploaded the forty-four items to <a href="https://dspacetest.cgiar.org/handle/10568/119135">DSpace Test</a></li> </ul> </li> <li>Margarita from CCAFS said they are no longer using the CCAFS subject or CCAFS phase 2 project tag <ul> <li>I removed these from the input-form.xml and Discovery facets: <ul> <li>cg.identifier.ccafsprojectpii</li> <li>cg.subject.cifor</li> </ul> </li> <li>For now we will keep them in the search filters</li> </ul> </li> <li>I modified my <code>check-duplicates.py</code> script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: <a href="https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents">https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents</a>) <ul> <li>I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace</li> <li>I am curious to see how the similarity scores compare to those from trgm… perhaps we don’t need them actually</li> </ul> </li> <li>Deploy latest changes to submission form, Discovery, and browse on CGSpace <ul> <li>Also run all system updates and reboot the host</li> </ul> </li> <li>Fix 152 <code>dcterms.relation</code> that are using “cgspace.cgiar.org” links instead of handles:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$'; </span></span></code></pre></div><h2 id="2022-07-10">2022-07-10</h2> <ul> <li>UptimeRobot says that CGSpace is down <ul> <li>I see high load around 22, high CPU around 800%</li> <li>Doesn’t seem to be a lot of unique IPs:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log | sort -u | wc -l </span></span><span style="display:flex;"><span>2243 </span></span></code></pre></div><p>Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:</p> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>1613 </span></span><span style="display:flex;"><span>$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>1696 </span></span><span style="display:flex;"><span>$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>1708 </span></span><span style="display:flex;"><span>$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>1830 </span></span><span style="display:flex;"><span>$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE <span style="color:#e6db74">'session_id=[A-Z0-9]{32}:ip_addr='</span> | sort | uniq | wc -l </span></span><span style="display:flex;"><span>1811 </span></span></code></pre></div><p><img src="/cgspace-notes/2022/07/jmx_dspace_sessions-week.png" alt="DSpace sessions week"></p> <ul> <li> <p>These IPs are using normal-looking user agents:</p> <ul> <li><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1</code></li> <li><code>Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"</code></li> <li><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1</code></li> <li><code>Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36</code></li> </ul> </li> <li> <p>I will add networks I’m seeing now to nginx’s bot-networks.conf for now (not all of Hetzner) and purge the hits later:</p> <ul> <li>65.108.0.0/16</li> <li>65.21.0.0/16</li> <li>95.216.0.0/16</li> <li>135.181.0.0/16</li> <li>138.201.0.0/16</li> </ul> </li> <li> <p>I think I’m going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need</p> </li> <li> <p>Sheesh, there are a bunch more IPv6 addresses also on Hetzner:</p> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $1}'</span> /var/log/nginx/<span style="color:#f92672">{</span>access,library-access<span style="color:#f92672">}</span>.log | sort | grep 2a01:4f9 | uniq -c | sort -h </span></span><span style="display:flex;"><span> 1 2a01:4f9:6a:1c2b::2 </span></span><span style="display:flex;"><span> 2 2a01:4f9:2b:5a8::2 </span></span><span style="display:flex;"><span> 2 2a01:4f9:4b:4495::2 </span></span><span style="display:flex;"><span> 96 2a01:4f9:c010:518c::1 </span></span><span style="display:flex;"><span> 137 2a01:4f9:c010:a9bc::1 </span></span><span style="display:flex;"><span> 142 2a01:4f9:c010:58c9::1 </span></span><span style="display:flex;"><span> 142 2a01:4f9:c010:58ea::1 </span></span><span style="display:flex;"><span> 144 2a01:4f9:c010:58eb::1 </span></span><span style="display:flex;"><span> 145 2a01:4f9:c010:6ff8::1 </span></span><span style="display:flex;"><span> 148 2a01:4f9:c010:5190::1 </span></span><span style="display:flex;"><span> 149 2a01:4f9:c010:7d6d::1 </span></span><span style="display:flex;"><span> 153 2a01:4f9:c010:5226::1 </span></span><span style="display:flex;"><span> 156 2a01:4f9:c010:7f74::1 </span></span><span style="display:flex;"><span> 160 2a01:4f9:c010:5188::1 </span></span><span style="display:flex;"><span> 161 2a01:4f9:c010:58e5::1 </span></span><span style="display:flex;"><span> 168 2a01:4f9:c010:58ed::1 </span></span><span style="display:flex;"><span> 170 2a01:4f9:c010:548e::1 </span></span><span style="display:flex;"><span> 170 2a01:4f9:c010:8c97::1 </span></span><span style="display:flex;"><span> 175 2a01:4f9:c010:58c8::1 </span></span><span style="display:flex;"><span> 175 2a01:4f9:c010:aada::1 </span></span><span style="display:flex;"><span> 182 2a01:4f9:c010:58ec::1 </span></span><span style="display:flex;"><span> 182 2a01:4f9:c010:ae8c::1 </span></span><span style="display:flex;"><span> 502 2a01:4f9:c010:ee57::1 </span></span><span style="display:flex;"><span> 530 2a01:4f9:c011:567a::1 </span></span><span style="display:flex;"><span> 535 2a01:4f9:c010:d04e::1 </span></span><span style="display:flex;"><span> 539 2a01:4f9:c010:3d9a::1 </span></span><span style="display:flex;"><span> 586 2a01:4f9:c010:93db::1 </span></span><span style="display:flex;"><span> 593 2a01:4f9:c010:a04a::1 </span></span><span style="display:flex;"><span> 601 2a01:4f9:c011:4166::1 </span></span><span style="display:flex;"><span> 607 2a01:4f9:c010:9881::1 </span></span><span style="display:flex;"><span> 640 2a01:4f9:c010:87fb::1 </span></span><span style="display:flex;"><span> 648 2a01:4f9:c010:e680::1 </span></span><span style="display:flex;"><span> 1141 2a01:4f9:3a:2696::2 </span></span><span style="display:flex;"><span> 1146 2a01:4f9:3a:2555::2 </span></span><span style="display:flex;"><span> 3207 2a01:4f9:3a:2c19::2 </span></span></code></pre></div><ul> <li>Maybe it’s time I ban all of Hetzner… sheesh.</li> <li>I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back</li> </ul> <p><img src="/cgspace-notes/2022/07/cpu-day.png" alt="CPU day"></p> <ul> <li>I am not sure what’s going on <ul> <li>I extracted all the IPs and used <code>resolve-addresses-geoip2.py</code> to analyze them and extract all the Hetzner networks and block them</li> <li>It’s 181 IPs on Hetzner…</li> </ul> </li> <li>I rebooted the server to see if it was just some stuck locks in PostgreSQL…</li> <li>The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block</li> <li>Start a harvest on AReS</li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2022-07/">July, 2022</a></li> <li><a href="/cgspace-notes/2022-06/">June, 2022</a></li> <li><a href="/cgspace-notes/2022-05/">May, 2022</a></li> <li><a href="/cgspace-notes/2022-04/">April, 2022</a></li> <li><a href="/cgspace-notes/2022-03/">March, 2022</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>