<!DOCTYPE html> <html lang="en" > <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <meta property="og:title" content="September, 2022" /> <meta property="og:description" content="2022-09-01 A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change The submission works as expected Start debugging some region-related issues with csv-metadata-quality I created a new test file test-geography.csv with some different scenarios I also fixed a few bugs and improved the region-matching logic " /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-01/" /> <meta property="article:published_time" content="2022-01-01T09:41:36+03:00" /> <meta property="article:modified_time" content="2022-09-09T17:29:51+03:00" /> <meta name="twitter:card" content="summary"/> <meta name="twitter:title" content="September, 2022"/> <meta name="twitter:description" content="2022-09-01 A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change The submission works as expected Start debugging some region-related issues with csv-metadata-quality I created a new test file test-geography.csv with some different scenarios I also fixed a few bugs and improved the region-matching logic "/> <meta name="generator" content="Hugo 0.102.3" /> <script type="application/ld+json"> { "@context": "http://schema.org", "@type": "BlogPosting", "headline": "September, 2022", "url": "https://alanorth.github.io/cgspace-notes/2022-01/", "wordCount": "1259", "datePublished": "2022-01-01T09:41:36+03:00", "dateModified": "2022-09-09T17:29:51+03:00", "author": { "@type": "Person", "name": "Alan Orth" }, "keywords": "Notes" } </script> <link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-01/"> <title>September, 2022 | CGSpace Notes</title> <!-- combined, minified CSS --> <link href="https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel="stylesheet" integrity="sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin="anonymous"> <!-- minified Font Awesome for SVG icons --> <script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script> <!-- RSS 2.0 feed --> </head> <body> <div class="blog-masthead"> <div class="container"> <nav class="nav blog-nav"> <a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a> </nav> </div> </div> <header class="blog-header"> <div class="container"> <h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1> <p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p> </div> </header> <div class="container"> <div class="row"> <div class="col-sm-8 blog-main"> <article class="blog-post"> <header> <h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-01/">September, 2022</a></h2> <p class="blog-post-meta"> <time datetime="2022-01-01T09:41:36+03:00">Sat Jan 01, 2022</time> in <span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a> </p> </header> <h2 id="2022-09-01">2022-09-01</h2> <ul> <li>A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet</li> <li>I tested an item submission on DSpace Test with the Cocoon <code>org.apache.cocoon.uploads.autosave=false</code> change <ul> <li>The submission works as expected</li> </ul> </li> <li>Start debugging some region-related issues with csv-metadata-quality <ul> <li>I created a new test file <code>test-geography.csv</code> with some different scenarios</li> <li>I also fixed a few bugs and improved the region-matching logic</li> </ul> </li> </ul> <ul> <li>I filed <a href="https://github.com/konstantinstadler/country_converter/issues/115">an issue for the “South-eastern Asia” case mismatch in country_converter</a> on GitHub</li> <li>Meeting with Moayad to discuss OpenRXV developments <ul> <li>He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more</li> </ul> </li> </ul> <h2 id="2022-09-02">2022-09-02</h2> <ul> <li>I worked a bit more on exclusion and skipping logic in csv-metadata-quality <ul> <li>I also pruned and updated all the Python dependencies</li> <li>Then I released <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0">version 0.6.0</a> now that the excludes and region matching support is working way better</li> </ul> </li> </ul> <h2 id="2022-09-05">2022-09-05</h2> <ul> <li>Started a harvest on AReS last night</li> <li>Looking over the Solr statistics from last month I see many user agents that look suspicious: <ul> <li>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)</li> <li>Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36</li> <li>Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0</li> <li>Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre</li> <li>Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131</li> <li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)</li> <li>Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)</li> <li>curb</li> <li>bitdiscovery</li> <li>omgili/0.5 +http://omgili.com</li> <li>Mozilla/5.0 (compatible)</li> <li>Vizzit</li> <li>Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0</li> <li>Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0</li> <li>Java/17-ea</li> <li>AdobeUxTechC4-Async/3.0.12 (win32)</li> <li>ZaloPC-win32-24v473</li> <li>Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com</li> <li>Scoop.it</li> <li>Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0</li> <li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</li> <li>ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0</li> <li>WebAPIClient</li> <li>Mozilla/5.0 Firefox/26.0</li> <li>Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)</li> </ul> </li> <li>For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (<code>Mozilla / 5.0</code>)</li> <li>Tons of hosts making requests likt this:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0 </span></span></code></pre></div><ul> <li>I got a list of hosts making requests like that so I can purge their hits:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.<span style="color:#f92672">[</span>123<span style="color:#f92672">]</span>*.gz | grep <span style="color:#e6db74">'String.fromCharCode('</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort -u > /tmp/ips.txt </span></span></code></pre></div><ul> <li>I purged 4,718 hits from IPs</li> <li>I see some new Hetzner ranges that I hadn’t blocked yet apparently? <ul> <li>I got a <a href="https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh">list of Hetzner’s IPs from IP Quality Score</a> then added them to the existing ones in my Ansible playbooks:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ awk <span style="color:#e6db74">'{print $1}'</span> /tmp/hetzner.txt | wc -l </span></span><span style="display:flex;"><span>36 </span></span><span style="display:flex;"><span>$ sort -u /tmp/hetzner-combined.txt | wc -l </span></span><span style="display:flex;"><span>49 </span></span></code></pre></div><ul> <li>I will add this new list to nginx’s <code>bot-networks.conf</code> so they get throttled on scraping XMLUI and get classified as bots in Solr statistics</li> <li>Then I purged hits from the following user agents:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents </span></span><span style="display:flex;"><span>Found 374 hits from curb in statistics </span></span><span style="display:flex;"><span>Found 350 hits from bitdiscovery in statistics </span></span><span style="display:flex;"><span>Found 564 hits from omgili in statistics </span></span><span style="display:flex;"><span>Found 390 hits from Vizzit in statistics </span></span><span style="display:flex;"><span>Found 9125 hits from AdobeUxTechC4-Async in statistics </span></span><span style="display:flex;"><span>Found 97 hits from ZaloPC-win32-24v473 in statistics </span></span><span style="display:flex;"><span>Found 518 hits from nbertaupete95 in statistics </span></span><span style="display:flex;"><span>Found 218 hits from Scoop.it in statistics </span></span><span style="display:flex;"><span>Found 584 hits from WebAPIClient in statistics </span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"> </span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 12220 </span></span></code></pre></div><ul> <li>Then I will add these user agents to the ILRI spider override in DSpace</li> </ul> <h2 id="2022-09-06">2022-09-06</h2> <ul> <li>I’m testing dspace-statistics-api with our DSpace 7 test server <ul> <li>After setting up the env and the database the <code>python -m dspace_statistics_api.indexer</code> runs without issues</li> <li>While playing with Solr I tried to search for statistics from this month using <code>time:2022-09*</code> but I get this error: “Can’t run prefix queries on numeric fields”</li> <li>I guess that the syntax in Solr changed since 4.10…</li> <li>This works, but is super annoying: <code>time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]</code></li> </ul> </li> </ul> <h2 id="2022-09-07">2022-09-07</h2> <ul> <li>I tested the controlled-vocabulary changes on DSpace 6 and they work fine <ul> <li>Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values</li> <li>This is a pain because it means I have to re-do the IDs in each file every time I update them</li> <li>If I add <code>id="0000"</code> to each, then I can use <a href="https://vim.fandom.com/wiki/Making_a_list_of_numbers#Substitute_with_ascending_numbers">this vim expression</a> <code>let i=0001 | g/0000/s//\=i/ | let i=i+1</code> to replace the numbers with increments starting from 1</li> </ul> </li> <li>Meeting with Marie Angelique, Abenet, Sarа, аnd Margarita to continue the discussion about Types from last week <ul> <li>We made progress with concrete actions and will continue next week</li> </ul> </li> </ul> <h2 id="2022-09-08">2022-09-08</h2> <ul> <li>I had a meeting with Nicky from UNEP to discuss issues they are having with their DSpace <ul> <li>I told her about the meeting of DSpace community people that we’re planning at ILRI in the next few weeks</li> </ul> </li> </ul> <h2 id="2022-09-09">2022-09-09</h2> <ul> <li>Add some value mappings to AReS because I see a lot of incorrect regions and countries</li> <li>I also found some values that were blank in CGSpace so I deleted them:</li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>dspace=# BEGIN; </span></span><span style="display:flex;"><span>BEGIN </span></span><span style="display:flex;"><span>dspace=# DELETE FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_value=''; </span></span><span style="display:flex;"><span>DELETE 70 </span></span><span style="display:flex;"><span>dspace=# COMMIT; </span></span><span style="display:flex;"><span>COMMIT </span></span></code></pre></div><ul> <li>Start a full Discovery index on CGSpace to catch these changes in the Discovery</li> </ul> <h2 id="2022-09-11">2022-09-11</h2> <ul> <li>Today is Sunday and I see the load on the server is high <ul> <li>Google and a bunch of other bots have been blocked on XMLUI for the past two weeks so it’s not from them!</li> <li>Looking at the top IPs this morning:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># cat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.1 | grep <span style="color:#e6db74">'11/Sep/2022'</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">40</span> </span></span><span style="display:flex;"><span>... </span></span><span style="display:flex;"><span> 165 64.233.172.79 </span></span><span style="display:flex;"><span> 166 87.250.224.34 </span></span><span style="display:flex;"><span> 200 69.162.124.231 </span></span><span style="display:flex;"><span> 202 216.244.66.198 </span></span><span style="display:flex;"><span> 385 207.46.13.149 </span></span><span style="display:flex;"><span> 398 207.46.13.147 </span></span><span style="display:flex;"><span> 421 66.249.64.185 </span></span><span style="display:flex;"><span> 422 157.55.39.81 </span></span><span style="display:flex;"><span> 442 2a01:4f8:1c17:5550::1 </span></span><span style="display:flex;"><span> 451 64.124.8.36 </span></span><span style="display:flex;"><span> 578 137.184.159.211 </span></span><span style="display:flex;"><span> 597 136.243.228.195 </span></span><span style="display:flex;"><span> 1185 66.249.64.183 </span></span><span style="display:flex;"><span> 1201 157.55.39.80 </span></span><span style="display:flex;"><span> 3135 80.248.237.167 </span></span><span style="display:flex;"><span> 4794 54.195.118.125 </span></span><span style="display:flex;"><span> 5486 45.5.186.2 </span></span><span style="display:flex;"><span> 6322 2a01:7e00::f03c:91ff:fe9a:3a37 </span></span><span style="display:flex;"><span> 9556 66.249.64.181 </span></span></code></pre></div><ul> <li>The top is still Google, but all the requests are HTTP 503 because I classified them as bots for XMLUI at least</li> <li>Then there’s 80.248.237.167, which is using a normal user agent and scraping Discovery <ul> <li>That IP is on Internet Vikings aka Internetbolaget and we are already marking that subnet as ‘bot’ for XMLUI so most of these requests are HTTP 503</li> </ul> </li> <li>On another note, I’m curious to explore enabling caching of certain REST API responses <ul> <li>For example, where the use is for harvesting rather than actual clients getting bitstreams or thumbnails, it seems there might be a benefit to speeding these up for subsequent requestors:</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $7}'</span> /var/log/nginx/rest.log | grep -v retrieve | sort | uniq -c | sort -h | tail -n <span style="color:#ae81ff">10</span> </span></span><span style="display:flex;"><span> 4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/bitstreams </span></span><span style="display:flex;"><span> 4 /rest/items/3f692ddd-7856-4bf0-a587-99fb3df0688a/metadata </span></span><span style="display:flex;"><span> 4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/bitstreams </span></span><span style="display:flex;"><span> 4 /rest/items/b014e36f-b496-43d8-9148-cc9db8a6efac/metadata </span></span><span style="display:flex;"><span> 5 /rest/handle/10568/110310?expand=all </span></span><span style="display:flex;"><span> 5 /rest/handle/10568/89980?expand=all </span></span><span style="display:flex;"><span> 5 /rest/handle/10568/97614?expand=all </span></span><span style="display:flex;"><span> 6 /rest/handle/10568/107086?expand=all </span></span><span style="display:flex;"><span> 6 /rest/handle/10568/108503?expand=all </span></span><span style="display:flex;"><span> 6 /rest/handle/10568/98424?expand=all </span></span></code></pre></div><ul> <li>I specifically have to not cache things like requests for bitstreams because those are from actual users and we need to keep the real requests so we get the statistics hit <ul> <li>Will be interesting to check the results above as the day goes on (now 10AM)</li> <li>To estimate the potential savings from caching I will check how many non-bitstream requests are made versus how many are made more than once (updated the next morning using yesterday’s log):</li> </ul> </li> </ul> <div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $7}'</span> /var/log/nginx/rest.log.1 | grep -v retrieve | sort -u | wc -l </span></span><span style="display:flex;"><span>33733 </span></span><span style="display:flex;"><span># awk <span style="color:#e6db74">'{print $7}'</span> /var/log/nginx/rest.log.1 | grep -v retrieve | sort | uniq -c | awk <span style="color:#e6db74">'$1 > 1'</span> | wc -l </span></span><span style="display:flex;"><span>5637 </span></span></code></pre></div><ul> <li>In the afternoon I started a harvest on AReS (which should affect the numbers above also)</li> <li>I enabled an nginx proxy cache on DSpace Test for this location regex: <code>location ~ /rest/(handle|items|collections|communities)/.+</code></li> </ul> <h2 id="2022-09-12">2022-09-12</h2> <ul> <li>I am testing harvesting DSpace Test via AReS with the nginx proxy cache enabled</li> </ul> <!-- raw HTML omitted --> </article> </div> <!-- /.blog-main --> <aside class="col-sm-3 ml-auto blog-sidebar"> <section class="sidebar-module"> <h4>Recent Posts</h4> <ol class="list-unstyled"> <li><a href="/cgspace-notes/2022-08/">August, 2022</a></li> <li><a href="/cgspace-notes/2022-07/">July, 2022</a></li> <li><a href="/cgspace-notes/2022-06/">June, 2022</a></li> <li><a href="/cgspace-notes/2022-05/">May, 2022</a></li> <li><a href="/cgspace-notes/2022-04/">April, 2022</a></li> </ol> </section> <section class="sidebar-module"> <h4>Links</h4> <ol class="list-unstyled"> <li><a href="https://cgspace.cgiar.org">CGSpace</a></li> <li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li> <li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li> </ol> </section> </aside> </div> <!-- /.row --> </div> <!-- /.container --> <footer class="blog-footer"> <p dir="auto"> Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>. </p> <p> <a href="#">Back to top</a> </p> </footer> </body> </html>