cgspace-notes/docs/2022-01/index.html

322 lines
13 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<meta property="og:title" content="September, 2022" />
<meta property="og:description" content="2022-09-01
A bit of work on the &ldquo;Mapping CG CoreCGSpaceMELMARLO Types&rdquo; spreadsheet
I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
The submission works as expected
Start debugging some region-related issues with csv-metadata-quality
I created a new test file test-geography.csv with some different scenarios
I also fixed a few bugs and improved the region-matching logic
" />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-01/" />
<meta property="article:published_time" content="2022-01-01T09:41:36+03:00" />
<meta property="article:modified_time" content="2022-09-05T16:59:11+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="September, 2022"/>
<meta name="twitter:description" content="2022-09-01
A bit of work on the &ldquo;Mapping CG CoreCGSpaceMELMARLO Types&rdquo; spreadsheet
I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
The submission works as expected
Start debugging some region-related issues with csv-metadata-quality
I created a new test file test-geography.csv with some different scenarios
I also fixed a few bugs and improved the region-matching logic
"/>
<meta name="generator" content="Hugo 0.102.3" />
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "September, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-01/",
"wordCount": "629",
"datePublished": "2022-01-01T09:41:36+03:00",
"dateModified": "2022-09-05T16:59:11+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
</script>
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-01/">
<title>September, 2022 | CGSpace Notes</title>
<!-- combined, minified CSS -->
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC&#43;AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
<!-- minified Font Awesome for SVG icons -->
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz&#43;lcnA=" crossorigin="anonymous"></script>
<!-- RSS 2.0 feed -->
</head>
<body>
<div class="blog-masthead">
<div class="container">
<nav class="nav blog-nav">
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
</nav>
</div>
</div>
<header class="blog-header">
<div class="container">
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
</div>
</header>
<div class="container">
<div class="row">
<div class="col-sm-8 blog-main">
<article class="blog-post">
<header>
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-01/">September, 2022</a></h2>
<p class="blog-post-meta">
<time datetime="2022-01-01T09:41:36+03:00">Sat Jan 01, 2022</time>
in
<span class="fas fa-folder" aria-hidden="true"></span>&nbsp;<a href="/categories/notes/" rel="category tag">Notes</a>
</p>
</header>
<h2 id="2022-09-01">2022-09-01</h2>
<ul>
<li>A bit of work on the &ldquo;Mapping CG CoreCGSpaceMELMARLO Types&rdquo; spreadsheet</li>
<li>I tested an item submission on DSpace Test with the Cocoon <code>org.apache.cocoon.uploads.autosave=false</code> change
<ul>
<li>The submission works as expected</li>
</ul>
</li>
<li>Start debugging some region-related issues with csv-metadata-quality
<ul>
<li>I created a new test file <code>test-geography.csv</code> with some different scenarios</li>
<li>I also fixed a few bugs and improved the region-matching logic</li>
</ul>
</li>
</ul>
<ul>
<li>I filed <a href="https://github.com/konstantinstadler/country_converter/issues/115">an issue for the &ldquo;South-eastern Asia&rdquo; case mismatch in country_converter</a> on GitHub</li>
<li>Meeting with Moayad to discuss OpenRXV developments
<ul>
<li>He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more</li>
</ul>
</li>
</ul>
<h2 id="2022-09-02">2022-09-02</h2>
<ul>
<li>I worked a bit more on exclusion and skipping logic in csv-metadata-quality
<ul>
<li>I also pruned and updated all the Python dependencies</li>
<li>Then I released <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0">version 0.6.0</a> now that the excludes and region matching support is working way better</li>
</ul>
</li>
</ul>
<h2 id="2022-09-05">2022-09-05</h2>
<ul>
<li>Started a harvest on AReS last night</li>
<li>Looking over the Solr statistics from last month I see many user agents that look suspicious:
<ul>
<li>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)</li>
<li>Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36</li>
<li>Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0</li>
<li>Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre</li>
<li>Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131</li>
<li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)</li>
<li>Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)</li>
<li>curb</li>
<li>bitdiscovery</li>
<li>omgili/0.5 +http://omgili.com</li>
<li>Mozilla/5.0 (compatible)</li>
<li>Vizzit</li>
<li>Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0</li>
<li>Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0</li>
<li>Java/17-ea</li>
<li>AdobeUxTechC4-Async/3.0.12 (win32)</li>
<li>ZaloPC-win32-24v473</li>
<li>Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com</li>
<li>Scoop.it</li>
<li>Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0</li>
<li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</li>
<li>ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0</li>
<li>WebAPIClient</li>
<li>Mozilla/5.0 Firefox/26.0</li>
<li>Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)</li>
</ul>
</li>
<li>For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (<code>Mozilla / 5.0</code>)</li>
<li>Tons of hosts making requests likt this:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&amp;isAllowed=\x22&gt;&lt;script%20&gt;alert(String.fromCharCode(88,83,83))&lt;/script&gt; HTTP/1.1&#34; 400 5 &#34;-&#34; &#34;Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
</span></span></code></pre></div><ul>
<li>I got a list of hosts making requests like that so I can purge their hits:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.<span style="color:#f92672">[</span>123<span style="color:#f92672">]</span>*.gz | grep <span style="color:#e6db74">&#39;String.fromCharCode(&#39;</span> | awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> | sort -u &gt; /tmp/ips.txt
</span></span></code></pre></div><ul>
<li>I purged 4,718 hits from IPs</li>
<li>I see some new Hetzner ranges that I hadn&rsquo;t blocked yet apparently?
<ul>
<li>I got a <a href="https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh">list of Hetzner&rsquo;s IPs from IP Quality Score</a> then added them to the existing ones in my Ansible playbooks:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ awk <span style="color:#e6db74">&#39;{print $1}&#39;</span> /tmp/hetzner.txt | wc -l
</span></span><span style="display:flex;"><span>36
</span></span><span style="display:flex;"><span>$ sort -u /tmp/hetzner-combined.txt | wc -l
</span></span><span style="display:flex;"><span>49
</span></span></code></pre></div><ul>
<li>I will add this new list to nginx&rsquo;s <code>bot-networks.conf</code> so they get throttled on scraping XMLUI and get classified as bots in Solr statistics</li>
<li>Then I purged hits from the following user agents:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents
</span></span><span style="display:flex;"><span>Found 374 hits from curb in statistics
</span></span><span style="display:flex;"><span>Found 350 hits from bitdiscovery in statistics
</span></span><span style="display:flex;"><span>Found 564 hits from omgili in statistics
</span></span><span style="display:flex;"><span>Found 390 hits from Vizzit in statistics
</span></span><span style="display:flex;"><span>Found 9125 hits from AdobeUxTechC4-Async in statistics
</span></span><span style="display:flex;"><span>Found 97 hits from ZaloPC-win32-24v473 in statistics
</span></span><span style="display:flex;"><span>Found 518 hits from nbertaupete95 in statistics
</span></span><span style="display:flex;"><span>Found 218 hits from Scoop.it in statistics
</span></span><span style="display:flex;"><span>Found 584 hits from WebAPIClient in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 12220
</span></span></code></pre></div><ul>
<li>Then I will add these user agents to the ILRI spider override in DSpace</li>
</ul>
<h2 id="2022-09-06">2022-09-06</h2>
<ul>
<li>I&rsquo;m testing dspace-statistics-api with our DSpace 7 test server
<ul>
<li>After setting up the env and the database the <code>python -m dspace_statistics_api.indexer</code> runs without issues</li>
<li>While playing with Solr I tried to search for statistics from this month using <code>time:2022-09*</code> but I get this error: &ldquo;Can&rsquo;t run prefix queries on numeric fields&rdquo;</li>
<li>I guess that the syntax in Solr changed since 4.10&hellip;</li>
<li>This works, but is super annoying: <code>time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]</code></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</article>
</div> <!-- /.blog-main -->
<aside class="col-sm-3 ml-auto blog-sidebar">
<section class="sidebar-module">
<h4>Recent Posts</h4>
<ol class="list-unstyled">
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
</ol>
</section>
<section class="sidebar-module">
<h4>Links</h4>
<ol class="list-unstyled">
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
</ol>
</section>
</aside>
</div> <!-- /.row -->
</div> <!-- /.container -->
<footer class="blog-footer">
<p dir="auto">
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
</p>
<p>
<a href="#">Back to top</a>
</p>
</footer>
</body>
</html>