mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 16:38:19 +01:00
322 lines
13 KiB
HTML
322 lines
13 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en" >
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
|
||
|
||
|
||
<meta property="og:title" content="September, 2022" />
|
||
<meta property="og:description" content="2022-09-01
|
||
|
||
A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet
|
||
I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
|
||
|
||
The submission works as expected
|
||
|
||
|
||
Start debugging some region-related issues with csv-metadata-quality
|
||
|
||
I created a new test file test-geography.csv with some different scenarios
|
||
I also fixed a few bugs and improved the region-matching logic
|
||
|
||
|
||
" />
|
||
<meta property="og:type" content="article" />
|
||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2022-01/" />
|
||
<meta property="article:published_time" content="2022-01-01T09:41:36+03:00" />
|
||
<meta property="article:modified_time" content="2022-09-05T16:59:11+03:00" />
|
||
|
||
|
||
|
||
<meta name="twitter:card" content="summary"/>
|
||
<meta name="twitter:title" content="September, 2022"/>
|
||
<meta name="twitter:description" content="2022-09-01
|
||
|
||
A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet
|
||
I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
|
||
|
||
The submission works as expected
|
||
|
||
|
||
Start debugging some region-related issues with csv-metadata-quality
|
||
|
||
I created a new test file test-geography.csv with some different scenarios
|
||
I also fixed a few bugs and improved the region-matching logic
|
||
|
||
|
||
"/>
|
||
<meta name="generator" content="Hugo 0.102.3" />
|
||
|
||
|
||
|
||
<script type="application/ld+json">
|
||
{
|
||
"@context": "http://schema.org",
|
||
"@type": "BlogPosting",
|
||
"headline": "September, 2022",
|
||
"url": "https://alanorth.github.io/cgspace-notes/2022-01/",
|
||
"wordCount": "629",
|
||
"datePublished": "2022-01-01T09:41:36+03:00",
|
||
"dateModified": "2022-09-05T16:59:11+03:00",
|
||
"author": {
|
||
"@type": "Person",
|
||
"name": "Alan Orth"
|
||
},
|
||
"keywords": "Notes"
|
||
}
|
||
</script>
|
||
|
||
|
||
|
||
<link rel="canonical" href="https://alanorth.github.io/cgspace-notes/2022-01/">
|
||
|
||
<title>September, 2022 | CGSpace Notes</title>
|
||
|
||
|
||
<!-- combined, minified CSS -->
|
||
|
||
<link href="https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel="stylesheet" integrity="sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin="anonymous">
|
||
|
||
|
||
<!-- minified Font Awesome for SVG icons -->
|
||
|
||
<script defer src="https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity="sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin="anonymous"></script>
|
||
|
||
<!-- RSS 2.0 feed -->
|
||
|
||
|
||
|
||
|
||
</head>
|
||
|
||
<body>
|
||
|
||
|
||
<div class="blog-masthead">
|
||
<div class="container">
|
||
<nav class="nav blog-nav">
|
||
<a class="nav-link " href="https://alanorth.github.io/cgspace-notes/">Home</a>
|
||
</nav>
|
||
</div>
|
||
</div>
|
||
|
||
|
||
|
||
|
||
<header class="blog-header">
|
||
<div class="container">
|
||
<h1 class="blog-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/" rel="home">CGSpace Notes</a></h1>
|
||
<p class="lead blog-description" dir="auto">Documenting day-to-day work on the <a href="https://cgspace.cgiar.org">CGSpace</a> repository.</p>
|
||
</div>
|
||
</header>
|
||
|
||
|
||
|
||
|
||
<div class="container">
|
||
<div class="row">
|
||
<div class="col-sm-8 blog-main">
|
||
|
||
|
||
|
||
|
||
<article class="blog-post">
|
||
<header>
|
||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2022-01/">September, 2022</a></h2>
|
||
<p class="blog-post-meta">
|
||
<time datetime="2022-01-01T09:41:36+03:00">Sat Jan 01, 2022</time>
|
||
in
|
||
<span class="fas fa-folder" aria-hidden="true"></span> <a href="/categories/notes/" rel="category tag">Notes</a>
|
||
|
||
|
||
</p>
|
||
</header>
|
||
<h2 id="2022-09-01">2022-09-01</h2>
|
||
<ul>
|
||
<li>A bit of work on the “Mapping CG Core–CGSpace–MEL–MARLO Types” spreadsheet</li>
|
||
<li>I tested an item submission on DSpace Test with the Cocoon <code>org.apache.cocoon.uploads.autosave=false</code> change
|
||
<ul>
|
||
<li>The submission works as expected</li>
|
||
</ul>
|
||
</li>
|
||
<li>Start debugging some region-related issues with csv-metadata-quality
|
||
<ul>
|
||
<li>I created a new test file <code>test-geography.csv</code> with some different scenarios</li>
|
||
<li>I also fixed a few bugs and improved the region-matching logic</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<ul>
|
||
<li>I filed <a href="https://github.com/konstantinstadler/country_converter/issues/115">an issue for the “South-eastern Asia” case mismatch in country_converter</a> on GitHub</li>
|
||
<li>Meeting with Moayad to discuss OpenRXV developments
|
||
<ul>
|
||
<li>He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2022-09-02">2022-09-02</h2>
|
||
<ul>
|
||
<li>I worked a bit more on exclusion and skipping logic in csv-metadata-quality
|
||
<ul>
|
||
<li>I also pruned and updated all the Python dependencies</li>
|
||
<li>Then I released <a href="https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0">version 0.6.0</a> now that the excludes and region matching support is working way better</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<h2 id="2022-09-05">2022-09-05</h2>
|
||
<ul>
|
||
<li>Started a harvest on AReS last night</li>
|
||
<li>Looking over the Solr statistics from last month I see many user agents that look suspicious:
|
||
<ul>
|
||
<li>Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)</li>
|
||
<li>Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36</li>
|
||
<li>Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0</li>
|
||
<li>Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre</li>
|
||
<li>Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131</li>
|
||
<li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)</li>
|
||
<li>Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)</li>
|
||
<li>curb</li>
|
||
<li>bitdiscovery</li>
|
||
<li>omgili/0.5 +http://omgili.com</li>
|
||
<li>Mozilla/5.0 (compatible)</li>
|
||
<li>Vizzit</li>
|
||
<li>Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0</li>
|
||
<li>Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0</li>
|
||
<li>Java/17-ea</li>
|
||
<li>AdobeUxTechC4-Async/3.0.12 (win32)</li>
|
||
<li>ZaloPC-win32-24v473</li>
|
||
<li>Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com</li>
|
||
<li>Scoop.it</li>
|
||
<li>Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0</li>
|
||
<li>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</li>
|
||
<li>ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0</li>
|
||
<li>WebAPIClient</li>
|
||
<li>Mozilla/5.0 Firefox/26.0</li>
|
||
<li>Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)</li>
|
||
</ul>
|
||
</li>
|
||
<li>For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (<code>Mozilla / 5.0</code>)</li>
|
||
<li>Tons of hosts making requests likt this:</li>
|
||
</ul>
|
||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1&isAllowed=\x22><script%20>alert(String.fromCharCode(88,83,83))</script> HTTP/1.1" 400 5 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
|
||
</span></span></code></pre></div><ul>
|
||
<li>I got a list of hosts making requests like that so I can purge their hits:</li>
|
||
</ul>
|
||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># zcat /var/log/nginx/<span style="color:#f92672">{</span>access,library-access,oai,rest<span style="color:#f92672">}</span>.log.<span style="color:#f92672">[</span>123<span style="color:#f92672">]</span>*.gz | grep <span style="color:#e6db74">'String.fromCharCode('</span> | awk <span style="color:#e6db74">'{print $1}'</span> | sort -u > /tmp/ips.txt
|
||
</span></span></code></pre></div><ul>
|
||
<li>I purged 4,718 hits from IPs</li>
|
||
<li>I see some new Hetzner ranges that I hadn’t blocked yet apparently?
|
||
<ul>
|
||
<li>I got a <a href="https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh">list of Hetzner’s IPs from IP Quality Score</a> then added them to the existing ones in my Ansible playbooks:</li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ awk <span style="color:#e6db74">'{print $1}'</span> /tmp/hetzner.txt | wc -l
|
||
</span></span><span style="display:flex;"><span>36
|
||
</span></span><span style="display:flex;"><span>$ sort -u /tmp/hetzner-combined.txt | wc -l
|
||
</span></span><span style="display:flex;"><span>49
|
||
</span></span></code></pre></div><ul>
|
||
<li>I will add this new list to nginx’s <code>bot-networks.conf</code> so they get throttled on scraping XMLUI and get classified as bots in Solr statistics</li>
|
||
<li>Then I purged hits from the following user agents:</li>
|
||
</ul>
|
||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents
|
||
</span></span><span style="display:flex;"><span>Found 374 hits from curb in statistics
|
||
</span></span><span style="display:flex;"><span>Found 350 hits from bitdiscovery in statistics
|
||
</span></span><span style="display:flex;"><span>Found 564 hits from omgili in statistics
|
||
</span></span><span style="display:flex;"><span>Found 390 hits from Vizzit in statistics
|
||
</span></span><span style="display:flex;"><span>Found 9125 hits from AdobeUxTechC4-Async in statistics
|
||
</span></span><span style="display:flex;"><span>Found 97 hits from ZaloPC-win32-24v473 in statistics
|
||
</span></span><span style="display:flex;"><span>Found 518 hits from nbertaupete95 in statistics
|
||
</span></span><span style="display:flex;"><span>Found 218 hits from Scoop.it in statistics
|
||
</span></span><span style="display:flex;"><span>Found 584 hits from WebAPIClient in statistics
|
||
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
|
||
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of hits from bots: 12220
|
||
</span></span></code></pre></div><ul>
|
||
<li>Then I will add these user agents to the ILRI spider override in DSpace</li>
|
||
</ul>
|
||
<h2 id="2022-09-06">2022-09-06</h2>
|
||
<ul>
|
||
<li>I’m testing dspace-statistics-api with our DSpace 7 test server
|
||
<ul>
|
||
<li>After setting up the env and the database the <code>python -m dspace_statistics_api.indexer</code> runs without issues</li>
|
||
<li>While playing with Solr I tried to search for statistics from this month using <code>time:2022-09*</code> but I get this error: “Can’t run prefix queries on numeric fields”</li>
|
||
<li>I guess that the syntax in Solr changed since 4.10…</li>
|
||
<li>This works, but is super annoying: <code>time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]</code></li>
|
||
</ul>
|
||
</li>
|
||
</ul>
|
||
<!-- raw HTML omitted -->
|
||
|
||
|
||
|
||
|
||
|
||
</article>
|
||
|
||
|
||
|
||
</div> <!-- /.blog-main -->
|
||
|
||
<aside class="col-sm-3 ml-auto blog-sidebar">
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Recent Posts</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
|
||
<li><a href="/cgspace-notes/2022-08/">August, 2022</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2022-07/">July, 2022</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2022-06/">June, 2022</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2022-05/">May, 2022</a></li>
|
||
|
||
<li><a href="/cgspace-notes/2022-04/">April, 2022</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
|
||
|
||
|
||
<section class="sidebar-module">
|
||
<h4>Links</h4>
|
||
<ol class="list-unstyled">
|
||
|
||
<li><a href="https://cgspace.cgiar.org">CGSpace</a></li>
|
||
|
||
<li><a href="https://dspacetest.cgiar.org">DSpace Test</a></li>
|
||
|
||
<li><a href="https://github.com/ilri/DSpace">CGSpace @ GitHub</a></li>
|
||
|
||
</ol>
|
||
</section>
|
||
|
||
</aside>
|
||
|
||
|
||
</div> <!-- /.row -->
|
||
</div> <!-- /.container -->
|
||
|
||
|
||
|
||
<footer class="blog-footer">
|
||
<p dir="auto">
|
||
|
||
Blog template created by <a href="https://twitter.com/mdo">@mdo</a>, ported to Hugo by <a href='https://twitter.com/mralanorth'>@mralanorth</a>.
|
||
|
||
</p>
|
||
<p>
|
||
<a href="#">Back to top</a>
|
||
</p>
|
||
</footer>
|
||
|
||
|
||
</body>
|
||
|
||
</html>
|