2022-01-01 14:21:47 +01:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
2022-09-02 15:41:19 +02:00
< meta property = "og:title" content = "September, 2022" / >
< meta property = "og:description" content = "2022-09-01
A bit of work on the “ Mapping CG Core– CGSpace– MEL– MARLO Types” spreadsheet
I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
The submission works as expected
Start debugging some region-related issues with csv-metadata-quality
I created a new test file test-geography.csv with some different scenarios
I also fixed a few bugs and improved the region-matching logic
2022-01-01 14:21:47 +01:00
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2022-01/" / >
2022-09-02 15:41:19 +02:00
< meta property = "article:published_time" content = "2022-01-01T09:41:36+03:00" / >
2022-09-07 16:58:52 +02:00
< meta property = "article:modified_time" content = "2022-09-06T17:48:46+03:00" / >
2022-01-01 14:21:47 +01:00
< meta name = "twitter:card" content = "summary" / >
2022-09-02 15:41:19 +02:00
< meta name = "twitter:title" content = "September, 2022" / >
< meta name = "twitter:description" content = "2022-09-01
A bit of work on the “ Mapping CG Core– CGSpace– MEL– MARLO Types” spreadsheet
I tested an item submission on DSpace Test with the Cocoon org.apache.cocoon.uploads.autosave=false change
The submission works as expected
Start debugging some region-related issues with csv-metadata-quality
I created a new test file test-geography.csv with some different scenarios
I also fixed a few bugs and improved the region-matching logic
2022-01-01 14:21:47 +01:00
"/>
2022-09-05 15:59:11 +02:00
< meta name = "generator" content = "Hugo 0.102.3" / >
2022-01-01 14:21:47 +01:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
2022-09-02 15:41:19 +02:00
"headline": "September, 2022",
2022-01-01 14:21:47 +01:00
"url": "https://alanorth.github.io/cgspace-notes/2022-01/",
2022-09-07 16:58:52 +02:00
"wordCount": "740",
2022-09-02 15:41:19 +02:00
"datePublished": "2022-01-01T09:41:36+03:00",
2022-09-07 16:58:52 +02:00
"dateModified": "2022-09-06T17:48:46+03:00",
2022-01-01 14:21:47 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2022-01/" >
2022-09-02 15:41:19 +02:00
< title > September, 2022 | CGSpace Notes< / title >
2022-01-01 14:21:47 +01:00
<!-- combined, minified CSS -->
2022-09-07 16:58:52 +02:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel = "stylesheet" integrity = "sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin = "anonymous" >
2022-01-01 14:21:47 +01:00
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2022-09-02 15:41:19 +02:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2022-01/" > September, 2022< / a > < / h2 >
2022-01-01 14:21:47 +01:00
< p class = "blog-post-meta" >
2022-09-02 15:41:19 +02:00
< time datetime = "2022-01-01T09:41:36+03:00" > Sat Jan 01, 2022< / time >
2022-01-01 14:21:47 +01:00
in
2022-06-23 07:40:53 +02:00
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/categories/notes/" rel = "category tag" > Notes< / a >
2022-01-01 14:21:47 +01:00
< / p >
< / header >
2022-09-02 15:41:19 +02:00
< h2 id = "2022-09-01" > 2022-09-01< / h2 >
2022-01-27 14:58:05 +01:00
< ul >
2022-09-02 15:41:19 +02:00
< li > A bit of work on the “ Mapping CG Core– CGSpace– MEL– MARLO Types” spreadsheet< / li >
< li > I tested an item submission on DSpace Test with the Cocoon < code > org.apache.cocoon.uploads.autosave=false< / code > change
2022-01-27 14:58:05 +01:00
< ul >
2022-09-02 15:41:19 +02:00
< li > The submission works as expected< / li >
2022-01-27 14:58:05 +01:00
< / ul >
< / li >
2022-09-02 15:41:19 +02:00
< li > Start debugging some region-related issues with csv-metadata-quality
2022-01-27 14:58:05 +01:00
< ul >
2022-09-02 15:41:19 +02:00
< li > I created a new test file < code > test-geography.csv< / code > with some different scenarios< / li >
< li > I also fixed a few bugs and improved the region-matching logic< / li >
2022-01-27 14:58:05 +01:00
< / ul >
< / li >
< / ul >
< ul >
2022-09-02 15:41:19 +02:00
< li > I filed < a href = "https://github.com/konstantinstadler/country_converter/issues/115" > an issue for the “ South-eastern Asia” case mismatch in country_converter< / a > on GitHub< / li >
< li > Meeting with Moayad to discuss OpenRXV developments
2022-01-27 14:58:05 +01:00
< ul >
2022-09-02 15:41:19 +02:00
< li > He demoed his new multiple dashboards feature and I helped him rebase those changes to master so we can test them more< / li >
2022-01-27 14:58:05 +01:00
< / ul >
< / li >
< / ul >
2022-09-02 15:41:19 +02:00
< h2 id = "2022-09-02" > 2022-09-02< / h2 >
2022-01-27 14:58:05 +01:00
< ul >
2022-09-02 15:41:19 +02:00
< li > I worked a bit more on exclusion and skipping logic in csv-metadata-quality
2022-01-31 07:00:59 +01:00
< ul >
2022-09-02 15:41:19 +02:00
< li > I also pruned and updated all the Python dependencies< / li >
< li > Then I released < a href = "https://github.com/ilri/csv-metadata-quality/releases/tag/v0.6.0" > version 0.6.0< / a > now that the excludes and region matching support is working way better< / li >
2022-01-31 07:00:59 +01:00
< / ul >
< / li >
2022-01-27 14:58:05 +01:00
< / ul >
2022-09-05 15:59:11 +02:00
< h2 id = "2022-09-05" > 2022-09-05< / h2 >
< ul >
< li > Started a harvest on AReS last night< / li >
< li > Looking over the Solr statistics from last month I see many user agents that look suspicious:
< ul >
< li > Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.2; WOW64; Trident/7.0; .NET4.0E; .NET4.0C)< / li >
< li > Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 77.0.3865.90 Safari / 537.36< / li >
< li > Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0< / li >
< li > Mozilla/5.0 (X11; Linux i686; rv:2.0b12pre) Gecko/20110204 Firefox/4.0b12pre< / li >
< li > Mozilla/5.0 (Windows NT 10.0; Win64; x64; Xbox; Xbox One) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36 Edge/44.18363.8131< / li >
< li > Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)< / li >
< li > Mozilla/4.0 (compatible; MSIE 4.5; Windows 98;)< / li >
< li > curb< / li >
< li > bitdiscovery< / li >
< li > omgili/0.5 +http://omgili.com< / li >
< li > Mozilla/5.0 (compatible)< / li >
< li > Vizzit< / li >
< li > Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0< / li >
< li > Mozilla/5.0 (Android; Mobile; rv:13.0) Gecko/13.0 Firefox/13.0< / li >
< li > Java/17-ea< / li >
< li > AdobeUxTechC4-Async/3.0.12 (win32)< / li >
< li > ZaloPC-win32-24v473< / li >
< li > Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com< / li >
< li > Scoop.it< / li >
< li > Mozilla/5.0 (Windows NT 6.1; rv:27.0) Gecko/20100101 Firefox/27.0< / li >
< li > Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)< / li >
< li > ows NT 10.0; WOW64; rv: 50.0) Gecko/20100101 Firefox/50.0< / li >
< li > WebAPIClient< / li >
< li > Mozilla/5.0 Firefox/26.0< / li >
< li > Mozilla/5.0 (compatible; woorankreview/2.0; +https://www.woorank.com/)< / li >
< / ul >
< / li >
< li > For example, some are apparently using versions of Firefox that are over ten years old, and some are obviously trying to look like valid user agents, but making typos (< code > Mozilla / 5.0< / code > )< / li >
< li > Tons of hosts making requests likt this:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > GET /bitstream/handle/10568/109408/Milk%20testing%20lab%20protocol.pdf?sequence=1& isAllowed=\x22> < script%20> alert(String.fromCharCode(88,83,83))< /script> HTTP/1.1" 400 5 " -" " Mozilla/5.0 (Windows NT 10.0; WOW64; Rv:50.0) Gecko/20100101 Firefox/50.0
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I got a list of hosts making requests like that so I can purge their hits:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > # zcat /var/log/nginx/< span style = "color:#f92672" > {< / span > access,library-access,oai,rest< span style = "color:#f92672" > }< / span > .log.< span style = "color:#f92672" > [< / span > 123< span style = "color:#f92672" > ]< / span > *.gz | grep < span style = "color:#e6db74" > ' String.fromCharCode(' < / span > | awk < span style = "color:#e6db74" > ' {print $1}' < / span > | sort -u > /tmp/ips.txt
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I purged 4,718 hits from IPs< / li >
< li > I see some new Hetzner ranges that I hadn’ t blocked yet apparently?
< ul >
< li > I got a < a href = "https://www.ipqualityscore.com/asn-details/AS24940/hetzner-online-gmbh" > list of Hetzner’ s IPs from IP Quality Score< / a > then added them to the existing ones in my Ansible playbooks:< / li >
< / ul >
< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ awk < span style = "color:#e6db74" > ' {print $1}' < / span > /tmp/hetzner.txt | wc -l
< / span > < / span > < span style = "display:flex;" > < span > 36
< / span > < / span > < span style = "display:flex;" > < span > $ sort -u /tmp/hetzner-combined.txt | wc -l
< / span > < / span > < span style = "display:flex;" > < span > 49
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I will add this new list to nginx’ s < code > bot-networks.conf< / code > so they get throttled on scraping XMLUI and get classified as bots in Solr statistics< / li >
< li > Then I purged hits from the following user agents:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ ./ilri/check-spider-hits.sh -f /tmp/agents
< / span > < / span > < span style = "display:flex;" > < span > Found 374 hits from curb in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 350 hits from bitdiscovery in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 564 hits from omgili in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 390 hits from Vizzit in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 9125 hits from AdobeUxTechC4-Async in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 97 hits from ZaloPC-win32-24v473 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 518 hits from nbertaupete95 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 218 hits from Scoop.it in statistics
< / span > < / span > < span style = "display:flex;" > < span > Found 584 hits from WebAPIClient in statistics
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > Total number of hits from bots: 12220
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Then I will add these user agents to the ILRI spider override in DSpace< / li >
< / ul >
2022-09-06 16:48:46 +02:00
< h2 id = "2022-09-06" > 2022-09-06< / h2 >
< ul >
< li > I’ m testing dspace-statistics-api with our DSpace 7 test server
< ul >
< li > After setting up the env and the database the < code > python -m dspace_statistics_api.indexer< / code > runs without issues< / li >
< li > While playing with Solr I tried to search for statistics from this month using < code > time:2022-09*< / code > but I get this error: “ Can’ t run prefix queries on numeric fields” < / li >
< li > I guess that the syntax in Solr changed since 4.10… < / li >
< li > This works, but is super annoying: < code > time:[2022-09-01T00:00:00Z TO 2022-09-30T23:59:59Z]< / code > < / li >
< / ul >
< / li >
< / ul >
2022-09-07 16:58:52 +02:00
< h2 id = "2022-09-07" > 2022-09-07< / h2 >
< ul >
< li > I tested the controlled-vocabulary changes on DSpace 6 and they work fine
< ul >
< li > Last week I found that DSpace 7 is more strict with controlled vocabularies and requires IDs for all node values< / li >
< li > This is a pain because it means I have to re-do the IDs in each file every time I update them< / li >
< li > If I add < code > id=" 0000" < / code > to each, then I can use < a href = "https://vim.fandom.com/wiki/Making_a_list_of_numbers#Substitute_with_ascending_numbers" > this vim expression< / a > < code > let i=0001 | g/0000/s//\=i/ | let i=i+1< / code > to replace the numbers with increments starting from 1< / li >
< / ul >
< / li >
< li > Meeting with Marie Angelique, Abenet, Sarа , а nd Margarita to continue the discussion about Types from last week
< ul >
< li > We made progress with concrete actions and will continue next week< / li >
< / ul >
< / li >
< / ul >
2022-01-12 18:55:47 +01:00
<!-- raw HTML omitted -->
2022-01-01 14:21:47 +01:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2022-08-01 15:36:13 +02:00
< li > < a href = "/cgspace-notes/2022-08/" > August, 2022< / a > < / li >
2022-07-04 08:25:14 +02:00
< li > < a href = "/cgspace-notes/2022-07/" > July, 2022< / a > < / li >
2022-06-06 08:45:43 +02:00
< li > < a href = "/cgspace-notes/2022-06/" > June, 2022< / a > < / li >
2022-05-04 10:09:45 +02:00
< li > < a href = "/cgspace-notes/2022-05/" > May, 2022< / a > < / li >
2022-04-27 08:58:45 +02:00
< li > < a href = "/cgspace-notes/2022-04/" > April, 2022< / a > < / li >
2022-03-01 15:48:40 +01:00
2022-01-01 14:21:47 +01:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >