2023-11-08 06:20:31 +01:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "November, 2023" / >
< meta property = "og:description" content = "2023-11-01
Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
I improved the filtering and wrote some Python using pandas to merge my sources more reliably
2023-11-02
Export CGSpace to check missing Initiative collection mappings
Start a harvest on AReS
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2023-11/" / >
< meta property = "article:published_time" content = "2023-11-02T12:59:36+03:00" / >
2023-12-02 08:38:09 +01:00
< meta property = "article:modified_time" content = "2023-11-23T16:15:13+03:00" / >
2023-11-08 06:20:31 +01:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "November, 2023" / >
< meta name = "twitter:description" content = "2023-11-01
Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
I improved the filtering and wrote some Python using pandas to merge my sources more reliably
2023-11-02
Export CGSpace to check missing Initiative collection mappings
Start a harvest on AReS
"/>
2023-11-13 14:54:36 +01:00
< meta name = "generator" content = "Hugo 0.120.4" >
2023-11-08 06:20:31 +01:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-11/",
2023-12-02 08:38:09 +01:00
"wordCount": "1318",
2023-11-08 06:20:31 +01:00
"datePublished": "2023-11-02T12:59:36+03:00",
2023-12-02 08:38:09 +01:00
"dateModified": "2023-11-23T16:15:13+03:00",
2023-11-08 06:20:31 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2023-11/" >
< title > November, 2023 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel = "stylesheet" integrity = "sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin = "anonymous" >
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2023-11/" > November, 2023< / a > < / h2 >
< p class = "blog-post-meta" >
< time datetime = "2023-11-02T12:59:36+03:00" > Thu Nov 02, 2023< / time >
in
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/categories/notes/" rel = "category tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2023-11-01" > 2023-11-01< / h2 >
< ul >
< li > Work a bit on the ETL pipeline for the CGIAR Climate Change Synthesis
< ul >
< li > I improved the filtering and wrote some Python using pandas to merge my sources more reliably< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2023-11-02" > 2023-11-02< / h2 >
< ul >
< li > Export CGSpace to check missing Initiative collection mappings< / li >
< li > Start a harvest on AReS< / li >
< / ul >
< ul >
< li > IFPRI contacted us about importing their Slideshare presentations to CGSpace
< ul >
< li > There are ~1,700 of them and date back to as early as 2008< / li >
< li > I did a quick cleanup of the metadata export from Slideshare (including tagging with some AGROVOC in OpenRefine) and uploaded to DSpace Test< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2023-11-03" > 2023-11-03< / h2 >
< ul >
< li > A little bit of work on the CGIAR Climate Change Synthesis< / li >
< li > Discuss some CGSpace migration plans with Leigh from IFPRI
< ul >
< li > For their Slideshare content we agreed:
< ul >
< li > Exclude private< / li >
< li > Exclude deleted< / li >
< li > Exclude non presentation types< / li >
< li > Exclude duplicates within the collection for now until we can sort them out< / li >
< / ul >
< / li >
< li > That leaves about 1,500 items out of the 1,700< / li >
< / ul >
< / li >
< li > I did a duplicate check against CGSpace and found 44 items with 1.0 similarity so I removed those< / li >
< / ul >
< h2 id = "2023-11-04" > 2023-11-04< / h2 >
< ul >
< li > Export CGSpace to check for missing Initiative collection mappings< / li >
< li > I ran through the list of potential duplicates on the IFPRI Slideshare presentations< / li >
< / ul >
< h2 id = "2023-11-05" > 2023-11-05< / h2 >
< ul >
< li > Work with Salem to migrate AReS to the new version< / li >
< / ul >
< h2 id = "2023-11-07" > 2023-11-07< / h2 >
< ul >
< li > DSpace 7 Test went down and there is very high load on the server
< ul >
< li > I saw very high load from Java but didn’ t have time to check exactly what was wrong so I just rebooted the host< / li >
< li > A few hours after restarting the system went down again, with very high load from Java again< / li >
< li > I see lots of messages like this in the Tomcat log:< / li >
< / ul >
< / li >
< / ul >
< pre tabindex = "0" > < code > tomcat9[732]: [9955.662s][info ][gc] GC(6291) Pause Full (G1 Compaction Pause) 4085M-> 4080M(4096M) 677.251ms
tomcat9[732]: [9955.662s][info ][gc] GC(6290) Concurrent Mark Cycle 677.558ms
tomcat9[732]: [9955.666s][info ][gc] GC(6292) To-space exhausted
< / code > < / pre > < ul >
< li > I see some messages in < code > dspace.log< / code > about heap space:< / li >
< / ul >
< pre tabindex = "0" > < code > Caused by: java.lang.OutOfMemoryError: Java heap space
< / code > < / pre > < ul >
< li > I will increase Tomcat’ s heap from 4096m to 5120m
< ul >
< li > A few hours later it happened again, so I increased the heap from 5120m to 6144m< / li >
< li > Not sure what’ s going on today… < / li >
< / ul >
< / li >
< li > I tested moving the CGIAR Fund Council community to the CGIAR historic archive on DSpace Test:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ dspace community-filiator -r -p 10568/83389 -c 10947/2516
< / span > < / span > < span style = "display:flex;" > < span > $ dspace community-filiator -s -p 10947/2515 -c 10947/2516
< / span > < / span > < span style = "display:flex;" > < span > $ dspace index-discovery -r 10947/2516
< / span > < / span > < span style = "display:flex;" > < span > $ dspace index-discovery -r 10947/2515
< / span > < / span > < span style = "display:flex;" > < span > $ dspace index-discovery -r 10568/83389
< / span > < / span > < span style = "display:flex;" > < span > $ dspace index-discovery
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I think this is the minimal we can do to avoid a full Discovery reindex which is very expensive< / li >
< li > I helped Maria resize some massive PDFs for upload to CGSpace using GhostScript prepress mode as I had done before in < a href = "/cgspace-notes/2023-09/" > September, 2023< / a > ,< / li >
< / ul >
< h2 id = "2023-11-08" > 2023-11-08< / h2 >
< ul >
< li > DSpace 7 Test has very high load again and I see more Java heap space errors in the log< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > # grep -c < span style = "color:#e6db74" > ' Caused by: java.lang.OutOfMemoryError: Java heap space' < / span > /home/dspace7/log/dspace.log-2023-11-07
< / span > < / span > < span style = "display:flex;" > < span > 35
< / span > < / span > < span style = "display:flex;" > < span > # grep -c < span style = "color:#e6db74" > ' Caused by: java.lang.OutOfMemoryError: Java heap space' < / span > /home/dspace7/log/dspace.log
< / span > < / span > < span style = "display:flex;" > < span > 7
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I don’ t know what is happening… I will increase the heap size from 6144m to 7168m again… < / li >
2023-11-13 14:54:36 +01:00
< li > I did some work on the value mappings in AReS
< ul >
< li > I wanted to test the import/export feature, and found that I could get a JSON and convert it to CSV for manipulation in OpenRefine< / li >
< li > Importing duplicates records, so I deleted and re-created the index in Elasticsearch first< / li >
< li > Then I started a new harvest on AReS to make sure the mappings are applied< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2023-11-09" > 2023-11-09< / h2 >
< ul >
< li > Ryan asked me for help uploading a large PDF to CGSpace
< ul >
< li > I tried my usual GhostScript preprint invocation and found the size decrease significantly, but some minor artifacts appeared in the images< / li >
< li > Interestingly, the < a href = "https://ghostscript.com/docs/9.54.0/VectorDevices.htm" > GhostScript docs< / a > mention that < code > prepress< / code > doesn’ t give the best results:< / li >
< / ul >
< / li >
< / ul >
< blockquote >
< p > Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The ‘ best’ quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).< / p >
< / blockquote >
< ul >
< li > Also, I found < a href = "https://stackoverflow.com/questions/40849325/ghostscript-pdfwrite-specify-jpeg-quality" > a question on StackOverflow discussing some further techniques for PDFs with images< / a > :< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ gs -sOutputFile< span style = "color:#f92672" > =< / span > 137166-default-dct.pdf -sDEVICE< span style = "color:#f92672" > =< / span > pdfwrite -dCompatibilityLevel< span style = "color:#f92672" > =< / span > 1.4 -dNOPAUSE -dBATCH -dPDFSETTINGS< span style = "color:#f92672" > =< / span > /default -c < span style = "color:#e6db74" > " < < /ColorACSImageDict < < /VSamples [ 1 1 1 1 ] /HSam
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#e6db74" > < / span > ples [ 1 1 1 1 ] /QFactor 0.08 /Blend 1 > > /ColorImageDownsampleType /Bicubic /ColorConversionStrategy /LeaveColorUnchanged > > setdistillerparams" -f 137166.pdf
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > This looks much better, and is still much smaller than the original
< ul >
< li > Also, I used < code > pdfimages< / code > to extract all the images from the original and the one above and found:< / li >
< / ul >
< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ du -sh images-*
< / span > < / span > < span style = "display:flex;" > < span > 886M images-default-dct
< / span > < / span > < span style = "display:flex;" > < span > 1012M images-original
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > And from < a href = "https://www.wecompress.com/en/analyze" > WeCompress’ s analysis< / a > I see that the images are 85% of the size of the PDF< / li >
< / ul >
< h2 id = "2023-11-10" > 2023-11-10< / h2 >
< ul >
< li > I finished checking the IFPRI Slideshare records and added some tagging of countries, regions, and CRPs and then uploaded them to CGSpace< / li >
< / ul >
< h2 id = "2023-11-11" > 2023-11-11< / h2 >
< ul >
< li > Salem fixed a bug on OpenRXV that was splitting country values by “ ,” before matching them with ISO countries< / li >
< li > I exported CGSpace to check for missing Initiative collection mappings< / li >
< li > Start a fresh harvest on AReS< / li >
2023-11-08 06:20:31 +01:00
< / ul >
2023-11-16 15:25:15 +01:00
< h2 id = "2023-11-16" > 2023-11-16< / h2 >
< ul >
< li > Discuss mapping ICARDA outputs from Initiatives to ICARDA collections on CGSpace
< ul >
< li > I added MEL’ s CGSpace user to the administrator group of a handful of collections< / li >
< li > I also did a batch mapping of 274 existing Initiative outputs from ICARDA to the relevant collections< / li >
< / ul >
< / li >
< / ul >
2023-11-19 12:29:52 +01:00
< h2 id = "2023-11-18" > 2023-11-18< / h2 >
< ul >
< li > Export CGSpace to check for missing Initiative collection mappings< / li >
< li > Start a harvest on AReS< / li >
< / ul >
2023-11-23 14:15:13 +01:00
< h2 id = "2023-11-22" > 2023-11-22< / h2 >
< ul >
< li > I was checking out the < a href = "https://github.com/DSpace/RestContract/blob/main/statistics-reports.md" > DSpace 7 statistics< / a > again and found that we have total visits and total downloads for each DSpace object, for example < a href = "https://dspace7test.ilri.org/items/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748" > this item< / a > :
< ul >
< li > TotalVisits: < a href = "https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits" > https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalVisits< / a > < / li >
< li > TotalDownloads: < a href = "https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads" > https://dspace7test.ilri.org/server/api/statistics/usagereports/3f1b9605-f5ff-4bbb-8c89-d6fe4157f748_TotalDownloads< / a > < / li >
< / ul >
< / li >
< li > And the numbers match those in my dspace-statisitcs-api < em > exactly< / em > !< / li >
< li > This can be useful to get an individual DSpace object’ s stats, but there is no way to iterate over all objects like all items…
< ul >
< li > We can look at using this to draw stats on the community, collection, and item pages< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2023-11-23" > 2023-11-23< / h2 >
< ul >
< li > Brian King was asking me how many PDFs we had in CGSpace so I got a rough estimate using this SQL query:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > localhost/dspace7= ☘ SELECT COUNT(uuid) FROM bitstream WHERE bitstream_format_id=(SELECT bitstream_format_id FROM bitstreamformatregistry WHERE mimetype=' application/pdf' );
< / span > < / span > < span style = "display:flex;" > < span > count
< / span > < / span > < span style = "display:flex;" > < span > ───────
< / span > < / span > < span style = "display:flex;" > < span > 47818
< / span > < / span > < span style = "display:flex;" > < span > (1 row)
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > It’ s been some time since I looked at our Solr statistics to find new bots
< ul >
< li > I found a few new ones that I < a href = "https://github.com/atmire/COUNTER-Robots/pull/60" > submitted to COUNTER-Robots< / a > and added to our local bot list:
< ul >
< li > GuzzleHttp/7< / li >
< li > < a href = "mailto:Owler@ows.eu" > Owler@ows.eu< / a > /1< / li >
< li > newspaperjs< / li >
< / ul >
< / li >
< / ul >
< / li >
< li > I ran my old < code > check-spider-hits.sh< / code > script with a list of bots from our local overrides to purge hits from Solr:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
< / span > < / span > < span style = "display:flex;" > < span > Purging 30 hits from ubermetrics in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 59 hits from curb in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 36 hits from bitdiscovery in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 87 hits from omgili in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 47 hits from Vizzit in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 109 hits from Java\/17-ea in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 40 hits from AdobeUxTechC4-Async in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 21 hits from ZaloPC-win32-24v473 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 21 hits from nbertaupete95 in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 52 hits from Scoop\.it in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 16 hits from WebAPIClient in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 241 hits from RStudio in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1255 hits from ^MEL in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 47850 hits from GuzzleHttp in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 8714 hits from Owler in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1083 hits from newspaperjs in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 369 hits from ^Chrome$ in statistics
< / span > < / span > < span style = "display:flex;" > < span > Purging 1474 hits from curl in statistics
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > Total number of bot hits purged: 61504
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I also noticed 35,000 requests over the past few years from lowercase user agents, which is < a href = "https://developers.whatismybrowser.com/api/features/user-agent-checks/weird/#all_lower_case" > definitely weird< / a > , for example:
< ul >
< li > < code > mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/89.0.4389.90 safari/537.36< / code > < / li >
< li > < code > mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/90.0.4430.93 safari/537.36< / code > < / li >
< / ul >
< / li >
< li > I’ m gonna add those to our overrides and purge them:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
< / span > < / span > < span style = "display:flex;" > < span > Purging 35816 hits from ^mozilla in statistics
< / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" >
< / span > < / span > < / span > < span style = "display:flex;" > < span > < span style = "color:#960050;background-color:#1e0010" > < / span > Total number of bot hits purged: 35816
2023-12-02 08:38:09 +01:00
< / span > < / span > < / code > < / pre > < / div > < h2 id = "2023-11-30" > 2023-11-30< / h2 >
< ul >
< li > Minor updates to our OAI MODS crosswalk
< ul >
< li > Stefano found a minor markup issue with our alternative titles (< code > < titleInfo> < / code > tag)< / li >
< / ul >
< / li >
< li > Very high load on CGSpace since after lunch
< ul >
< li > I killed some locks that had been stuck for a few hours< / li >
< / ul >
< / li >
< / ul >
<!-- raw HTML omitted -->
2023-11-08 06:20:31 +01:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2023-12-02 08:38:09 +01:00
< li > < a href = "/cgspace-notes/2023-12/" > December, 2023< / a > < / li >
2023-11-08 06:20:31 +01:00
< li > < a href = "/cgspace-notes/2023-11/" > November, 2023< / a > < / li >
< li > < a href = "/cgspace-notes/2023-10/" > October, 2023< / a > < / li >
< li > < a href = "/cgspace-notes/2023-09/" > September, 2023< / a > < / li >
< li > < a href = "/cgspace-notes/2023-08/" > August, 2023< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >