2024-10-03 11:51:44 +03:00
<!DOCTYPE html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "October, 2024" / >
< meta property = "og:description" content = "2024-10-03
I had an idea to get abstracts from OpenAlex
For copyright reasons they don’ t include plain abstracts, but the pyalex library can convert them on the fly
" />
< meta property = "og:type" content = "article" / >
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2024-10/" / >
< meta property = "article:published_time" content = "2024-10-03T11:01:00+03:00" / >
2024-12-04 16:27:49 +03:00
< meta property = "article:modified_time" content = "2024-11-19T10:40:23+03:00" / >
2024-10-03 11:51:44 +03:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "October, 2024" / >
< meta name = "twitter:description" content = "2024-10-03
I had an idea to get abstracts from OpenAlex
For copyright reasons they don’ t include plain abstracts, but the pyalex library can convert them on the fly
"/>
< meta name = "generator" content = "Hugo 0.133.1" >
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "October, 2024",
"url": "https://alanorth.github.io/cgspace-notes/2024-10/",
2024-11-19 10:40:23 +03:00
"wordCount": "620",
2024-10-03 11:51:44 +03:00
"datePublished": "2024-10-03T11:01:00+03:00",
2024-12-04 16:27:49 +03:00
"dateModified": "2024-11-19T10:40:23+03:00",
2024-10-03 11:51:44 +03:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2024-10/" >
< title > October, 2024 | CGSpace Notes< / title >
<!-- combined, minified CSS -->
< link href = "https://alanorth.github.io/cgspace-notes/css/style.c6ba80bc50669557645abe05f86b73cc5af84408ed20f1551a267bc19ece8228.css" rel = "stylesheet" integrity = "sha256-xrqAvFBmlVdkWr4F+GtzzFr4RAjtIPFVGiZ7wZ7Ogig=" crossorigin = "anonymous" >
<!-- minified Font Awesome for SVG icons -->
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
<!-- RSS 2.0 feed -->
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
< header class = "blog-header" >
< div class = "container" >
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
< / div >
< / header >
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2024-10/" > October, 2024< / a > < / h2 >
< p class = "blog-post-meta" >
< time datetime = "2024-10-03T11:01:00+03:00" > Thu Oct 03, 2024< / time >
in
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/categories/notes/" rel = "category tag" > Notes< / a >
< / p >
< / header >
< h2 id = "2024-10-03" > 2024-10-03< / h2 >
< ul >
< li > I had an idea to get abstracts from OpenAlex
< ul >
< li > For < a href = "https://docs.openalex.org/api-entities/works/work-object#abstract_inverted_index" > copyright reasons they don’ t include plain abstracts< / a > , but the < a href = "https://github.com/J535D165/pyalex" > pyalex< / a > library can convert them on the fly< / li >
< / ul >
< / li >
< / ul >
< ul >
< li > I filtered for journal articles that were Creative Commons and missing abstracts:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > $ csvcut -c < span style = "color:#e6db74" > ' id,dc.title[en_US],dcterms.abstract[en_US],cg.identifier.doi[en_US],dcterms.type[en_US],dcterms.language[en_US],dcterms.license[en_US]' < / span > ~/Downloads/2024-09-30-cgspace.csv | csvgrep -c < span style = "color:#e6db74" > ' dcterms.type[en_US]' < / span > -r < span style = "color:#e6db74" > ' ^Journal Article$' < / span > | csvgrep -c < span style = "color:#e6db74" > ' cg.identifier.doi[en_US]' < / span > -r < span style = "color:#e6db74" > ' ^.+$' < / span > | csvgrep -c < span style = "color:#e6db74" > ' dcterms.license[en_US]' < / span > -r < span style = "color:#e6db74" > ' ^CC-' < / span > | csvgrep -c < span style = "color:#e6db74" > ' dcterms.abstract[en_US]' < / span > -r < span style = "color:#e6db74" > ' ^$' < / span > | csvgrep -c < span style = "color:#e6db74" > ' dcterms.language[en_US]' < / span > -r < span style = "color:#e6db74" > ' ^en$' < / span > | grep -v < span style = "color:#e6db74" > " ||" < / span > | grep -v -- < span style = "color:#e6db74" > ' -ND' < / span > | grep -v -E < span style = "color:#e6db74" > ' https://doi.org/10.(2499|4160|17528)/' < / span > > /tmp/missing-abstracts.csv
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > Then wrote a script to get them from OpenAlex
< ul >
< li > After inspecting and cleaning a few dozen up in OpenRefine (removing “ Keywords:” and copyright, and HTML entities, etc) I managed to get about 440< / li >
< / ul >
< / li >
< / ul >
2024-10-08 13:46:23 +03:00
< h2 id = "2024-10-06" > 2024-10-06< / h2 >
< ul >
< li > Since I increase Solr’ s heap from 2 to 3G a few weeks ago it seems like Solr is always using 100% CPU
< ul >
< li > I don’ t understand this because it was running well before, and I only increased it in anticipation of running the dspace-statistics-api-js, though never got around to it< / li >
< li > I just realized that this may be related to the JMX monitoring, as I’ ve seen gaps in the Grafana dashboards and remember that it took surprisingly long to scrape the metrics< / li >
< li > Maybe I need to change the scrape interval< / li >
< / ul >
< / li >
< / ul >
< h2 id = "2024-10-08" > 2024-10-08< / h2 >
< ul >
< li > I checked the VictoriaMetrics vmagent dashboard and saw that there were thousands of errors scraping the < code > jvm_solr< / code > target from Solr
< ul >
< li > So it seems like I do need to change the scrape interval< / li >
< li > I will increase it from 15s (global) to 20s for that job< / li >
< li > Reading some documentation I found < a href = "https://www.robustperception.io/keep-it-simple-scrape_interval-id/" > this reference from Brian Brazil that discusses this very problem< / a > < / li >
< li > He recommends keeping a single scrape interval for all targets, but also checking the slow exporter (< code > jmx_exporter< / code > in this case) and seeing if we can limit the data we scrape< / li >
< li > To keep things simple for now I will increase the global scrape interval to 20s< / li >
< li > Long term I should limit the metrics… < / li >
< li > Oh wow, I found out that < a href = "https://solr.apache.org/guide/8_11/monitoring-solr-with-prometheus-and-grafana.html" > Solr ships with a Prometheus exporter!< / a > and even includes a Grafana dashboard< / li >
< / ul >
< / li >
< li > I’ m trying to run the Solr prometheus-exporter as a one-off systemd unit to test it:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > # cd /opt/solr-8.11.3/contrib/prometheus-exporter
< / span > < / span > < span style = "display:flex;" > < span > # systemd-run --uid< span style = "color:#f92672" > =< / span > victoriametrics --gid< span style = "color:#f92672" > =< / span > victoriametrics --working-directory< span style = "color:#f92672" > =< / span > /opt/solr-8.11.3/contrib/prometheus-exporter ./bin/solr-exporter -p < span style = "color:#ae81ff" > 9854< / span > -b http://localhost:8983/solr -f ./conf/solr-exporter-config.xml -s < span style = "color:#ae81ff" > 20< / span >
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > The default scrape interval is 60 seconds, so if we scrape it more than that the metrics will be stale
< ul >
< li > From what I’ ve seen this returns in less than one second so it should be safe to reduce the scrape interval< / li >
< / ul >
< / li >
< / ul >
2024-11-19 10:40:23 +03:00
< h2 id = "2024-10-19" > 2024-10-19< / h2 >
< ul >
< li > Heavy load on CGSpace today
< ul >
< li > There is a noted increase just before 4PM local time< / li >
< li > I extracted a list of IPs:< / li >
< / ul >
< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > # grep -E < span style = "color:#e6db74" > ' 19/Oct/2024:1[567]' < / span > /var/log/nginx/api-access.log | awk < span style = "color:#e6db74" > ' {print $1}' < / span > | sort -u > /tmp/ips.txt
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > I looked them up and found some data center IPs that were using normal user agents with hundreds of IPs, for example:
< ul >
< li > 154.47.29.168 # 212238 (CDNEXT - Datacamp Limited, GB)< / li >
< li > 91.210.64.12 # 29802 (HVC-AS, US) - HIVELOCITY, Inc.< / li >
< li > 103.221.57.120 # 132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)< / li >
< li > 109.107.150.136 # 201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200< / li >
< li > 185.210.207.1 # 209709 (CODE200-ISP1 - UAB code200, LT)< / li >
< li > 185.162.119.101 # 207223 (GLOBALCON - Global Connections Network LLC, US)< / li >
< li > 173.244.35.101 # 64286 (LOGICWEB, US) - Tesonet< / li >
< li > 139.28.160.141 # 396319 (US-INTERNET-396319, US) - OxyLabs< / li >
< li > 104.143.89.112 # 62874 (WEB2OBJECTS, US) - Web2Objects LLC< / li >
< / ul >
< / li >
< li > I added some network blocks to the nginx conf< / li >
< li > Interestingly, I see so many IPs using the same user agent today:< / li >
< / ul >
< div class = "highlight" > < pre tabindex = "0" style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;" > < code class = "language-console" data-lang = "console" > < span style = "display:flex;" > < span > # grep < span style = "color:#e6db74" > " Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.3" < / span > /var/log/nginx/api-access.log | awk < span style = "color:#e6db74" > ' {print $1}' < / span > | sort -u | wc -l
< / span > < / span > < span style = "display:flex;" > < span > 767
< / span > < / span > < / code > < / pre > < / div > < ul >
< li > For reference, the current Chrome version is 129 or so…
< ul >
< li > This is definitely worth looking into because it seems like one massive botnet< / li >
< / ul >
< / li >
< / ul >
2024-10-03 11:51:44 +03:00
<!-- raw HTML omitted -->
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2025-01-03 12:37:39 +03:00
< li > < a href = "/cgspace-notes/2025-01/" > January, 2025< / a > < / li >
2024-12-04 16:27:49 +03:00
< li > < a href = "/cgspace-notes/2024-12/" > December, 2024< / a > < / li >
2024-11-19 10:40:23 +03:00
< li > < a href = "/cgspace-notes/2024-11/" > November, 2024< / a > < / li >
2024-10-03 11:51:44 +03:00
< li > < a href = "/cgspace-notes/2024-10/" > October, 2024< / a > < / li >
< li > < a href = "/cgspace-notes/2024-09/" > September, 2024< / a > < / li >
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
< p dir = "auto" >
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >