2018-02-11 17:28:23 +01:00
<!DOCTYPE html>
2019-10-11 10:19:42 +02:00
< html lang = "en" >
2018-02-11 17:28:23 +01:00
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
2020-12-06 15:53:29 +01:00
2018-02-11 17:28:23 +01:00
< meta property = "og:title" content = "November, 2017" / >
< meta property = "og:description" content = "2017-11-01
The CORE developers responded to say they are looking into their bot not respecting our robots.txt
2017-11-02
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
2022-03-04 13:30:06 +01:00
# grep -c " CORE" /var/log/nginx/access.log
2018-02-11 17:28:23 +01:00
0
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = ' contributor' and qualifier = ' author' ) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
" />
< meta property = "og:type" content = "article" / >
2019-02-02 13:12:57 +01:00
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2017-11/" / >
2019-08-08 17:10:44 +02:00
< meta property = "article:published_time" content = "2017-11-02T09:37:54+02:00" / >
2019-10-28 12:43:25 +01:00
< meta property = "article:modified_time" content = "2019-10-28T13:39:25+02:00" / >
2018-09-30 07:23:48 +02:00
2020-12-06 15:53:29 +01:00
2018-02-11 17:28:23 +01:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "November, 2017" / >
< meta name = "twitter:description" content = "2017-11-01
The CORE developers responded to say they are looking into their bot not respecting our robots.txt
2017-11-02
Today there have been no hits by CORE and no alerts from Linode (coincidence?)
2022-03-04 13:30:06 +01:00
# grep -c " CORE" /var/log/nginx/access.log
2018-02-11 17:28:23 +01:00
0
Generate list of authors on CGSpace for Peter to go through and correct:
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = ' contributor' and qualifier = ' author' ) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
"/>
2022-09-05 15:59:11 +02:00
< meta name = "generator" content = "Hugo 0.102.3" / >
2018-02-11 17:28:23 +01:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "November, 2017",
2020-04-02 09:55:42 +02:00
"url": "https://alanorth.github.io/cgspace-notes/2017-11/",
2018-04-30 18:05:39 +02:00
"wordCount": "5428",
2019-10-11 10:19:42 +02:00
"datePublished": "2017-11-02T09:37:54+02:00",
2019-10-28 12:43:25 +01:00
"dateModified": "2019-10-28T13:39:25+02:00",
2018-02-11 17:28:23 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2017-11/" >
< title > November, 2017 | CGSpace Notes< / title >
2019-10-11 10:19:42 +02:00
2018-02-11 17:28:23 +01:00
<!-- combined, minified CSS -->
2020-01-23 19:19:38 +01:00
2022-09-09 16:29:51 +02:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.beb8012edc08ba10be012f079d618dc243812267efe62e11f22fe49618f976a4.css" rel = "stylesheet" integrity = "sha256-vrgBLtwIuhC+AS8HnWGNwkOBImfv5i4R8i/klhj5dqQ=" crossorigin = "anonymous" >
2019-10-11 10:19:42 +02:00
2018-02-11 17:28:23 +01:00
2020-01-28 11:01:42 +01:00
<!-- minified Font Awesome for SVG icons -->
2021-09-28 09:32:32 +02:00
< script defer src = "https://alanorth.github.io/cgspace-notes/js/fontawesome.min.f5072c55a0721857184db93a50561d7dc13975b4de2e19db7f81eb5f3fa57270.js" integrity = "sha256-9QcsVaByGFcYTbk6UFYdfcE5dbTeLhnbf4HrXz+lcnA=" crossorigin = "anonymous" > < / script >
2020-01-28 11:01:42 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2018-02-11 17:28:23 +01:00
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
2018-12-19 12:20:39 +01:00
2018-02-11 17:28:23 +01:00
< header class = "blog-header" >
< div class = "container" >
2019-10-11 10:19:42 +02:00
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
2018-02-11 17:28:23 +01:00
< / div >
< / header >
2018-12-19 12:20:39 +01:00
2018-02-11 17:28:23 +01:00
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2019-10-11 10:19:42 +02:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2017-11/" > November, 2017< / a > < / h2 >
2020-11-16 09:54:00 +01:00
< p class = "blog-post-meta" >
< time datetime = "2017-11-02T09:37:54+02:00" > Thu Nov 02, 2017< / time >
in
2022-06-23 07:40:53 +02:00
< span class = "fas fa-folder" aria-hidden = "true" > < / span > < a href = "/categories/notes/" rel = "category tag" > Notes< / a >
2018-02-11 17:28:23 +01:00
< / p >
< / header >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-01" > 2017-11-01< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > The CORE developers responded to say they are looking into their bot not respecting our robots.txt< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-02" > 2017-11-02< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Today there have been no hits by CORE and no alerts from Linode (coincidence?)< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep -c " CORE" /var/log/nginx/access.log
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Generate list of authors on CGSpace for Peter to go through and correct:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = ' contributor' and qualifier = ' author' ) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
2018-02-11 17:28:23 +01:00
COPY 54701
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2018-02-11 17:28:23 +01:00
< li > Abenet asked if it would be possible to generate a report of items in Listing and Reports that had “ International Fund for Agricultural Development” as the < em > only< / em > investor< / li >
< li > I opened a ticket with Atmire to ask if this was possible: < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540" > https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=540< / a > < / li >
< li > Work on making the thumbnails in the item view clickable< / li >
2019-11-28 16:30:45 +01:00
< li > Basically, once you read the METS XML for an item it becomes easy to trace the structure to find the bitstream link< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > //mets:fileSec/mets:fileGrp[@USE=' CONTENT' ]/mets:file/mets:FLocat[@LOCTYPE=' URL' ]/@xlink:href
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > METS XML is available for all items with this pattern: /metadata/handle/10568/95947/mets.xml< / li >
< li > I whipped up a quick hack to print a clickable link with this URL on the thumbnail but it needs to check a few corner cases, like when there is a thumbnail but no content bitstream!< / li >
< li > Help proof fifty-three CIAT records for Sisay: < a href = "https://dspacetest.cgiar.org/handle/10568/95895" > https://dspacetest.cgiar.org/handle/10568/95895< / a > < / li >
< li > A handful of issues with < code > cg.place< / code > using format like “ Lima, PE” instead of “ Lima, Peru” < / li >
< li > Also, some dates like with completely invalid format like “ 2010- 06” and “ 2011-3-28” < / li >
< li > I also collapsed some consecutive whitespace on a handful of fields< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-03" > 2017-11-03< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Atmire got back to us to say that they estimate it will take two days of labor to implement the change to Listings and Reports< / li >
2020-01-27 15:20:44 +01:00
< li > I said I’ d ask Abenet if she wants that feature< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-04" > 2017-11-04< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
2020-01-27 15:20:44 +01:00
< li > I finished looking through Sisay’ s CIAT records for the “ Alianzas de Aprendizaje” data< / li >
2018-02-11 17:28:23 +01:00
< li > I corrected about half of the authors to standardize them< / li >
< li > Linode emailed this morning to say that the CPU usage was high again, this time at 6:14AM< / li >
2020-01-27 15:20:44 +01:00
< li > It’ s the first time in a few days that this has happened< / li >
< li > I had a look to see what was going on, but it isn’ t the CORE bot:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # awk ' {print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
306 68.180.229.31
323 61.148.244.116
414 66.249.66.91
507 40.77.167.16
618 157.55.39.161
652 207.46.13.103
666 157.55.39.254
1173 104.196.152.243
1737 66.249.66.90
23101 138.201.52.218
< / code > < / pre > < ul >
< li > 138.201.52.218 is from some Hetzner server, and I see it making 40,000 requests yesterday too, but none before that:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zgrep -c 138.201.52.218 /var/log/nginx/access.log*
2018-02-11 17:28:23 +01:00
/var/log/nginx/access.log:24403
/var/log/nginx/access.log.1:45958
/var/log/nginx/access.log.2.gz:0
/var/log/nginx/access.log.3.gz:0
/var/log/nginx/access.log.4.gz:0
/var/log/nginx/access.log.5.gz:0
/var/log/nginx/access.log.6.gz:0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > It’ s clearly a bot as it’ s making tens of thousands of requests, but it’ s using a “ normal” user agent:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > For now I don’ t know what this user is!< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-05" > 2017-11-05< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Peter asked if I could fix the appearance of “ International Livestock Research Institute” in the author lookup during item submission< / li >
< li > It looks to be just an issue with the user interface expecting authors to have both a first and last name:< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/author-lookup.png" alt = "Author lookup" >
< img src = "/cgspace-notes/2017/11/add-author.png" alt = "Add author" > < / p >
2018-02-11 17:28:23 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > But in the database the authors are correct (none with weird < code > , /< / code > characters):< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > dspace=# select distinct text_value, authority, confidence from metadatavalue value where resource_type_id=2 and metadata_field_id=3 and text_value like ' International Livestock Research Institute%' ;
2019-11-28 16:30:45 +01:00
text_value | authority | confidence
2018-02-11 17:28:23 +01:00
--------------------------------------------+--------------------------------------+------------
2019-11-28 16:30:45 +01:00
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 0
International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | 0
International Livestock Research Institute | | -1
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 600
International Livestock Research Institute | f4db1627-47cd-4699-b394-bab7eba6dadc | -1
International Livestock Research Institute | | 600
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | -1
International Livestock Research Institute | 8f3865dc-d056-4aec-90b7-77f49ab4735c | 500
2018-02-11 17:28:23 +01:00
(8 rows)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > So I’ m not sure if this is just a graphical glitch or if editors have to edit this metadata field prior to approval< / li >
< li > Looking at monitoring Tomcat’ s JVM heap with Prometheus, it looks like we need to use JMX + < a href = "https://github.com/prometheus/jmx_exporter" > jmx_exporter< / a > < / li >
2019-11-28 16:30:45 +01:00
< li > This guide shows how to < a href = "https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/" > enable JMX in Tomcat< / a > by modifying < code > CATALINA_OPTS< / code > < / li >
< li > I was able to successfully connect to my local Tomcat with jconsole!< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-07" > 2017-11-07< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > CGSpace when down and up a few times this morning, first around 3AM, then around 7< / li >
< li > Tsega had to restart Tomcat 7 to fix it temporarily< / li >
2019-11-28 16:30:45 +01:00
< li > I will start by looking at bot usage (access.log.1 includes usage until 6AM today):< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log.1 | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
619 65.49.68.184
840 65.49.68.199
924 66.249.66.91
1131 68.180.229.254
1583 66.249.66.90
1953 207.46.13.103
1999 207.46.13.80
2021 157.55.39.161
2034 207.46.13.36
4681 104.196.152.243
< / code > < / pre > < ul >
< li > 104.196.152.243 seems to be a top scraper for a few weeks now:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zgrep -c 104.196.152.243 /var/log/nginx/access.log*
2018-02-11 17:28:23 +01:00
/var/log/nginx/access.log:336
/var/log/nginx/access.log.1:4681
/var/log/nginx/access.log.2.gz:3531
/var/log/nginx/access.log.3.gz:3532
/var/log/nginx/access.log.4.gz:5786
/var/log/nginx/access.log.5.gz:8542
/var/log/nginx/access.log.6.gz:6988
/var/log/nginx/access.log.7.gz:7517
/var/log/nginx/access.log.8.gz:7211
/var/log/nginx/access.log.9.gz:2763
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > This user is responsible for hundreds and sometimes thousands of Tomcat sessions:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
954
2022-03-04 13:30:06 +01:00
$ grep 104.196.152.243 dspace.log.2017-11-03 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
6199
2022-03-04 13:30:06 +01:00
$ grep 104.196.152.243 dspace.log.2017-11-01 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
7051
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > The worst thing is that this user never specifies a user agent string so we can’ t lump it in with the other bots using the Tomcat Session Crawler Manager Valve< / li >
< li > They don’ t request dynamic URLs like “ /discover” but they seem to be fetching handles from XMLUI instead of REST (and some with < code > //handle< / code > , note the regex below):< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # grep -c 104.196.152.243 /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
4681
2022-03-04 13:30:06 +01:00
# grep 104.196.152.243 /var/log/nginx/access.log.1 | grep -c -P ' GET //?handle'
2018-02-11 17:28:23 +01:00
4618
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I just realized that < code > ciat.cgiar.org< / code > points to 104.196.152.243, so I should contact Leroy from CIAT to see if we can change their scraping behavior< / li >
2020-01-27 15:20:44 +01:00
< li > The next IP (207.46.13.36) seem to be Microsoft’ s bingbot, but all its requests specify the “ bingbot” user agent and there are no requests for dynamic URLs that are forbidden, like “ /discover” :< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ grep -c 207.46.13.36 /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
2034
2022-03-04 13:30:06 +01:00
# grep 207.46.13.36 /var/log/nginx/access.log.1 | grep -c " GET /discover"
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The next IP (157.55.39.161) also seems to be bingbot, and none of its requests are for URLs forbidden by robots.txt either:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep 157.55.39.161 /var/log/nginx/access.log.1 | grep -c " GET /discover"
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The next few seem to be bingbot as well, and they declare a proper user agent and do not request dynamic URLs like “ /discover” :< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep -c -E ' 207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
5997
2022-03-04 13:30:06 +01:00
# grep -E ' 207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c " bingbot"
2018-02-11 17:28:23 +01:00
5988
2022-03-04 13:30:06 +01:00
# grep -E ' 207.46.13.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c " GET /discover"
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The next few seem to be Googlebot, and they declare a proper user agent and do not request dynamic URLs like “ /discover” :< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep -c -E ' 66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
3048
2022-03-04 13:30:06 +01:00
# grep -E ' 66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c Google
2018-02-11 17:28:23 +01:00
3048
2022-03-04 13:30:06 +01:00
# grep -E ' 66.249.66.[0-9]{2,3}' /var/log/nginx/access.log.1 | grep -c " GET /discover"
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The next seems to be Yahoo, which declares a proper user agent and does not request dynamic URLs like “ /discover” :< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # grep -c 68.180.229.254 /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
1131
2022-03-04 13:30:06 +01:00
# grep 68.180.229.254 /var/log/nginx/access.log.1 | grep -c " GET /discover"
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The last of the top ten IPs seems to be some bot with a weird user agent, but they are not behaving too well:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep -c -E ' 65.49.68.[0-9]{3}' /var/log/nginx/access.log.1
2018-02-11 17:28:23 +01:00
2950
2022-03-04 13:30:06 +01:00
# grep -E ' 65.49.68.[0-9]{3}' /var/log/nginx/access.log.1 | grep -c " GET /discover"
2018-02-11 17:28:23 +01:00
330
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Their user agents vary, ie:
2018-02-11 17:28:23 +01:00
< ul >
< li > < code > Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36< / code > < / li >
< li > < code > Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11< / code > < / li >
< li > < code > Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)< / code > < / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
2020-01-27 15:20:44 +01:00
< li > I’ ll just keep an eye on that one for now, as it only made a few hundred requests to dynamic discovery URLs< / li >
< li > While it’ s not in the top ten, Baidu is one bot that seems to not give a fuck:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep " 7/Nov/2017" | grep -c Baiduspider
2018-02-11 17:28:23 +01:00
8912
2022-03-04 13:30:06 +01:00
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep " 7/Nov/2017" | grep Baiduspider | grep -c -E " GET /(browse|discover|search-filter)"
2018-02-11 17:28:23 +01:00
2521
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > According to their documentation their bot < a href = "http://www.baidu.com/search/robots_english.html" > respects < code > robots.txt< / code > < / a > , but I don’ t see this being the case< / li >
2019-11-28 16:30:45 +01:00
< li > I think I will end up blocking Baidu as well… < / li >
< li > Next is for me to look and see what was happening specifically at 3AM and 7AM when the server crashed< / li >
2020-01-27 15:20:44 +01:00
< li > I should look in nginx access.log, rest.log, oai.log, and DSpace’ s dspace.log.2017-11-07< / li >
2019-11-28 16:30:45 +01:00
< li > Here are the top IPs making requests to XMLUI from 2 to 8 AM:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E ' 07/Nov/2017:0[2-8]' | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
279 66.249.66.91
373 65.49.68.199
446 68.180.229.254
470 104.196.152.243
470 197.210.168.174
598 207.46.13.103
603 157.55.39.161
637 207.46.13.80
703 207.46.13.36
724 66.249.66.90
< / code > < / pre > < ul >
< li > Of those, most are Google, Bing, Yahoo, etc, except 63.143.42.244 and 63.143.42.242 which are Uptime Robot< / li >
< li > Here are the top IPs making requests to REST from 2 to 8 AM:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E ' 07/Nov/2017:0[2-8]' | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
8 207.241.229.237
10 66.249.66.90
16 104.196.152.243
25 41.60.238.61
26 157.55.39.161
27 207.46.13.103
27 207.46.13.80
31 207.46.13.36
1498 50.116.102.77
< / code > < / pre > < ul >
< li > The OAI requests during that same time period are nothing to worry about:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E ' 07/Nov/2017:0[2-8]' | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
1 66.249.66.92
4 66.249.66.90
6 68.180.229.254
< / code > < / pre > < ul >
< li > The top IPs from dspace.log during the 2– 8 AM period:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ grep -E ' 2017-11-07 0[2-8]' dspace.log.2017-11-07 | grep -o -E ' ip_addr=[0-9.]+' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
143 ip_addr=213.55.99.121
181 ip_addr=66.249.66.91
223 ip_addr=157.55.39.161
248 ip_addr=207.46.13.80
251 ip_addr=207.46.13.103
291 ip_addr=207.46.13.36
297 ip_addr=197.210.168.174
312 ip_addr=65.49.68.199
462 ip_addr=104.196.152.243
488 ip_addr=66.249.66.90
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > These aren’ t actually very interesting, as the top few are Google, CIAT, Bingbot, and a few other unknown scrapers< / li >
< li > The number of requests isn’ t even that high to be honest< / li >
2019-11-28 16:30:45 +01:00
< li > As I was looking at these logs I noticed another heavy user (124.17.34.59) that was not active during this time period, but made many requests today alone:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zgrep -c 124.17.34.59 /var/log/nginx/access.log*
2018-02-11 17:28:23 +01:00
/var/log/nginx/access.log:22581
/var/log/nginx/access.log.1:0
/var/log/nginx/access.log.2.gz:14
/var/log/nginx/access.log.3.gz:0
/var/log/nginx/access.log.4.gz:0
/var/log/nginx/access.log.5.gz:3
/var/log/nginx/access.log.6.gz:0
/var/log/nginx/access.log.7.gz:0
/var/log/nginx/access.log.8.gz:0
/var/log/nginx/access.log.9.gz:1
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > The whois data shows the IP is from China, but the user agent doesn’ t really give any clues:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep 124.17.34.59 /var/log/nginx/access.log | awk -F' " ' ' {print $3}' | sort | uniq -c | sort -h
210 " Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
22610 " Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Win64; x64; Trident/7.0; LCTE)"
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > A Google search for “ LCTE bot” doesn’ t return anything interesting, but this < a href = "https://stackoverflow.com/questions/42500881/what-is-lcte-in-user-agent" > Stack Overflow discussion< / a > references the lack of information< / li >
2019-11-28 16:30:45 +01:00
< li > So basically after a few hours of looking at the log files I am not closer to understanding what is going on!< / li >
< li > I do know that we want to block Baidu, though, as it does not respect < code > robots.txt< / code > < / li >
< li > And as we speak Linode alerted that the outbound traffic rate is very high for the past two hours (about 12– 14 hours)< / li >
< li > At least for now it seems to be that new Chinese IP (124.17.34.59):< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep -E " 07/Nov/2017:1[234]:" /var/log/nginx/access.log | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
198 207.46.13.103
203 207.46.13.80
205 207.46.13.36
218 157.55.39.161
249 45.5.184.221
258 45.5.187.130
386 66.249.66.90
410 197.210.168.174
1896 104.196.152.243
11005 124.17.34.59
< / code > < / pre > < ul >
< li > Seems 124.17.34.59 are really downloading all our PDFs, compared to the next top active IPs during this time!< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep -E " 07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 124.17.34.59 | grep -c pdf
2018-02-11 17:28:23 +01:00
5948
2022-03-04 13:30:06 +01:00
# grep -E " 07/Nov/2017:1[234]:" /var/log/nginx/access.log | grep 104.196.152.243 | grep -c pdf
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > About CIAT, I think I need to encourage them to specify a user agent string for their requests, because they are not reuising their Tomcat session and they are creating thousands of sessions per day< / li >
< li > All CIAT requests vs unique ones:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ grep -Io -E ' session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | wc -l
2018-02-11 17:28:23 +01:00
3506
2022-03-04 13:30:06 +01:00
$ grep -Io -E ' session_id=[A-Z0-9]{32}:ip_addr=104.196.152.243' dspace.log.2017-11-07 | sort | uniq | wc -l
2018-02-11 17:28:23 +01:00
3506
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I emailed CIAT about the session issue, user agent issue, and told them they should not scrape the HTML contents of communities, instead using the REST API< / li >
< li > About Baidu, I found a link to their < a href = "http://ziyuan.baidu.com/robots/" > robots.txt tester tool< / a > < / li >
< li > It seems like our robots.txt file is valid, and they claim to recognize that URLs like < code > /discover< / code > should be forbidden (不允许, aka “ not allowed” ):< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/baidu-robotstxt.png" alt = "Baidu robots.txt tester" > < / p >
2018-02-11 17:28:23 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > But they literally just made this request today:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > 180.76.15.136 - - [07/Nov/2017:06:25:11 +0000] " GET /discover?filtertype_0=crpsubject& filter_relational_operator_0=equals& filter_0=WATER%2C+LAND+AND+ECOSYSTEMS& filtertype=subject& filter_relational_operator=equals& filter=WATER+RESOURCES HTTP/1.1" 200 82265 " -" " Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Along with another thousand or so requests to URLs that are forbidden in robots.txt today alone:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # grep -c Baiduspider /var/log/nginx/access.log
2018-02-11 17:28:23 +01:00
3806
2022-03-04 13:30:06 +01:00
# grep Baiduspider /var/log/nginx/access.log | grep -c -E " GET /(browse|discover|search-filter)"
2018-02-11 17:28:23 +01:00
1085
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I will think about blocking their IPs but they have 164 of them!< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # grep " Baiduspider/2.0" /var/log/nginx/access.log | awk ' {print $1}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
164
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2017-11-08" > 2017-11-08< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode sent several alerts last night about CPU usage and outbound traffic rate at 6:13PM< / li >
< li > Linode sent another alert about CPU usage in the morning at 6:12AM< / li >
2019-11-28 16:30:45 +01:00
< li > Jesus, the new Chinese IP (124.17.34.59) has downloaded 24,000 PDFs in the last 24 hours:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E " 0[78]/Nov/2017:" | grep 124.17.34.59 | grep -v pdf.jpg | grep -c pdf
2018-02-11 17:28:23 +01:00
24981
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > This is about 20,000 Tomcat sessions:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ cat dspace.log.2017-11-07 dspace.log.2017-11-08 | grep -Io -E ' session_id=[A-Z0-9]{32}:ip_addr=124.17.34.59' | sort | uniq | wc -l
2018-02-11 17:28:23 +01:00
20733
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > I’ m getting really sick of this< / li >
2019-11-28 16:30:45 +01:00
< li > Sisay re-uploaded the CIAT records that I had already corrected earlier this week, erasing all my corrections< / li >
< li > I had to re-correct all the publishers, places, names, dates, etc and apply the changes on DSpace Test< / li >
< li > Run system updates on DSpace Test and reboot the server< / li >
< li > Magdalena had written to say that two of their Phase II project tags were missing on CGSpace, so I added them (< a href = "https://github.com/ilri/DSpace/pull/346" > #346< / a > )< / li >
2020-01-27 15:20:44 +01:00
< li > I figured out a way to use nginx’ s map function to assign a “ bot” user agent to misbehaving clients who don’ t define a user agent< / li >
< li > Most bots are automatically lumped into one generic session by < a href = "https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve" > Tomcat’ s Crawler Session Manager Valve< / a > but this only works if their user agent matches a pre-defined regular expression like < code > .*[bB]ot.*< / code > < / li >
2019-11-28 16:30:45 +01:00
< li > Some clients send thousands of requests without a user agent which ends up creating thousands of Tomcat sessions, wasting precious memory, CPU, and database resources in the process< / li >
< li > Basically, we modify the nginx config to add a mapping with a modified user agent < code > $ua< / code > :< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > map $remote_addr $ua {
2019-11-28 16:30:45 +01:00
# 2017-11-08 Random Chinese host grabbing 20,000 PDFs
2022-03-04 13:30:06 +01:00
124.17.34.59 ' ChineseBot' ;
2019-11-28 16:30:45 +01:00
default $http_user_agent;
2018-02-11 17:28:23 +01:00
}
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > If the client’ s address matches then the user agent is set, otherwise the default < code > $http_user_agent< / code > variable is used< / li >
< li > Then, in the server’ s < code > /< / code > block we pass this header to Tomcat:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > proxy_pass http://tomcat_http;
2018-02-11 17:28:23 +01:00
proxy_set_header User-Agent $ua;
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Note to self: the < code > $ua< / code > variable won’ t show up in nginx access logs because the default < code > combined< / code > log format doesn’ t show it, so don’ t run around pulling your hair out wondering with the modified user agents aren’ t showing in the logs!< / li >
2019-11-28 16:30:45 +01:00
< li > If a client matching one of these IPs connects without a session, it will be assigned one by the Crawler Session Manager Valve< / li >
2020-01-27 15:20:44 +01:00
< li > You can verify by cross referencing nginx’ s < code > access.log< / code > and DSpace’ s < code > dspace.log.2017-11-08< / code > , for example< / li >
2019-11-28 16:30:45 +01:00
< li > I will deploy this on CGSpace later this week< / li >
< li > I am interested to check how this affects the number of sessions used by the CIAT and Chinese bots (see above on < a href = "#2017-11-07" > 2017-11-07< / a > for example)< / li >
< li > I merged the clickable thumbnails code to < code > 5_x-prod< / code > (< a href = "https://github.com/ilri/DSpace/pull/347" > #347< / a > ) and will deploy it later along with the new bot mapping stuff (and re-run the Asible < code > nginx< / code > and < code > tomcat< / code > tags)< / li >
< li > I was thinking about Baidu again and decided to see how many requests they have versus Google to URL paths that are explicitly forbidden in < code > robots.txt< / code > :< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # zgrep Baiduspider /var/log/nginx/access.log* | grep -c -E " GET /(browse|discover|search-filter)"
2018-02-11 17:28:23 +01:00
22229
2022-03-04 13:30:06 +01:00
# zgrep Googlebot /var/log/nginx/access.log* | grep -c -E " GET /(browse|discover|search-filter)"
2018-02-11 17:28:23 +01:00
0
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > It seems that they rarely even bother checking < code > robots.txt< / code > , but Google does multiple times per day!< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # zgrep Baiduspider /var/log/nginx/access.log* | grep -c robots.txt
2018-02-11 17:28:23 +01:00
14
# zgrep Googlebot /var/log/nginx/access.log* | grep -c robots.txt
1134
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I have been looking for a reason to ban Baidu and this is definitely a good one< / li >
2020-01-27 15:20:44 +01:00
< li > Disallowing < code > Baiduspider< / code > in < code > robots.txt< / code > probably won’ t work because this bot doesn’ t seem to respect the robot exclusion standard anyways!< / li >
2019-11-28 16:30:45 +01:00
< li > I will whip up something in nginx later< / li >
< li > Run system updates on CGSpace and reboot the server< / li >
< li > Re-deploy latest < code > 5_x-prod< / code > branch on CGSpace and DSpace Test (includes the clickable thumbnails, CCAFS phase II project tags, and updated news text)< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-09" > 2017-11-09< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Awesome, it seems my bot mapping stuff in nginx actually reduced the number of Tomcat sessions used by the CIAT scraper today, total requests and unique sessions:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep ' 09/Nov/2017' | grep -c 104.196.152.243
2018-02-11 17:28:23 +01:00
8956
2022-03-04 13:30:06 +01:00
$ grep 104.196.152.243 dspace.log.2017-11-09 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
223
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Versus the same stats for yesterday and the day before:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep ' 08/Nov/2017' | grep -c 104.196.152.243
2018-02-11 17:28:23 +01:00
10216
2022-03-04 13:30:06 +01:00
$ grep 104.196.152.243 dspace.log.2017-11-08 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
2592
2022-03-04 13:30:06 +01:00
# zcat -f -- /var/log/nginx/access.log.2.gz /var/log/nginx/access.log.3.gz | grep ' 07/Nov/2017' | grep -c 104.196.152.243
2018-02-11 17:28:23 +01:00
8120
2022-03-04 13:30:06 +01:00
$ grep 104.196.152.243 dspace.log.2017-11-07 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
3506
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The number of sessions is over < em > ten times less< / em > !< / li >
2020-01-27 15:20:44 +01:00
< li > This gets me thinking, I wonder if I can use something like nginx’ s rate limiter to automatically change the user agent of clients who make too many requests< / li >
2019-11-28 16:30:45 +01:00
< li > Perhaps using a combination of geo and map, like illustrated here: < a href = "https://www.nginx.com/blog/rate-limiting-nginx/" > https://www.nginx.com/blog/rate-limiting-nginx/< / a > < / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-11" > 2017-11-11< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > I was looking at the Google index and noticed there are 4,090 search results for dspace.ilri.org but only seven for mahider.ilri.org< / li >
< li > Search with something like: inurl:dspace.ilri.org inurl:https< / li >
< li > I want to get rid of those legacy domains eventually!< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-12" > 2017-11-12< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Update the < a href = "https://github.com/ilri/rmg-ansible-public" > Ansible infrastructure templates< / a > to be a little more modular and flexible< / li >
2020-01-27 15:20:44 +01:00
< li > Looking at the top client IPs on CGSpace so far this morning, even though it’ s only been eight hours:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep " 12/Nov/2017" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
243 5.83.120.111
335 40.77.167.103
424 66.249.66.91
529 207.46.13.36
554 40.77.167.129
604 207.46.13.53
754 104.196.152.243
883 66.249.66.90
1150 95.108.181.88
1381 5.9.6.51
< / code > < / pre > < ul >
< li > 5.9.6.51 seems to be a Russian bot:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # grep 5.9.6.51 /var/log/nginx/access.log | tail -n 1
2022-03-04 13:30:06 +01:00
5.9.6.51 - - [12/Nov/2017:08:13:13 +0000] " GET /handle/10568/16515/recent-submissions HTTP/1.1" 200 5097 " -" " Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)"
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > What’ s amazing is that it seems to reuse its Java session across all requests:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ grep -c -E ' session_id=[A-Z0-9]{32}:ip_addr=5.9.6.51' dspace.log.2017-11-12
2018-02-11 17:28:23 +01:00
1558
2022-03-04 13:30:06 +01:00
$ grep 5.9.6.51 dspace.log.2017-11-12 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
1
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Bravo to MegaIndex.ru!< / li >
2020-01-27 15:20:44 +01:00
< li > The same cannot be said for 95.108.181.88, which appears to be YandexBot, even though Tomcat’ s Crawler Session Manager valve regex should match ‘ YandexBot’ :< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > # grep 95.108.181.88 /var/log/nginx/access.log | tail -n 1
2022-03-04 13:30:06 +01:00
95.108.181.88 - - [12/Nov/2017:08:33:17 +0000] " GET /bitstream/handle/10568/57004/GenebankColombia_23Feb2015.pdf HTTP/1.1" 200 972019 " -" " Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
$ grep -c -E ' session_id=[A-Z0-9]{32}:ip_addr=95.108.181.88' dspace.log.2017-11-12
2018-02-11 17:28:23 +01:00
991
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Move some items and collections on CGSpace for Peter Ballantyne, running < a href = "https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515" > < code > move_collections.sh< / code > < / a > with the following configuration:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 10947/6 10947/1 10568/83389
2018-02-11 17:28:23 +01:00
10947/34 10947/1 10568/83389
10947/2512 10947/1 10568/83389
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > I explored nginx rate limits as a way to aggressively throttle Baidu bot which doesn’ t seem to respect disallowed URLs in robots.txt< / li >
< li > There’ s an interesting < a href = "https://www.nginx.com/blog/rate-limiting-nginx/" > blog post from Nginx’ s team about rate limiting< / a > as well as a < a href = "https://gist.github.com/arosenhagen/8aaf5d7f94171778c0e9" > clever use of mapping with rate limits< / a > < / li >
2019-11-28 16:30:45 +01:00
< li > The solution < a href = "https://github.com/ilri/rmg-ansible-public/commit/f0646991772660c505bea9c5ac586490e7c86156" > I came up with< / a > uses tricks from both of those< / li >
< li > I deployed the limit on CGSpace and DSpace Test and it seems to work well:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:' Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
2018-02-11 17:28:23 +01:00
HTTP/1.1 200 OK
Connection: keep-alive
Content-Encoding: gzip
Content-Language: en-US
Content-Type: text/html;charset=utf-8
Date: Sun, 12 Nov 2017 16:30:19 GMT
Server: nginx
Strict-Transport-Security: max-age=15768000
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-Cocoon-Version: 2.2.0
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
2022-03-04 13:30:06 +01:00
$ http --print h https://cgspace.cgiar.org/handle/10568/1 User-Agent:' Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)'
2018-02-11 17:28:23 +01:00
HTTP/1.1 503 Service Temporarily Unavailable
Connection: keep-alive
Content-Length: 206
Content-Type: text/html
Date: Sun, 12 Nov 2017 16:30:21 GMT
Server: nginx
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The first request works, second is denied with an HTTP 503!< / li >
< li > I need to remember to check the Munin graphs for PostgreSQL and JVM next week to see how this affects them< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-13" > 2017-11-13< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > At the end of the day I checked the logs and it really looks like the Baidu rate limiting is working, HTTP 200 vs 503:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep " 13/Nov/2017" | grep " Baiduspider" | grep -c " 200 "
2018-02-11 17:28:23 +01:00
1132
2022-03-04 13:30:06 +01:00
# zcat -f -- /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep " 13/Nov/2017" | grep " Baiduspider" | grep -c " 503 "
2018-02-11 17:28:23 +01:00
10105
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Helping Sisay proof 47 records for IITA: < a href = "https://dspacetest.cgiar.org/handle/10568/97029" > https://dspacetest.cgiar.org/handle/10568/97029< / a > < / li >
< li > From looking at the data in OpenRefine I found:
2018-02-11 17:28:23 +01:00
< ul >
< li > Errors in < code > cg.authorship.types< / code > < / li >
< li > Errors in < code > cg.coverage.country< / code > (smart quote in “ COTE D’ IVOIRE” , “ HAWAII” is not a country)< / li >
< li > Whitespace issues in some < code > cg.contributor.affiliation< / code > < / li >
< li > Whitespace issues in some < code > cg.identifier.doi< / code > fields and most values are using HTTP instead of HTTPS< / li >
< li > Whitespace issues in some < code > dc.contributor.author< / code > fields< / li >
< li > Issue with invalid < code > dc.date.issued< / code > value “ 2011-3” < / li >
< li > Description fields are poorly copy– pasted< / li >
< li > Whitespace issues in < code > dc.description.sponsorship< / code > < / li >
< li > Lots of inconsistency in < code > dc.format.extent< / code > (mixed dash style, periods at the end of values)< / li >
< li > Whitespace errors in < code > dc.identifier.citation< / code > < / li >
< li > Whitespace errors in < code > dc.subject< / code > < / li >
< li > Whitespace errors in < code > dc.title< / code > < / li >
< / ul >
2019-11-28 16:30:45 +01:00
< / li >
< li > After uploading and looking at the data in DSpace Test I saw more errors with CRPs, subjects (one item had four copies of all of its subjects, another had a “ .” in it), affiliations, sponsors, etc.< / li >
< li > Atmire responded to the < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510" > ticket about ORCID stuff< / a > a few days ago, today I told them that I need to talk to Peter and the partners to see what we would like to do< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-14" > 2017-11-14< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Deploy some nginx configuration updates to CGSpace< / li >
< li > They had been waiting on a branch for a few months and I think I just forgot about them< / li >
2020-01-27 15:20:44 +01:00
< li > I have been running them on DSpace Test for a few days and haven’ t seen any issues there< / li >
2018-02-11 17:28:23 +01:00
< li > Started testing DSpace 6.2 and a few things have changed< / li >
2019-11-28 16:30:45 +01:00
< li > Now PostgreSQL needs < code > pgcrypto< / code > :< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ psql dspace6
2018-02-11 17:28:23 +01:00
dspace6=# CREATE EXTENSION pgcrypto;
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Also, local settings are no longer in < code > build.properties< / code > , they are now in < code > local.cfg< / code > < / li >
2020-01-27 15:20:44 +01:00
< li > I’ m not sure if we can use separate profiles like we did before with < code > mvn -Denv=blah< / code > to use blah.properties< / li >
2019-11-28 16:30:45 +01:00
< li > It seems we need to use “ system properties” to override settings, ie: < code > -Ddspace.dir=/Users/aorth/dspace6< / code > < / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-15" > 2017-11-15< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Send Adam Hunt an invite to the DSpace Developers network on Yammer< / li >
< li > He is the new head of communications at WLE, since Michael left< / li >
2020-01-27 15:20:44 +01:00
< li > Merge changes to item view’ s wording of link metadata (< a href = "https://github.com/ilri/DSpace/pull/348" > #348< / a > )< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-17" > 2017-11-17< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Uptime Robot said that CGSpace went down today and I see lots of < code > Timeout waiting for idle object< / code > errors in the DSpace logs< / li >
< li > I looked in PostgreSQL using < code > SELECT * FROM pg_stat_activity;< / code > and saw that there were 73 active connections< / li >
< li > After a few minutes the connecitons went down to 44 and CGSpace was kinda back up, it seems like Tsega restarted Tomcat< / li >
2020-01-27 15:20:44 +01:00
< li > Looking at the REST and XMLUI log files, I don’ t see anything too crazy:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep " 17/Nov/2017" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
13 66.249.66.223
14 207.46.13.36
17 207.46.13.137
22 207.46.13.23
23 66.249.66.221
92 66.249.66.219
187 104.196.152.243
1400 70.32.83.92
1503 50.116.102.77
6037 45.5.184.196
2022-03-04 13:30:06 +01:00
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep " 17/Nov/2017" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
325 139.162.247.24
354 66.249.66.223
422 207.46.13.36
434 207.46.13.23
501 207.46.13.137
647 66.249.66.221
662 34.192.116.178
762 213.55.99.121
1867 104.196.152.243
2020 66.249.66.219
< / code > < / pre > < ul >
< li > I need to look into using JMX to analyze active sessions I think, rather than looking at log files< / li >
2020-01-27 15:20:44 +01:00
< li > After adding appropriate < a href = "https://geekflare.com/enable-jmx-tomcat-to-monitor-administer/" > JMX listener options to Tomcat’ s JAVA_OPTS< / a > and restarting Tomcat, I can connect remotely using an SSH dynamic port forward (SOCKS) on port 7777 for example, and then start jconsole locally like:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=7777 service:jmx:rmi:///jndi/rmi://localhost:9000/jmxrmi -J-DsocksNonProxyHosts=
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Looking at the MBeans you can drill down in Catalina→Manager→webapp→localhost→Attributes and see active sessions, etc< / li >
< li > I want to enable JMX listener on CGSpace but I need to do some more testing on DSpace Test and see if it causes any performance impact, for example< / li >
< li > If I hit the server with some requests as a normal user I see the session counter increase, but if I specify a bot user agent then the sessions seem to be reused (meaning the Crawler Session Manager is working)< / li >
< li > Here is the Jconsole screen after looping < code > http --print Hh https://dspacetest.cgiar.org/handle/10568/1< / code > for a few minutes:< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/jconsole-sessions.png" alt = "Jconsole sessions for XMLUI" > < / p >
2018-02-11 17:28:23 +01:00
< ul >
< li > Switch DSpace Test to using the G1GC for JVM so I can see what the JVM graph looks like eventually, and start evaluating it for production< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-19" > 2017-11-19< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode sent an alert that CGSpace was using a lot of CPU around 4– 6 AM< / li >
2019-11-28 16:30:45 +01:00
< li > Looking in the nginx access logs I see the most active XMLUI users between 4 and 6 AM:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E " 19/Nov/2017:0[456]" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
111 66.249.66.155
171 5.9.6.51
188 54.162.241.40
229 207.46.13.23
233 207.46.13.137
247 40.77.167.6
251 207.46.13.36
275 68.180.229.254
325 104.196.152.243
1610 66.249.66.153
< / code > < / pre > < ul >
< li > 66.249.66.153 appears to be Googlebot:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > 66.249.66.153 - - [19/Nov/2017:06:26:01 +0000] " GET /handle/10568/2203 HTTP/1.1" 200 6309 " -" " Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > We know Googlebot is persistent but behaves well, so I guess it was just a coincidence that it came at a time when we had other traffic and server activity< / li >
< li > In related news, I see an Atmire update process going for many hours and responsible for hundreds of thousands of log entries (two thirds of all log entries)< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > $ wc -l dspace.log.2017-11-19
2018-02-11 17:28:23 +01:00
388472 dspace.log.2017-11-19
$ grep -c com.atmire.utils.UpdateSolrStatsMetadata dspace.log.2017-11-19
267494
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > WTF is this process doing every day, and for so many hours?< / li >
< li > In unrelated news, when I was looking at the DSpace logs I saw a bunch of errors like this:< / li >
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2017-11-19 03:00:32,806 INFO org.apache.pdfbox.pdfparser.PDFParser @ Document is encrypted
2018-02-11 17:28:23 +01:00
2017-11-19 03:00:32,807 ERROR org.apache.pdfbox.filter.FlateFilter @ FlateFilter: stop reading corrupt stream due to a DataFormatException
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > It’ s been a few days since I enabled the G1GC on DSpace Test and the JVM graph definitely changed:< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/tomcat-jvm-g1gc.png" alt = "Tomcat G1GC" > < / p >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-20" > 2017-11-20< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > I found < a href = "https://www.cakesolutions.net/teamblogs/low-pause-gc-on-the-jvm" > an article about JVM tuning< / a > that gives some pointers how to enable logging and tools to analyze logs for you< / li >
< li > Also notes on < a href = "https://blog.gceasy.io/2016/11/15/rotating-gc-log-files/" > rotating GC logs< / a > < / li >
2020-01-27 15:20:44 +01:00
< li > I decided to switch DSpace Test back to the CMS garbage collector because it is designed for low pauses and high throughput (like G1GC!) and because we haven’ t even tried to monitor or tune it< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-21" > 2017-11-21< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Magdalena was having problems logging in via LDAP and it seems to be a problem with the CGIAR LDAP server:< / li >
2019-05-05 15:45:12 +02:00
< / ul >
2021-09-13 15:21:16 +02:00
< pre tabindex = "0" > < code > 2017-11-21 11:11:09,621 WARN org.dspace.authenticate.LDAPAuthentication @ anonymous:session_id=2FEC0E5286C17B6694567FFD77C3171C:ip_addr=77.241.141.58:ldap_authentication:type=failed_auth javax.naming.CommunicationException\colon; simple bind failed\colon; svcgroot2.cgiarad.org\colon;3269 [Root exception is javax.net.ssl.SSLHandshakeException\colon; sun.security.validator.ValidatorException\colon; PKIX path validation failed\colon; java.security.cert.CertPathValidatorException\colon; validity check failed]
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2017-11-22" > 2017-11-22< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode sent an alert that the CPU usage on the CGSpace server was very high around 4 to 6 AM< / li >
2020-01-27 15:20:44 +01:00
< li > The logs don’ t show anything particularly abnormal between those hours:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E " 22/Nov/2017:0[456]" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
136 31.6.77.23
174 68.180.229.254
217 66.249.66.91
256 157.55.39.79
268 54.144.57.183
281 207.46.13.137
282 207.46.13.36
290 207.46.13.23
696 66.249.66.90
707 104.196.152.243
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > I haven’ t seen 54.144.57.183 before, it is apparently the CCBot from commoncrawl.org< / li >
2019-11-28 16:30:45 +01:00
< li > In other news, it looks like the JVM garbage collection pattern is back to its standard jigsaw pattern after switching back to CMS a few days ago:< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/tomcat-jvm-cms.png" alt = "Tomcat JVM with CMS GC" > < / p >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-23" > 2017-11-23< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode alerted again that CPU usage was high on CGSpace from 4:13 to 6:13 AM< / li >
2019-11-28 16:30:45 +01:00
< li > I see a lot of Googlebot (66.249.66.90) in the XMLUI access logs< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E " 23/Nov/2017:0[456]" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
88 66.249.66.91
140 68.180.229.254
155 54.196.2.131
182 54.224.164.166
301 157.55.39.79
315 207.46.13.36
331 207.46.13.23
358 207.46.13.137
565 104.196.152.243
1570 66.249.66.90
< / code > < / pre > < ul >
< li > … and the usual REST scrapers from CIAT (45.5.184.196) and CCAFS (70.32.83.92):< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 | grep -E " 23/Nov/2017:0[456]" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
5 190.120.6.219
6 104.198.9.108
14 104.196.152.243
21 112.134.150.6
22 157.55.39.79
22 207.46.13.137
23 207.46.13.36
26 207.46.13.23
942 45.5.184.196
3995 70.32.83.92
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > These IPs crawling the REST API don’ t specify user agents and I’ d assume they are creating many Tomcat sessions< / li >
2019-11-28 16:30:45 +01:00
< li > I would catch them in nginx to assign a “ bot” user agent to them so that the Tomcat Crawler Session Manager valve could deal with them, but they seem to create any really — at least not in the dspace.log:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ grep 70.32.83.92 dspace.log.2017-11-23 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
2
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > I’ m wondering if REST works differently, or just doesn’ t log these sessions?< / li >
2019-11-28 16:30:45 +01:00
< li > I wonder if they are measurable via JMX MBeans?< / li >
2020-01-27 15:20:44 +01:00
< li > I did some tests locally and I don’ t see the sessionCounter incrementing after making requests to REST, but it does with XMLUI and OAI< / li >
< li > I came across some interesting PostgreSQL tuning advice for SSDs: < a href = "https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0" > https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0< / a > < / li >
2019-11-28 16:30:45 +01:00
< li > Apparently setting < code > random_page_cost< / code > to 1 is “ common” advice for systems running PostgreSQL on SSD (the default is 4)< / li >
< li > So I deployed this on DSpace Test and will check the Munin PostgreSQL graphs in a few days to see if anything changes< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-24" > 2017-11-24< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
2020-01-27 15:20:44 +01:00
< li > It’ s too early to tell for sure, but after I made the < code > random_page_cost< / code > change on DSpace Test’ s PostgreSQL yesterday the number of connections dropped drastically:< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/postgres-connections-week.png" alt = "PostgreSQL connections after tweak (week)" > < / p >
2018-02-11 17:28:23 +01:00
< ul >
< li > There have been other temporary drops before, but if I look at the past month and actually the whole year, the trend is that connections are four or five times higher on average:< / li >
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2017/11/postgres-connections-month.png" alt = "PostgreSQL connections after tweak (month)" > < / p >
2018-02-11 17:28:23 +01:00
< ul >
2020-01-27 15:20:44 +01:00
< li > I just realized that we’ re not logging access requests to other vhosts on CGSpace, so it’ s possible I have no idea that we’ re getting slammed at 4AM on another domain that we’ re just silently redirecting to cgspace.cgiar.org< / li >
< li > I’ ve enabled logging on the CGIAR Library on CGSpace so I can check to see if there are many requests there< / li >
2018-02-11 17:28:23 +01:00
< li > In just a few seconds I already see a dozen requests from Googlebot (of course they get HTTP 301 redirects to cgspace.cgiar.org)< / li >
2019-11-28 16:30:45 +01:00
< li > I also noticed that CGNET appears to be monitoring the old domain every few minutes:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > 192.156.137.184 - - [24/Nov/2017:20:33:58 +0000] " HEAD / HTTP/1.1" 301 0 " -" " curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I should probably tell CGIAR people to have CGNET stop that< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-26" > 2017-11-26< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode alerted that CGSpace server was using too much CPU from 5:18 to 7:18 AM< / li >
2019-11-28 16:30:45 +01:00
< li > Yet another mystery because the load for all domains looks fine at that time:< / li >
2019-05-05 15:45:12 +02:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E " 26/Nov/2017:0[567]" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
190 66.249.66.83
195 104.196.152.243
220 40.77.167.82
246 207.46.13.137
247 68.180.229.254
257 157.55.39.214
289 66.249.66.91
298 157.55.39.206
379 66.249.66.70
1855 66.249.66.90
2019-12-17 13:49:24 +01:00
< / code > < / pre > < h2 id = "2017-11-29" > 2017-11-29< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode alerted that CGSpace was using 279% CPU from 6 to 8 AM this morning< / li >
< li > About an hour later Uptime Robot said that the server was down< / li >
2019-11-28 16:30:45 +01:00
< li > Here are all the top XMLUI and REST users from today:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > # cat /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/access.log /var/log/nginx/access.log.1 /var/log/nginx/library-access.log /var/log/nginx/library-access.log.1 | grep -E " 29/Nov/2017" | awk ' {print $1}' | sort -n | uniq -c | sort -h | tail
2019-11-28 16:30:45 +01:00
540 66.249.66.83
659 40.77.167.36
663 157.55.39.214
681 157.55.39.206
733 157.55.39.158
850 66.249.66.70
1311 66.249.66.90
1340 104.196.152.243
4008 70.32.83.92
6053 45.5.184.196
< / code > < / pre > < ul >
< li > PostgreSQL activity shows 69 connections< / li >
2020-01-27 15:20:44 +01:00
< li > I don’ t have time to troubleshoot more as I’ m in Nairobi working on the HPC so I just restarted Tomcat for now< / li >
2019-11-28 16:30:45 +01:00
< li > A few hours later Uptime Robot says the server is down again< / li >
2020-01-27 15:20:44 +01:00
< li > I don’ t see much activity in the logs but there are 87 PostgreSQL connections< / li >
2019-11-28 16:30:45 +01:00
< li > But shit, there were 10,000 unique Tomcat sessions today:< / li >
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ cat dspace.log.2017-11-29 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
10037
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > Although maybe that’ s not much, as the previous two days had more:< / li >
2019-11-28 16:30:45 +01:00
< / ul >
2022-03-04 13:30:06 +01:00
< pre tabindex = "0" > < code > $ cat dspace.log.2017-11-27 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
12377
2022-03-04 13:30:06 +01:00
$ cat dspace.log.2017-11-28 | grep -o -E ' session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
2018-02-11 17:28:23 +01:00
16984
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
2020-01-27 15:20:44 +01:00
< li > I think we just need start increasing the number of allowed PostgreSQL connections instead of fighting this, as it’ s the most common source of crashes we have< / li >
< li > I will bump DSpace’ s < code > db.maxconnections< / code > from 60 to 90, and PostgreSQL’ s < code > max_connections< / code > from 183 to 273 (which is using my loose formula of 90 * webapps + 3)< / li >
2019-11-28 16:30:45 +01:00
< li > I really need to figure out how to get DSpace to use a PostgreSQL connection pool< / li >
2018-02-11 17:28:23 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2017-11-30" > 2017-11-30< / h2 >
2018-02-11 17:28:23 +01:00
< ul >
< li > Linode alerted about high CPU usage on CGSpace again around 6 to 8 AM< / li >
2020-01-27 15:20:44 +01:00
< li > Then Uptime Robot said CGSpace was down a few minutes later, but it resolved itself I think (or Tsega restarted Tomcat, I don’ t know)< / li >
2018-02-11 17:28:23 +01:00
< / ul >
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2022-08-01 15:36:13 +02:00
< li > < a href = "/cgspace-notes/2022-08/" > August, 2022< / a > < / li >
2022-07-04 08:25:14 +02:00
< li > < a href = "/cgspace-notes/2022-07/" > July, 2022< / a > < / li >
2022-06-06 08:45:43 +02:00
< li > < a href = "/cgspace-notes/2022-06/" > June, 2022< / a > < / li >
2022-05-04 10:09:45 +02:00
< li > < a href = "/cgspace-notes/2022-05/" > May, 2022< / a > < / li >
2022-04-27 08:58:45 +02:00
< li > < a href = "/cgspace-notes/2022-04/" > April, 2022< / a > < / li >
2022-03-01 15:48:40 +01:00
2018-02-11 17:28:23 +01:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
2019-10-11 10:19:42 +02:00
< p dir = "auto" >
2018-02-11 17:28:23 +01:00
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >