diff --git a/content/post/2017-10.md b/content/post/2017-10.md index 26d46d174..618add04c 100644 --- a/content/post/2017-10.md +++ b/content/post/2017-10.md @@ -224,3 +224,89 @@ http://library.cgiar.org/browse?value=Intellectual%20Assets%20Reports&type=subje - I will check the logs in a few days to see if they are harvesting us regularly, then add their bot's user agent to the Tomcat Crawler Session Valve - After browsing the CORE site it seems that the CGIAR Library is somehow a member of CORE, so they have probably only been harvesting CGSpace since we did the migration, as library.cgiar.org directs to us now - For now I will just contact them to have them update their contact info in the bot's user agent, but eventually I think I'll tell them to swap out the CGIAR Library entry for CGSpace + +## 2017-10-30 + +- Like clock work, Linode alerted about high CPU usage on CGSpace again this morning (this time at 8:13 AM) +- Uptime Robot noticed that CGSpace went down around 10:15 AM, and I saw that there were 93 PostgreSQL connections: + +``` +dspace=# SELECT * FROM pg_stat_activity; +... +(93 rows) +``` + +- Surprise surprise, the CORE bot is likely responsible for the recent load issues, making hundreds of thousands of requests yesterday and today: + +``` +# grep -c "CORE/0.6" /var/log/nginx/access.log +26475 +# grep -c "CORE/0.6" /var/log/nginx/access.log.1 +135083 +``` + +- IP addresses for this bot currently seem to be: + +``` +# grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq +137.108.70.6 +137.108.70.7 +``` + +- I will add their user agent to the Tomcat Session Crawler Valve but it won't help much because they are only using two sessions: + +``` +# grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq +session_id=5771742CABA3D0780860B8DA81E0551B +session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A +``` + +- ... and most of their requests are for dynamic discover pages: + +``` +# grep -c 137.108.70 /var/log/nginx/access.log +26622 +# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover" +24055 +``` + +- Just because I'm curious who the top IPs are: + +``` +# awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail + 496 62.210.247.93 + 571 46.4.94.226 + 651 40.77.167.39 + 763 157.55.39.231 + 782 207.46.13.90 + 998 66.249.66.90 + 1948 104.196.152.243 + 4247 190.19.92.5 + 31602 137.108.70.6 + 31636 137.108.70.7 +``` + +- At least we know the top two are CORE, but who are the others? +- 190.19.92.5 is apparently in Argentina, and 104.196.152.243 is from Google Cloud Engine +- Actually, these two scrapers might be more responsible for the heavy load than the CORE bot, because they don't reuse their session variable, creating thousands of new sessions! + +``` +# grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l +1419 +# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l +2811 +``` + +- From looking at the requests, it appears these are from CIAT and CCAFS +- I wonder if I could somehow instruct them to use a user agent so that we could apply a crawler session manager valve to them +- Actually, according to the Tomcat docs, we could use an IP with `crawlerIps`: https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Crawler_Session_Manager_Valve +- As for now, it actually seems the CORE bot coming from 137.108.70.6 and 137.108.70.7 is only using a few sessions per day, which is good: + +``` +# grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h + 410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7 + 574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7 + 1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6 +``` + +- I will check again tomorrow diff --git a/public/2017-10/index.html b/public/2017-10/index.html index df48a36bb..9df71442e 100644 --- a/public/2017-10/index.html +++ b/public/2017-10/index.html @@ -66,7 +66,7 @@ Add Katherine Lutz to the groups for content sumission and edit steps of the CGI "@type": "BlogPosting", "headline": "October, 2017", "url": "https://alanorth.github.io/cgspace-notes/2017-10/", - "wordCount": "1851", + "wordCount": "2261", "datePublished": "2017-10-01T08:07:54+03:00", "dateModified": "2017-10-29T10:02:34+02:00", "author": { @@ -395,6 +395,102 @@ Add Katherine Lutz to the groups for content sumission and edit steps of the CGI
  • For now I will just contact them to have them update their contact info in the bot’s user agent, but eventually I think I’ll tell them to swap out the CGIAR Library entry for CGSpace
  • +

    2017-10-30

    + + + +
    dspace=# SELECT * FROM pg_stat_activity;
    +...
    +(93 rows)
    +
    + + + +
    # grep -c "CORE/0.6" /var/log/nginx/access.log 
    +26475
    +# grep -c "CORE/0.6" /var/log/nginx/access.log.1
    +135083
    +
    + + + +
    # grep "CORE/0.6" /var/log/nginx/access.log | awk '{print $1}' | sort -n | uniq
    +137.108.70.6
    +137.108.70.7
    +
    + + + +
    # grep 137.108.70 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq
    +session_id=5771742CABA3D0780860B8DA81E0551B
    +session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A
    +
    + + + +
    # grep -c 137.108.70 /var/log/nginx/access.log
    +26622
    +# grep 137.108.70 /var/log/nginx/access.log | grep -c "GET /discover"
    +24055
    +
    + + + +
    # awk '{print $1}' /var/log/nginx/access.log | sort -n | uniq -c | sort -h | tail
    +    496 62.210.247.93
    +    571 46.4.94.226
    +    651 40.77.167.39
    +    763 157.55.39.231
    +    782 207.46.13.90
    +    998 66.249.66.90
    +   1948 104.196.152.243
    +   4247 190.19.92.5
    +  31602 137.108.70.6
    +  31636 137.108.70.7
    +
    + + + +
    # grep 190.19.92.5 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +1419
    +# grep 104.196.152.243 dspace.log.2017-10-30 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort -n | uniq | wc -l
    +2811
    +
    + + + +
    # grep -o -E 'session_id=[A-Z0-9]{32}:ip_addr=137.108.70.(6|7)' dspace.log.2017-10-30 | sort -n | uniq -c | sort -h 
    +    410 session_id=74F0C3A133DBF1132E7EC30A7E7E0D60:ip_addr=137.108.70.7
    +    574 session_id=5771742CABA3D0780860B8DA81E0551B:ip_addr=137.108.70.7
    +   1012 session_id=6C30F10B4351A4ED83EC6ED50AFD6B6A:ip_addr=137.108.70.6
    +
    + + +