CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2019

2019-11-04

  • Peter noticed that there were 5.2 million hits on CGSpace in 2019-10 according to the Atmire usage statistics

    • I looked in the nginx logs and see 4.6 million in the access logs, and 1.2 million in the API logs:

      # zcat --force /var/log/nginx/*access.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
      4671942
      # zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE "[0-9]{1,2}/Oct/2019"
      1277694
      
  • So 4.6 million from XMLUI and another 1.2 million from API requests

  • Let’s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):

    # zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E "[0-9]{1,2}/Oct/2019"
    1183456 
    # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams"
    106781
    
  • The types of requests in the access logs are (by lazily extracting the sixth field in the nginx log)

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | awk '{print $6}' | sed 's/"//' | sort | uniq -c | sort -n
      1 PUT
      8 PROPFIND
    283 OPTIONS
    30102 POST
    46581 HEAD
    4594967 GET
    
  • Two very active IPs are 34.224.4.16 and 34.234.204.152, which made over 360,000 requests in October:

    # zcat --force /var/log/nginx/*access.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E '(34\.224\.4\.16|34\.234\.204\.152)'
    365288
    
  • Their user agent is one I’ve never seen before:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
    
  • Most of them seem to be to community or collection discover and browse results pages like /handle/10568/103/discover:

zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -o -E “GET /(bitstream|discover|handle)” | sort | uniq -c

6566 GET /bitstream 351928 GET /handle

zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c discover

214209

zcat –force /var/log/nginx/access.log..gz | grep -E “[0-9]{1,2}/Oct/2019” | grep Amazonbot | grep -E “GET /(bitstream|discover|handle)” | grep -c browse

86874


- As far as I can tell, none of their requests are counted in the Solr statistics:

$ http –print b ‘http://localhost:8081/solr/statistics/select?q=(ip%3A34.224.4.16+OR+ip%3A34.234.204.152)&rows=0&wt=json&indent=true'


- Still, those requests are CPU intensive so I will add their user agent to the "badbots" rate limiting in nginx to reduce the impact on server load
- After deploying it I checked by setting my user agent to Amazonbot and making a few requests (which were denied with HTTP 503):

$ http –print Hh ‘https://dspacetest.cgiar.org/handle/10568/1/discover' User-Agent:“Amazonbot/0.1” ```