CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

October, 2018

2018-10-01

  • Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items
  • I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now

2018-10-03

  • I see Moayad was busy collecting item views and downloads from CGSpace yesterday:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | awk '{print $1}
' | sort | uniq -c | sort -n | tail -n 10
    933 40.77.167.90
    971 95.108.181.88
   1043 41.204.190.40
   1454 157.55.39.54
   1538 207.46.13.69
   1719 66.249.64.61
   2048 50.116.102.77
   4639 66.249.64.59
   4736 35.237.175.180
 150362 34.218.226.147
  • Of those, about 20% were HTTP 500 responses (!):
$ zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Oct/2018" | grep 34.218.226.147 | awk '{print $9}' | sort -n | uniq -c
 118927 200
  31435 500
  • I added Phil Thornton and Sonal Henson’s ORCID identifiers to the controlled vocabulary for cg.creator.orcid and then re-generated the names using my resolve-orcids.py script:
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml | sort | uniq > 2018-10-03-orcids.txt
$ ./resolve-orcids.py -i 2018-10-03-orcids.txt -o 2018-10-03-names.txt -d
  • I found a new corner case error that I need to check, given and family names deactivated:
Looking up the names associated with ORCID iD: 0000-0001-7930-5752
Given Names Deactivated Family Name Deactivated: 0000-0001-7930-5752
  • It appears to be Jim Lorenzen… I need to check that later!
  • I merged the changes to the 5_x-prod branch (#390)
  • Linode sent another alert about CPU usage on CGSpace (linode18) this evening
  • It seems that Moayad is making quite a lot of requests today:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "03/Oct/2018" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
   1594 157.55.39.160
   1627 157.55.39.173
   1774 136.243.6.84
   4228 35.237.175.180
   4497 70.32.83.92
   4856 66.249.64.59
   7120 50.116.102.77
  12518 138.201.49.199
  87646 34.218.226.147
 111729 213.139.53.62
  • But in super positive news, he says they are using my new dspace-statistics-api and it’s MUCH faster than using Atmire CUA’s internal “restlet” API
  • I don’t recognize the 138.201.49.199 IP, but it is in Germany (Hetzner) and appears to be paginating over some browse pages and downloading bitstreams:
# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /[a-z]+' | sort | uniq -c
   8324 GET /bitstream
   4193 GET /handle
  • Suspiciously, it’s only grabbing the CGIAR System Office community (handle prefix 10947):
# grep 138.201.49.199 /var/log/nginx/access.log | grep -o -E 'GET /handle/[0-9]{5}' | sort | uniq -c
      7 GET /handle/10568
   4186 GET /handle/10947
  • The user agent is suspicious too:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36
  • It’s clearly a bot and it’s not re-using its Tomcat session, so I will add its IP to the nginx bad bot list
  • I looked in Solr’s statistics core and these hits were actually all counted as isBot:false (of course)… hmmm
  • I tagged all of Sonal and Phil’s items with their ORCID identifiers on CGSpace using my add-orcid-identifiers.py script:
$ ./add-orcid-identifiers-csv.py -i 2018-10-03-add-orcids.csv -db dspace -u dspace -p 'fuuu'
  • Where 2018-10-03-add-orcids.csv contained:
dc.contributor.author,cg.creator.id
"Henson, Sonal P.",Sonal Henson: 0000-0002-2002-5462
"Henson, S.",Sonal Henson: 0000-0002-2002-5462
"Thornton, P.K.",Philip Thornton: 0000-0002-1854-0182
"Thornton, Philip K",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phil",Philip Thornton: 0000-0002-1854-0182
"Thornton, Philip K.",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phillip",Philip Thornton: 0000-0002-1854-0182
"Thornton, Phillip K.",Philip Thornton: 0000-0002-1854-0182

2018-10-04

  • Salem raised an issue that the dspace-statistics-api reports downloads for some items that have no bitstreams (like many limited access items)
  • Every item has at least a LICENSE bundle, and some have a THUMBNAIL bundle, but the indexing code is specifically checking for downloads from the ORIGINAL bundle
  • I see there are other bundles we might need to pay attention to: TEXT, @_LOGO-COLLECTION_@, @_LOGO-COMMUNITY_@, etc…
  • On a hunch I dropped the statistics table and re-indexed and now those two items above have no downloads
  • So it’s fixed, but I’m not sure why!
  • Peter wants to know the number of API requests per month, which was about 250,000 in September (exluding statlet requests):
# zcat --force /var/log/nginx/{oai,rest}.log* | grep -E 'Sep/2018' | grep -c -v 'statlets'
251226