CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

September, 2019

2019-09-01

  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:

    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    440 17.58.101.255
    441 157.55.39.101
    485 207.46.13.43
    728 169.60.128.125
    730 207.46.13.108
    758 157.55.39.9
    808 66.160.140.179
    814 207.46.13.212
    2472 163.172.71.23
    6092 3.94.211.189
    # zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     33 2a01:7e00::f03c:91ff:fe16:fcb
     57 3.83.192.124
     57 3.87.77.25
     57 54.82.1.8
    822 2a01:9cc0:47:1:1a:4:0:2
    1223 45.5.184.72
    1633 172.104.229.92
    5112 205.186.128.185
    7249 2a01:7e00::f03c:91ff:fe18:7396
    9124 45.5.186.2
    
  • 3.94.211.189 is MauiBot, and most of its requests are to Discovery and get rate limited with HTTP 503
  • 163.172.71.23 is some IP on Online SAS in France and its user agent is:

    Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
    
  • It actually got mostly HTTP 200 responses:

    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | awk '{print $9}' | sort | uniq -c
    1775 200
    703 499
     72 503
    
  • And it was mostly requesting Discover pages:

    # zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | grep 163.172.71.23 | grep -o -E "(bitstream|discover|handle)" | sort | uniq -c 
    2350 discover
     71 handle
    
  • I’m not sure why the outbound traffic rate was so high…

2019-09-02

  • Follow up with Carol and Francesca from Bioversity as they were on holiday during the mid-to-late August
    • I told them to check the temporary collection on DSpace Test where I uploaded the 1,427 items so they can see how it will look
    • Also, I told them to advise me about the strange file extensions (.7z, .zip, .lck)
    • Also, I reminded Abenet to check the metadata, as the institutional authors at least will need some modification

2019-09-10

  • Altmetric responded to say that they have fixed an issue with their badge code so now research outputs with multiple handles are showing badges!
  • Follow up with Bosede about the mixup with PDFs in the items uploaded in 2018-12 (aka Daniel1807)
    • These are the same ones that Peter noticed last week, that Bosede and I had been discussing earlier this year that we never sorted out
    • It looks like these items were uploaded by Sisay on 2018-12-19 so we can use the accession date as a filter to narrow it down to 230 items (of which only 104 have PDFs, according to the Daniel1807.xls input input file)
  • Continue working on CG Core v2 migration, focusing on the crosswalk mappings
    • I think we can skip the MODS crosswalk for now because it is only used in AIP exports that are meant for non-DSpace systems
    • We should probably do the QDC crosswalk as well as those in xhtml-head-item.properties
    • Ouch, there is potentially a lot of work in the OAI metadata formats like DIM, METS, and QDC (see dspace/config/crosswalks/oai/*.xsl)
    • In general I think I should only modify the left side of the crosswalk mappings (ie, where metadata is coming from) so we maintain the same exact output for search engines, etc