CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

September, 2019

2019-09-01

  • Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning
  • Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:
# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    440 17.58.101.255
    441 157.55.39.101
    485 207.46.13.43
    728 169.60.128.125
    730 207.46.13.108
    758 157.55.39.9
    808 66.160.140.179
    814 207.46.13.212
   2472 163.172.71.23
   6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E "01/Sep/2019:0" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     33 2a01:7e00::f03c:91ff:fe16:fcb
     57 3.83.192.124
     57 3.87.77.25
     57 54.82.1.8
    822 2a01:9cc0:47:1:1a:4:0:2
   1223 45.5.184.72
   1633 172.104.229.92
   5112 205.186.128.185
   7249 2a01:7e00::f03c:91ff:fe18:7396
   9124 45.5.186.2
Read more →

August, 2019

2019-08-03

  • Look at Bioversity’s latest migration CSV and now I see that Francesco has cleaned up the extra columns and the newline at the end of the file, but many of the column headers have an extra space in the name…

2019-08-04

  • Deploy ORCID identifier updates requested by Bioversity to CGSpace
  • Run system updates on CGSpace (linode18) and reboot it
    • Before updating it I checked Solr and verified that all statistics cores were loaded properly…
    • After rebooting, all statistics cores were loaded… wow, that’s lucky.
  • Run system updates on DSpace Test (linode19) and reboot it
Read more →

July, 2019

2019-07-01

  • Create an “AfricaRice books and book chapters” collection on CGSpace for AfricaRice
  • Last month Sisay asked why the following “most popular” statistics link for a range of months in 2018 works for the CIAT community on DSpace Test, but not on CGSpace:
  • Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community
Read more →

May, 2019

2019-05-01

  • Help CCAFS with regenerating some item thumbnails after they uploaded new PDFs to some items on CGSpace
  • A user on the dspace-tech mailing list offered some suggestions for troubleshooting the problem with the inability to delete certain items
    • Apparently if the item is in the workflowitem table it is submitted to a workflow
    • And if it is in the workspaceitem table it is in the pre-submitted state
  • The item seems to be in a pre-submitted state, so I tried to delete it from there:
dspace=# DELETE FROM workspaceitem WHERE item_id=74648;
DELETE 1
  • But after this I tried to delete the item from the XMLUI and it is still present…
Read more →

April, 2019

2019-04-01

  • Meeting with AgroKnow to discuss CGSpace, ILRI data, AReS, GARDIAN, etc
    • They asked if we had plans to enable RDF support in CGSpace
  • There have been 4,400 more downloads of the CTA Spore publication from those strange Amazon IP addresses today
    • I suspected that some might not be successful, because the stats show less, but today they were all HTTP 200!
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
   4432 200
  • In the last two weeks there have been 47,000 downloads of this same exact PDF by these three IP addresses
  • Apply country and region corrections and deletions on DSpace Test and CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
Read more →

March, 2019

2019-03-01

  • I checked IITA’s 259 Feb 14 records from last month for duplicates using Atmire’s Duplicate Checker on a fresh snapshot of CGSpace on my local machine and everything looks good
  • I am now only waiting to hear from her about where the items should go, though I assume Journal Articles go to IITA Journal Articles collection, etc…
  • Looking at the other half of Udana’s WLE records from 2018-11
    • I finished the ones for Restoring Degraded Landscapes (RDL), but these are for Variability, Risks and Competing Uses (VRC)
    • I did the usual cleanups for whitespace, added regions where they made sense for certain countries, cleaned up the DOI link formats, added rights information based on the publications page for a few items
    • Most worryingly, there are encoding errors in the abstracts for eleven items, for example:
    • 68.15% � 9.45 instead of 68.15% ± 9.45
    • 2003�2013 instead of 2003–2013
  • I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs
Read more →

February, 2019

2019-02-01

  • Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!
  • The top IPs before, during, and after this latest alert tonight were:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "01/Feb/2019:(17|18|19|20|21)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
    245 207.46.13.5
    332 54.70.40.11
    385 5.143.231.38
    405 207.46.13.173
    405 207.46.13.75
   1117 66.249.66.219
   1121 35.237.175.180
   1546 5.9.6.51
   2474 45.5.186.2
   5490 85.25.237.71
  • 85.25.237.71 is the “Linguee Bot” that I first saw last month
  • The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase
  • There were just over 3 million accesses in the nginx logs last month:
# time zcat --force /var/log/nginx/* | grep -cE "[0-9]{1,2}/Jan/2019"
3018243

real    0m19.873s
user    0m22.203s
sys     0m1.979s
Read more →

January, 2019

2019-01-02

  • Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning
  • I don’t see anything interesting in the web server logs around that time though:
# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "02/Jan/2019:0(1|2|3)" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
     92 40.77.167.4
     99 210.7.29.100
    120 38.126.157.45
    177 35.237.175.180
    177 40.77.167.32
    216 66.249.75.219
    225 18.203.76.93
    261 46.101.86.248
    357 207.46.13.1
    903 54.70.40.11
Read more →

December, 2018

2018-12-01

  • Switch CGSpace (linode18) to use OpenJDK instead of Oracle JDK
  • I manually installed OpenJDK, then removed Oracle JDK, then re-ran the Ansible playbook to update all configuration files, etc
  • Then I ran all system updates and restarted the server

2018-12-02

Read more →