CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2020

2020-03-02

$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x

2020-03-03

  • Skype with Peter and Abenet to discuss the CG Core survey
    • We also discussed some other CGSpace issues

2020-03-04

  • Abenet asked me to add some new ILRI subjects to CGSpace
    • I updated the input-forms.xml in our 5_x-prod branch on GitHub
    • Abenet said we are changing HEALTH to HUMAN HEALTH so I need to fix those using my fix-metadata-values.py script:
$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
  • But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it…
  • Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs

2020-03-05

Lucene/Solr 7.0 was the first version that successfully passed our tests using Java 9 and higher. You should avoid Java 9 or later for Lucene/Solr 6.x or earlier.

2020-03-08

  • I want to try to consolidate our yearly Solr statistics cores back into one statistics core using the solr-import-export-json tool
  • I will try it on DSpace test, doing one year at a time:
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"
  • I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage
    • Exporting half years might work, using a filter query with months as a regular expression:
$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
  • Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
    • I’ve been running it for one month in my local environment, and others have reported on the dspace-tech mailing list that they are using 10 and 11
# apt install postgresql-10 postgresql-contrib-10
# systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
# pg_ctlcluster 10 main stop
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r

2020-03-09

  • Peter noticed that the Solr stats were not showing anything before 2020
    • I had to restart Tomcat three times before all cores loaded properly…

2020-03-10

  • Fix some logic issues in the nginx config
    • Use generic blocking of [Bb]ot and [Cc]rawl and [Ss]pider in the “badbots” rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages no matter what so we punish hard here)
    • We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)
    • Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
  • It seems to only be a problem in the last week:
# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
/var/log/nginx/rest.log.1:0
/var/log/nginx/rest.log.2:0
/var/log/nginx/rest.log.3:0
/var/log/nginx/rest.log.4:3625
/var/log/nginx/rest.log.5:27458
/var/log/nginx/rest.log.6:0
/var/log/nginx/rest.log.7:0
/var/log/nginx/rest.log.8:0
/var/log/nginx/rest.log.9:0
  • In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
  • I will purge them from Solr statistics:
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
  • Another user agent that seems to be a bot is:
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
  • In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:
# zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
  63090 163.172.68.99
 183428 163.172.70.248
 147608 163.172.71.24
  • It is making 10,000 to 40,000 requests to XMLUI per day…
# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
/var/log/nginx/access.log.30.gz:18687
/var/log/nginx/access.log.31.gz:28936
/var/log/nginx/access.log.32.gz:36402
/var/log/nginx/access.log.33.gz:38886
/var/log/nginx/access.log.34.gz:30607
/var/log/nginx/access.log.35.gz:19040
/var/log/nginx/access.log.36.gz:10780
/var/log/nginx/access.log.37.gz:5808
/var/log/nginx/access.log.38.gz:3100
/var/log/nginx/access.log.39.gz:1485
/var/log/nginx/access.log.3.gz:2898
/var/log/nginx/access.log.40.gz:373
/var/log/nginx/access.log.41.gz:3909
/var/log/nginx/access.log.42.gz:4729
/var/log/nginx/access.log.43.gz:3906
  • I will purge those hits too!
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
  • Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
    • I need to re-run the check-bot-hits.sh script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn’t support case-insensitive regular expressions:
$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: Citoid
Purging 11 hits from Citoid in statistics
(DEBUG) Checking for hits from spider: ecointernet
Purging 375 hits from ecointernet in statistics
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
Purging 1 hits from ^Pattern\/[0-9] in statistics
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
Purging 6 hits from Typhoeus in statistics
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient
Purging 3178 hits from Apache-HttpClient in statistics

Total number of bot hits purged: 3571

$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: [Bb]ot
Purging 8317 hits from [Bb]ot in statistics
(DEBUG) Checking for hits from spider: [Cc]rawl
Purging 1314 hits from [Cc]rawl in statistics
(DEBUG) Checking for hits from spider: [Ss]pider
Purging 62 hits from [Ss]pider in statistics
(DEBUG) Checking for hits from spider: Citoid
(DEBUG) Checking for hits from spider: ecointernet
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient