CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

March, 2020

2020-03-02

$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x

2020-03-03

  • Skype with Peter and Abenet to discuss the CG Core survey
    • We also discussed some other CGSpace issues

2020-03-04

  • Abenet asked me to add some new ILRI subjects to CGSpace
    • I updated the input-forms.xml in our 5_x-prod branch on GitHub
    • Abenet said we are changing HEALTH to HUMAN HEALTH so I need to fix those using my fix-metadata-values.py script:
$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
  • But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it…
  • Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs

2020-03-05

Lucene/Solr 7.0 was the first version that successfully passed our tests using Java 9 and higher. You should avoid Java 9 or later for Lucene/Solr 6.x or earlier.

2020-03-08

  • I want to try to consolidate our yearly Solr statistics cores back into one statistics core using the solr-import-export-json tool
  • I will try it on DSpace test, doing one year at a time:
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"
  • I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage
    • Exporting half years might work, using a filter query with months as a regular expression:
$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
  • Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
    • I’ve been running it for one month in my local environment, and others have reported on the dspace-tech mailing list that they are using 10 and 11
# apt install postgresql-10 postgresql-contrib-10
# systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
# pg_ctlcluster 10 main stop
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r

2020-03-09

  • Peter noticed that the Solr stats were not showing anything before 2020
    • I had to restart Tomcat three times before all cores loaded properly…

2020-03-10

  • Fix some logic issues in the nginx config
    • Use generic blocking of [Bb]ot and [Cc]rawl and [Ss]pider in the “badbots” rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages no matter what so we punish hard here)
    • We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)
    • Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
  • It seems to only be a problem in the last week:
# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
/var/log/nginx/rest.log.1:0
/var/log/nginx/rest.log.2:0
/var/log/nginx/rest.log.3:0
/var/log/nginx/rest.log.4:3625
/var/log/nginx/rest.log.5:27458
/var/log/nginx/rest.log.6:0
/var/log/nginx/rest.log.7:0
/var/log/nginx/rest.log.8:0
/var/log/nginx/rest.log.9:0
  • In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
  • I will purge them from Solr statistics:
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
  • Another user agent that seems to be a bot is:
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
  • In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:
# zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
  63090 163.172.68.99
 183428 163.172.70.248
 147608 163.172.71.24
  • It is making 10,000 to 40,000 requests to XMLUI per day…
# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
/var/log/nginx/access.log.30.gz:18687
/var/log/nginx/access.log.31.gz:28936
/var/log/nginx/access.log.32.gz:36402
/var/log/nginx/access.log.33.gz:38886
/var/log/nginx/access.log.34.gz:30607
/var/log/nginx/access.log.35.gz:19040
/var/log/nginx/access.log.36.gz:10780
/var/log/nginx/access.log.37.gz:5808
/var/log/nginx/access.log.38.gz:3100
/var/log/nginx/access.log.39.gz:1485
/var/log/nginx/access.log.3.gz:2898
/var/log/nginx/access.log.40.gz:373
/var/log/nginx/access.log.41.gz:3909
/var/log/nginx/access.log.42.gz:4729
/var/log/nginx/access.log.43.gz:3906
  • I will purge those hits too!
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
  • Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
    • I need to re-run the check-bot-hits.sh script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn’t support case-insensitive regular expressions:
$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: Citoid
Purging 11 hits from Citoid in statistics
(DEBUG) Checking for hits from spider: ecointernet
Purging 375 hits from ecointernet in statistics
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
Purging 1 hits from ^Pattern\/[0-9] in statistics
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
Purging 6 hits from Typhoeus in statistics
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient
Purging 3178 hits from Apache-HttpClient in statistics

Total number of bot hits purged: 3571

$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: [Bb]ot
Purging 8317 hits from [Bb]ot in statistics
(DEBUG) Checking for hits from spider: [Cc]rawl
Purging 1314 hits from [Cc]rawl in statistics
(DEBUG) Checking for hits from spider: [Ss]pider
Purging 62 hits from [Ss]pider in statistics
(DEBUG) Checking for hits from spider: Citoid
(DEBUG) Checking for hits from spider: ecointernet
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient

2020-03-11

  • Ask Michael Victor for permission to create a new Linode server for DSpace Test

2020-3-12

  • I’m working on the 170 IITA records on DSpace Test from January finally
    • It’s been two months since I last looked and I want to do a thorough check to make sure Bosede didn’t introduce any new issues, but I want to consolidate all the text languages for these records so it’s easier to check them in OpenRefine
    • First I got a list of IDs from csvcut and then I updated the text languages for only those records:
dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
  • Then I exported the metadata from DSpace Test and imported it into OpenRefine
    • I corrected one invalid AGROVOC subject using my csv-metadata-quality script
  • I exported a new list of affiliations from the database, added line numbers with csvcut, and then validated them in OpenRefine using reconcile-csv:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
dspace=# \q
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
$ lein run /tmp/affiliations.csv name id
  • I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL: if(cell.recon.matched, cell.recon.match.name, value)
  • I mapped all 170 items to their appropriate collections based on type and uploaded them to CGSpace

2020-03-16

  • I’m looking at the CPU usage of CGSpace (linode18) over the past year and I see we rarely even go over two CPUs on average sustained usage:

linode18 CPU usage year

  • Also clearly visible is the effect of CPU steal in 2019-03

linode18 RAM usage year

linode18 JVM heap usage year

  • At max we have committed 10GB of RAM, the rest is used opportunistically by the filesystem cache, likely for Solr
    • There was a huge drop in 2019-07 when I changed the JVM settings
    • I think we should re-evaluate our deployment and perhaps target a different instance type and add block storage for assetstore (as we determined Linode’s block storage to be too slow for Solr)

2020-03-17

  • Update the PostgreSQL JDBC driver to version 42.2.11