March, 2020
2020-03-02
- Update dspace-statistics-api for DSpace 6+ UUIDs
- Tag version 1.2.0 on GitHub
- Test migrating legacy Solr statistics to UUIDs with the as-of-yet unreleased SolrUpgradePre6xStatistics.java
- You need to download this into the DSpace 6.x source and compile it
$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
2020-03-03
- Skype with Peter and Abenet to discuss the CG Core survey
- We also discussed some other CGSpace issues
2020-03-04
- Abenet asked me to add some new ILRI subjects to CGSpace
- I updated the input-forms.xml in our
5_x-prod
branch on GitHub - Abenet said we are changing
HEALTH
toHUMAN HEALTH
so I need to fix those using myfix-metadata-values.py
script:
- I updated the input-forms.xml in our
$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
- But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it…
- Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs
2020-03-05
- I found a very interesting comment on the Solr 8.1 guide about Java compatibility:
Lucene/Solr 7.0 was the first version that successfully passed our tests using Java 9 and higher. You should avoid Java 9 or later for Lucene/Solr 6.x or earlier.
2020-03-08
- I want to try to consolidate our yearly Solr statistics cores back into one
statistics
core using the solr-import-export-json tool - I will try it on DSpace test, doing one year at a time:
$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2011.json -k uid
$ curl -s "http://localhost:8081/solr/statistics-2011/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2011*</query></delete>"
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2012.json -k uid
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
"response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s 'http://localhost:8081/solr/statistics-2012/select?q=time:2012*&rows=0&wt=json&indent=true' | grep numFound
"response":{"numFound":3761989,"start":0,"docs":[]
$ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2012*</query></delete>"
- I will do this for as many cores as I can (disk space limited) and then monitor the effect on the system and JVM memory usage
- Exporting half years might work, using a filter query with months as a regular expression:
$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
- Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
- I’ve been running it for one month in my local environment, and others have reported on the dspace-tech mailing list that they are using 10 and 11
# apt install postgresql-10 postgresql-contrib-10
# systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
# pg_ctlcluster 10 main stop
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
2020-03-09
- Peter noticed that the Solr stats were not showing anything before 2020
- I had to restart Tomcat three times before all cores loaded properly…
2020-03-10
- Fix some logic issues in the nginx config
- Use generic blocking of
[Bb]ot
and[Cc]rawl
and[Ss]pider
in the “badbots” rate limiting logic instead of trying to list them all one by one (bots should not be trying to index dynamic pages no matter what so we punish hard here) - We were not properly forwarding the remote IP address to Tomcat in all nginx location blocks, which led some locations to log a hit from 127.0.0.1 (because we need to explicitly add the global proxy params when setting other headers in location blocks)
- Unfortunately this affected the REST API and there are a few hundred thousand requests from this user agent:
- Use generic blocking of
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
- It seems to only be a problem in the last week:
# zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
/var/log/nginx/rest.log.1:0
/var/log/nginx/rest.log.2:0
/var/log/nginx/rest.log.3:0
/var/log/nginx/rest.log.4:3625
/var/log/nginx/rest.log.5:27458
/var/log/nginx/rest.log.6:0
/var/log/nginx/rest.log.7:0
/var/log/nginx/rest.log.8:0
/var/log/nginx/rest.log.9:0
- In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean
- I will purge them from Solr statistics:
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
- Another user agent that seems to be a bot is:
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
- In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:
# zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
63090 163.172.68.99
183428 163.172.70.248
147608 163.172.71.24
- It is making 10,000 to 40,000 requests to XMLUI per day…
# zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
/var/log/nginx/access.log.30.gz:18687
/var/log/nginx/access.log.31.gz:28936
/var/log/nginx/access.log.32.gz:36402
/var/log/nginx/access.log.33.gz:38886
/var/log/nginx/access.log.34.gz:30607
/var/log/nginx/access.log.35.gz:19040
/var/log/nginx/access.log.36.gz:10780
/var/log/nginx/access.log.37.gz:5808
/var/log/nginx/access.log.38.gz:3100
/var/log/nginx/access.log.39.gz:1485
/var/log/nginx/access.log.3.gz:2898
/var/log/nginx/access.log.40.gz:373
/var/log/nginx/access.log.41.gz:3909
/var/log/nginx/access.log.42.gz:4729
/var/log/nginx/access.log.43.gz:3906
- I will purge those hits too!
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
- Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
- I need to re-run the
check-bot-hits.sh
script with the standard COUNTER-Robots list again, but add my own versions of a few because the script/Solr doesn’t support case-insensitive regular expressions:
- I need to re-run the
$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: Citoid
Purging 11 hits from Citoid in statistics
(DEBUG) Checking for hits from spider: ecointernet
Purging 375 hits from ecointernet in statistics
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
Purging 1 hits from ^Pattern\/[0-9] in statistics
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
Purging 6 hits from Typhoeus in statistics
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient
Purging 3178 hits from Apache-HttpClient in statistics
Total number of bot hits purged: 3571
$ ./check-spider-hits.sh -f /tmp/bots -d -p
(DEBUG) Using spiders pattern file: /tmp/bots
(DEBUG) Checking for hits from spider: [Bb]ot
Purging 8317 hits from [Bb]ot in statistics
(DEBUG) Checking for hits from spider: [Cc]rawl
Purging 1314 hits from [Cc]rawl in statistics
(DEBUG) Checking for hits from spider: [Ss]pider
Purging 62 hits from [Ss]pider in statistics
(DEBUG) Checking for hits from spider: Citoid
(DEBUG) Checking for hits from spider: ecointernet
(DEBUG) Checking for hits from spider: ^Pattern\/[0-9]
(DEBUG) Checking for hits from spider: sqlmap
(DEBUG) Checking for hits from spider: Typhoeus
(DEBUG) Checking for hits from spider: 7siters
(DEBUG) Checking for hits from spider: Apache-HttpClient
2020-03-11
- Ask Michael Victor for permission to create a new Linode server for DSpace Test
2020-3-12
- I’m working on the 170 IITA records on DSpace Test from January finally
- It’s been two months since I last looked and I want to do a thorough check to make sure Bosede didn’t introduce any new issues, but I want to consolidate all the text languages for these records so it’s easier to check them in OpenRefine
- First I got a list of IDs from
csvcut
and then I updated the text languages for only those records:
dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
- Then I exported the metadata from DSpace Test and imported it into OpenRefine
- I corrected one invalid AGROVOC subject using my
csv-metadata-quality
script
- I corrected one invalid AGROVOC subject using my
- I exported a new list of affiliations from the database, added line numbers with
csvcut
, and then validated them in OpenRefine usingreconcile-csv
:
dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
dspace=# \q
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
$ lein run /tmp/affiliations.csv name id
- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new column and populate it using this GREL:
if(cell.recon.matched, cell.recon.match.name, value)
- I mapped all 170 items to their appropriate collections based on type and uploaded them to CGSpace
2020-03-16
- I’m looking at the CPU usage of CGSpace (linode18) over the past year and I see we rarely even go over two CPUs on average sustained usage:
- Also clearly visible is the effect of CPU steal in 2019-03
- At max we have committed 10GB of RAM, the rest is used opportunistically by the filesystem cache, likely for Solr
- There was a huge drop in 2019-07 when I changed the JVM settings
- I think we should re-evaluate our deployment and perhaps target a different instance type and add block storage for assetstore (as we determined Linode’s block storage to be too slow for Solr)