CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

February, 2020

2020-02-02

  • Continue working on porting CGSpace’s DSpace 5 code to DSpace 6.3 that I started yesterday
    • Sign up for an account with MaxMind so I can get the GeoLite2-City.mmdb database
    • I still need to wire up the API credentials and cron job into the Ansible infrastructure playbooks
    • Fix some minor issues in the config and XMLUI themes, like removing Atmire stuff
    • The code finally builds and runs with a fresh install
  • Now we don’t specify the build environment because site modification are in local.cfg, so we just build like this:
$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
  • And it seems that we need to enable pgcrypto now (used for UUIDs):
$ psql -h localhost -U postgres dspace63
dspace63=# CREATE EXTENSION pgcrypto;
CREATE EXTENSION pgcrypto;
  • I tried importing a PostgreSQL snapshot from CGSpace and had errors due to missing Atmire database migrations
    • If I try to run dspace database migrate I get the IDs of the migrations that are missing
    • I delete them manually in psql:
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
  • Then I ran dspace database migrate and got an error:
$ ~/dspace63/bin/dspace database migrate

Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
Migrating database to latest version... (Check dspace logs for details)
Migration exception:
java.sql.SQLException: Flyway migration error occurred
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:673)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:576)
        at org.dspace.storage.rdbms.DatabaseUtils.main(DatabaseUtils.java:221)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
Caused by: org.flywaydb.core.internal.dbsupport.FlywaySqlScriptException:
Migration V6.0_2015.03.07__DS-2701_Hibernate_migration.sql failed
-----------------------------------------------------------------
SQL State  : 2BP01
Error Code : 0
Message    : ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
  Detail: view eperson_metadata depends on table metadatavalue column resource_id
  Hint: Use DROP ... CASCADE to drop the dependent objects too.
Location   : org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql (/home/aorth/src/git/DSpace-6.3/file:/home/aorth/dspace63/lib/dspace-api-6.3.jar!/org/dspace/storage/rdbms/sqlmigration/postgres/V6.0_2015.03.07__DS-2701_Hibernate_migration.sql)
Line       : 391
Statement  : ALTER TABLE metadatavalue DROP COLUMN IF EXISTS resource_id

        at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:117)
        at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:71)
        at org.flywaydb.core.internal.command.DbMigrate.doMigrate(DbMigrate.java:352)
        at org.flywaydb.core.internal.command.DbMigrate.access$1100(DbMigrate.java:47)
        at org.flywaydb.core.internal.command.DbMigrate$4.doInTransaction(DbMigrate.java:308)
        at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
        at org.flywaydb.core.internal.command.DbMigrate.applyMigration(DbMigrate.java:305)
        at org.flywaydb.core.internal.command.DbMigrate.access$1000(DbMigrate.java:47)
        at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:230)
        at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:173)
        at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
        at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:173)
        at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
        at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
        at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
        at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
        at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:662)
        ... 8 more
Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
  Detail: view eperson_metadata depends on table metadatavalue column resource_id
  Hint: Use DROP ... CASCADE to drop the dependent objects too.
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
        at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
        at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
        at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
        at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
        at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
        at org.flywaydb.core.internal.dbsupport.JdbcTemplate.executeStatement(JdbcTemplate.java:238)
        at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:114)
        ... 24 more
  • I think I might need to update the sequences first… nope
  • Perhaps it’s due to some missing bitstream IDs and I need to run dspace cleanup on CGSpace and take a new PostgreSQL dump… nope
  • A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the resource_id column
  • Our database had the same issue, where the eperson_metadata view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:
dspace63=# DROP VIEW eperson_metadata;
DROP VIEW
  • After that the migration was successful and DSpace starts up successfully and begins indexing
    • xmlui, solr, jspui, rest, and oai are working (rest was redirecting to HTTPS, so I set the Tomcat connector to secure="true" and it fixed it on localhost, but caused other issues so I disabled it for now)
    • I started diffing our themes against the Mirage 2 reference theme to capture the latest changes

2020-02-03

  • Update DSpace mimetype fallback images from KDE Breeze Icons project
  • Issues remaining in the DSpace 6 port of our CGSpace 5.x code:
    • Community and collection pages only show one recent submission (seems that there is only one item in Solr?)
    • Community and collection pages have tons of “Browse” buttons that we need to remove
    • Order of navigation elements in right side bar (“My Account” etc, compare to DSpace Test)
    • Home page trail says “CGSpace Home” instead of “CGSpace Home / Community List” (see DSpace Test)
  • There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:
2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception: 
org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
  • If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…
  • I dropped all the documents in the search core:
$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
  • Still didn’t work, so I’m going to try a clean database import and migration:
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspace63                               
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ ~/dspace63/bin/dspace database migrate
  • I notice that the indexing doesn’t work correctly if I start it manually with dspace index-discovery -b (search.resourceid becomes an integer!)
    • If I induce an indexing by touching dspace/solr/search/conf/reindex.flag the search.resourceid are all UUIDs…
  • Speaking of database stuff, there was a performance-related update for the indexes that we used in DSpace 5
    • We might want to apply it in DSpace 6, as it was never merged to 6.x, but it helped with the performance of /submissions in XMLUI for us in 2018-03

2020-02-04

  • The indexing issue I was having yesterday seems to only present itself the first time a new installation is running DSpace 6
    • Once the indexing induced by touching dspace/solr/search/conf/reindex.flag has finished, subsequent manual invocations of dspace index-discovery -b work as expected
    • Nevertheless, I sent a message to the dspace-tech mailing list describing the issue to see if anyone has any comments
  • I am seeing that the number of important commits on the unreleased DSpace 6.4 are really numerous and it might be better for us to target that version
    • I did a simple test and it’s easy to rebase my current 6.3 branch on top of the upstream dspace-6_x branch:
$ git checkout -b 6_x-dev64 6_x-dev
$ git rebase -i upstream/dspace-6_x
  • I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x
    • The code in ./dspace-xmlui/src/main/java/org/dspace/app/xmlui/aspect/browseArtifacts/CommunityBrowse.java iterates over all the browse indexes and prints them when it is called
    • The XMLUI theme code in dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/browse.xsl calls the template because the id of the div matches “aspect.browseArtifacts.CommunityBrowse.list.community-browse”
    • I checked the DRI of a community page on my local 6.x and DSpace Test 5.x by appending ?XML to the URL and I see the ID is missing on DSpace 5.x
    • The issue is the same with the ordering of the “My Account” link, but in Navigation.java
    • I tried modifying preprocess/browse.xsl but it always ends up printing some default list of browse by links…
    • I’m starting to wonder if Atmire’s modules somehow override this, as I don’t see how CommunityBrowse.java can behave like ours on DSpace 5.x unless they have overridden it (as the open source code is the same in 5.x and 6.x)
    • At least the “account” link in the sidebar is overridden in our 5.x branch because Atmire copied a modified Navigation.java to the local xmlui modules folder… so that explains that (and it’s easy to replicate in 6.x)

2020-02-05

  • UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it
  • Testing Discovery indexing speed on my local DSpace 6.3:
$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3771.78s user 93.63s system 41% cpu 2:34:19.53 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3360.28s user 82.63s system 38% cpu 2:30:22.07 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  4678.72s user 138.87s system 42% cpu 3:08:35.72 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  3334.19s user 86.54s system 35% cpu 2:41:56.73 total
  • DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  299.53s user 69.67s system 20% cpu 30:34.47 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  270.31s user 69.88s system 19% cpu 29:01.38 total
  • Checking out the DSpace 6.x REST API query client
    • There is a tutorial that explains how it works and I see it is very powerful because you can export a CSV of results in order to fix and re-upload them with batch import!
    • Custom queries can be added in dspace-rest/src/main/webapp/static/reports/restQueryReport.js
  • I noticed two new bots in the logs with the following user agents:
    • Jersey/2.6 (HttpUrlConnection 1.8.0_152)
    • magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)
  • I filed an issue to add Jersey to the COUNTER-Robots list
  • Peter noticed that the statlets on community, collection, and item pages aren’t working on CGSpace
    • I thought it might be related to the fact that the yearly sharding didn’t complete successfully this year so the statistics-2019 core is empty
    • I removed the statistics-2019 core and had to restart Tomcat like six times before all cores would load properly (ugh!!!!)
    • After that the statlets were working properly…
  • Run all system updates on DSpace Test (linode19) and restart it

2020-02-06

  • I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6
  • I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:
$ podman pull postgres:10-alpine
$ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
$ createuser -h localhost -U postgres --pwprompt dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
$ pg_restore -h localhost -U postgres -d dspacetest -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost ~/Downloads/cgspace_2020-02-06.backup
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest
$ psql -h localhost -U postgres -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspace63
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
$ psql -h localhost -U postgres dspace63                               
dspace63=# CREATE EXTENSION pgcrypto;
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
  • I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my check-spider-hits.sh script:
$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
  • I noticed another user agen in the logs that we should add to the list:
ReactorNetty/0.9.2.RELEASE
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
$ ls -lh /tmp/statistics-2019-01.json
-rw-rw-r-- 1 aorth aorth 3.7G Feb  6 09:26 /tmp/statistics-2019-01.json
  • Then I tested importing this by creating a new core in my development environment:
$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
  • This imports the records into the core, but DSpace can’t see them, and when I restart Tomcat the core is not seen by Solr…
  • I got the core to load by adding it to dspace/solr/solr.xml manually, ie:
  <cores adminPath="/admin/cores">
  ...
    <core name="statistics" instanceDir="statistics" />
    <core name="statistics-2019" instanceDir="statistics">
        <property name="dataDir" value="/home/aorth/dspace/solr/statistics-2019/data" />
    </core>
  ...
  </cores>
  • But I don’t like having to do that… why doesn’t it load automatically?
  • I sent a mail to the dspace-tech mailing list to ask about it
  • Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:
  • First, create a Solr statistics core using the DSpace 7 config:
$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
  • Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:
$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
  • OK that imported! I wonder if it works… maybe I’ll try another day

2020-02-07

$ cd ~/src/git/perf-map-agent
$ cmake     .
$ make
$ ./bin/create-links-in ~/.local/bin
$ export FLAMEGRAPH_DIR=/home/aorth/src/git/FlameGraph
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
$ ~/dspace63/bin/dspace index-discovery -b &
# pid of tomcat java process
$ perf-java-flames 4478
# pid of java indexing process
$ perf-java-flames 11359
  • All Java processes need to have -XX:+PreserveFramePointer if you want to trace their methods
  • I did the same tests against DSpace 5.8 and 6.4-SNAPSHOT’s CLI indexing process and Tomcat process
    • For what it’s worth, it appears all the Hibernate stuff is in the CLI processes, so we don’t need to trace the Tomcat process
  • Here is the flame graph for DSpace 5.8’s dspace index-discovery -b java process:

DSpace 5.8 index-discovery flame graph

  • Here is the flame graph for DSpace 6.4-SNAPSHOT’s dspace index-discovery -b java process:

DSpace 6.4-SNAPSHOT index-discovery flame graph

  • If the width of the stacks indicates time, then it’s clear that Hibernate takes longer…
  • Apparently there is a “flame diff” tool, I wonder if we can use that to compare!

2020-02-09

  • This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:
# CGSpace 5.8
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  385.72s user 131.16s system 19% cpu 43:21.18 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  382.95s user 127.31s system 20% cpu 42:10.07 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  368.56s user 143.97s system 20% cpu 42:22.66 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b  360.09s user 104.03s system 19% cpu 39:24.41 total

# Vanilla DSpace 5.10
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b  236.19s user 59.70s system 3% cpu 2:03:31.14 total
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b  232.41s user 50.38s system 3% cpu 2:04:16.00 total

# Vanilla DSpace 6.4-SNAPSHOT
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  5112.96s user 127.80s system 40% cpu 3:36:53.98 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b  5112.96s user 127.80s system 40% cpu 3:21:0.0 total
  • I generated better flame graphs for the DSpace indexing process by using perf-record-stack and filtering out the java process:
$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
$ export PERF_RECORD_SECONDS=60
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
# process id of java indexing process (not Tomcat)
$ perf-java-record-stack 169639
$ sudo perf script -i /tmp/perf-169639.data > out.dspace510-1
$ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' | ../FlameGraph/flamegraph.pl --color=java --hash > out.dspace510-1.svg
  • All data recorded on my laptop with the same kernel, same boot, etc.
  • CGSpace 5.8 (with Atmire patches):

DSpace 5.8 (with Atmire modules) index-discovery flame graph

  • Vanilla DSpace 5.10:

Vanilla DSpace 5.10 index-discovery flame graph

  • Vanilla DSpace 6.4-SNAPSHOT:

Vanilla DSpace 6.4-SNAPSHOT index-discovery flame graph

  • I sent my feedback to the dspace-tech mailing list so someone can hopefully comment.
  • Last week Peter asked Sisay to upload some items to CGSpace in the GENNOVATE collection (part of Gender CRP)

2020-02-10

2020-02-11

  • Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
  • Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using fix-metadata-values.py:
$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
  • On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
    • I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:
dc.contributor.author,cg.creator.id
"Staver, Charles",charles staver: 0000-0002-4532-6077
"Staver, C.",charles staver: 0000-0002-4532-6077
"Fungo, R.",Robert Fungo: 0000-0002-4264-6905
"Remans, R.",Roseline Remans: 0000-0003-3659-8529
"Remans, Roseline",Roseline Remans: 0000-0003-3659-8529
"Rietveld A.",Anne Rietveld: 0000-0002-9400-9473
"Rietveld, A.",Anne Rietveld: 0000-0002-9400-9473
"Rietveld, A.M.",Anne Rietveld: 0000-0002-9400-9473
"Rietveld, Anne M.",Anne Rietveld: 0000-0002-9400-9473
"Fongar, A.",Andrea Fongar: 0000-0003-2084-1571
"Müller, Anna",Anna Müller: 0000-0003-3120-8560
"Müller, A.",Anna Müller: 0000-0003-3120-8560
  • Running the add-orcid-identifiers-csv.py script I added 144 ORCID iDs to items on CGSpace!
$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
  • Minor updates to all Python utility scripts in the CGSpace git repository
  • Update the spider agent patterns in CGSpace 5_x-prod branch from the latest COUNTER-Robots project
    • I ran the check-spider-hits.sh script with the updated file and purged 6,000 hits from our Solr statistics core on CGSpace

2020-02-12

  • Follow up with people about AReS funding for next phase
  • Peter asked about the “stats” and “summary” reports that he had requested in December
  • Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:
dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
UPDATE 26

2020-02-17

  • A few days ago Atmire responded to my question about DSpace 6.4-SNAPSHOT saying that they can only confirm that 6.3 works with their modules
    • I responded to say that we agree to target 6.3, but that I will cherry-pick important patches from the dspace-6_x branch at our own responsibility
  • Send a message to dspace-devel asking them to tag DSpace 6.4
  • Udana from IWMI asked about the OAI base URL for their community on CGSpace

2020-02-19

  • I noticed a thread on the mailing list about the Tomcat header size and Solr max boolean clauses error
    • The solution is to do as we have done and increase the headers / boolean clauses, or to simply disable access rights awareness in Discovery
    • I applied the fix to the 5_x-prod branch and cherry-picked it to 6_x-dev
  • Upgrade Tomcat from 7.0.99 to 7.0.100 in Ansible infrastructure playbooks
  • Upgrade PostgreSQL JDBC driver from 42.2.9 to 42.2.10 in Ansible infrastructure playbooks
  • Run Tomcat and PostgreSQL JDBC driver updates on DSpace Test (linode19)

2020-02-23

  • I see a new spider in the nginx logs on CGSpace:
Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
  • I think this should be covered by the COUNTER-Robots patterns for the statistics at least…
  • I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
  • Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent
  • I will add the IP addresses to the nginx badbots list
  • 31.6.77.23 is in the UK and judging by its DNS it belongs to a web marketing company called Bronco
    • I looked for its DNS entry in Solr statistics and found a few hundred thousand over the years:
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">dns:/squeeze3.bronco.co.uk./</str><str name="rows">0</str></lst></lst><result name="response" numFound="86044" start="0"></result>
</response>
  • The totals in each core are:
    • statistics: 86044
    • statistics-2018: 65144
    • statistics-2017: 79405
    • statistics-2016: 121316
    • statistics-2015: 30720
    • statistics-2014: 4524
    • … so about 387,000 hits!
  • I will purge them from each core one by one, ie:
$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
  • Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)
  • Deploy latest 5_x-prod branch on CGSpace (linode18)
  • Run all system updates on CGSpace (linode18) server and reboot it
    • After the server came back up Tomcat started, but there were errors loading some Solr statistics cores
    • Luckily after restarting Tomcat once more they all came back up
  • I ran the dspace cleanup -v process on CGSpace and got an error:
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(183996) is still referenced from table "bundle".
  • The solution is, as always:
# su - postgres
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
UPDATE 1
  • Аdd one more new Bioversity ORCID iD to the controlled vocabulary on CGSpace
  • Felix Shaw from Earlham emailed me to ask about his admin account on DSpace Test
    • His old one got lost when I re-sync’d DSpace Test with CGSpace a few weeks ago
    • I added a new account for him and added it to the Administrators group:
$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
  • For some reason the Atmire Content and Usage Analysis (CUA) module’s Usage Statistics is drawing blank graphs
    • I looked in the dspace.log and see:
2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoClassDefFoundError: Could not
 initialize class org.jfree.chart.JFreeChart
  • The same error happens on DSpace Test, but graphs are working on my local instance
    • The only thing I’ve changed recently is the Tomcat version, but it’s working locally…
    • I see the following file on my local instance, CGSpace, and DSpace Test: dspace/webapps/xmlui/WEB-INF/lib/jfreechart-1.0.5.jar
    • I deployed Tomcat 7.0.99 on DSpace Test but the JFreeChart classs still can’t be found…
    • So it must be somthing with the library search path…
    • Strange it works with Tomcat 7.0.100 on my local machine
  • I copied the jfreechart-1.0.5.jar file to the Tomcat lib folder and then there was a different error when I loaded Atmire CUA:
2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!  org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
  • Some search results suggested commenting out the following line in /etc/java-8-openjdk/accessibility.properties:
assistive_technologies=org.GNOME.Accessibility.AtkWrapper
  • And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test…
    • Hmm, actually I think this is an Java bug, perhaps introduced or at least present in 18.04, with lots of references to it happening in other configurations like Debian 9 with Jenkins, etc…
    • Apparently if you use the non-headless version of openjdk this doesn’t happen… but that pulls in X11 stuff so no thanks
    • Also, I see dozens of occurences of this going back over one month (we have logs for about that period):
# grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
dspace.log.2020-01-12:4
dspace.log.2020-01-13:66
dspace.log.2020-01-14:4
dspace.log.2020-01-15:36
dspace.log.2020-01-16:88
dspace.log.2020-01-17:4
dspace.log.2020-01-18:4
dspace.log.2020-01-19:4
dspace.log.2020-01-20:4
dspace.log.2020-01-21:4
...
  • I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…
  • On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?
$ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">811</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="rows">0</str></lst></lst><result name="response" numFound="5536097" start="0"></result>
</response>
  • And there are apparently two million from last month (2020-01):
$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">248</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="fq">dateYearMonth:2020-01</str><str name="rows">0</str></lst></lst><result name="response" numFound="2173455" start="0"></result>
</response>
  • But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on /rest and none of which are to XMLUI:
# zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
84322
# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
84322
  • Either the requests didn’t get logged, or there is some mixup with the Solr documents (fuck!)
    • On second inspection, I do see lots of notes here about 34.218.226.147, including 150,000 on one day in October, 2018 alone…
  • To make matters worse, I see hits from REST in the regular nginx access log!
    • I did a few tests and I can’t figure out, but it seems that hits appear in either (not both)
    • Also, I see zero hits to /rest in the access.log on DSpace Test (linode19)
  • Anyways, I faceted by IP in 2020-01 and see:
$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
...
        "172.104.229.92",2686876,
        "34.218.226.147",2173455,
        "163.172.70.248",80945,
        "163.172.71.24",55211,
        "163.172.68.99",38427,
  • Surprise surprise, the top two IPs are from AReS servers… wtf.
  • The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
  • And all the same three are already inflating the statistics for 2020-02… hmmm.
  • I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests…
  • Shiiiiit, I see 84,000 requests from the AReS IP today alone:
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
...
  "response":{"numFound":84594,"start":0,"docs":[]
  • Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:
        "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
        "2a01:7e00::f03c:91ff:fe18:7396",26155,
  • I need to try to make some requests for these URLs and observe if they make a statistics hit:
    • /rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450
    • /rest/handle/10568/28702?expand=all
  • Those are the requests AReS and ILRI servers are making… nearly 150,000 per day!
  • Well that settles it!
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
  "response":{"numFound":12,"start":0,"docs":[
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
  "response":{"numFound":62,"start":0,"docs":[
  • A REST request with limit=50 will make exactly fifty statistics_type=view statistics in the Solr core… fuck.
    • So not only do I need to purge all these millions of hits, we need to add these IPs to the list of spider IPs so they don’t get recorded

2020-02-24

  • I tried to add some IPs to the DSpace spider list so they would not get recorded in Solr statistics, but it doesn’t support IPv6
    • A better method is actually to just use the nginx mapping logic we already have to reset the user agent for these requests to “bot”
    • That, or to really insist that users harvesting us specify some kind of user agent
  • I tried to add the IPs to our nginx IP bot mapping but it doesn’t seem to work… WTF, why is everything broken?!
  • Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:
$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
  "response":{"numFound":42395486,"start":0,"docs":[]
  • I modified my check-spider-hits.sh script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
Purging 22809216 hits from 34.218.226.147 in statistics
Purging 19586270 hits from 172.104.229.92 in statistics
Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
Purging 271668 hits from 2a01:7e00::f03c:91ff:fe18:7396 in statistics

Total number of bot hits purged: 42778291
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics-2018 -p
Purging 5535399 hits from 34.218.226.147 in statistics-2018

Total number of bot hits purged: 5535399
  • (The statistics core holds 2019 and 2020 stats, because the yearly sharding process failed this year)
  • Attached is a before and after of the period from 2019-01 to 2020-02:

CGSpace stats for 2019 and 2020 before the purge

CGSpace stats for 2019 and 2020 after the purge

  • And here is a graph of the stats by year since 2011:

CGSpace stats by year since 2011 after the purge

  • I’m a little suspicious of the 2012, 2013, and 2014 numbers, though
    • I should facet those years by IP and see if any stand out…
  • The next thing I need to do is figure out why the nginx IP to bot mapping isn’t working…
    • Actually, and I’ve probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I’m only using the mapped one in the proxy pass…
    • This trick for adding a header with the mapped “ua” variable is nice:
add_header X-debug-message "ua is $ua" always;
  • Then in the HTTP response you see:
X-debug-message: ua is bot
  • So the IP to bot mapping is working, phew.
  • More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
    • For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a “mystery” client on Google Cloud in 2018-09
    • Others I should probably add to the nginx bot map list are:
      • wle.cgiar.org (70.32.90.172)
      • ccafs.cgiar.org (205.186.128.185)
      • another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72)
      • macaronilab.com (63.32.242.35)
      • africa-rising.net (162.243.171.159
  • These IPs are all active in the REST API logs over the last few months and they account for thirty-four million more hits in the statistics!
  • I purged them from CGSpace:
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 15 hits from 104.196.152.243 in statistics
Purging 61064 hits from 35.237.175.180 in statistics
Purging 1378 hits from 70.32.90.172 in statistics
Purging 28880 hits from 205.186.128.185 in statistics
Purging 464613 hits from 63.32.242.35 in statistics
Purging 131 hits from 162.243.171.159 in statistics

Total number of bot hits purged: 556081
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 684888 hits from 104.196.152.243 in statistics-2018
Purging 323737 hits from 35.227.26.162 in statistics-2018
Purging 221091 hits from 35.237.175.180 in statistics-2018
Purging 3834 hits from 205.186.128.185 in statistics-2018
Purging 20337 hits from 63.32.242.35 in statistics-2018

Total number of bot hits purged: 1253887
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 1752548 hits from 104.196.152.243 in statistics-2017

Total number of bot hits purged: 1752548
  • I looked in the REST API logs for the past month and found a few more IPs:
    • 95.110.154.135 (BioversityBot)
    • 34.209.213.122 (IITA? bot)
  • The client at 3.225.28.105 is using the following user agent:
Apache-HttpClient/4.3.4 (java 1.5)
  • But I don’t see any hits for it in the statistics core for some reason
  • Looking more into the 2015 statistics I see some questionable IPs:
    • 50.115.121.196 has a DNS of saltlakecity2tr.monitis.com
    • 70.32.99.142 has userAgent Drupal
    • 104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month
    • 45.56.65.158 was some scraper on Linode that made ~30,000 requests per month
    • 23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent
    • 180.76.15.6 and dozens of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!)
  • For the IPs I purged them using check-spider-ip-hits.sh:
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
Purging 11478 hits from 95.110.154.135 in statistics
Purging 1208 hits from 34.209.213.122 in statistics
Purging 10 hits from 54.184.39.242 in statistics

Total number of bot hits purged: 12696
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 12572 hits from 95.110.154.135 in statistics-2018
Purging 233 hits from 34.209.213.122 in statistics-2018

Total number of bot hits purged: 12805
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 37503 hits from 95.110.154.135 in statistics-2017
Purging 25 hits from 34.209.213.122 in statistics-2017
Purging 8621 hits from 23.97.198.40 in statistics-2017

Total number of bot hits purged: 46149
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
Purging 1476 hits from 95.110.154.135 in statistics-2016
Purging 10490 hits from 70.32.99.142 in statistics-2016
Purging 29519 hits from 50.115.121.196 in statistics-2016
Purging 175758 hits from 45.56.65.158 in statistics-2016
Purging 26279 hits from 23.97.198.40 in statistics-2016

Total number of bot hits purged: 243522
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
Purging 49351 hits from 70.32.99.142 in statistics-2015
Purging 30278 hits from 50.115.121.196 in statistics-2015
Purging 172292 hits from 104.130.164.111 in statistics-2015
Purging 78571 hits from 45.56.65.158 in statistics-2015
Purging 16069 hits from 23.97.198.40 in statistics-2015

Total number of bot hits purged: 346561
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2014 -p 
Purging 462 hits from 70.32.99.142 in statistics-2014
Purging 1766 hits from 50.115.121.196 in statistics-2014

Total number of bot hits purged: 2228
  • Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn’t have a proper user agent and the only way to identify them was via DNS:
$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
  • Jesus, the more I keep looking, the more I see ridiculous stuff…
  • In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network…
    • 79.173.222.114
    • 149.200.141.57
    • 86.108.89.91
    • And others…
  • Also I see a CIAT IP 45.5.186.2 that was making hundreds of thousands of requests (and 100/sec at one point in 2019)
  • Also I see some IP on Hetzner making 10,000 requests per month: 2a01:4f8:210:51ef::2
  • Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130
  • I purged a bunch more from all cores:
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p     
Purging 109965 hits from 45.5.186.2 in statistics
Purging 78648 hits from 79.173.222.114 in statistics
Purging 49032 hits from 149.200.141.57 in statistics
Purging 26897 hits from 86.108.89.91 in statistics
Purging 80898 hits from 2a01:4f8:210:51ef::2 in statistics
Purging 130831 hits from 143.233.242.130 in statistics
Purging 46489 hits from 83.103.94.48 in statistics

Total number of bot hits purged: 522760
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2018 -p
Purging 41574 hits from 45.5.186.2 in statistics-2018
Purging 39620 hits from 2a01:4f8:210:51ef::2 in statistics-2018
Purging 19325 hits from 83.103.94.48 in statistics-2018

Total number of bot hits purged: 100519
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017   
Found 296 hits from 45.5.186.2 in statistics-2017
Found 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
Found 16086 hits from 83.103.94.48 in statistics-2017

Total number of hits from bots: 16772
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2017 -p
Purging 296 hits from 45.5.186.2 in statistics-2017
Purging 390 hits from 2a01:4f8:210:51ef::2 in statistics-2017
Purging 16086 hits from 83.103.94.48 in statistics-2017

Total number of bot hits purged: 16772
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2016 -p
Purging 394 hits from 2a01:4f8:210:51ef::2 in statistics-2016
Purging 26519 hits from 83.103.94.48 in statistics-2016

Total number of bot hits purged: 26913
$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics-2015 -p
Purging 1 hits from 143.233.242.130 in statistics-2015
Purging 14109 hits from 83.103.94.48 in statistics-2015

Total number of bot hits purged: 14110
  • Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like “Microsoft Office Word 2014”
  • Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:
# zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
      1 Microsoft Office Word 2014
      1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
      1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
      1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
      2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
      3 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36
     24 GuzzleHttp/6.3.3 curl/7.59.0 PHP/7.0.31
     34 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
     98 Apache-HttpClient/4.3.4 (java 1.5)
  54850 -
  • I see lots of requests coming from the following user agents:
"Apache-HttpClient/4.5.7 (Java/11.0.3)"
"Apache-HttpClient/4.5.7 (Java/11.0.2)"
"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
"EventMachine HttpClient"
  • I should definitely add HttpClient to the bot user agents…
  • Also, while bot, spider, and crawl are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can’t do case-insensitive matching in Solr with check-spider-hits.sh
    • I need to add Bot, Spider, and Crawl to my local user agent file to purge them
    • Also, I see lots of hits from “Indy Library”, which we’ve been blocking for a long time, but somehow these got through (I think it’s the Greek guys using Delphi)
    • Somehow my regex conversion isn’t working in check-spider-hits.sh, but “Indy” will work for now
    • Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020
  • More weird user agents in 2019:
ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/

2020-02-25

  • And what’s the 950,000 hits from Online.net IPs with the following user agent:
Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
  • Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I’m purging them all
  • I looked deeper in the Solr statistics and found a bunch more weird user agents:
LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
EventMachine HttpClient
ecolink (+https://search.ecointernet.org/)
ecoweb (+https://search.ecointernet.org/)
EcoInternet http://www.ecointernet.org/
EcoInternet http://ecointernet.org/
Biosphere EcoSearch http://search.ecointernet.org/
Typhoeus - https://github.com/typhoeus/typhoeus
Citoid (Wikimedia tool; learn more at https://www.mediawiki.org/wiki/Citoid)
node-fetch/1.0 (+https://github.com/bitinn/node-fetch)
7Siters/1.08 (+https://7ooo.ru/siters/)
sqlmap/1.0-dev-nongit-20190527 (http://sqlmap.org)
sqlmap/1.3.4.14#dev (http://sqlmap.org)
lua-resty-http/0.10 (Lua) ngx_lua/10000
omgili/0.5 +http://omgili.com
IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)
Twurly v1.1 (https://twurly.org)
okhttp/3.11.0
okhttp/3.10.0
Pattern/2.6 +http://www.clips.ua.ac.be/pattern
Link Check; EPrints 3.3.x;
CyotekWebCopy/1.7 CyotekHTTP/2.0
Adestra Link Checker: http://www.adestra.co.uk
HTTPie/1.0.2
  • I notice that some of these would be matched by the COUNTER-Robots list when DSpace uses it in Java because there we have more robust (and case-insensitive) matching
    • I created a temporary file of some of the patterns and converted them to use capitalization so I could run them through check-spider-hits.sh
Link.?Check
Http.?Client
ecointernet
  • That removes another 500,000 or so:
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
Purging 253 hits from Jersey\/[0-9] in statistics
Purging 7302 hits from Link.?Check in statistics
Purging 85574 hits from Http.?Client in statistics
Purging 495 hits from HTTPie\/[0-9] in statistics
Purging 56726 hits from ecointernet in statistics

Total number of bot hits purged: 150350
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2018 -p
Purging 3442 hits from Link.?Check in statistics-2018
Purging 21922 hits from Http.?Client in statistics-2018
Purging 2120 hits from HTTPie\/[0-9] in statistics-2018
Purging 10 hits from ecointernet in statistics-2018

Total number of bot hits purged: 27494
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2017 -p
Purging 6416 hits from Link.?Check in statistics-2017
Purging 403402 hits from Http.?Client in statistics-2017
Purging 12 hits from HTTPie\/[0-9] in statistics-2017
Purging 6 hits from ecointernet in statistics-2017

Total number of bot hits purged: 409836
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2016 -p
Purging 2348 hits from Link.?Check in statistics-2016
Purging 225664 hits from Http.?Client in statistics-2016
Purging 15 hits from HTTPie\/[0-9] in statistics-2016

Total number of bot hits purged: 228027
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2015 -p
Purging 3459 hits from Link.?Check in statistics-2015
Purging 263 hits from Http.?Client in statistics-2015
Purging 15 hits from HTTPie\/[0-9] in statistics-2015

Total number of bot hits purged: 3737
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2014 -p
Purging 5 hits from Link.?Check in statistics-2014
Purging 8 hits from Http.?Client in statistics-2014
Purging 4 hits from HTTPie\/[0-9] in statistics-2014

Total number of bot hits purged: 17
$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics-2011 -p
Purging 159 hits from Http.?Client in statistics-2011

Total number of bot hits purged: 159
  • Make pull requests for issues with user agents in the COUNTER-Robots repository:
  • One benefit of all this is that the size of the statistics Solr core has reduced by 6GiB since yesterday, though I can’t remember how big it was before that
    • According to my notes it was 43GiB in January when it failed the first time
    • I wonder if the sharding process would work now…

2020-02-26

  • Bosede finally got back to me about the IITA records from earlier last month (IITA_201907_Jan13)
    • She said she has added more information to fifty-three of the journal articles, as I had requested
  • I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
  • Interestingly I saw this in the Solr log:
2020-02-26 08:55:47,433 INFO  org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
2020-02-26 08:55:47,511 INFO  org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
  • The process has been going for several hours now and I suspect it will fail eventually
    • I want to explore manually creating and migrating the core
  • Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:
$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
  • After that the statistics-2019 core was immediately available in the Solr UI, but after restarting Tomcat it was gone
    • I wonder if I import some old statistics into the current statistics core and then let DSpace create the statistics-2019 core itself using dspace stats-util -s will work…
  • First export a small slice of 2019 stats from the main CGSpace statistics core, skipping Atmire schema additions:
$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
  • Then import into my local statistics core:
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
$ ~/dspace63/bin/dspace stats-util -s
Moving: 21993 into core statistics-2019
  • To my surprise, the statistics-2019 core is created and the documents are immediately visible in the Solr UI!
    • Also, I am able to see the stats in DSpace’s default “View Usage Statistics” screen
    • Items appear with the words “(legacy)” at the end, ie “Improving farming practices in flood-prone areas in the Solomon Islands(legacy)”
    • Interestingly, if I make a bunch of requests for that item they will not be recognized as the same item, showing up as “Improving farming practices in flood-prone areas in the Solomon Islands” without the the legacy identifier
    • I need to remember to test out the SolrUpgradePre6xStatistics tool
  • After restarting my local Tomcat on DSpace 6.4-SNAPSHOT the statistics-2019 core loaded up…
    • I wonder what the difference is between the core I created vs the one created by stats-util?
    • I’m honestly considering just moving everything back into one core…
    • Or perhaps I can export all the stats for 2019 by month, then delete everything, re-import each month, and migrate them with stats-util
  • A few hours later the sharding has completed successfully so I guess I don’t have to worry about this any more for now, though I’m seriously considering moving all my data back into the one statistics core
  • Testing some proposed patches for 6.4 in my local 6_x-dev64 branch
  • DS-4135 (citation author UTF-8)
<meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />

2020-02-27

  • Tezira startd a discussion on Yammer about the ISI Journal field
    • She and Abenet both insist that selecting N/A for the “Journal status” in the submission form makes the item show ISI Journal on the item display page
    • I told them that the N/A does not store a value so this is impossible
    • I tested it to be sure on DSpace Test, and it does not show a value…
    • I checked this morning’s database snapshot and found three items that had a value of N/A, but they have already been fixed manually on CGSpace by Abenet or Tezira
    • I re-worded the N/A to say “Non-ISI Journal” in the submission form, though it still does not store a value
  • I tested the one last remaining issue with our 6.x-dev branch: the export CSV from search results
  • I added some debugging to the Solr core loading in DSpace 6.4-SNAPSHOT (SolrLoggerServiceImpl.java) and I see this when DSpace starts up now:
2020-02-27 12:26:35,695 INFO  org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException].  New Core Will be Created
  • When I check Solr I see the statistics-2019 core loaded (from stats-util -s yesterday, not manually created)