<li>I tried importing a PostgreSQL snapshot from CGSpace and had errors due to missing Atmire database migrations
<ul>
<li>If I try to run <code>dspace database migrate</code> I get the IDs of the migrations that are missing</li>
<li>I delete them manually in psql:</li>
</ul>
</li>
</ul>
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
</code></pre><ul>
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
Statement : ALTER TABLE metadatavalue DROP COLUMN IF EXISTS resource_id
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:117)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:71)
at org.flywaydb.core.internal.command.DbMigrate.doMigrate(DbMigrate.java:352)
at org.flywaydb.core.internal.command.DbMigrate.access$1100(DbMigrate.java:47)
at org.flywaydb.core.internal.command.DbMigrate$4.doInTransaction(DbMigrate.java:308)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.applyMigration(DbMigrate.java:305)
at org.flywaydb.core.internal.command.DbMigrate.access$1000(DbMigrate.java:47)
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:230)
at org.flywaydb.core.internal.command.DbMigrate$2.doInTransaction(DbMigrate.java:173)
at org.flywaydb.core.internal.util.jdbc.TransactionTemplate.execute(TransactionTemplate.java:72)
at org.flywaydb.core.internal.command.DbMigrate.migrate(DbMigrate.java:173)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:959)
at org.flywaydb.core.Flyway$1.execute(Flyway.java:917)
at org.flywaydb.core.Flyway.execute(Flyway.java:1373)
at org.flywaydb.core.Flyway.migrate(Flyway.java:917)
at org.dspace.storage.rdbms.DatabaseUtils.updateDatabase(DatabaseUtils.java:662)
... 8 more
Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatavalue column resource_id because other objects depend on it
Detail: view eperson_metadata depends on table metadatavalue column resource_id
Hint: Use DROP ... CASCADE to drop the dependent objects too.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2422)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2167)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:306)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:441)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:365)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:307)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:293)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:270)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:266)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:291)
at org.flywaydb.core.internal.dbsupport.JdbcTemplate.executeStatement(JdbcTemplate.java:238)
at org.flywaydb.core.internal.dbsupport.SqlScript.execute(SqlScript.java:114)
... 24 more
</code></pre><ul>
<li>I think I might need to update the sequences first… nope</li>
<li>Perhaps it’s due to some missing bitstream IDs and I need to run <code>dspace cleanup</code> on CGSpace and take a new PostgreSQL dump… nope</li>
<li>A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the <code>resource_id</code> column</li>
<li>Our database had the same issue, where the <code>eperson_metadata</code> view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:</li>
</ul>
<pre><code>dspace63=# DROP VIEW eperson_metadata;
DROP VIEW
</code></pre><ul>
<li>After that the migration was successful and DSpace starts up successfully and begins indexing
<li>xmlui, solr, jspui, rest, and oai are working (rest was redirecting to HTTPS, so I set the Tomcat connector to <code>secure="true"</code> and it fixed it on localhost, but caused other issues so I disabled it for now)</li>
<li><inputchecked=""disabled=""type="checkbox">Community and collection pages only show one recent submission (seems that there is only one item in Solr?)</li>
<li><inputchecked=""disabled=""type="checkbox">Order of navigation elements in right side bar (“My Account” etc, compare to DSpace Test)</li>
<li><inputdisabled=""type="checkbox">Home page trail says “CGSpace Home” instead of “CGSpace Home / Community List” (see DSpace Test)</li>
</ul>
</li>
<li>There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:</li>
<li>If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…</li>
<li>I dropped all the documents in the search core:</li>
</ul>
<pre><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
</code></pre><ul>
<li>Still didn’t work, so I’m going to try a clean database import and migration:</li>
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
<li>I notice that the indexing doesn’t work correctly if I start it manually with <code>dspace index-discovery -b</code> (search.resourceid becomes an integer!)
<ul>
<li>If I induce an indexing by touching <code>dspace/solr/search/conf/reindex.flag</code> the search.resourceid are all UUIDs…</li>
</ul>
</li>
<li>Speaking of database stuff, there was a performance-related update for the <ahref="https://github.com/DSpace/DSpace/pull/1791/">indexes that we used in DSpace 5</a>
<ul>
<li>We might want to <ahref="https://github.com/DSpace/DSpace/pull/1792">apply it in DSpace 6</a>, as it was never merged to 6.x, but it helped with the performance of <code>/submissions</code> in XMLUI for us in <ahref="/cgspace-notes/2018-03/">2018-03</a></li>
<li>The indexing issue I was having yesterday seems to only present itself the first time a new installation is running DSpace 6
<ul>
<li>Once the indexing induced by touching <code>dspace/solr/search/conf/reindex.flag</code> has finished, subsequent manual invocations of <code>dspace index-discovery -b</code> work as expected</li>
<li>Nevertheless, I sent a message to the dspace-tech mailing list describing the issue to see if anyone has any comments</li>
<li>I am seeing that the number of important commits on the unreleased DSpace 6.4 are really numerous and it might be better for us to target that version
<ul>
<li>I did a simple test and it’s easy to rebase my current 6.3 branch on top of the upstream <code>dspace-6_x</code> branch:</li>
</ul>
</li>
</ul>
<pre><code>$ git checkout -b 6_x-dev64 6_x-dev
$ git rebase -i upstream/dspace-6_x
</code></pre><ul>
<li>I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x
<ul>
<li>The code in <code>./dspace-xmlui/src/main/java/org/dspace/app/xmlui/aspect/browseArtifacts/CommunityBrowse.java</code> iterates over all the browse indexes and prints them when it is called</li>
<li>The XMLUI theme code in <code>dspace/modules/xmlui-mirage2/src/main/webapp/themes/0_CGIAR/xsl/preprocess/browse.xsl</code> calls the template because the id of the div matches “aspect.browseArtifacts.CommunityBrowse.list.community-browse”</li>
<li>I checked the DRI of a community page on my local 6.x and DSpace Test 5.x by appending <code>?XML</code> to the URL and I see the ID is missing on DSpace 5.x</li>
<li>The issue is the same with the ordering of the “My Account” link, but in Navigation.java</li>
<li>I tried modifying <code>preprocess/browse.xsl</code> but it always ends up printing some default list of browse by links…</li>
<li>I’m starting to wonder if Atmire’s modules somehow override this, as I don’t see how <code>CommunityBrowse.java</code> can behave like ours on DSpace 5.x unless they have overridden it (as the open source code is the same in 5.x and 6.x)</li>
<li>At least the “account” link in the sidebar is overridden in our 5.x branch because Atmire copied a modified <code>Navigation.java</code> to the local xmlui modules folder… so that explains that (and it’s easy to replicate in 6.x)</li>
<li>Checking out the DSpace 6.x REST API query client
<ul>
<li>There is a <ahref="https://terrywbrady.github.io/restReportTutorial/intro">tutorial</a> that explains how it works and I see it is very powerful because you can export a CSV of results in order to fix and re-upload them with batch import!</li>
<li>Custom queries can be added in <code>dspace-rest/src/main/webapp/static/reports/restQueryReport.js</code></li>
</ul>
</li>
<li>I noticed two new bots in the logs with the following user agents:
<li><code>magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)</code></li>
</ul>
</li>
<li>I filed an <ahref="https://github.com/atmire/COUNTER-Robots/issues/30">issue to add Jersey to the COUNTER-Robots</a> list</li>
<li>Peter noticed that the statlets on community, collection, and item pages aren’t working on CGSpace
<ul>
<li>I thought it might be related to the fact that the yearly sharding didn’t complete successfully this year so the <code>statistics-2019</code> core is empty</li>
<li>I removed the <code>statistics-2019</code> core and had to restart Tomcat like six times before all cores would load properly (ugh!!!!)</li>
<li>After that the statlets were working properly…</li>
</ul>
</li>
<li>Run all system updates on DSpace Test (linode19) and restart it</li>
dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
dspace63=# DROP VIEW eperson_metadata;
dspace63=# \q
</code></pre><ul>
<li>I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my <code>check-spider-hits.sh</code> script:</li>
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
<li>I noticed another user agen in the logs that we should add to the list:</li>
</ul>
<pre><code>ReactorNetty/0.9.2.RELEASE
</code></pre><ul>
<li>I made <ahref="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
<li>I found a <ahref="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to workfor exporting our 2019 stats from the large statistics core!</li>
<li>Using flame graphs with java: <ahref="https://netflixtechblog.com/java-in-flames-e763b3d32166">https://netflixtechblog.com/java-in-flames-e763b3d32166</a></li>
<li>Fantastic wrapper scripts for doing perf on Java processes: <ahref="https://github.com/jvm-profiling-tools/perf-map-agent">https://github.com/jvm-profiling-tools/perf-map-agent</a></li>
<li>This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:</li>
</ul>
<pre><code># CGSpace 5.8
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 385.72s user 131.16s system 19% cpu 43:21.18 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 382.95s user 127.31s system 20% cpu 42:10.07 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 368.56s user 143.97s system 20% cpu 42:22.66 total
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 360.09s user 104.03s system 19% cpu 39:24.41 total
# Vanilla DSpace 5.10
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 236.19s user 59.70s system 3% cpu 2:03:31.14 total
schedtool -D -e ~/dspace510/bin/dspace index-discovery -b 232.41s user 50.38s system 3% cpu 2:04:16.00 total
# Vanilla DSpace 6.4-SNAPSHOT
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:36:53.98 total
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s system 40% cpu 3:21:0.0 total
</code></pre><ul>
<li>I generated better flame graphs for the DSpace indexing process by using <code>perf-record-stack</code> and filtering out the java process:</li>
<li>Follow up with <ahref="https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706">Atmire about DSpace 6.x upgrade</a>
<ul>
<li>I raised the issue of targetting 6.4-SNAPSHOT as well as the Discovery indexing performance issues in 6.x</li>
</ul>
</li>
</ul>
<h2id="2020-02-11">2020-02-11</h2>
<ul>
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <ahref="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
<ul>
<li>I ran the <code>check-spider-hits.sh</code> script with the updated file and purged 6,000 hits from our Solr statistics core on CGSpace</li>
</ul>
</li>
</ul>
<h2id="2020-02-12">2020-02-12</h2>
<ul>
<li>Follow up with people about AReS funding for next phase</li>
<li>Peter asked about the “stats” and “summary” reports that he had requested in December
<ul>
<li>I opened a <ahref="https://github.com/ilri/AReS/issues/13">new issue on AReS for the “summary” report</a></li>
</ul>
</li>
<li>Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:</li>
</ul>
<pre><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
UPDATE 26
</code></pre><h2id="2020-02-17">2020-02-17</h2>
<ul>
<li>A few days ago Atmire responded to my question about DSpace 6.4-SNAPSHOT saying that they can only confirm that 6.3 works with their modules
<ul>
<li>I responded to say that we agree to target 6.3, but that I will cherry-pick important patches from the <code>dspace-6_x</code> branch at our own responsibility</li>
</ul>
</li>
<li>Send a message to dspace-devel asking them to tag DSpace 6.4</li>
<li>Udana from IWMI asked about the OAI base URL for their community on CGSpace
<ul>
<li>I think it should be this: <ahref="https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814">https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814</a></li>
</ul>
</li>
</ul>
<h2id="2020-02-19">2020-02-19</h2>
<ul>
<li>I noticed a thread on the mailing list about the Tomcat header size and Solr max boolean clauses error
<ul>
<li>The solution is to do as we have done and increase the headers / boolean clauses, or to simply <ahref="https://wiki.lyrasis.org/display/DSPACE/TechnicalFaq#TechnicalFAQ-I'mgetting%22SolrException:BadRequest%22followedbyalongqueryora%22tooManyClauses%22Exception">disable access rights awareness</a> in Discovery</li>
<li>I applied the fix to the <code>5_x-prod</code> branch and cherry-picked it to <code>6_x-dev</code></li>
</ul>
</li>
<li>Upgrade Tomcat from 7.0.99 to 7.0.100 in <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Upgrade PostgreSQL JDBC driver from 42.2.9 to 42.2.10 in <ahref="https://github.com/ilri/rmg-ansible-public">Ansible infrastructure playbooks</a></li>
<li>Run Tomcat and PostgreSQL JDBC driver updates on DSpace Test (linode19)</li>
<li>I think this should be covered by the <ahref="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> patterns for the statistics at least…</li>
<li>I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
</code></pre><ul>
<li>Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent</li>
<li>I will add the IP addresses to the nginx badbots list</li>
<li>And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test…
<ul>
<li>Hmm, actually I think this is an Java bug, perhaps introduced or at <ahref="https://bugs.openjdk.java.net/browse/JDK-8204862">least present in 18.04</a>, with lots of <ahref="https://code-maven.com/slides/jenkins-intro/no-graph-error">references</a> to it <ahref="https://issues.jenkins-ci.org/browse/JENKINS-39636">happening in other</a> configurations like Debian 9 with Jenkins, etc…</li>
<li>Apparently if you use the <em>non-headless</em> version of openjdk this doesn’t happen… but that pulls in X11 stuff so no thanks</li>
<li>Also, I see dozens of occurences of this going back over one month (we have logs for about that period):</li>
</ul>
</li>
</ul>
<pre><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
dspace.log.2020-01-12:4
dspace.log.2020-01-13:66
dspace.log.2020-01-14:4
dspace.log.2020-01-15:36
dspace.log.2020-01-16:88
dspace.log.2020-01-17:4
dspace.log.2020-01-18:4
dspace.log.2020-01-19:4
dspace.log.2020-01-20:4
dspace.log.2020-01-21:4
...
</code></pre><ul>
<li>I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…</li>
<li>On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?</li>
<li>But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on <code>/rest</code> and none of which are to XMLUI:</li>
<li>A REST request with <code>limit=50</code> will make exactly fifty <code>statistics_type=view</code> statistics in the Solr core… fuck.
<ul>
<li>So not only do I need to purge all these millions of hits, we need to add these IPs to the list of spider IPs so they don’t get recorded</li>
</ul>
</li>
</ul>
<h2id="2020-02-24">2020-02-24</h2>
<ul>
<li>I tried to add some IPs to the DSpace spider list so they would not get recorded in Solr statistics, but it doesn’t support IPv6
<ul>
<li>A better method is actually to just use the nginx mapping logic we already have to reset the user agent for these requests to “bot”</li>
<li>That, or to really insist that users harvesting us specify some kind of user agent</li>
</ul>
</li>
<li>I tried to add the IPs to our nginx IP bot mapping but it doesn’t seem to work… WTF, why is everything broken?!</li>
<li>Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:</li>
<li>I modified my <code>check-spider-hits.sh</code> script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:</li>
Purging 5535399 hits from 34.218.226.147 in statistics-2018
Total number of bot hits purged: 5535399
</code></pre><ul>
<li>(The <code>statistics</code> core holds 2019 and 2020 stats, because the yearly sharding process failed this year)</li>
<li>Attached is a before and after of the period from 2019-01 to 2020-02:</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/02/cgspace-stats-before.png"alt="CGSpace stats for 2019 and 2020 before the purge"></p>
<p><imgsrc="/cgspace-notes/2020/02/cgspace-stats-after.png"alt="CGSpace stats for 2019 and 2020 after the purge"></p>
<ul>
<li>And here is a graph of the stats by year since 2011:</li>
</ul>
<p><imgsrc="/cgspace-notes/2020/02/cgspace-stats-years.png"alt="CGSpace stats by year since 2011 after the purge"></p>
<ul>
<li>I’m a little suspicious of the 2012, 2013, and 2014 numbers, though
<ul>
<li>I should facet those years by IP and see if any stand out…</li>
</ul>
</li>
<li>The next thing I need to do is figure out why the nginx IP to bot mapping isn’t working…
<ul>
<li>Actually, and I’ve probably learned this before, but the bot mapping is working, but nginx only logs the real user agent (of course!), as I’m only using the mapped one in the proxy pass…</li>
<li>This trick for adding a header with the mapped “ua” variable is nice:</li>
</ul>
</li>
</ul>
<pre><code>add_header X-debug-message "ua is $ua" always;
</code></pre><ul>
<li>Then in the HTTP response you see:</li>
</ul>
<pre><code>X-debug-message: ua is bot
</code></pre><ul>
<li>So the IP to bot mapping is working, phew.</li>
<li>More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
<ul>
<li>For example, ciat.cgiar.org was previously 104.196.152.243, but it is now 35.237.175.180, which I had noticed as a “mystery” client on Google Cloud in 2018-09</li>
<li>Others I should probably add to the nginx bot map list are:
<ul>
<li>wle.cgiar.org (70.32.90.172)</li>
<li>ccafs.cgiar.org (205.186.128.185)</li>
<li>another CIAT scraper using the PHP GuzzleHttp library (45.5.184.72)</li>
<li>These IPs are all active in the REST API logs over the last few months and they account for <em>thirty-four million</em> more hits in the statistics!</li>
<li>I looked in the REST API logs for the past month and found a few more IPs:
<ul>
<li>95.110.154.135 (BioversityBot)</li>
<li>34.209.213.122 (IITA? bot)</li>
</ul>
</li>
<li>The client at 3.225.28.105 is using the following user agent:</li>
</ul>
<pre><code>Apache-HttpClient/4.3.4 (java 1.5)
</code></pre><ul>
<li>But I don’t see any hits for it in the statistics core for some reason</li>
<li>Looking more into the 2015 statistics I see some questionable IPs:
<ul>
<li>50.115.121.196 has a DNS of saltlakecity2tr.monitis.com</li>
<li>70.32.99.142 has userAgent Drupal</li>
<li>104.130.164.111 was some scraper on Rackspace.com that made ~30,000 requests per month</li>
<li>45.56.65.158 was some scraper on Linode that made ~30,000 requests per month</li>
<li>23.97.198.40 was some scraper with an IP owned by Microsoft that made ~4,000 requests per month and had no user agent</li>
<li>180.76.15.6 and <em>dozens</em> of other IPs with DNS like baiduspider-180-76-15-6.crawl.baidu.com. (and they were using a Mozilla/5.0 user agent!)</li>
</ul>
</li>
<li>For the IPs I purged them using <code>check-spider-ip-hits.sh</code>:</li>
Purging 462 hits from 70.32.99.142 in statistics-2014
Purging 1766 hits from 50.115.121.196 in statistics-2014
Total number of bot hits purged: 2228
</code></pre><ul>
<li>Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn’t have a proper user agent and the only way to identify them was via DNS:</li>
Purging 1 hits from 143.233.242.130 in statistics-2015
Purging 14109 hits from 83.103.94.48 in statistics-2015
Total number of bot hits purged: 14110
</code></pre><ul>
<li>Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like “Microsoft Office Word 2014”</li>
<li>Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:</li>
"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
"EventMachine HttpClient"
</code></pre><ul>
<li>I should definitely add HttpClient to the bot user agents…</li>
<li>Also, while <code>bot</code>, <code>spider</code>, and <code>crawl</code> are in the pattern list already and can be used for case-insensitive matching when used by DSpace in Java, I can’t do case-insensitive matching in Solr with <code>check-spider-hits.sh</code>
<ul>
<li>I need to add <code>Bot</code>, <code>Spider</code>, and <code>Crawl</code> to my local user agent file to purge them</li>
<li>Also, I see lots of hits from “Indy Library”, which we’ve been blocking for a long time, but somehow these got through (I think it’s the Greek guys using Delphi)</li>
<li>Somehow my regex conversion isn’t working in check-spider-hits.sh, but “<em>Indy</em>” will work for now</li>
<li>Purging just these case-sensitive patterns removed ~1 million more hits from 2011 to 2020</li>
<li>And what’s the 950,000 hits from Online.net IPs with the following user agent:</li>
</ul>
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
</code></pre><ul>
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I’m purging them all</li>
IZaBEE/IZaBEE-1.01 (Buzzing Abound The Web; https://izabee.com; info at izabee dot com)
Twurly v1.1 (https://twurly.org)
okhttp/3.11.0
okhttp/3.10.0
Pattern/2.6 +http://www.clips.ua.ac.be/pattern
Link Check; EPrints 3.3.x;
CyotekWebCopy/1.7 CyotekHTTP/2.0
Adestra Link Checker: http://www.adestra.co.uk
HTTPie/1.0.2
</code></pre><ul>
<li>I notice that some of these would be matched by the COUNTER-Robots list when DSpace uses it in Java because there we have more robust (and case-insensitive) matching
<ul>
<li>I created a temporary file of some of the patterns and converted them to use capitalization so I could run them through <code>check-spider-hits.sh</code></li>
<li><ahref="https://github.com/atmire/COUNTER-Robots/pull/34">Add new bots</a></li>
</ul>
</li>
<li>One benefit of all this is that the size of the statistics Solr core has reduced by 6GB since yesterday, though I can’t remember how big it was before that
<ul>
<li>I wonder if the sharding process would work now…</li>