diff --git a/content/posts/2020-10.md b/content/posts/2020-10.md index 307c2bc6f..4ca416f68 100644 --- a/content/posts/2020-10.md +++ b/content/posts/2020-10.md @@ -215,4 +215,80 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token: 2. If the collection has a workflow the item will enter it and the API returns an item ID 3. If the collection does not have a workflow then the item is committed to the archive and you get a Handle +## 2020-10-09 + +- Skype with Peter about AReS and CGSpace + - We discussed removing Atmire Listings and Reports from DSpace 6 because we can probably make the same reports in AReS and this module is the one that is currently holding us back from the upgrade + - We discussed allowing partners to submit content via the REST API and perhaps making it an extra fee due to the burden it incurs with unfinished submissions, manual duplicate checking, developer support, etc + - He was excited about the possibility of using my statistics API for more things on AReS as well as item view pages +- Also I fixed a bunch of the CRP mappings in the AReS value mapper and started a fresh re-indexing + +## 2020-10-12 + +- Looking at CGSpace's Solr statistics for 2020-09 and I see: + - `RTB website BOT`: 212916 + - `Java/1.8.0_66`: 3122 + - `Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1`: 614 + - `omgili/0.5 +http://omgili.com`: 272 + - `Mozilla/5.0 (compatible; TrendsmapResolver/0.1)`: 199 + - `Vizzit`: 160 + - `Scoop.it`: 151 +- I'm confused because a pattern for `bot` has existed in the default DSpace spider agents file forever... + - I see 259,000 hits in CGSpace's 2020 Solr core when I search for this: `userAgent:/.*[Bb][Oo][Tt].*/` + - This includes 228,000 for `RTB website BOT` and 18,000 for `ILRI Livestock Website Publications importer BOT` + - I made a few requests to DSpace Test with the RTB user agent to see if it gets logged or not: + +``` +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT" +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT" +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT" +$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT" +``` + +- After a few minutes I saw these four hits in Solr... WTF + - So is there some issue with DSpace's parsing of the spider agent files? + - I added `RTB website BOT` to the ilri pattern file, restarted Tomcat, and made four more requests to the bitstream + - These four requests were recorded in Solr too, WTF! + - It seems like the patterns aren't working at all... + - I decided to try something drastic and removed all pattern files, adding only one single pattern `bot` to make sure this is not because of a syntax or precedence issue + - Now even those four requests were recorded in Solr, WTF! + - I will try one last thing, to put a single entry with the exact pattern `RTB website BOT` in a single spider agents pattern file... + - Nope! Still records the hits... WTF + - As a last resort I tried to use the vanilla [DSpace 6 `example` file](https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace/config/spiders/agents/example) + - And the hits still get recorded... WTF + - So now I'm wondering if this is because of our custom Atmire shit? + - I will have to test on a vanilla DSpace instance I guess before I can complain to the dspace-tech mailing list +- I re-factored the `check-spider-hits.sh` script to read patterns from a text file rather than sed's stdout, and to properly search for spaces in patterns that use `\s` because Lucene's search syntax doesn't support it (and spaces work just fine) + - Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html + - Reference: https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Regexp_Searches +- I added `[Ss]pider` to the Tomcat Crawler Sessions Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONID +- I added a few of the patterns from above to our local agents list and ran the `check-spider-hits.sh` on CGSpace: + +``` +$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p +Purging 228916 hits from RTB website BOT in statistics +Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics +Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics +Purging 199 hits from [Ss]pider in statistics +Purging 2326 hits from ubermetrics in statistics +Purging 888 hits from omgili\.com in statistics +Purging 1888 hits from TrendsmapResolver in statistics +Purging 3546 hits from Vizzit in statistics +Purging 2127 hits from Scoop\.it in statistics + +Total number of bot hits purged: 261258 +$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2019 -u http://localhost:8083/solr -p +Purging 2952 hits from TrendsmapResolver in statistics-2019 +Purging 4252 hits from Vizzit in statistics-2019 +Purging 2976 hits from Scoop\.it in statistics-2019 + +Total number of bot hits purged: 10180 +$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2018 -u http://localhost:8083/solr -p +Purging 1702 hits from TrendsmapResolver in statistics-2018 +Purging 1062 hits from Vizzit in statistics-2018 +Purging 920 hits from Scoop\.it in statistics-2018 + +Total number of bot hits purged: 3684 +``` + diff --git a/docs/2015-11/index.html b/docs/2015-11/index.html index daa95249c..c4d4de583 100644 --- a/docs/2015-11/index.html +++ b/docs/2015-11/index.html @@ -31,7 +31,7 @@ Last week I had increased the limit from 30 to 60, which seemed to help, but now $ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace 78 "/> - + diff --git a/docs/2015-12/index.html b/docs/2015-12/index.html index 9af049c52..a8dcfed3d 100644 --- a/docs/2015-12/index.html +++ b/docs/2015-12/index.html @@ -33,7 +33,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less -rw-rw-r-- 1 tomcat7 tomcat7 387K Nov 18 23:59 dspace.log.2015-11-18.lzo -rw-rw-r-- 1 tomcat7 tomcat7 169K Nov 18 23:59 dspace.log.2015-11-18.xz "/> - + diff --git a/docs/2016-01/index.html b/docs/2016-01/index.html index 462af7234..0e338d326 100644 --- a/docs/2016-01/index.html +++ b/docs/2016-01/index.html @@ -25,7 +25,7 @@ Move ILRI collection 10568/12503 from 10568/27869 to 10568/27629 using the move_ I realized it is only necessary to clear the Cocoon cache after moving collections—rather than reindexing—as no metadata has changed, and therefore no search or browse indexes need to be updated. Update GitHub wiki for documentation of maintenance tasks. "/> - + diff --git a/docs/2016-02/index.html b/docs/2016-02/index.html index 0229a0873..62f2c182f 100644 --- a/docs/2016-02/index.html +++ b/docs/2016-02/index.html @@ -35,7 +35,7 @@ I noticed we have a very interesting list of countries on CGSpace: Not only are there 49,000 countries, we have some blanks (25)… Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE” "/> - + diff --git a/docs/2016-03/index.html b/docs/2016-03/index.html index 8c20c4d18..41addc93f 100644 --- a/docs/2016-03/index.html +++ b/docs/2016-03/index.html @@ -25,7 +25,7 @@ Looking at issues with author authorities on CGSpace For some reason we still have the index-lucene-update cron job active on CGSpace, but I’m pretty sure we don’t need it as of the latest few versions of Atmire’s Listings and Reports module Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Java JDK 1.7 to match environment on CGSpace server "/> - + diff --git a/docs/2016-04/index.html b/docs/2016-04/index.html index c33793bff..ec66e59ef 100644 --- a/docs/2016-04/index.html +++ b/docs/2016-04/index.html @@ -29,7 +29,7 @@ After running DSpace for over five years I’ve never needed to look in any This will save us a few gigs of backup space we’re paying for on S3 Also, I noticed the checker log has some errors we should pay attention to: "/> - + diff --git a/docs/2016-05/index.html b/docs/2016-05/index.html index db7f0b5d4..f81f91248 100644 --- a/docs/2016-05/index.html +++ b/docs/2016-05/index.html @@ -31,7 +31,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period! # awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168 "/> - + diff --git a/docs/2016-06/index.html b/docs/2016-06/index.html index a0eaf6c66..11002bae0 100644 --- a/docs/2016-06/index.html +++ b/docs/2016-06/index.html @@ -31,7 +31,7 @@ This is their publications set: http://ebrary.ifpri.org/oai/oai.php?verb=ListRec You can see the others by using the OAI ListSets verb: http://ebrary.ifpri.org/oai/oai.php?verb=ListSets Working on second phase of metadata migration, looks like this will work for moving CPWF-specific data in dc.identifier.fund to cg.identifier.cpwfproject and then the rest to dc.description.sponsorship "/> - + diff --git a/docs/2016-07/index.html b/docs/2016-07/index.html index ec4ca3ed8..0c24575fd 100644 --- a/docs/2016-07/index.html +++ b/docs/2016-07/index.html @@ -41,7 +41,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and In this case the select query was showing 95 results before the update "/> - + diff --git a/docs/2016-08/index.html b/docs/2016-08/index.html index 43a9fcd85..befc78d36 100644 --- a/docs/2016-08/index.html +++ b/docs/2016-08/index.html @@ -39,7 +39,7 @@ $ git checkout -b 55new 5_x-prod $ git reset --hard ilri/5_x-prod $ git rebase -i dspace-5.5 "/> - + diff --git a/docs/2016-09/index.html b/docs/2016-09/index.html index 2a0df487b..46abaf276 100644 --- a/docs/2016-09/index.html +++ b/docs/2016-09/index.html @@ -31,7 +31,7 @@ It looks like we might be able to use OUs now, instead of DCs: $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b "dc=cgiarad,dc=org" -D "admigration1@cgiarad.org" -W "(sAMAccountName=admigration1)" "/> - + diff --git a/docs/2016-10/index.html b/docs/2016-10/index.html index 58062b909..4829a355b 100644 --- a/docs/2016-10/index.html +++ b/docs/2016-10/index.html @@ -39,7 +39,7 @@ I exported a random item’s metadata as CSV, deleted all columns except id 0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X "/> - + diff --git a/docs/2016-11/index.html b/docs/2016-11/index.html index 56da433a2..20aa066b5 100644 --- a/docs/2016-11/index.html +++ b/docs/2016-11/index.html @@ -23,7 +23,7 @@ Add dc.type to the output options for Atmire’s Listings and Reports module Add dc.type to the output options for Atmire’s Listings and Reports module (#286) "/> - + diff --git a/docs/2016-12/index.html b/docs/2016-12/index.html index 03393e506..2ee74c03e 100644 --- a/docs/2016-12/index.html +++ b/docs/2016-12/index.html @@ -43,7 +43,7 @@ I see thousands of them in the logs for the last few months, so it’s not r I’ve raised a ticket with Atmire to ask Another worrying error from dspace.log is: "/> - + diff --git a/docs/2017-01/index.html b/docs/2017-01/index.html index d9696f4d0..a53772dd7 100644 --- a/docs/2017-01/index.html +++ b/docs/2017-01/index.html @@ -25,7 +25,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s I tested on DSpace Test as well and it doesn’t work there either I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years "/> - + diff --git a/docs/2017-02/index.html b/docs/2017-02/index.html index d4994ce46..2ab319ad5 100644 --- a/docs/2017-02/index.html +++ b/docs/2017-02/index.html @@ -47,7 +47,7 @@ DELETE 1 Create issue on GitHub to track the addition of CCAFS Phase II project tags (#301) Looks like we’ll be using cg.identifier.ccafsprojectpii as the field name "/> - + diff --git a/docs/2017-03/index.html b/docs/2017-03/index.html index 2866fc1f3..7ac7b2255 100644 --- a/docs/2017-03/index.html +++ b/docs/2017-03/index.html @@ -51,7 +51,7 @@ Interestingly, it seems DSpace 4.x’s thumbnails were sRGB, but forcing reg $ identify ~/Desktop/alc_contrastes_desafios.jpg /Users/aorth/Desktop/alc_contrastes_desafios.jpg JPEG 464x600 464x600+0+0 8-bit CMYK 168KB 0.000u 0:00.000 "/> - + diff --git a/docs/2017-04/index.html b/docs/2017-04/index.html index c4b42804d..55bbd8160 100644 --- a/docs/2017-04/index.html +++ b/docs/2017-04/index.html @@ -37,7 +37,7 @@ Testing the CMYK patch on a collection with 650 items: $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt "/> - + diff --git a/docs/2017-05/index.html b/docs/2017-05/index.html index 32088f787..64dd2a5d8 100644 --- a/docs/2017-05/index.html +++ b/docs/2017-05/index.html @@ -15,7 +15,7 @@ - + diff --git a/docs/2017-06/index.html b/docs/2017-06/index.html index 4cf233069..d0eb3b404 100644 --- a/docs/2017-06/index.html +++ b/docs/2017-06/index.html @@ -15,7 +15,7 @@ - + diff --git a/docs/2017-07/index.html b/docs/2017-07/index.html index 67b1a7fa7..09da59bee 100644 --- a/docs/2017-07/index.html +++ b/docs/2017-07/index.html @@ -33,7 +33,7 @@ Merge changes for WLE Phase II theme rename (#329) Looking at extracting the metadata registries from ICARDA’s MEL DSpace database so we can compare fields with CGSpace We can use PostgreSQL’s extended output format (-x) plus sed to format the output into quasi XML: "/> - + diff --git a/docs/2017-08/index.html b/docs/2017-08/index.html index ce29fe1ed..9e9bc0482 100644 --- a/docs/2017-08/index.html +++ b/docs/2017-08/index.html @@ -57,7 +57,7 @@ This was due to newline characters in the dc.description.abstract column, which I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet "/> - + diff --git a/docs/2017-09/index.html b/docs/2017-09/index.html index 166b02056..8358cf8e1 100644 --- a/docs/2017-09/index.html +++ b/docs/2017-09/index.html @@ -29,7 +29,7 @@ Linode sent an alert that CGSpace (linode18) was using 261% CPU for the past two Ask Sisay to clean up the WLE approvers a bit, as Marianne’s user account is both in the approvers step as well as the group "/> - + diff --git a/docs/2017-10/index.html b/docs/2017-10/index.html index edd49a67a..79bc49a38 100644 --- a/docs/2017-10/index.html +++ b/docs/2017-10/index.html @@ -31,7 +31,7 @@ http://hdl.handle.net/10568/78495||http://hdl.handle.net/10568/79336 There appears to be a pattern but I’ll have to look a bit closer and try to clean them up automatically, either in SQL or in OpenRefine Add Katherine Lutz to the groups for content submission and edit steps of the CGIAR System collections "/> - + diff --git a/docs/2017-11/index.html b/docs/2017-11/index.html index 4b5d3574d..49f883e9a 100644 --- a/docs/2017-11/index.html +++ b/docs/2017-11/index.html @@ -45,7 +45,7 @@ Generate list of authors on CGSpace for Peter to go through and correct: dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv; COPY 54701 "/> - + diff --git a/docs/2017-12/index.html b/docs/2017-12/index.html index d6444a3b4..128b1ecab 100644 --- a/docs/2017-12/index.html +++ b/docs/2017-12/index.html @@ -27,7 +27,7 @@ The logs say “Timeout waiting for idle object” PostgreSQL activity says there are 115 connections currently The list of connections to XMLUI and REST API for today: "/> - + diff --git a/docs/2018-01/index.html b/docs/2018-01/index.html index b0c38b822..ecbdf3e06 100644 --- a/docs/2018-01/index.html +++ b/docs/2018-01/index.html @@ -147,7 +147,7 @@ dspace.log.2018-01-02:34 Danny wrote to ask for help renewing the wildcard ilri.org certificate and I advised that we should probably use Let’s Encrypt if it’s just a handful of domains "/> - + diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index ade74649d..fb249532c 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -27,7 +27,7 @@ We don’t need to distinguish between internal and external works, so that Yesterday I figured out how to monitor DSpace sessions using JMX I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-plugins-java package and used the stuff I discovered about JMX in 2018-01 "/> - + diff --git a/docs/2018-03/index.html b/docs/2018-03/index.html index 81c1e6148..58c5fd1ea 100644 --- a/docs/2018-03/index.html +++ b/docs/2018-03/index.html @@ -21,7 +21,7 @@ Export a CSV of the IITA community metadata for Martin Mueller Export a CSV of the IITA community metadata for Martin Mueller "/> - + diff --git a/docs/2018-04/index.html b/docs/2018-04/index.html index bd12f3f13..74371e238 100644 --- a/docs/2018-04/index.html +++ b/docs/2018-04/index.html @@ -23,7 +23,7 @@ Catalina logs at least show some memory errors yesterday: I tried to test something on DSpace Test but noticed that it’s down since god knows when Catalina logs at least show some memory errors yesterday: "/> - + diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 9ba9bb316..a089b961f 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -35,7 +35,7 @@ http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E Then I reduced the JVM heap size from 6144 back to 5120m Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use "/> - + diff --git a/docs/2018-06/index.html b/docs/2018-06/index.html index 0e57b82d4..beeaf76cd 100644 --- a/docs/2018-06/index.html +++ b/docs/2018-06/index.html @@ -55,7 +55,7 @@ real 74m42.646s user 8m5.056s sys 2m7.289s "/> - + diff --git a/docs/2018-07/index.html b/docs/2018-07/index.html index 10e4b7d32..601a47ee0 100644 --- a/docs/2018-07/index.html +++ b/docs/2018-07/index.html @@ -33,7 +33,7 @@ During the mvn package stage on the 5.8 branch I kept getting issues with java r There is insufficient memory for the Java Runtime Environment to continue. "/> - + diff --git a/docs/2018-08/index.html b/docs/2018-08/index.html index e3a113e74..29c1add10 100644 --- a/docs/2018-08/index.html +++ b/docs/2018-08/index.html @@ -43,7 +43,7 @@ Anyways, perhaps I should increase the JVM heap from 5120m to 6144m like we did The server only has 8GB of RAM so we’ll eventually need to upgrade to a larger one because we’ll start starving the OS, PostgreSQL, and command line batch processes I ran all system updates on DSpace Test and rebooted it "/> - + diff --git a/docs/2018-09/index.html b/docs/2018-09/index.html index fb2e99458..d860ab123 100644 --- a/docs/2018-09/index.html +++ b/docs/2018-09/index.html @@ -27,7 +27,7 @@ I’ll update the DSpace role in our Ansible infrastructure playbooks and ru Also, I’ll re-run the postgresql tasks because the custom PostgreSQL variables are dynamic according to the system’s RAM, and we never re-ran them after migrating to larger Linodes last month I’m testing the new DSpace 5.8 branch in my Ubuntu 18.04 environment and I’m getting those autowire errors in Tomcat 8.5.30 again: "/> - + diff --git a/docs/2018-10/index.html b/docs/2018-10/index.html index 9f25c69dd..38f24b265 100644 --- a/docs/2018-10/index.html +++ b/docs/2018-10/index.html @@ -23,7 +23,7 @@ I created a GitHub issue to track this #389, because I’m super busy in Nai Phil Thornton got an ORCID identifier so we need to add it to the list on CGSpace and tag his existing items I created a GitHub issue to track this #389, because I’m super busy in Nairobi right now "/> - + diff --git a/docs/2018-11/index.html b/docs/2018-11/index.html index 5efa22778..f482dd1aa 100644 --- a/docs/2018-11/index.html +++ b/docs/2018-11/index.html @@ -33,7 +33,7 @@ Send a note about my dspace-statistics-api to the dspace-tech mailing list Linode has been sending mails a few times a day recently that CGSpace (linode18) has had high CPU usage Today these are the top 10 IPs: "/> - + diff --git a/docs/2018-12/index.html b/docs/2018-12/index.html index 543a424b7..fd5cd9754 100644 --- a/docs/2018-12/index.html +++ b/docs/2018-12/index.html @@ -33,7 +33,7 @@ Then I ran all system updates and restarted the server I noticed that there is another issue with PDF thumbnails on CGSpace, and I see there was another Ghostscript vulnerability last week "/> - + diff --git a/docs/2019-01/index.html b/docs/2019-01/index.html index 47cd08213..ef0487263 100644 --- a/docs/2019-01/index.html +++ b/docs/2019-01/index.html @@ -47,7 +47,7 @@ I don’t see anything interesting in the web server logs around that time t 357 207.46.13.1 903 54.70.40.11 "/> - + diff --git a/docs/2019-02/index.html b/docs/2019-02/index.html index cc4bbc51d..b6b628a6b 100644 --- a/docs/2019-02/index.html +++ b/docs/2019-02/index.html @@ -69,7 +69,7 @@ real 0m19.873s user 0m22.203s sys 0m1.979s "/> - + diff --git a/docs/2019-03/index.html b/docs/2019-03/index.html index 1080f201b..d09b220b5 100644 --- a/docs/2019-03/index.html +++ b/docs/2019-03/index.html @@ -43,7 +43,7 @@ Most worryingly, there are encoding errors in the abstracts for eleven items, fo I think I will need to ask Udana to re-copy and paste the abstracts with more care using Google Docs "/> - + diff --git a/docs/2019-04/index.html b/docs/2019-04/index.html index 4bda63196..1a1b39feb 100644 --- a/docs/2019-04/index.html +++ b/docs/2019-04/index.html @@ -61,7 +61,7 @@ $ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u ds $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d "/> - + diff --git a/docs/2019-05/index.html b/docs/2019-05/index.html index 6d8401357..5b4f37c11 100644 --- a/docs/2019-05/index.html +++ b/docs/2019-05/index.html @@ -45,7 +45,7 @@ DELETE 1 But after this I tried to delete the item from the XMLUI and it is still present… "/> - + diff --git a/docs/2019-06/index.html b/docs/2019-06/index.html index ac6f45754..33b4bb213 100644 --- a/docs/2019-06/index.html +++ b/docs/2019-06/index.html @@ -31,7 +31,7 @@ Run system updates on CGSpace (linode18) and reboot it Skype with Marie-Angélique and Abenet about CG Core v2 "/> - + diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html index cd4a23f77..2edbc8bee 100644 --- a/docs/2019-07/index.html +++ b/docs/2019-07/index.html @@ -35,7 +35,7 @@ CGSpace Abenet had another similar issue a few days ago when trying to find the stats for 2018 in the RTB community "/> - + diff --git a/docs/2019-08/index.html b/docs/2019-08/index.html index fee4d65b6..5231ae8d9 100644 --- a/docs/2019-08/index.html +++ b/docs/2019-08/index.html @@ -43,7 +43,7 @@ After rebooting, all statistics cores were loaded… wow, that’s luck Run system updates on DSpace Test (linode19) and reboot it "/> - + diff --git a/docs/2019-09/index.html b/docs/2019-09/index.html index c65d2f73d..f0563df4a 100644 --- a/docs/2019-09/index.html +++ b/docs/2019-09/index.html @@ -69,7 +69,7 @@ Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning: 7249 2a01:7e00::f03c:91ff:fe18:7396 9124 45.5.186.2 "/> - + diff --git a/docs/2019-10/index.html b/docs/2019-10/index.html index 81b95ef1f..03d0ae51a 100644 --- a/docs/2019-10/index.html +++ b/docs/2019-10/index.html @@ -15,7 +15,7 @@ - + diff --git a/docs/2019-11/index.html b/docs/2019-11/index.html index da9f3d46f..43050579c 100644 --- a/docs/2019-11/index.html +++ b/docs/2019-11/index.html @@ -55,7 +55,7 @@ Let’s see how many of the REST API requests were for bitstreams (because t # zcat --force /var/log/nginx/rest.log.*.gz | grep -E "[0-9]{1,2}/Oct/2019" | grep -c -E "/rest/bitstreams" 106781 "/> - + diff --git a/docs/2019-12/index.html b/docs/2019-12/index.html index c33283f26..b12997d38 100644 --- a/docs/2019-12/index.html +++ b/docs/2019-12/index.html @@ -43,7 +43,7 @@ Make sure all packages are up to date and the package manager is up to date, the # dpkg -C # reboot "/> - + diff --git a/docs/2020-01/index.html b/docs/2020-01/index.html index 633550d27..518c585be 100644 --- a/docs/2020-01/index.html +++ b/docs/2020-01/index.html @@ -53,7 +53,7 @@ I tweeted the CGSpace repository link "/> - + diff --git a/docs/2020-02/index.html b/docs/2020-02/index.html index 107973fb5..a346b9073 100644 --- a/docs/2020-02/index.html +++ b/docs/2020-02/index.html @@ -35,7 +35,7 @@ The code finally builds and runs with a fresh install "/> - + diff --git a/docs/2020-03/index.html b/docs/2020-03/index.html index c5e58db6a..057d8baaf 100644 --- a/docs/2020-03/index.html +++ b/docs/2020-03/index.html @@ -39,7 +39,7 @@ You need to download this into the DSpace 6.x source and compile it "/> - + diff --git a/docs/2020-04/index.html b/docs/2020-04/index.html index cc4b5ab82..ee6defe93 100644 --- a/docs/2020-04/index.html +++ b/docs/2020-04/index.html @@ -45,7 +45,7 @@ The third item now has a donut with score 1 since I tweeted it last week On the same note, the one item Abenet pointed out last week now has a donut with score of 104 after I tweeted it last week "/> - + diff --git a/docs/2020-05/index.html b/docs/2020-05/index.html index 6e16e58ac..be23919bb 100644 --- a/docs/2020-05/index.html +++ b/docs/2020-05/index.html @@ -31,7 +31,7 @@ I see that CGSpace (linode18) is still using PostgreSQL JDBC driver version 42.2 "/> - + diff --git a/docs/2020-06/index.html b/docs/2020-06/index.html index 4e840f00a..d6b39e765 100644 --- a/docs/2020-06/index.html +++ b/docs/2020-06/index.html @@ -33,7 +33,7 @@ I sent Atmire the dspace.log from today and told them to log into the server to In other news, I checked the statistics API on DSpace 6 and it’s working I tried to build the OAI registry on the freshly migrated DSpace 6 on DSpace Test and I get an error: "/> - + diff --git a/docs/2020-07/index.html b/docs/2020-07/index.html index 2d2f8c04a..d50daf8ec 100644 --- a/docs/2020-07/index.html +++ b/docs/2020-07/index.html @@ -35,7 +35,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request "/> - + diff --git a/docs/2020-08/index.html b/docs/2020-08/index.html index ab436ccd7..5c47d53a0 100644 --- a/docs/2020-08/index.html +++ b/docs/2020-08/index.html @@ -33,7 +33,7 @@ It is class based so I can easily add support for other vocabularies, and the te "/> - + diff --git a/docs/2020-09/index.html b/docs/2020-09/index.html index 0406d9588..6e49f8d80 100644 --- a/docs/2020-09/index.html +++ b/docs/2020-09/index.html @@ -45,7 +45,7 @@ I filed a bug on OpenRXV: https://github.com/ilri/OpenRXV/issues/39 I filed an issue on OpenRXV to make some minor edits to the admin UI: https://github.com/ilri/OpenRXV/issues/40 "/> - + diff --git a/docs/2020-10/index.html b/docs/2020-10/index.html index df93ac062..86f01645b 100644 --- a/docs/2020-10/index.html +++ b/docs/2020-10/index.html @@ -23,7 +23,7 @@ During the FlywayDB migration I got an error: - + @@ -41,7 +41,7 @@ During the FlywayDB migration I got an error: "/> - + @@ -51,9 +51,9 @@ During the FlywayDB migration I got an error: "@type": "BlogPosting", "headline": "October, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-10/", - "wordCount": "1195", + "wordCount": "1895", "datePublished": "2020-10-06T16:55:54+03:00", - "dateModified": "2020-10-08T11:15:49+03:00", + "dateModified": "2020-10-08T15:54:02+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -370,7 +370,96 @@ $ http POST http://localhost:8080/rest/collections/1549/items rest-dspace-token:
RTB website BOT
: 212916Java/1.8.0_66
: 3122Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
: 614omgili/0.5 +http://omgili.com
: 272Mozilla/5.0 (compatible; TrendsmapResolver/0.1)
: 199Vizzit
: 160Scoop.it
: 151bot
has existed in the default DSpace spider agents file forever…
+userAgent:/.*[Bb][Oo][Tt].*/
+RTB website BOT
and 18,000 for ILRI Livestock Website Publications importer BOT
$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
+$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
+$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
+$ http --print Hh https://dspacetest.cgiar.org/rest/bitstreams/dfa1d9c3-75d3-4380-a9d3-4c8cbbed2d21/retrieve User-Agent:"RTB website BOT"
+
RTB website BOT
to the ilri pattern file, restarted Tomcat, and made four more requests to the bitstreambot
to make sure this is not because of a syntax or precedence issueRTB website BOT
in a single spider agents pattern file…example
filecheck-spider-hits.sh
script to read patterns from a text file rather than sed’s stdout, and to properly search for spaces in patterns that use \s
because Lucene’s search syntax doesn’t support it (and spaces work just fine)
+
+[Ss]pider
to the Tomcat Crawler Sessions Manager Valve regex because this can catch a few more generic bots and force them to use the same Tomcat JSESSIONIDcheck-spider-hits.sh
on CGSpace:$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics -u http://localhost:8083/solr -p
+Purging 228916 hits from RTB website BOT in statistics
+Purging 18707 hits from ILRI Livestock Website Publications importer BOT in statistics
+Purging 2661 hits from ^Java\/[0-9]{1,2}.[0-9] in statistics
+Purging 199 hits from [Ss]pider in statistics
+Purging 2326 hits from ubermetrics in statistics
+Purging 888 hits from omgili\.com in statistics
+Purging 1888 hits from TrendsmapResolver in statistics
+Purging 3546 hits from Vizzit in statistics
+Purging 2127 hits from Scoop\.it in statistics
+
+Total number of bot hits purged: 261258
+$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2019 -u http://localhost:8083/solr -p
+Purging 2952 hits from TrendsmapResolver in statistics-2019
+Purging 4252 hits from Vizzit in statistics-2019
+Purging 2976 hits from Scoop\.it in statistics-2019
+
+Total number of bot hits purged: 10180
+$ ./check-spider-hits.sh -f dspace/config/spiders/agents/ilri -s statistics-2018 -u http://localhost:8083/solr -p
+Purging 1702 hits from TrendsmapResolver in statistics-2018
+Purging 1062 hits from Vizzit in statistics-2018
+Purging 920 hits from Scoop\.it in statistics-2018
+
+Total number of bot hits purged: 3684
+
diff --git a/docs/404.html b/docs/404.html
index d70a1ba44..0ae0ca08a 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -14,7 +14,7 @@
-
+
diff --git a/docs/categories/index.html b/docs/categories/index.html
index 661b7097c..4f3be472c 100644
--- a/docs/categories/index.html
+++ b/docs/categories/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html
index c776a3d82..c0eda7557 100644
--- a/docs/categories/notes/index.html
+++ b/docs/categories/notes/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html
index 8f3dcc458..5672ec372 100644
--- a/docs/categories/notes/page/2/index.html
+++ b/docs/categories/notes/page/2/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html
index 1063b21cc..5418ab0ef 100644
--- a/docs/categories/notes/page/3/index.html
+++ b/docs/categories/notes/page/3/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html
index 2165d6505..8790c51f0 100644
--- a/docs/categories/notes/page/4/index.html
+++ b/docs/categories/notes/page/4/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/cgiar-library-migration/index.html b/docs/cgiar-library-migration/index.html
index 6da15f39f..6b3434c65 100644
--- a/docs/cgiar-library-migration/index.html
+++ b/docs/cgiar-library-migration/index.html
@@ -15,7 +15,7 @@
-
+
diff --git a/docs/cgspace-cgcorev2-migration/index.html b/docs/cgspace-cgcorev2-migration/index.html
index 849dc7995..6cbcb5868 100644
--- a/docs/cgspace-cgcorev2-migration/index.html
+++ b/docs/cgspace-cgcorev2-migration/index.html
@@ -15,7 +15,7 @@
-
+
diff --git a/docs/index.html b/docs/index.html
index 88387441a..6c42950ef 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/page/2/index.html b/docs/page/2/index.html
index eb5642203..730ee5f5d 100644
--- a/docs/page/2/index.html
+++ b/docs/page/2/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/page/3/index.html b/docs/page/3/index.html
index 7f4087b7d..2692fb3ba 100644
--- a/docs/page/3/index.html
+++ b/docs/page/3/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/page/4/index.html b/docs/page/4/index.html
index a2b1f2bfc..81f1a445f 100644
--- a/docs/page/4/index.html
+++ b/docs/page/4/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/page/5/index.html b/docs/page/5/index.html
index 57c354a68..31e400c6f 100644
--- a/docs/page/5/index.html
+++ b/docs/page/5/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/page/6/index.html b/docs/page/6/index.html
index 95eb8ec23..c967b03c3 100644
--- a/docs/page/6/index.html
+++ b/docs/page/6/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/page/7/index.html b/docs/page/7/index.html
index 6535b4eb8..3a963c9e8 100644
--- a/docs/page/7/index.html
+++ b/docs/page/7/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/index.html b/docs/posts/index.html
index ca4b8cbe3..82951f99a 100644
--- a/docs/posts/index.html
+++ b/docs/posts/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html
index dbec73226..518c92dea 100644
--- a/docs/posts/page/2/index.html
+++ b/docs/posts/page/2/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html
index cc86d7ce4..078008417 100644
--- a/docs/posts/page/3/index.html
+++ b/docs/posts/page/3/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html
index 4be82b73f..3068d2496 100644
--- a/docs/posts/page/4/index.html
+++ b/docs/posts/page/4/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html
index 8c4d5157d..47db4f50a 100644
--- a/docs/posts/page/5/index.html
+++ b/docs/posts/page/5/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html
index ea5e9837d..f7c5c4396 100644
--- a/docs/posts/page/6/index.html
+++ b/docs/posts/page/6/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html
index 1d2a5852e..b0784f1e1 100644
--- a/docs/posts/page/7/index.html
+++ b/docs/posts/page/7/index.html
@@ -9,12 +9,12 @@
-
+
-
+
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
index cfd2a9626..0e107ddc1 100644
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,27 +4,27 @@