diff --git a/content/posts/2020-07.md b/content/posts/2020-07.md index 57e88fa5e..6f8dd69b2 100644 --- a/content/posts/2020-07.md +++ b/content/posts/2020-07.md @@ -673,6 +673,11 @@ $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H - I started the statistics-2017 core and it finished in 3:44:15 - I started the statistics-2016 core and it finished in 2:27:08 - I started the statistics-2015 core and it finished in 1:07:38 + - I started the statistics-2014 core and it finished in 1:45:44 + - I started the statistics-2013 core and it finished in 1:41:50 + - I started the statistics-2012 core and it finished in 1:23:36 + - I started the statistics-2011 core and it finished in 0:39:37 + - I started the statistics-2010 core and it finished in 0:01:46 ## 2020-07-24 @@ -699,7 +704,7 @@ Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3 - 54.214.112.202 made 839,000 requests with no user agent... - It is on Amazon Web Services (AWS) and made 100% `statistics_type:view` so I guess it was harvesting via the REST API - A few IPs owned by perfectip.net made 400,000 requests in 2018-01 - - They are 2607:fa98:40:9:26b6:fdff:feff:195d and 2607:fa98:40:9:26b6:fdff:feff:1888 and 2607:fa98:40:9:26b6:fdff:feff:1c96 + - They are 2607:fa98:40:9:26b6:fdff:feff:195d and 2607:fa98:40:9:26b6:fdff:feff:1888 and 2607:fa98:40:9:26b6:fdff:feff:1c96 and 70.36.107.49 - All the requests used this user agent: ``` @@ -709,6 +714,8 @@ Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom - Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it's definitely CodeObia / ICARDA and I will purge them - Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645 - Jesus fuck there is 46.101.86.248 making 15,000 requests per month in 2018 with no user agent... +- Jesus fuck there is 84.38.130.177 in Latvia that was making 75,000 requests in 2018-11 and 2018-10 +- Jesus fuck there is 104.198.9.108 on Google Cloud that was making 30,000 requests with no user agent - I will purge the hits from all the following IPs: ``` @@ -727,12 +734,16 @@ Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom 213.139.53.62 2a01:7e00::f03c:91ff:fe0a:d645 46.101.86.248 +54.214.112.202 +84.38.130.177 +104.198.9.108 +70.36.107.49 ``` - In total these accounted for the following amount of requests in each year: - 2020: 1436 - - 2019: 933148 - - 2018: 613936 + - 2019: 960274 + - 2018: 1588149 - I noticed a few other user agents that should be purged too: ``` @@ -751,7 +762,7 @@ mailto\:team@impactstory\.org ``` - I purged them from the stats too: - - 2020: 18153 + - 2020: 19553 - 2019: 29745 - 2018: 18083 - 2017: 19399 @@ -759,4 +770,58 @@ mailto\:team@impactstory\.org - 2015: 16659 - 2014: 713 +## 2020-07-26 + +- I continued with the Solr ID to UUID migrations (solr-upgrade-statistics-6x) from last week and updated my notes for each core above + - After all cores finished migrating I optimized them to delete old documents +- Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with: + +``` +$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId +``` + +- Run system updates on DSpace Test (linode26) and reboot it +- I looked into the umigrated Solr records more and they are overwhelmingly `type: 5` (which means "Site" according to the DSpace constants): + - statistics + - id: -1-unmigrated + - type 5: 167316 + - id: 0-unmigrated + - type 5: 32581 + - id: -1 + - type 5: 10198 + - statistics-2019 + - id: -1 + - type 5: 2690500 + - id: -1-unmigrated + - type 5: 1348202 + - id: 0-unmigrated + - type 5: 141576 + - statistics-2018 + - id: -1 + - type 5: 365466 + - id: -1-unmigrated + - type 5: 254680 + - id: 0-unmigrated + - type 5: 204854 + - 145870-unmigrated + - type 0: 83235 + - statistics-2017 + - id: -1 + - type 5: 808346 + - id: -1-unmigrated + - type 5: 598022 + - id: 0-unmigrated + - type 5: 254014 + - 145870-unmigrated + - type 0: 28168 + - bundleName THUMBNAIL: 28168 + +- There is another one appears in 2018 and 2017 at least of type 0, which would be download + - In that case the id is of a bitstream that no longer exists...? +- I started processing Solr stats with the Atmire tool now: + +``` +$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12 +``` + diff --git a/docs/2020-07/index.html b/docs/2020-07/index.html index 1f6cf52ac..f478d2753 100644 --- a/docs/2020-07/index.html +++ b/docs/2020-07/index.html @@ -20,7 +20,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f - + @@ -45,9 +45,9 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f "@type": "BlogPosting", "headline": "July, 2020", "url": "https://alanorth.github.io/cgspace-notes/2020-07/", - "wordCount": "4728", + "wordCount": "5045", "datePublished": "2020-07-01T10:53:54+03:00", - "dateModified": "2020-07-23T12:32:11+03:00", + "dateModified": "2020-07-24T23:23:15+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -807,6 +807,11 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
  • I started the statistics-2017 core and it finished in 3:44:15
  • I started the statistics-2016 core and it finished in 2:27:08
  • I started the statistics-2015 core and it finished in 1:07:38
  • +
  • I started the statistics-2014 core and it finished in 1:45:44
  • +
  • I started the statistics-2013 core and it finished in 1:41:50
  • +
  • I started the statistics-2012 core and it finished in 1:23:36
  • +
  • I started the statistics-2011 core and it finished in 0:39:37
  • +
  • I started the statistics-2010 core and it finished in 0:01:46
  • @@ -845,7 +850,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
  • A few IPs owned by perfectip.net made 400,000 requests in 2018-01
  • @@ -857,6 +862,8 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
  • Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it’s definitely CodeObia / ICARDA and I will purge them
  • Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645
  • Jesus fuck there is 46.101.86.248 making 15,000 requests per month in 2018 with no user agent…
  • +
  • Jesus fuck there is 84.38.130.177 in Latvia that was making 75,000 requests in 2018-11 and 2018-10
  • +
  • Jesus fuck there is 104.198.9.108 on Google Cloud that was making 30,000 requests with no user agent
  • I will purge the hits from all the following IPs:
  • 192.157.89.4
    @@ -874,12 +881,16 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
     213.139.53.62
     2a01:7e00::f03c:91ff:fe0a:d645
     46.101.86.248
    +54.214.112.202
    +84.38.130.177
    +104.198.9.108
    +70.36.107.49