mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -38,7 +38,7 @@ The code finally builds and runs with a fresh install
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -138,11 +138,11 @@ The code finally builds and runs with a fresh install
|
||||
<ul>
|
||||
<li>Now we don’t specify the build environment because site modification are in <code>local.cfg</code>, so we just build like this:</li>
|
||||
</ul>
|
||||
<pre><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||||
<pre tabindex="0"><code>$ schedtool -D -e ionice -c2 -n7 nice -n19 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false clean package
|
||||
</code></pre><ul>
|
||||
<li>And it seems that we need to enabled <code>pg_crypto</code> now (used for UUIDs):</li>
|
||||
</ul>
|
||||
<pre><code>$ psql -h localhost -U postgres dspace63
|
||||
<pre tabindex="0"><code>$ psql -h localhost -U postgres dspace63
|
||||
dspace63=# CREATE EXTENSION pgcrypto;
|
||||
CREATE EXTENSION pgcrypto;
|
||||
</code></pre><ul>
|
||||
@ -153,11 +153,11 @@ CREATE EXTENSION pgcrypto;
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
|
||||
<pre tabindex="0"><code>dspace63=# DELETE FROM schema_version WHERE version IN ('5.0.2015.01.27', '5.6.2015.12.03.2', '5.6.2016.08.08', '5.0.2017.04.28', '5.0.2017.09.25', '5.8.2015.12.03.3');
|
||||
</code></pre><ul>
|
||||
<li>Then I ran <code>dspace database migrate</code> and got an error:</li>
|
||||
</ul>
|
||||
<pre><code>$ ~/dspace63/bin/dspace database migrate
|
||||
<pre tabindex="0"><code>$ ~/dspace63/bin/dspace database migrate
|
||||
|
||||
Database URL: jdbc:postgresql://localhost:5432/dspace63?ApplicationName=dspaceCli
|
||||
Migrating database to latest version... (Check dspace logs for details)
|
||||
@ -225,7 +225,7 @@ Caused by: org.postgresql.util.PSQLException: ERROR: cannot drop table metadatav
|
||||
<li>A thread on the dspace-tech mailing list regarding this migration noticed that his database had some views created that were using the <code>resource_id</code> column</li>
|
||||
<li>Our database had the same issue, where the <code>eperson_metadata</code> view was created by something (Atmire module?) but has no references in the vanilla DSpace code, so I dropped it and tried the migration again:</li>
|
||||
</ul>
|
||||
<pre><code>dspace63=# DROP VIEW eperson_metadata;
|
||||
<pre tabindex="0"><code>dspace63=# DROP VIEW eperson_metadata;
|
||||
DROP VIEW
|
||||
</code></pre><ul>
|
||||
<li>After that the migration was successful and DSpace starts up successfully and begins indexing
|
||||
@ -252,7 +252,7 @@ DROP VIEW
|
||||
</li>
|
||||
<li>There are lots of errors in the DSpace log, which might explain some of the issues with recent submissions / Solr:</li>
|
||||
</ul>
|
||||
<pre><code>2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception:
|
||||
<pre tabindex="0"><code>2020-02-03 10:27:14,485 ERROR org.dspace.browse.ItemCountDAOSolr @ caught exception:
|
||||
org.dspace.discovery.SearchServiceException: Invalid UUID string: 1
|
||||
2020-02-03 13:20:20,475 ERROR org.dspace.app.xmlui.aspect.discovery.AbstractRecentSubmissionTransformer @ Caught SearchServiceException while retrieving recent submission for: home page
|
||||
org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
|
||||
@ -260,11 +260,11 @@ org.dspace.discovery.SearchServiceException: Invalid UUID string: 111210
|
||||
<li>If I look in Solr’s search core I do actually see items with integers for their resource ID, which I think are all supposed to be UUIDs now…</li>
|
||||
<li>I dropped all the documents in the search core:</li>
|
||||
</ul>
|
||||
<pre><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
|
||||
<pre tabindex="0"><code>$ http --print b 'http://localhost:8080/solr/search/update?stream.body=<delete><query>*:*</query></delete>&commit=true'
|
||||
</code></pre><ul>
|
||||
<li>Still didn’t work, so I’m going to try a clean database import and migration:</li>
|
||||
</ul>
|
||||
<pre><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
|
||||
<pre tabindex="0"><code>$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspace63
|
||||
$ psql -h localhost -U postgres -c 'alter user dspacetest superuser;'
|
||||
$ pg_restore -h localhost -U postgres -d dspace63 -O --role=dspacetest -h localhost dspace_2020-01-27.backup
|
||||
$ psql -h localhost -U postgres -c 'alter user dspacetest nosuperuser;'
|
||||
@ -301,7 +301,7 @@ $ ~/dspace63/bin/dspace database migrate
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ git checkout -b 6_x-dev64 6_x-dev
|
||||
<pre tabindex="0"><code>$ git checkout -b 6_x-dev64 6_x-dev
|
||||
$ git rebase -i upstream/dspace-6_x
|
||||
</code></pre><ul>
|
||||
<li>I finally understand why our themes show all the “Browse by” buttons on community and collection pages in DSpace 6.x
|
||||
@ -321,7 +321,7 @@ $ git rebase -i upstream/dspace-6_x
|
||||
<li>UptimeRobot told me that AReS Explorer crashed last night, so I logged into it, ran all updates, and rebooted it</li>
|
||||
<li>Testing Discovery indexing speed on my local DSpace 6.3:</li>
|
||||
</ul>
|
||||
<pre><code>$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
|
||||
<pre tabindex="0"><code>$ time schedtool -D -e ~/dspace63/bin/dspace index-discovery -b
|
||||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3771.78s user 93.63s system 41% cpu 2:34:19.53 total
|
||||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3360.28s user 82.63s system 38% cpu 2:30:22.07 total
|
||||
schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 4678.72s user 138.87s system 42% cpu 3:08:35.72 total
|
||||
@ -329,7 +329,7 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 3334.19s user 86.54s s
|
||||
</code></pre><ul>
|
||||
<li>DSpace 5.8 was taking about 1 hour (or less on this laptop), so this is 2-3 times longer!</li>
|
||||
</ul>
|
||||
<pre><code>$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
|
||||
<pre tabindex="0"><code>$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b
|
||||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 299.53s user 69.67s system 20% cpu 30:34.47 total
|
||||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s system 19% cpu 29:01.38 total
|
||||
</code></pre><ul>
|
||||
@ -360,7 +360,7 @@ schedtool -D -e ~/dspace/bin/dspace index-discovery -b 270.31s user 69.88s syst
|
||||
<li>I sent a mail to the dspace-tech mailing list asking about slow Discovery indexing speed in DSpace 6</li>
|
||||
<li>I destroyed my PostgreSQL 9.6 containers and re-created them using PostgreSQL 10 to see if there are any speedups with DSpace 6.x:</li>
|
||||
</ul>
|
||||
<pre><code>$ podman pull postgres:10-alpine
|
||||
<pre tabindex="0"><code>$ podman pull postgres:10-alpine
|
||||
$ podman run --name dspacedb10 -v dspacedb_data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres:10-alpine
|
||||
$ createuser -h localhost -U postgres --pwprompt dspacetest
|
||||
$ createdb -h localhost -U postgres -O dspacetest --encoding=UNICODE dspacetest
|
||||
@ -379,29 +379,29 @@ dspace63=# \q
|
||||
</code></pre><ul>
|
||||
<li>I purged ~33,000 hits from the “Jersey/2.6” bot in CGSpace’s statistics using my <code>check-spider-hits.sh</code> script:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
|
||||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -d -p -f /tmp/jersey -s statistics -u http://localhost:8081/solr
|
||||
$ for year in 2018 2017 2016 2015; do ./check-spider-hits.sh -d -p -f /tmp/jersey -s "statistics-${year}" -u http://localhost:8081/solr; done
|
||||
</code></pre><ul>
|
||||
<li>I noticed another user agen in the logs that we should add to the list:</li>
|
||||
</ul>
|
||||
<pre><code>ReactorNetty/0.9.2.RELEASE
|
||||
<pre tabindex="0"><code>ReactorNetty/0.9.2.RELEASE
|
||||
</code></pre><ul>
|
||||
<li>I made <a href="https://github.com/atmire/COUNTER-Robots/issues/31">an issue on the COUNTER-Robots repository</a></li>
|
||||
<li>I found a <a href="https://github.com/freedev/solr-import-export-json">nice tool for exporting and importing Solr records</a> and it seems to work for exporting our 2019 stats from the large statistics core!</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01.json -f 'dateYearMonth:2019-01' -k uid
|
||||
$ ls -lh /tmp/statistics-2019-01.json
|
||||
-rw-rw-r-- 1 aorth aorth 3.7G Feb 6 09:26 /tmp/statistics-2019-01.json
|
||||
</code></pre><ul>
|
||||
<li>Then I tested importing this by creating a new core in my development environment:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
|
||||
<pre tabindex="0"><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace/solr/statistics&dataDir=/home/aorth/dspace/solr/statistics-2019/data'
|
||||
$ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Downloads/statistics-2019-01.json -k uid
|
||||
</code></pre><ul>
|
||||
<li>This imports the records into the core, but DSpace can’t see them, and when I restart Tomcat the core is not seen by Solr…</li>
|
||||
<li>I got the core to load by adding it to <code>dspace/solr/solr.xml</code> manually, ie:</li>
|
||||
</ul>
|
||||
<pre><code> <cores adminPath="/admin/cores">
|
||||
<pre tabindex="0"><code> <cores adminPath="/admin/cores">
|
||||
...
|
||||
<core name="statistics" instanceDir="statistics" />
|
||||
<core name="statistics-2019" instanceDir="statistics">
|
||||
@ -415,11 +415,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Download
|
||||
<li>Just for fun I tried to load these stats into a Solr 7.7.2 instance using the DSpace 7 solr config:</li>
|
||||
<li>First, create a Solr statistics core using the DSpace 7 config:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
|
||||
<pre tabindex="0"><code>$ ./bin/solr create_core -c statistics -d ~/src/git/DSpace/dspace/solr/statistics/conf -p 8983
|
||||
</code></pre><ul>
|
||||
<li>Then try to import the stats, skipping a shitload of fields that are apparently added to our Solr statistics by Atmire modules:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8983/solr/statistics -a import -o ~/Downloads/statistics-2019-01.json -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
</code></pre><ul>
|
||||
<li>OK that imported! I wonder if it works… maybe I’ll try another day</li>
|
||||
</ul>
|
||||
@ -433,7 +433,7 @@ $ ./run.sh -s http://localhost:8080/solr/statistics-2019 -a import -o ~/Download
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ cd ~/src/git/perf-map-agent
|
||||
<pre tabindex="0"><code>$ cd ~/src/git/perf-map-agent
|
||||
$ cmake .
|
||||
$ make
|
||||
$ ./bin/create-links-in ~/.local/bin
|
||||
@ -467,7 +467,7 @@ $ perf-java-flames 11359
|
||||
<ul>
|
||||
<li>This weekend I did a lot more testing of indexing performance with our DSpace 5.8 branch, vanilla DSpace 5.10, and vanilla DSpace 6.4-SNAPSHOT:</li>
|
||||
</ul>
|
||||
<pre><code># CGSpace 5.8
|
||||
<pre tabindex="0"><code># CGSpace 5.8
|
||||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 385.72s user 131.16s system 19% cpu 43:21.18 total
|
||||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 382.95s user 127.31s system 20% cpu 42:10.07 total
|
||||
schedtool -D -e ~/dspace/bin/dspace index-discovery -b 368.56s user 143.97s system 20% cpu 42:22.66 total
|
||||
@ -483,7 +483,7 @@ schedtool -D -e ~/dspace63/bin/dspace index-discovery -b 5112.96s user 127.80s
|
||||
</code></pre><ul>
|
||||
<li>I generated better flame graphs for the DSpace indexing process by using <code>perf-record-stack</code> and filtering out the java process:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
|
||||
<pre tabindex="0"><code>$ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
|
||||
$ export PERF_RECORD_SECONDS=60
|
||||
$ export JAVA_OPTS="-XX:+PreserveFramePointer"
|
||||
$ time schedtool -D -e ~/dspace/bin/dspace index-discovery -b &
|
||||
@ -525,14 +525,14 @@ $ cat out.dspace510-1 | ../FlameGraph/stackcollapse-perf.pl | grep -E '^java' |
|
||||
<ul>
|
||||
<li>Maria from Bioversity asked me to add some ORCID iDs to our controlled vocabulary so I combined them with our existing ones and updated the names from the ORCID API:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
|
||||
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcid-ids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-02-11-combined-orcids.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2020-02-11-combined-orcids.txt -o /tmp/2020-02-11-combined-names.txt -d
|
||||
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
||||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||||
</code></pre><ul>
|
||||
<li>Then I noticed some author names had changed, so I captured the old and new names in a CSV file and fixed them using <code>fix-metadata-values.py</code>:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-02-11-correct-orcid-ids.csv -db dspace -u dspace -p 'fuuu' -f cg.creator.id -t correct -m 240 -d
|
||||
</code></pre><ul>
|
||||
<li>On a hunch I decided to try to add these ORCID iDs to existing items that might not have them yet
|
||||
<ul>
|
||||
@ -540,7 +540,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dc.contributor.author,cg.creator.id
|
||||
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
|
||||
"Staver, Charles",charles staver: 0000-0002-4532-6077
|
||||
"Staver, C.",charles staver: 0000-0002-4532-6077
|
||||
"Fungo, R.",Robert Fungo: 0000-0002-4264-6905
|
||||
@ -556,7 +556,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</code></pre><ul>
|
||||
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 144 ORCID iDs to items on CGSpace!</li>
|
||||
</ul>
|
||||
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
|
||||
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-02-11-add-orcid-ids.csv -db dspace -u dspace -p 'fuuu'
|
||||
</code></pre><ul>
|
||||
<li>Minor updates to all Python utility scripts in the CGSpace git repository</li>
|
||||
<li>Update the spider agent patterns in CGSpace <code>5_x-prod</code> branch from the latest <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> project
|
||||
@ -575,7 +575,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</li>
|
||||
<li>Peter asked me to update John McIntire’s name format on CGSpace so I ran the following PostgreSQL query:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
|
||||
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value='McIntire, John M.' WHERE resource_type_id=2 AND metadata_field_id=3 AND text_value='McIntire, John';
|
||||
UPDATE 26
|
||||
</code></pre><h2 id="2020-02-17">2020-02-17</h2>
|
||||
<ul>
|
||||
@ -607,12 +607,12 @@ UPDATE 26
|
||||
<ul>
|
||||
<li>I see a new spider in the nginx logs on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)
|
||||
</code></pre><ul>
|
||||
<li>I think this should be covered by the <a href="https://github.com/atmire/COUNTER-Robots">COUNTER-Robots</a> patterns for the statistics at least…</li>
|
||||
<li>I see some IP (186.32.217.255) in Costa Rica making requests like a bot with the following user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
|
||||
</code></pre><ul>
|
||||
<li>Another IP address (31.6.77.23) in the UK making a few hundred requests without a user agent</li>
|
||||
<li>I will add the IP addresses to the nginx badbots list</li>
|
||||
@ -622,7 +622,7 @@ UPDATE 26
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=dns:/squeeze3.bronco.co.uk./&rows=0"
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<response>
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">dns:/squeeze3.bronco.co.uk./</str><str name="rows">0</str></lst></lst><result name="response" numFound="86044" start="0"></result>
|
||||
@ -641,7 +641,7 @@ UPDATE 26
|
||||
</li>
|
||||
<li>I will purge them from each core one by one, ie:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:squeeze3.bronco.co.uk.</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Deploy latest Tomcat and PostgreSQL JDBC driver changes on CGSpace (linode18)</li>
|
||||
@ -654,12 +654,12 @@ $ curl -s "http://localhost:8081/solr/statistics-2014/update?softCommit=tru
|
||||
</li>
|
||||
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
|
||||
</ul>
|
||||
<pre><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(183996) is still referenced from table "bundle".
|
||||
</code></pre><ul>
|
||||
<li>The solution is, as always:</li>
|
||||
</ul>
|
||||
<pre><code># su - postgres
|
||||
<pre tabindex="0"><code># su - postgres
|
||||
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (183996);'
|
||||
UPDATE 1
|
||||
</code></pre><ul>
|
||||
@ -671,7 +671,7 @@ UPDATE 1
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
|
||||
<pre tabindex="0"><code>$ dspace user -a -m wow@me.com -g Felix -s Shaw -p 'fuananaaa'
|
||||
</code></pre><ul>
|
||||
<li>For some reason the Atmire Content and Usage Analysis (CUA) module’s Usage Statistics is drawing blank graphs
|
||||
<ul>
|
||||
@ -679,7 +679,7 @@ UPDATE 1
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
|
||||
<pre tabindex="0"><code>2020-02-23 11:28:13,696 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
|
||||
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.NoClassDefFoundError: Could not
|
||||
initialize class org.jfree.chart.JFreeChart
|
||||
</code></pre><ul>
|
||||
@ -694,11 +694,11 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
|
||||
</li>
|
||||
<li>I copied the <code>jfreechart-1.0.5.jar</code> file to the Tomcat lib folder and then there was a different error when I loaded Atmire CUA:</li>
|
||||
</ul>
|
||||
<pre><code>2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request! org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
|
||||
<pre tabindex="0"><code>2020-02-23 16:25:10,841 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request! org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.awt.AWTError: Assistive Technology not found: org.GNOME.Accessibility.AtkWrapper
|
||||
</code></pre><ul>
|
||||
<li>Some search results suggested commenting out the following line in <code>/etc/java-8-openjdk/accessibility.properties</code>:</li>
|
||||
</ul>
|
||||
<pre><code>assistive_technologies=org.GNOME.Accessibility.AtkWrapper
|
||||
<pre tabindex="0"><code>assistive_technologies=org.GNOME.Accessibility.AtkWrapper
|
||||
</code></pre><ul>
|
||||
<li>And removing the extra jfreechart library and restarting Tomcat I was able to load the usage statistics graph on DSpace Test…
|
||||
<ul>
|
||||
@ -708,7 +708,7 @@ org.springframework.web.util.NestedServletException: Handler processing failed;
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
|
||||
<pre tabindex="0"><code># grep -c 'initialize class org.jfree.chart.JFreeChart' dspace.log.2020-0*
|
||||
dspace.log.2020-01-12:4
|
||||
dspace.log.2020-01-13:66
|
||||
dspace.log.2020-01-14:4
|
||||
@ -724,7 +724,7 @@ dspace.log.2020-01-21:4
|
||||
<li>I deployed the fix on CGSpace (linode18) and I was able to see the graphs in the Atmire CUA Usage Statistics…</li>
|
||||
<li>On an unrelated note there is something weird going on in that I see millions of hits from IP 34.218.226.147 in Solr statistics, but if I remember correctly that IP belongs to CodeObia’s AReS explorer, but it should only be using REST and therefore no Solr statistics…?</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/select" -d "q=ip:34.218.226.147&rows=0"
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<response>
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">811</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="rows">0</str></lst></lst><result name="response" numFound="5536097" start="0"></result>
|
||||
@ -732,7 +732,7 @@ dspace.log.2020-01-21:4
|
||||
</code></pre><ul>
|
||||
<li>And there are apparently two million from last month (2020-01):</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=ip:34.218.226.147&fq=dateYearMonth:2020-01&rows=0"
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<response>
|
||||
<lst name="responseHeader"><int name="status">0</int><int name="QTime">248</int><lst name="params"><str name="q">ip:34.218.226.147</str><str name="fq">dateYearMonth:2020-01</str><str name="rows">0</str></lst></lst><result name="response" numFound="2173455" start="0"></result>
|
||||
@ -740,7 +740,7 @@ dspace.log.2020-01-21:4
|
||||
</code></pre><ul>
|
||||
<li>But when I look at the nginx access logs for the past month or so I only see 84,000, all of which are on <code>/rest</code> and none of which are to XMLUI:</li>
|
||||
</ul>
|
||||
<pre><code># zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
|
||||
<pre tabindex="0"><code># zcat /var/log/nginx/*.log.*.gz | grep -c 34.218.226.147
|
||||
84322
|
||||
# zcat /var/log/nginx/*.log.*.gz | grep 34.218.226.147 | grep -c '/rest'
|
||||
84322
|
||||
@ -758,7 +758,7 @@ dspace.log.2020-01-21:4
|
||||
</li>
|
||||
<li>Anyways, I faceted by IP in 2020-01 and see:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=*:*&fq=dateYearMonth:2020-01&rows=0&wt=json&indent=true&facet=true&facet.field=ip'
|
||||
...
|
||||
"172.104.229.92",2686876,
|
||||
"34.218.226.147",2173455,
|
||||
@ -769,19 +769,19 @@ dspace.log.2020-01-21:4
|
||||
<li>Surprise surprise, the top two IPs are from AReS servers… wtf.</li>
|
||||
<li>The next three are from Online in France and they are all using this weird user agent and making tens of thousands of requests to Discovery:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
</code></pre><ul>
|
||||
<li>And all the same three are already inflating the statistics for 2020-02… hmmm.</li>
|
||||
<li>I need to see why AReS harvesting is inflating the stats, as it should only be making REST requests…</li>
|
||||
<li>Shiiiiit, I see 84,000 requests from the AReS IP today alone:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-22*+AND+ip:172.104.229.92&rows=0&wt=json&indent=true'
|
||||
...
|
||||
"response":{"numFound":84594,"start":0,"docs":[]
|
||||
</code></pre><ul>
|
||||
<li>Fuck! And of course the ILRI websites doing their daily REST harvesting are causing issues too, from today alone:</li>
|
||||
</ul>
|
||||
<pre><code> "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
|
||||
<pre tabindex="0"><code> "2a01:7e00::f03c:91ff:fe9a:3a37",35512,
|
||||
"2a01:7e00::f03c:91ff:fe18:7396",26155,
|
||||
</code></pre><ul>
|
||||
<li>I need to try to make some requests for these URLs and observe if they make a statistics hit:
|
||||
@ -793,7 +793,7 @@ dspace.log.2020-01-21:4
|
||||
<li>Those are the requests AReS and ILRI servers are making… nearly 150,000 per day!</li>
|
||||
<li>Well that settles it!</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
|
||||
<pre tabindex="0"><code>$ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+statistics_type:view&fq=ip:78.128.99.24&rows=10&wt=json&indent=true' | grep numFound
|
||||
"response":{"numFound":12,"start":0,"docs":[
|
||||
$ curl -s 'https://dspacetest.cgiar.org/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=50&offset=82450'
|
||||
$ curl -s 'http://localhost:8081/solr/statistics/update?softCommit=true'
|
||||
@ -817,12 +817,12 @@ $ curl -s 'http://localhost:8081/solr/statistics/select?q=time:2020-02-23*+AND+s
|
||||
<li>I tried to add the IPs to our nginx IP bot mapping but it doesn’t seem to work… WTF, why is everything broken?!</li>
|
||||
<li>Oh lord have mercy, the two AReS harvester IPs alone are responsible for 42 MILLION hits in 2019 and 2020 so far by themselves:</li>
|
||||
</ul>
|
||||
<pre><code>$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
|
||||
<pre tabindex="0"><code>$ http 'http://localhost:8081/solr/statistics/select?q=ip:34.218.226.147+OR+ip:172.104.229.92&rows=0&wt=json&indent=true' | grep numFound
|
||||
"response":{"numFound":42395486,"start":0,"docs":[]
|
||||
</code></pre><ul>
|
||||
<li>I modified my <code>check-spider-hits.sh</code> script to create a version that works with IPs and purged 47 million stats from Solr on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f 2020-02-24-bot-ips.txt -s statistics -p
|
||||
Purging 22809216 hits from 34.218.226.147 in statistics
|
||||
Purging 19586270 hits from 172.104.229.92 in statistics
|
||||
Purging 111137 hits from 2a01:7e00::f03c:91ff:fe9a:3a37 in statistics
|
||||
@ -856,11 +856,11 @@ Total number of bot hits purged: 5535399
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>add_header X-debug-message "ua is $ua" always;
|
||||
<pre tabindex="0"><code>add_header X-debug-message "ua is $ua" always;
|
||||
</code></pre><ul>
|
||||
<li>Then in the HTTP response you see:</li>
|
||||
</ul>
|
||||
<pre><code>X-debug-message: ua is bot
|
||||
<pre tabindex="0"><code>X-debug-message: ua is bot
|
||||
</code></pre><ul>
|
||||
<li>So the IP to bot mapping is working, phew.</li>
|
||||
<li>More bad news, I checked the remaining IPs in our existing bot IP mapping, and there are statistics registered for them!
|
||||
@ -880,7 +880,7 @@ Total number of bot hits purged: 5535399
|
||||
<li>These IPs are all active in the REST API logs over the last few months and they account for <em>thirty-four million</em> more hits in the statistics!</li>
|
||||
<li>I purged them from CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
Purging 15 hits from 104.196.152.243 in statistics
|
||||
Purging 61064 hits from 35.237.175.180 in statistics
|
||||
Purging 1378 hits from 70.32.90.172 in statistics
|
||||
@ -910,7 +910,7 @@ Total number of bot hits purged: 1752548
|
||||
</li>
|
||||
<li>The client at 3.225.28.105 is using the following user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Apache-HttpClient/4.3.4 (java 1.5)
|
||||
<pre tabindex="0"><code>Apache-HttpClient/4.3.4 (java 1.5)
|
||||
</code></pre><ul>
|
||||
<li>But I don’t see any hits for it in the statistics core for some reason</li>
|
||||
<li>Looking more into the 2015 statistics I see some questionable IPs:
|
||||
@ -925,7 +925,7 @@ Total number of bot hits purged: 1752548
|
||||
</li>
|
||||
<li>For the IPs I purged them using <code>check-spider-ip-hits.sh</code>:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
Purging 11478 hits from 95.110.154.135 in statistics
|
||||
Purging 1208 hits from 34.209.213.122 in statistics
|
||||
Purging 10 hits from 54.184.39.242 in statistics
|
||||
@ -966,7 +966,7 @@ Total number of bot hits purged: 2228
|
||||
</code></pre><ul>
|
||||
<li>Then I purged about 200,000 Baidu hits from the 2015 to 2019 statistics cores with a few manual delete queries because they didn’t have a proper user agent and the only way to identify them was via DNS:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>dns:*crawl.baidu.com.</query></delete>"
|
||||
</code></pre><ul>
|
||||
<li>Jesus, the more I keep looking, the more I see ridiculous stuff…</li>
|
||||
<li>In 2019 there were a few hundred thousand requests from CodeObia on Orange Jordan network…
|
||||
@ -982,7 +982,7 @@ Total number of bot hits purged: 2228
|
||||
<li>Also I see some IP in Greece making 130,000 requests with weird user agents: 143.233.242.130</li>
|
||||
<li>I purged a bunch more from all cores:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-ip-hits.sh -u http://localhost:8081/solr -f /tmp/ips.txt -s statistics -p
|
||||
Purging 109965 hits from 45.5.186.2 in statistics
|
||||
Purging 78648 hits from 79.173.222.114 in statistics
|
||||
Purging 49032 hits from 149.200.141.57 in statistics
|
||||
@ -1024,7 +1024,7 @@ Total number of bot hits purged: 14110
|
||||
<li>Though looking in my REST logs for the last month I am second guessing my judgement on 45.5.186.2 because I see user agents like “Microsoft Office Word 2014”</li>
|
||||
<li>Actually no, the overwhelming majority of these are coming from something harvesting the REST API with no user agent:</li>
|
||||
</ul>
|
||||
<pre><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
|
||||
<pre tabindex="0"><code># zgrep 45.5.186.2 /var/log/nginx/rest.log.[1234]* | awk -F\" '{print $6}' | sort | uniq -c | sort -h
|
||||
1 Microsoft Office Word 2014
|
||||
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Win64; x64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; ms-office)
|
||||
1 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
|
||||
@ -1038,7 +1038,7 @@ Total number of bot hits purged: 14110
|
||||
</code></pre><ul>
|
||||
<li>I see lots of requests coming from the following user agents:</li>
|
||||
</ul>
|
||||
<pre><code>"Apache-HttpClient/4.5.7 (Java/11.0.3)"
|
||||
<pre tabindex="0"><code>"Apache-HttpClient/4.5.7 (Java/11.0.3)"
|
||||
"Apache-HttpClient/4.5.7 (Java/11.0.2)"
|
||||
"LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3 +http://www.linkedin.com)"
|
||||
"EventMachine HttpClient"
|
||||
@ -1054,7 +1054,7 @@ Total number of bot hits purged: 14110
|
||||
</li>
|
||||
<li>More weird user agents in 2019:</li>
|
||||
</ul>
|
||||
<pre><code>ecolink (+https://search.ecointernet.org/)
|
||||
<pre tabindex="0"><code>ecolink (+https://search.ecointernet.org/)
|
||||
ecoweb (+https://search.ecointernet.org/)
|
||||
EcoInternet http://www.ecointernet.org/
|
||||
EcoInternet http://ecointernet.org/
|
||||
@ -1062,12 +1062,12 @@ EcoInternet http://ecointernet.org/
|
||||
<ul>
|
||||
<li>And what’s the 950,000 hits from Online.net IPs with the following user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
</code></pre><ul>
|
||||
<li>Over half of the requests were to Discover and Browse pages, and the rest were to actual item pages, but they were within seconds of each other, so I’m purging them all</li>
|
||||
<li>I looked deeper in the Solr statistics and found a bunch more weird user agents:</li>
|
||||
</ul>
|
||||
<pre><code>LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
|
||||
<pre tabindex="0"><code>LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/4.3
|
||||
EventMachine HttpClient
|
||||
ecolink (+https://search.ecointernet.org/)
|
||||
ecoweb (+https://search.ecointernet.org/)
|
||||
@ -1098,13 +1098,13 @@ HTTPie/1.0.2
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Link.?Check
|
||||
<pre tabindex="0"><code>Link.?Check
|
||||
Http.?Client
|
||||
ecointernet
|
||||
</code></pre><ul>
|
||||
<li>That removes another 500,000 or so:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -u http://localhost:8081/solr -f /tmp/agents -s statistics -p
|
||||
Purging 253 hits from Jersey\/[0-9] in statistics
|
||||
Purging 7302 hits from Link.?Check in statistics
|
||||
Purging 85574 hits from Http.?Client in statistics
|
||||
@ -1171,12 +1171,12 @@ Total number of bot hits purged: 159
|
||||
</li>
|
||||
<li>I tried to migrate the 2019 Solr statistics again on CGSpace because the automatic sharding failed last month:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx512m"
|
||||
$ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-util.log.$(date --iso-8601)
|
||||
</code></pre><ul>
|
||||
<li>Interestingly I saw this in the Solr log:</li>
|
||||
</ul>
|
||||
<pre><code>2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
|
||||
<pre tabindex="0"><code>2020-02-26 08:55:47,433 INFO org.apache.solr.core.SolrCore @ [statistics-2019] Opening new SolrCore at [dspace]/solr/statistics/, dataDir=[dspace]/solr/statistics-2019/data/
|
||||
2020-02-26 08:55:47,511 INFO org.apache.solr.servlet.SolrDispatchFilter @ [admin] webapp=null path=/admin/cores params={dataDir=[dspace]/solr/statistics-2019/data&name=statistics-2019&action=CREATE&instanceDir=statistics&wt=javabin&version=2} status=0 QTime=590
|
||||
</code></pre><ul>
|
||||
<li>The process has been going for several hours now and I suspect it will fail eventually
|
||||
@ -1186,7 +1186,7 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-ut
|
||||
</li>
|
||||
<li>Manually create a core in the DSpace 6.4-SNAPSHOT instance on my local environment:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
|
||||
<pre tabindex="0"><code>$ curl 'http://localhost:8080/solr/admin/cores?action=CREATE&name=statistics-2019&instanceDir=/home/aorth/dspace63/solr/statistics&dataDir=/home/aorth/dspace63/solr/statistics-2019/data'
|
||||
</code></pre><ul>
|
||||
<li>After that the <code>statistics-2019</code> core was immediately available in the Solr UI, but after restarting Tomcat it was gone
|
||||
<ul>
|
||||
@ -1195,11 +1195,11 @@ $ schedtool -D -e ionice -c2 -n7 dspace stats-util -s >> log/cron-stats-ut
|
||||
</li>
|
||||
<li>First export a small slice of 2019 stats from the main CGSpace <code>statistics</code> core, skipping Atmire schema additions:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2019-01-16.json -f 'time:2019-01-16*' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
</code></pre><ul>
|
||||
<li>Then import into my local <code>statistics</code> core:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-01-16.json -k uid
|
||||
$ ~/dspace63/bin/dspace stats-util -s
|
||||
Moving: 21993 into core statistics-2019
|
||||
</code></pre><ul>
|
||||
@ -1226,7 +1226,7 @@ Moving: 21993 into core statistics-2019
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code><meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
|
||||
<pre tabindex="0"><code><meta content="Thu hoạch v&agrave; bảo quản c&agrave; ph&ecirc; ch&egrave; đ&uacute;ng kỹ thuật (Harvesting and storing Arabica coffee)" name="citation_title">
|
||||
<meta name="citation_title" content="Thu hoạch và bảo quản cà phê chè đúng kỹ thuật (Harvesting and storing Arabica coffee)" />
|
||||
</code></pre><ul>
|
||||
<li><a href="https://jira.lyrasis.org/browse/DS-4397">DS-4397 controlled vocabulary loading speedup</a></li>
|
||||
@ -1250,7 +1250,7 @@ Moving: 21993 into core statistics-2019
|
||||
</li>
|
||||
<li>I added some debugging to the Solr core loading in DSpace 6.4-SNAPSHOT (<code>SolrLoggerServiceImpl.java</code>) and I see this when DSpace starts up now:</li>
|
||||
</ul>
|
||||
<pre><code>2020-02-27 12:26:35,695 INFO org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException]. New Core Will be Created
|
||||
<pre tabindex="0"><code>2020-02-27 12:26:35,695 INFO org.dspace.statistics.SolrLoggerServiceImpl @ Alan Ping of Solr Core [statistics-2019] Failed with [org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException]. New Core Will be Created
|
||||
</code></pre><ul>
|
||||
<li>When I check Solr I see the <code>statistics-2019</code> core loaded (from <code>stats-util -s</code> yesterday, not manually created)</li>
|
||||
</ul>
|
||||
|
Reference in New Issue
Block a user