mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -38,7 +38,7 @@ I restarted Tomcat and PostgreSQL and the issue was gone
|
||||
|
||||
Since I was restarting Tomcat anyways I decided to redeploy the latest changes from the 5_x-prod branch and I added a note about COVID-19 items to the CGSpace frontpage at Peter’s request
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -139,7 +139,7 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
|
||||
<li>Also, Linode is alerting that we had high outbound traffic rate early this morning around midnight AND high CPU load later in the morning</li>
|
||||
<li>First looking at the traffic in the morning:</li>
|
||||
</ul>
|
||||
<pre><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
|
||||
<pre tabindex="0"><code># cat /var/log/nginx/*.log.1 /var/log/nginx/*.log | grep -E "01/Jul/2020:(00|01|02|03|04)" | goaccess --log-format=COMBINED -
|
||||
...
|
||||
9659 33.56% 1 0.08% 340.94 MiB 64.39.99.13
|
||||
3317 11.53% 1 0.08% 871.71 MiB 199.47.87.140
|
||||
@ -148,23 +148,23 @@ Since I was restarting Tomcat anyways I decided to redeploy the latest changes f
|
||||
</code></pre><ul>
|
||||
<li>64.39.99.13 belongs to Qualys, but I see they are using a normal desktop user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
|
||||
</code></pre><ul>
|
||||
<li>I will purge hits from that IP from Solr</li>
|
||||
<li>The 199.47.87.x IPs belong to Turnitin, and apparently they are NOT marked as bots and we have 40,000 hits from them in 2020 statistics alone:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/select" -d "q=userAgent:/Turnitin.*/&rows=0" | grep -oE 'numFound="[0-9]+"'
|
||||
numFound="41694"
|
||||
</code></pre><ul>
|
||||
<li>They used to be “TurnitinBot”… hhmmmm, seems they use both: <a href="https://turnitin.com/robot/crawlerinfo.html">https://turnitin.com/robot/crawlerinfo.html</a></li>
|
||||
<li>I will add Turnitin to the DSpace bot user agent list, but I see they are reqesting <code>robots.txt</code> and only requesting item pages, so that’s impressive! I don’t need to add them to the “bad bot” rate limit list in nginx</li>
|
||||
<li>While looking at the logs I noticed eighty-one IPs in the range 185.152.250.x making little requests this user agent:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:76.0) Gecko/20100101 Firefox/76.0
|
||||
</code></pre><ul>
|
||||
<li>The IPs all belong to HostRoyale:</li>
|
||||
</ul>
|
||||
<pre><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
|
||||
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | wc -l
|
||||
81
|
||||
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep '01/Jul/2020' | awk '{print $1}' | grep 185.152.250. | sort | uniq | sort -h
|
||||
185.152.250.1
|
||||
@ -269,7 +269,7 @@ numFound="41694"
|
||||
<li>I purged 20,000 hits from IPs and 45,000 hits from user agents</li>
|
||||
<li>I will revert the default “example” agents file back to the upstream master branch of COUNTER-Robots, and then add all my custom ones that are pending in pull requests they haven’t merged yet:</li>
|
||||
</ul>
|
||||
<pre><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
|
||||
<pre tabindex="0"><code>$ diff --unchanged-line-format= --old-line-format= --new-line-format='%L' dspace/config/spiders/agents/example ~/src/git/COUNTER-Robots/COUNTER_Robots_list.txt
|
||||
Citoid
|
||||
ecointernet
|
||||
GigablastOpenSource
|
||||
@ -285,7 +285,7 @@ Typhoeus
|
||||
</code></pre><ul>
|
||||
<li>Just a note that I <em>still</em> can’t deploy the <code>6_x-dev-atmire-modules</code> branch as it fails at ant update:</li>
|
||||
</ul>
|
||||
<pre><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
|
||||
<pre tabindex="0"><code> [java] java.lang.RuntimeException: Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'DefaultStorageUpdateConfig': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire method: public void com.atmire.statistics.util.StorageReportsUpdater.setStorageReportServices(java.util.List); nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cuaEPersonStorageReportService': Injection of autowired dependencies failed; nested exception is org.springframework.beans.factory.BeanCreationException: Could not autowire field: private com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO com.atmire.dspace.cua.CUAStorageReportServiceImpl$CUAEPersonStorageReportServiceImpl.CUAEPersonStorageReportDAO; nested exception is org.springframework.beans.factory.NoUniqueBeanDefinitionException: No qualifying bean of type [com.atmire.dspace.cua.dao.storage.CUAEPersonStorageReportDAO] is defined: expected single matching bean but found 2: com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#0,com.atmire.dspace.cua.dao.impl.CUAStorageReportDAOImpl$CUAEPersonStorageReportDAOImpl#1
|
||||
</code></pre><ul>
|
||||
<li>I had told Atmire about this several weeks ago… but I reminded them again in the ticket
|
||||
<ul>
|
||||
@ -308,7 +308,7 @@ Typhoeus
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
|
||||
<pre tabindex="0"><code>$ curl -g -s 'http://localhost:8081/solr/statistics-2019/select?q=*:*&fq=time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z%5D&rows=0&wt=json&indent=true'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
@ -324,12 +324,12 @@ Typhoeus
|
||||
</code></pre><ul>
|
||||
<li>But not in solr-import-export-json… hmmm… seems we need to URL encode <em>only</em> the date range itself, but not the brackets:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:%5B2019-01-01T00%3A00%3A00Z%20TO%202019-06-30T23%3A59%3A59Z]' -k uid
|
||||
$ zstd /tmp/statistics-2019-1.json
|
||||
</code></pre><ul>
|
||||
<li>Then import it on my local dev environment:</li>
|
||||
</ul>
|
||||
<pre><code>$ zstd -d statistics-2019-1.json.zst
|
||||
<pre tabindex="0"><code>$ zstd -d statistics-2019-1.json.zst
|
||||
$ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/statistics-2019-1.json -k uid
|
||||
</code></pre><h2 id="2020-07-05">2020-07-05</h2>
|
||||
<ul>
|
||||
@ -358,11 +358,11 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
|
||||
</li>
|
||||
<li>I noticed that we have 20,000 distinct values for <code>dc.subject</code>, but there are at least 500 that are lower or mixed case that we should simply uppercase without further thought:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
|
||||
<pre tabindex="0"><code>dspace=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
|
||||
</code></pre><ul>
|
||||
<li>DSpace Test needs a different query because it is running DSpace 6 with UUIDs for everything:</li>
|
||||
</ul>
|
||||
<pre><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
|
||||
<pre tabindex="0"><code>dspace63=# UPDATE metadatavalue SET text_value=UPPER(text_value) WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=57 AND text_value ~ '[[:lower:]]';
|
||||
</code></pre><ul>
|
||||
<li>Note the use of the POSIX character class :)</li>
|
||||
<li>I suggest that we generate a list of the top 5,000 values that don’t match AGROVOC so that Sisay can correct them
|
||||
@ -371,14 +371,14 @@ $ ./run.sh -s http://localhost:8080/solr/statistics -a import -o ~/Downloads/sta
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=57 GROUP BY text_value ORDER BY count DESC) TO /tmp/2020-07-05-subjects.csv WITH CSV;
|
||||
COPY 19640
|
||||
dspace=# \q
|
||||
$ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-cgspace-subjects.txt
|
||||
</code></pre><ul>
|
||||
<li>Then start looking them up using <code>agrovoc-lookup.py</code>:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
|
||||
<pre tabindex="0"><code>$ ./agrovoc-lookup.py -i 2020-07-05-cgspace-subjects.txt -om 2020-07-05-cgspace-subjects-matched.txt -or 2020-07-05-cgspace-subjects-rejected.txt -d
|
||||
</code></pre><h2 id="2020-07-06">2020-07-06</h2>
|
||||
<ul>
|
||||
<li>I made some optimizations to the suite of Python utility scripts in our DSpace directory as well as the <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a> script
|
||||
@ -399,12 +399,12 @@ $ csvcut -c1 /tmp/2020-07-05-subjects-upper.csv | head -n 6500 > 2020-07-05-c
|
||||
<ul>
|
||||
<li>Peter asked me to send him a list of sponsors on CGSpace</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC) TO /tmp/2020-07-07-sponsors.csv WITH CSV HEADER;
|
||||
COPY 707
|
||||
</code></pre><ul>
|
||||
<li>I ran it quickly through my <code>csv-metadata-quality</code> tool and found two issues that I will correct with <code>fix-metadata-values.py</code> on CGSpace immediately:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat 2020-07-07-fix-sponsors.csv
|
||||
<pre tabindex="0"><code>$ cat 2020-07-07-fix-sponsors.csv
|
||||
dc.description.sponsorship,correct
|
||||
"Ministe`re des Affaires Etrange`res et Européennes, France","Ministère des Affaires Étrangères et Européennes, France"
|
||||
"Global Food Security Programme, United Kingdom","Global Food Security Programme, United Kingdom"
|
||||
@ -432,7 +432,7 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
|
||||
<ul>
|
||||
<li>Generate a CSV of all the AGROVOC subjects that didn’t match from the top 6500 I exported earlier this week:</li>
|
||||
</ul>
|
||||
<pre><code>$ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
|
||||
<pre tabindex="0"><code>$ csvgrep -c 'number of matches' -r "^0$" 2020-07-05-cgspace-subjects.csv | csvcut -c 1 > 2020-07-05-cgspace-invalid-subjects.csv
|
||||
</code></pre><ul>
|
||||
<li>Yesterday Gabriela from CIP emailed to say that she was removing the accents from her authors' names because of “funny character” issues with reports generated from CGSpace
|
||||
<ul>
|
||||
@ -442,7 +442,7 @@ $ ./fix-metadata-values.py -i 2020-07-07-fix-sponsors.csv -db dspace -u dspace -
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
|
||||
<pre tabindex="0"><code>$ csvgrep -c 2 -r "^.+$" ~/Downloads/cip-authors-GH-20200706.csv | csvgrep -c 1 -r "^.*[À-ú].*$" | csvgrep -c 2 -r "^.*[À-ú].*$" -i | csvcut -c 1,2
|
||||
dc.contributor.author,correction
|
||||
"López, G.","Lopez, G."
|
||||
"Gómez, R.","Gomez, R."
|
||||
@ -475,11 +475,11 @@ dc.contributor.author,correction
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-07-08-affiliations.csv WITH CSV HEADER;
|
||||
</code></pre><ul>
|
||||
<li>Then I stripped the CSV header and quotes to make it a plain text file and ran <code>ror-lookup.py</code>:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
|
||||
<pre tabindex="0"><code>$ ./ror-lookup.py -i /tmp/2020-07-08-affiliations.txt -r ror.json -o 2020-07-08-affiliations-ror.csv -d
|
||||
$ wc -l /tmp/2020-07-08-affiliations.txt
|
||||
5866 /tmp/2020-07-08-affiliations.txt
|
||||
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
|
||||
@ -500,7 +500,7 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
|
||||
</li>
|
||||
<li>I updated <code>ror-lookup.py</code> to check aliases and acronyms as well and now the results are better for CGSpace’s affiliation list:</li>
|
||||
</ul>
|
||||
<pre><code>$ wc -l /tmp/2020-07-08-affiliations.txt
|
||||
<pre tabindex="0"><code>$ wc -l /tmp/2020-07-08-affiliations.txt
|
||||
5866 /tmp/2020-07-08-affiliations.txt
|
||||
$ csvgrep -c matched -m true 2020-07-08-affiliations-ror.csv | wc -l
|
||||
1516
|
||||
@ -510,16 +510,16 @@ $ csvgrep -c matched -m false 2020-07-08-affiliations-ror.csv | wc -l
|
||||
<li>So now our matching improves to 1515 out of 5866 (25.8%)</li>
|
||||
<li>Gabriela from CIP said that I should run the author corrections minus those that remove accent characters so I will run it on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-09-fix-90-cip-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correction -m 3
|
||||
</code></pre><ul>
|
||||
<li>Apply 110 fixes and 90 deletions to sponsorships that Peter sent me a few days ago:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-07-fix-110-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -t 'correct/action' -m 29
|
||||
$ ./delete-metadata-values.py -i /tmp/2020-07-07-delete-90-sponsors.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29
|
||||
</code></pre><ul>
|
||||
<li>Start a full Discovery re-index on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ time chrt -b 0 dspace index-discovery -b
|
||||
<pre tabindex="0"><code>$ time chrt -b 0 dspace index-discovery -b
|
||||
|
||||
real 94m21.413s
|
||||
user 9m40.364s
|
||||
@ -527,7 +527,7 @@ sys 2m37.246s
|
||||
</code></pre><ul>
|
||||
<li>I modified <code>crossref-funders-lookup.py</code> to be case insensitive and now CGSpace’s sponsors match 173 out of 534 (32.4%):</li>
|
||||
</ul>
|
||||
<pre><code>$ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
|
||||
<pre tabindex="0"><code>$ ./crossref-funders-lookup.py -i 2020-07-09-cgspace-sponsors.txt -o 2020-07-09-cgspace-sponsors-crossref.csv -d -e a.orth@cgiar.org
|
||||
$ wc -l 2020-07-09-cgspace-sponsors.txt
|
||||
534 2020-07-09-cgspace-sponsors.txt
|
||||
$ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
|
||||
@ -552,7 +552,7 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code># grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
<pre tabindex="0"><code># grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
2815
|
||||
</code></pre><ul>
|
||||
<li>So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session</li>
|
||||
@ -563,11 +563,11 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
|
||||
</code></pre><ul>
|
||||
<li>Generate a list of sponsors to update our controlled vocabulary:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
|
||||
COPY 125
|
||||
dspace=# \q
|
||||
$ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv > dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
|
||||
@ -590,12 +590,12 @@ $ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-descripti
|
||||
<ul>
|
||||
<li>I ran the <code>dspace cleanup -v</code> process on CGSpace and got an error:</li>
|
||||
</ul>
|
||||
<pre><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(189618) is still referenced from table "bundle".
|
||||
</code></pre><ul>
|
||||
<li>The solution is, as always:</li>
|
||||
</ul>
|
||||
<pre><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
|
||||
<pre tabindex="0"><code>$ psql -d dspace -U dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (189618, 188837);'
|
||||
UPDATE 1
|
||||
</code></pre><ul>
|
||||
<li>Udana from WLE asked me about some items that didn’t show Altmetric donuts
|
||||
@ -616,12 +616,12 @@ UPDATE 1
|
||||
<li>All four IWMI items that I tweeted yesterday have Altmetric donuts with a score of 1 now…</li>
|
||||
<li>Export CGSpace countries to check them against ISO 3166-1 and ISO 3166-3 (historic countries):</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-07-15-countries.csv;
|
||||
COPY 194
|
||||
</code></pre><ul>
|
||||
<li>I wrote a script <code>iso3166-lookup.py</code> to check them:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
|
||||
<pre tabindex="0"><code>$ ./iso3166-1-lookup.py -i /tmp/2020-07-15-countries.csv -o /tmp/2020-07-15-countries-resolved.csv
|
||||
$ csvgrep -c matched -m false /tmp/2020-07-15-countries-resolved.csv
|
||||
country,match type,matched
|
||||
CAPE VERDE,,false
|
||||
@ -642,16 +642,16 @@ IRAN,,false
|
||||
</code></pre><ul>
|
||||
<li>Check the database for DOIs that are not in the preferred “<a href="https://doi.org/%22">https://doi.org/"</a> format:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT text_value as "cg.identifier.doi" FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=220 AND text_value NOT LIKE 'https://doi.org/%') TO /tmp/2020-07-15-doi.csv WITH CSV HEADER;
|
||||
COPY 186
|
||||
</code></pre><ul>
|
||||
<li>Then I imported them into OpenRefine and replaced them in a new “correct” column using this GREL transform:</li>
|
||||
</ul>
|
||||
<pre><code>value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http://dx.doi.org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
|
||||
<pre tabindex="0"><code>value.replace("dx.doi.org", "doi.org").replace("http://", "https://").replace("https://dx,doi,org", "https://doi.org").replace("https://doi.dx.org", "https://doi.org").replace("https://dx.doi:", "https://doi.org").replace("DOI: ", "https://doi.org/").replace("doi: ", "https://doi.org/").replace("http://dx.doi.org", "https://doi.org").replace("https://dx. doi.org. ", "https://doi.org").replace("https://dx.doi", "https://doi.org").replace("https://dx.doi:", "https://doi.org/").replace("hdl.handle.net", "doi.org")
|
||||
</code></pre><ul>
|
||||
<li>Then I fixed the DOIs on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-15-fix-164-DOIs.csv -db dspace -u dspace -p 'fuuu' -f cg.identifier.doi -t 'correct' -m 220
|
||||
</code></pre><ul>
|
||||
<li>I filed <a href="https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/10">an issue on Debian’s iso-codes</a> project to ask why “Swaziland” does not appear in the ISO 3166-3 list of historical country names despite it being changed to “Eswatini” in 2018.</li>
|
||||
<li>Atmire responded about the Solr issue
|
||||
@ -666,7 +666,7 @@ COPY 186
|
||||
<ul>
|
||||
<li>Looking at the nginx logs on CGSpace (linode18) last night I see that the Macaroni Bros have started using a unique identifier for at least one of their harvesters:</li>
|
||||
</ul>
|
||||
<pre><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
|
||||
<pre tabindex="0"><code>217.64.192.138 - - [20/Jul/2020:01:01:39 +0200] "GET /rest/rest/bitstreams/114779/retrieve HTTP/1.0" 302 138 "-" "ILRI Livestock Website Publications importer BOT"
|
||||
</code></pre><ul>
|
||||
<li>I still see 12,000 records in Solr from this user agent, though.
|
||||
<ul>
|
||||
@ -683,7 +683,7 @@ COPY 186
|
||||
<li>I re-ran the <code>check-spider-hits.sh</code> script with the new lists and purged another 14,000 more stats hits for several years each (2020, 2019, 2018, 2017, 2016), around 70,000 total</li>
|
||||
<li>I looked at the <a href="https://clarisa.cgiar.org/">CLARISA</a> institutions list again, since I hadn’t looked at it in over six months:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
|
||||
<pre tabindex="0"><code>$ cat ~/Downloads/response_1595270924560.json | jq '.[] | {name: .name}' | grep name | awk -F: '{print $2}' | sed -e 's/"//g' -e 's/^ //' -e '1iname' | csvcut -l | sed '1s/line_number/id/' > /tmp/clarisa-institutions.csv
|
||||
</code></pre><ul>
|
||||
<li>The API still needs a key unless you query from Swagger web interface
|
||||
<ul>
|
||||
@ -700,7 +700,7 @@ COPY 186
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
|
||||
<pre tabindex="0"><code>$ csv-metadata-quality -i /tmp/clarisa-institutions.csv -o /tmp/clarisa-institutions-cleaned.csv
|
||||
Removing excessive whitespace (name): Comitato Internazionale per lo Sviluppo dei Popoli / International Committee for the Development of Peoples
|
||||
Removing excessive whitespace (name): Deutsche Landwirtschaftsgesellschaft / German agriculture society
|
||||
Removing excessive whitespace (name): Institute of Arid Regions of Medenine
|
||||
@ -732,7 +732,7 @@ Removing unnecessary Unicode (U+200B): Agencia de Servicios a la Comercializaci
|
||||
</li>
|
||||
<li>I started processing the 2019 stats in a batch of 1 million on DSpace Test:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
|
||||
...
|
||||
*** Statistics Records with Legacy Id ***
|
||||
@ -749,7 +749,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2019
|
||||
</code></pre><ul>
|
||||
<li>The statistics-2019 finished processing after about 9 hours so I started the 2018 ones:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
|
||||
*** Statistics Records with Legacy Id ***
|
||||
|
||||
@ -765,7 +765,7 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
|
||||
</code></pre><ul>
|
||||
<li>Moayad finally made OpenRXV use a unique user agent:</li>
|
||||
</ul>
|
||||
<pre><code>OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
|
||||
<pre tabindex="0"><code>OpenRXV harvesting bot; https://github.com/ilri/OpenRXV
|
||||
</code></pre><ul>
|
||||
<li>I see nearly 200,000 hits in Solr from the IP address, though, so I need to make sure those are old ones from before today
|
||||
<ul>
|
||||
@ -793,12 +793,12 @@ $ chrt -b 0 dspace solr-upgrade-statistics-6x -n 1000000 -i statistics-2018
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
<pre tabindex="0"><code>Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
|
||||
</code></pre><ul>
|
||||
<li>There were four records so I deleted them:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:10</query></delete>'
|
||||
</code></pre><ul>
|
||||
<li>Meeting with Moayad and Peter and Abenet to discuss the latest AReS changes</li>
|
||||
</ul>
|
||||
@ -826,7 +826,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
|
||||
</code></pre><ul>
|
||||
<li>Also, in the same month with the same <em>exact</em> user agent, I see 300,000 from 192.157.89.x
|
||||
<ul>
|
||||
@ -842,7 +842,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
</code></pre><ul>
|
||||
<li>In statistics-2018 I see more weird IPs
|
||||
<ul>
|
||||
@ -860,7 +860,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36
|
||||
</code></pre><ul>
|
||||
<li>Then there is 213.139.53.62 in 2018, which is on Orange Telecom Jordan, so it’s definitely CodeObia / ICARDA and I will purge them</li>
|
||||
<li>Jesus, and then there are 100,000 from the ILRI harvestor on Linode on 2a01:7e00::f03c:91ff:fe0a:d645</li>
|
||||
@ -869,7 +869,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
<li>Jesus fuck there is 104.198.9.108 on Google Cloud that was making 30,000 requests with no user agent</li>
|
||||
<li>I will purge the hits from all the following IPs:</li>
|
||||
</ul>
|
||||
<pre><code>192.157.89.4
|
||||
<pre tabindex="0"><code>192.157.89.4
|
||||
192.157.89.5
|
||||
192.157.89.6
|
||||
192.157.89.7
|
||||
@ -898,7 +898,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error whil
|
||||
</li>
|
||||
<li>I noticed a few other user agents that should be purged too:</li>
|
||||
</ul>
|
||||
<pre><code>^Java\/\d{1,2}.\d
|
||||
<pre tabindex="0"><code>^Java\/\d{1,2}.\d
|
||||
FlipboardProxy\/\d
|
||||
API scraper
|
||||
RebelMouse\/\d
|
||||
@ -932,7 +932,7 @@ mailto\:team@impactstory\.org
|
||||
</li>
|
||||
<li>Export some of the CGSpace Solr stats minus the Atmire CUA schema additions for Salem to play with:</li>
|
||||
</ul>
|
||||
<pre><code>$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
<pre tabindex="0"><code>$ chrt -b 0 ./run.sh -s http://localhost:8081/solr/statistics-2019 -a export -o /tmp/statistics-2019-1.json -f 'time:[2019-01-01T00\:00\:00Z TO 2019-06-30T23\:59\:59Z]' -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId
|
||||
</code></pre><ul>
|
||||
<li>
|
||||
<p>Run system updates on DSpace Test (linode26) and reboot it</p>
|
||||
@ -1036,11 +1036,11 @@ mailto\:team@impactstory\.org
|
||||
<p>I started processing Solr stats with the Atmire tool now:</p>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
|
||||
<pre tabindex="0"><code>$ dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -c statistics -f -t 12
|
||||
</code></pre><ul>
|
||||
<li>This one failed after a few hours:</li>
|
||||
</ul>
|
||||
<pre><code>Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
|
||||
<pre tabindex="0"><code>Record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b couldn't be processed
|
||||
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: c4b5974a-025d-4adc-b6c3-c8846048b62b, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
|
||||
@ -1063,7 +1063,7 @@ If run the update again with the resume option (-r) they will be reattempted
|
||||
<li>I started the same script for the statistics-2019 core (12 million records…)</li>
|
||||
<li>Update an ILRI author’s name on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2020-07-27-fix-ILRI-author.csv -db dspace -u cgspace -p 'fuuu' -f dc.contributor.author -t 'correct' -m 3
|
||||
Fixed 13 occurences of: Muloi, D.
|
||||
Fixed 4 occurences of: Muloi, D.M.
|
||||
</code></pre><h2 id="2020-07-28">2020-07-28</h2>
|
||||
@ -1110,7 +1110,7 @@ Fixed 4 occurences of: Muloi, D.M.
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code># grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
|
||||
<pre tabindex="0"><code># grep -c numeric /usr/share/iso-codes/json/iso_3166-1.json
|
||||
249
|
||||
# grep -c -E '"name":' /usr/share/iso-codes/json/iso_3166-1.json
|
||||
249
|
||||
|
Reference in New Issue
Block a user