mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -42,7 +42,7 @@ You need to download this into the DSpace 6.x source and compile it
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -141,7 +141,7 @@ You need to download this into the DSpace 6.x source and compile it
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8"
|
||||
$ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
|
||||
</code></pre><h2 id="2020-03-03">2020-03-03</h2>
|
||||
<ul>
|
||||
@ -160,7 +160,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i 2020-03-04-fix-1-ilri-subject.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
|
||||
</code></pre><ul>
|
||||
<li>But I have not run it on CGSpace yet because we want to ask Peter if he is sure about it…</li>
|
||||
<li>Send a message to Macaroni Bros to ask them about their Drupal module and its readiness for DSpace 6 UUIDs</li>
|
||||
@ -177,7 +177,7 @@ $ ~/dspace63/bin/dspace solr-upgrade-statistics-6x
|
||||
<li>I want to try to consolidate our yearly Solr statistics cores back into one <code>statistics</code> core using the solr-import-export-json tool</li>
|
||||
<li>I will try it on DSpace test, doing one year at a time:</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2010 -a export -o /tmp/statistics-2010.json -k uid
|
||||
$ ./run.sh -s http://localhost:8081/solr/statistics -a import -o /tmp/statistics-2010.json -k uid
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>time:2010*</query></delete>"
|
||||
$ ./run.sh -s http://localhost:8081/solr/statistics-2011 -a export -o /tmp/statistics-2011.json -k uid
|
||||
@ -196,7 +196,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
|
||||
<pre tabindex="0"><code>$ ./run.sh -s http://localhost:8081/solr/statistics-2014 -a export -o /tmp/statistics-2014-1.json -k uid -f 'time:/2014-0[1-6].*/'
|
||||
</code></pre><ul>
|
||||
<li>Upgrade PostgreSQL from 9.6 to 10 on DSpace Test (linode19)
|
||||
<ul>
|
||||
@ -204,7 +204,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code># apt install postgresql-10 postgresql-contrib-10
|
||||
<pre tabindex="0"><code># apt install postgresql-10 postgresql-contrib-10
|
||||
# systemctl stop tomcat7
|
||||
# pg_ctlcluster 9.6 main stop
|
||||
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
|
||||
@ -232,11 +232,11 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)
|
||||
</code></pre><ul>
|
||||
<li>It seems to only be a problem in the last week:</li>
|
||||
</ul>
|
||||
<pre><code># zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
|
||||
<pre tabindex="0"><code># zgrep -c 64.225.40.66 /var/log/nginx/rest.log.{1..9}
|
||||
/var/log/nginx/rest.log.1:0
|
||||
/var/log/nginx/rest.log.2:0
|
||||
/var/log/nginx/rest.log.3:0
|
||||
@ -250,22 +250,22 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
|
||||
<li>In Solr the IP is 127.0.0.1, but in the nginx logs I can luckily see the real IP (64.225.40.66), which is on Digital Ocean</li>
|
||||
<li>I will purge them from Solr statistics:</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 (.NET CLR 3.5.30729)"</query></delete>'
|
||||
</code></pre><ul>
|
||||
<li>Another user agent that seems to be a bot is:</li>
|
||||
</ul>
|
||||
<pre><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
<pre tabindex="0"><code>Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)
|
||||
</code></pre><ul>
|
||||
<li>In Solr the IP is 127.0.0.1 because of the misconfiguration, but in nginx’s logs I see it belongs to three IPs on Online.net in France:</li>
|
||||
</ul>
|
||||
<pre><code># zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
|
||||
<pre tabindex="0"><code># zcat /var/log/nginx/access.log.*.gz /var/log/nginx/rest.log.*.gz | grep 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' | awk '{print $1}' | sort | uniq -c
|
||||
63090 163.172.68.99
|
||||
183428 163.172.70.248
|
||||
147608 163.172.71.24
|
||||
</code></pre><ul>
|
||||
<li>It is making 10,000 to 40,000 requests to XMLUI per day…</li>
|
||||
</ul>
|
||||
<pre><code># zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
|
||||
<pre tabindex="0"><code># zgrep -c 'Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)' /var/log/nginx/access.log.{1..9}
|
||||
/var/log/nginx/access.log.30.gz:18687
|
||||
/var/log/nginx/access.log.31.gz:28936
|
||||
/var/log/nginx/access.log.32.gz:36402
|
||||
@ -284,7 +284,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
|
||||
</code></pre><ul>
|
||||
<li>I will purge those hits too!</li>
|
||||
</ul>
|
||||
<pre><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>userAgent:"Mozilla/5.0 ((Windows; U; Windows NT 6.1; fr; rv:1.9.2) Gecko/20100115 Firefox/3.6)"</query></delete>'
|
||||
</code></pre><ul>
|
||||
<li>Shit, and something happened and a few thousand hits from user agents with “Bot” in their user agent got through
|
||||
<ul>
|
||||
@ -292,7 +292,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2012/update?softCommit=tru
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>$ ./check-spider-hits.sh -f /tmp/bots -d -p
|
||||
<pre tabindex="0"><code>$ ./check-spider-hits.sh -f /tmp/bots -d -p
|
||||
(DEBUG) Using spiders pattern file: /tmp/bots
|
||||
(DEBUG) Checking for hits from spider: Citoid
|
||||
Purging 11 hits from Citoid in statistics
|
||||
@ -337,7 +337,7 @@ Purging 62 hits from [Ss]pider in statistics
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
|
||||
<pre tabindex="0"><code>dspace=# SELECT DISTINCT text_lang, COUNT(*) FROM metadatavalue WHERE resource_type_id=2 AND resource_id in (111295,111294,111293,111292,111291,111290,111288,111286,111285,111284,111283,111282,111281,111280,111279,111278,111277,111276,111275,111274,111273,111272,111271,111270,111269,111268,111267,111266,111265,111264,111263,111262,111261,111260,111259,111258,111257,111256,111255,111254,111253,111252,111251,111250,111249,111248,111247,111246,111245,111244,111243,111242,111241,111240,111238,111237,111236,111235,111234,111233,111232,111231,111230,111229,111228,111227,111226,111225,111224,111223,111222,111221,111220,111219,111218,111217,111216,111215,111214,111213,111212,111211,111209,111208,111207,111206,111205,111204,111203,111202,111201,111200,111199,111198,111197,111196,111195,111194,111193,111192,111191,111190,111189,111188,111187,111186,111185,111184,111183,111182,111181,111180,111179,111178,111177,111176,111175,111174,111173,111172,111171,111170,111169,111168,111299,111298,111297,111296,111167,111166,111165,111164,111163,111162,111161,111160,111159,111158,111157,111156,111155,111154,111153,111152,111151,111150,111149,111148,111147,111146,111145,111144,111143,111142,111141,111140,111139,111138,111137,111136,111135,111134,111133,111132,111131,111129,111128,111127,111126,111125) GROUP BY text_lang ORDER BY count;
|
||||
</code></pre><ul>
|
||||
<li>Then I exported the metadata from DSpace Test and imported it into OpenRefine
|
||||
<ul>
|
||||
@ -346,7 +346,7 @@ Purging 62 hits from [Ss]pider in statistics
|
||||
</li>
|
||||
<li>I exported a new list of affiliations from the database, added line numbers with <code>csvcut</code>, and then validated them in OpenRefine using <code>reconcile-csv</code>:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2020-03-12-affiliations.csv WITH CSV HEADER;`
|
||||
dspace=# \q
|
||||
$ csvcut -l -c 0 /tmp/2020-03-12-affiliations.csv | sed -e 's/^line_number/id/' -e 's/text_value/name/' > /tmp/affiliations.csv
|
||||
$ lein run /tmp/affiliations.csv name id
|
||||
@ -417,14 +417,14 @@ $ lein run /tmp/affiliations.csv name id
|
||||
<li>Update Tomcat to version 7.0.103 in the Ansible infrastrcutrue playbooks and deploy on DSpace Test (linode26)</li>
|
||||
<li>Maria sent me a few new ORCID identifiers from Bioversity so I combined them with our existing ones, filtered the unique ones, and then resolved their names using my <code>resolve-orcids.py</code> script:</li>
|
||||
</ul>
|
||||
<pre><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
|
||||
<pre tabindex="0"><code>$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/bioversity-orcids | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-03-26-combined-orcids.txt
|
||||
$ ./resolve-orcids.py -i /tmp/2020-03-26-combined-orcids.txt -o /tmp/2020-03-26-combined-names.txt -d
|
||||
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
|
||||
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
|
||||
</code></pre><ul>
|
||||
<li>I checked the database for likely matches to the author name and then created a CSV with the author names and ORCID iDs:</li>
|
||||
</ul>
|
||||
<pre><code>dc.contributor.author,cg.creator.id
|
||||
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
|
||||
"King, Brian","Brian King: 0000-0002-7056-9214"
|
||||
"Ortiz-Crespo, Berta","Berta Ortiz-Crespo: 0000-0002-6664-0815"
|
||||
"Ekesa, Beatrice","Beatrice Ekesa: 0000-0002-2630-258X"
|
||||
@ -434,7 +434,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
</code></pre><ul>
|
||||
<li>Running the <code>add-orcid-identifiers-csv.py</code> script I added 32 ORCID iDs to items on CGSpace!</li>
|
||||
</ul>
|
||||
<pre><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
|
||||
<pre tabindex="0"><code>$ ./add-orcid-identifiers-csv.py -i /tmp/2020-03-26-ciat-orcids.csv -db dspace -u dspace -p 'fuuu'
|
||||
</code></pre><ul>
|
||||
<li>Udana from IWMI asked about some items that are missing Altmetric donuts on CGSpace
|
||||
<ul>
|
||||
@ -449,7 +449,7 @@ $ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-i
|
||||
<ul>
|
||||
<li>Add two more Bioversity ORCID iDs to CGSpace and then tag ~70 of the authors' existing publications in the database using this CSV with my <code>add-orcid-identifiers-csv.py</code> script:</li>
|
||||
</ul>
|
||||
<pre><code>dc.contributor.author,cg.creator.id
|
||||
<pre tabindex="0"><code>dc.contributor.author,cg.creator.id
|
||||
"Snook, L.K.","Laura Snook: 0000-0002-9168-1301"
|
||||
"Snook, L.","Laura Snook: 0000-0002-9168-1301"
|
||||
"Zheng, S.J.","Sijun Zheng: 0000-0003-1550-3738"
|
||||
|
Reference in New Issue
Block a user