mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -36,7 +36,7 @@ It is class based so I can easily add support for other vocabularies, and the te
|
||||
|
||||
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -150,8 +150,8 @@ It is class based so I can easily add support for other vocabularies, and the te
|
||||
</li>
|
||||
<li>I purged all unmigrated stats in a few cores and then restarted processing:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
|
||||
</code></pre><ul>
|
||||
<li>Andrea from Macaroni Bros emailed me a few days ago to say he’s having issues with the CGSpace REST API
|
||||
@ -192,16 +192,16 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
|
||||
"numberItems" : 63,
|
||||
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
|
||||
<pre tabindex="0"><code>$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
|
||||
"numberItems" : 63,
|
||||
$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
|
||||
61
|
||||
</code></pre><ul>
|
||||
<li>Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
|
||||
"numberItems" : 61,
|
||||
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
|
||||
<pre tabindex="0"><code>$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
|
||||
"numberItems" : 61,
|
||||
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
|
||||
59
|
||||
</code></pre><ul>
|
||||
<li>Ah! I exported that collection’s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
|
||||
@ -210,7 +210,7 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
|
||||
<pre tabindex="0"><code>dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
|
||||
id | collection_id | item_id
|
||||
--------+---------------+---------
|
||||
133698 | 966 | 107687
|
||||
@ -220,8 +220,8 @@ $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-49
|
||||
</code></pre><ul>
|
||||
<li>So for each id you can delete one duplicate mapping:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id='134686';
|
||||
dspace=# DELETE FROM collection2item WHERE id='128819';
|
||||
<pre tabindex="0"><code>dspace=# DELETE FROM collection2item WHERE id='134686';
|
||||
dspace=# DELETE FROM collection2item WHERE id='128819';
|
||||
</code></pre><ul>
|
||||
<li>Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter’s preferred display names</li>
|
||||
</ul>
|
||||
@ -229,11 +229,11 @@ dspace=# DELETE FROM collection2item WHERE id='128819';
|
||||
cg.coverage.country,correct
|
||||
CAPE VERDE,CABO VERDE
|
||||
COCOS ISLANDS,COCOS (KEELING) ISLANDS
|
||||
"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
|
||||
COTE D'IVOIRE,CÔTE D'IVOIRE
|
||||
"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
|
||||
PALESTINE,"PALESTINE, STATE OF"
|
||||
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
|
||||
COTE D'IVOIRE,CÔTE D'IVOIRE
|
||||
"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
|
||||
PALESTINE,"PALESTINE, STATE OF"
|
||||
$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
|
||||
</code></pre><ul>
|
||||
<li>I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
|
||||
<ul>
|
||||
@ -267,7 +267,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
|
||||
</li>
|
||||
<li>I checked the nginx logs around 5PM yesterday to see who was accessing the server:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
|
||||
<pre tabindex="0"><code># cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
|
||||
</code></pre><ul>
|
||||
<li>I see the Macaroni Bros are using their new user agent for harvesting: <code>RTB website BOT</code>
|
||||
<ul>
|
||||
@ -276,7 +276,7 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
5693
|
||||
</code></pre><ul>
|
||||
<li>DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources
|
||||
@ -291,9 +291,9 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
|
||||
</li>
|
||||
<li>A few more IPs causing lots of Tomcat sessions yesterday:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
1585
|
||||
$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
5691
|
||||
</code></pre><ul>
|
||||
<li>38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:</li>
|
||||
@ -318,8 +318,8 @@ Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
|
||||
</code></pre><ul>
|
||||
<li>And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
|
||||
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
<pre tabindex="0"><code>$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
|
||||
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
2777
|
||||
</code></pre><ul>
|
||||
<li>I will add <code>Turnitin</code> to the Tomcat Crawler Session Manager Valve regex as well…</li>
|
||||
@ -377,8 +377,8 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
|
||||
<ul>
|
||||
<li>The Atmire stats processing for the statistics-2018 Solr core keeps stopping with this error:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
|
||||
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
|
||||
<pre tabindex="0"><code>Exception: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
|
||||
java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's most likely an issue with the connection to the solr server. Shutting down.
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.storeOnServer(SourceFile:317)
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:177)
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.performRun(SourceFile:161)
|
||||
@ -398,71 +398,71 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/[0-9]+/</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/[0-9]+/</query></delete>'
|
||||
</code></pre><h2 id="2020-08-09">2020-08-09</h2>
|
||||
<ul>
|
||||
<li>The Atmire script did something to the server and created 132GB of log files so the root partition ran out of space…</li>
|
||||
<li>I removed the log file and tried to re-run the process but it seems to be looping over 11,000 records and failing, creating millions of lines in the logs again:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code># grep -oE "Record uid: ([a-f0-9\\-]*){1} couldn't be processed" /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 > /tmp/not-processed-errors.txt
|
||||
<pre tabindex="0"><code># grep -oE "Record uid: ([a-f0-9\\-]*){1} couldn't be processed" /home/dspacetest.cgiar.org/log/dspace.log.2020-08-09 > /tmp/not-processed-errors.txt
|
||||
# wc -l /tmp/not-processed-errors.txt
|
||||
2202973 /tmp/not-processed-errors.txt
|
||||
# sort /tmp/not-processed-errors.txt | uniq -c | tail -n 10
|
||||
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
|
||||
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
|
||||
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
|
||||
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
|
||||
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
|
||||
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
|
||||
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
|
||||
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
|
||||
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
|
||||
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
|
||||
220 Record uid: ffe52878-ba23-44fb-8df7-a261bb358abc couldn't be processed
|
||||
220 Record uid: ffecb2b0-944d-4629-afdf-5ad995facaf9 couldn't be processed
|
||||
220 Record uid: ffedde6b-0782-4d9f-93ff-d1ba1a737585 couldn't be processed
|
||||
220 Record uid: ffedfb13-e929-4909-b600-a18295520a97 couldn't be processed
|
||||
220 Record uid: fff116fb-a1a0-40d0-b0fb-b71e9bb898e5 couldn't be processed
|
||||
221 Record uid: fff1349d-79d5-4ceb-89a1-ce78107d982d couldn't be processed
|
||||
220 Record uid: fff13ddb-b2a2-410a-9baa-97e333118c74 couldn't be processed
|
||||
220 Record uid: fff232a6-a008-47d0-ad83-6e209bb6cdf9 couldn't be processed
|
||||
221 Record uid: fff75243-c3be-48a0-98f8-a656f925cb68 couldn't be processed
|
||||
221 Record uid: fff88af8-88d4-4f79-ba1a-79853973c872 couldn't be processed
|
||||
</code></pre><ul>
|
||||
<li>I looked at some of those records and saw strange objects in their <code>containerCommunity</code>, <code>containerCollection</code>, etc…</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>{
|
||||
"responseHeader": {
|
||||
"status": 0,
|
||||
"QTime": 0,
|
||||
"params": {
|
||||
"q": "uid:fff1349d-79d5-4ceb-89a1-ce78107d982d",
|
||||
"indent": "true",
|
||||
"wt": "json",
|
||||
"_": "1596957629970"
|
||||
"responseHeader": {
|
||||
"status": 0,
|
||||
"QTime": 0,
|
||||
"params": {
|
||||
"q": "uid:fff1349d-79d5-4ceb-89a1-ce78107d982d",
|
||||
"indent": "true",
|
||||
"wt": "json",
|
||||
"_": "1596957629970"
|
||||
}
|
||||
},
|
||||
"response": {
|
||||
"numFound": 1,
|
||||
"start": 0,
|
||||
"docs": [
|
||||
"response": {
|
||||
"numFound": 1,
|
||||
"start": 0,
|
||||
"docs": [
|
||||
{
|
||||
"containerCommunity": [
|
||||
"155",
|
||||
"155",
|
||||
"{set=null}"
|
||||
"containerCommunity": [
|
||||
"155",
|
||||
"155",
|
||||
"{set=null}"
|
||||
],
|
||||
"uid": "fff1349d-79d5-4ceb-89a1-ce78107d982d",
|
||||
"containerCollection": [
|
||||
"1099",
|
||||
"830",
|
||||
"{set=830}"
|
||||
"uid": "fff1349d-79d5-4ceb-89a1-ce78107d982d",
|
||||
"containerCollection": [
|
||||
"1099",
|
||||
"830",
|
||||
"{set=830}"
|
||||
],
|
||||
"owningComm": [
|
||||
"155",
|
||||
"155",
|
||||
"{set=null}"
|
||||
"owningComm": [
|
||||
"155",
|
||||
"155",
|
||||
"{set=null}"
|
||||
],
|
||||
"isInternal": false,
|
||||
"isBot": false,
|
||||
"statistics_type": "view",
|
||||
"time": "2018-05-08T23:17:00.157Z",
|
||||
"owningColl": [
|
||||
"1099",
|
||||
"830",
|
||||
"{set=830}"
|
||||
"isInternal": false,
|
||||
"isBot": false,
|
||||
"statistics_type": "view",
|
||||
"time": "2018-05-08T23:17:00.157Z",
|
||||
"owningColl": [
|
||||
"1099",
|
||||
"830",
|
||||
"{set=830}"
|
||||
],
|
||||
"_version_": 1621500445042147300
|
||||
"_version_": 1621500445042147300
|
||||
}
|
||||
]
|
||||
}
|
||||
@ -470,8 +470,8 @@ java.lang.RuntimeException: 50 consecutive records couldn't be saved. There's mo
|
||||
</code></pre><ul>
|
||||
<li>I deleted those 11,724 records with the strange “set” object in the collections and communities, as well as 360,000 records with <code>id: -1</code></li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:\-1</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:\-1</query></delete>'
|
||||
</code></pre><ul>
|
||||
<li>I was going to compare the CUA stats for 2018 and 2019 on CGSpace and DSpace Test, but after Linode rebooted CGSpace (linode18) for maintenance yesterday the solr cores didn’t all come back up OK
|
||||
<ul>
|
||||
@ -487,24 +487,24 @@ $ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=tru
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ cat 2020-08-09-add-ILRI-orcids.csv
|
||||
dc.contributor.author,cg.creator.id
|
||||
"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
|
||||
"Delia Grace","Delia Grace: 0000-0002-0195-9489"
|
||||
"Baker, Derek","Derek Baker: 0000-0001-6020-6973"
|
||||
"Ngan Tran Thi","Tran Thi Ngan: 0000-0002-7184-3086"
|
||||
"Dang Xuan Sinh","Sinh Dang-Xuan: 0000-0002-0522-7808"
|
||||
"Hung Nguyen-Viet","Hung Nguyen-Viet: 0000-0001-9877-0596"
|
||||
"Pham Van Hung","Pham Anh Hung: 0000-0001-9366-0259"
|
||||
"Lindahl, Johanna F.","Johanna Lindahl: 0000-0002-1175-0398"
|
||||
"Teufel, Nils","Nils Teufel: 0000-0001-5305-6620"
|
||||
"Duncan, Alan J.",Alan Duncan: 0000-0002-3954-3067"
|
||||
"Moodley, Arshnee","Arshnee Moodley: 0000-0002-6469-3948"
|
||||
"Grace, Delia","Delia Grace: 0000-0002-0195-9489"
|
||||
"Delia Grace","Delia Grace: 0000-0002-0195-9489"
|
||||
"Baker, Derek","Derek Baker: 0000-0001-6020-6973"
|
||||
"Ngan Tran Thi","Tran Thi Ngan: 0000-0002-7184-3086"
|
||||
"Dang Xuan Sinh","Sinh Dang-Xuan: 0000-0002-0522-7808"
|
||||
"Hung Nguyen-Viet","Hung Nguyen-Viet: 0000-0001-9877-0596"
|
||||
"Pham Van Hung","Pham Anh Hung: 0000-0001-9366-0259"
|
||||
"Lindahl, Johanna F.","Johanna Lindahl: 0000-0002-1175-0398"
|
||||
"Teufel, Nils","Nils Teufel: 0000-0001-5305-6620"
|
||||
"Duncan, Alan J.",Alan Duncan: 0000-0002-3954-3067"
|
||||
"Moodley, Arshnee","Arshnee Moodley: 0000-0002-6469-3948"
|
||||
</code></pre><ul>
|
||||
<li>That got me curious, so I generated a list of all the unique ORCID identifiers we have in the database:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# \COPY (SELECT DISTINCT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=240) TO /tmp/2020-08-09-orcid-identifiers.csv;
|
||||
COPY 2095
|
||||
dspace=# \q
|
||||
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq > /tmp/2020-08-09-orcid-identifiers-uniq.csv
|
||||
$ grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' /tmp/2020-08-09-orcid-identifiers.csv | sort | uniq > /tmp/2020-08-09-orcid-identifiers-uniq.csv
|
||||
$ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
|
||||
1949 /tmp/2020-08-09-orcid-identifiers-uniq.csv
|
||||
</code></pre><ul>
|
||||
@ -517,9 +517,9 @@ $ wc -l /tmp/2020-08-09-orcid-identifiers-uniq.csv
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2018/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
|
||||
...
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>owningColl:/.*set.*/</query></delete>'
|
||||
</code></pre><ul>
|
||||
<li>I added <code>Googlebot</code> and <code>Twitterbot</code> to the list of explicitly allowed bots
|
||||
<ul>
|
||||
@ -573,7 +573,7 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
|
||||
<ul>
|
||||
<li>Last night I started the processing of the statistics-2016 core with the Atmire stats util and I see some errors like this:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
|
||||
<pre tabindex="0"><code>Record uid: f6b288d7-d60d-4df9-b311-1696b88552a0 couldn't be processed
|
||||
com.atmire.statistics.util.update.atomic.ProcessingException: something went wrong while processing record uid: f6b288d7-d60d-4df9-b311-1696b88552a0, an error occured in the com.atmire.statistics.util.update.atomic.processor.ContainerOwnerDBProcessor
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.applyProcessors(SourceFile:304)
|
||||
at com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdater.processRecords(SourceFile:176)
|
||||
@ -598,8 +598,8 @@ Caused by: java.lang.NullPointerException
|
||||
</li>
|
||||
<li>I purged the unmigrated docs and continued processing:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2016/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
|
||||
$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics-2016
|
||||
</code></pre><ul>
|
||||
<li>Altmetric asked for a dump of CGSpace’s OAI “sets” so they can update their affiliation mappings
|
||||
@ -608,8 +608,8 @@ $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisti
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' > /tmp/0.xml
|
||||
$ for num in {100..1300..100}; do http "https://cgspace.cgiar.org/oai/request?verb=ListSets&resumptionToken=////$num" > /tmp/$num.xml; sleep 2; done
|
||||
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/oai/request?verb=ListSets' > /tmp/0.xml
|
||||
$ for num in {100..1300..100}; do http "https://cgspace.cgiar.org/oai/request?verb=ListSets&resumptionToken=////$num" > /tmp/$num.xml; sleep 2; done
|
||||
$ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets.xml; done
|
||||
</code></pre><ul>
|
||||
<li>This produces one file that has all the sets, albeit with 14 pages of responses concatenated into one document, but that’s how theirs was in the first place…</li>
|
||||
@ -620,9 +620,9 @@ $ for num in {0..1300..100}; do cat /tmp/$num.xml >> /tmp/cgspace-oai-sets
|
||||
<li>The com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI script that was processing 2015 records last night started spitting shit tons of errors and created 120GB of logs…</li>
|
||||
<li>I looked at a few of the UIDs that it was having problems with and they were unmigrated ones… so I purged them in 2015 and all the rest of the statistics cores</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
<pre tabindex="0"><code>$ curl -s "http://localhost:8081/solr/statistics-2015/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
...
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
$ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
|
||||
</code></pre><h2 id="2020-08-19">2020-08-19</h2>
|
||||
<ul>
|
||||
<li>I tested the DSpace 5 and DSpace 6 versions of the <a href="https://github.com/ilri/cgspace-java-helpers">country code tagger curation task</a> and noticed a few things
|
||||
@ -715,17 +715,17 @@ $ curl -s "http://localhost:8081/solr/statistics-2010/update?softCommit=tru
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
|
||||
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
|
||||
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
|
||||
<pre tabindex="0"><code>$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=0' User-Agent:'curl' > /tmp/wle-trade-off-page1.xml
|
||||
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=100' User-Agent:'curl' > /tmp/wle-trade-off-page2.xml
|
||||
$ http 'https://cgspace.cgiar.org/open-search/discover?scope=10568%2F34494&query=trade+off&rpp=100&start=200' User-Agent:'curl' > /tmp/wle-trade-off-page3.xml
|
||||
</code></pre><ul>
|
||||
<li>Ugh, and to extract the <code><id></code> from each <code><entry></code> we have to use an XPath query, but use a <a href="http://blog.powered-up-games.com/wordpress/archives/70">hack to ignore the default namespace by setting each element’s local name</a>:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
|
||||
$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
|
||||
$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
|
||||
<pre tabindex="0"><code>$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page1.xml >> /tmp/ids.txt
|
||||
$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page2.xml >> /tmp/ids.txt
|
||||
$ xmllint --xpath '//*[local-name()="entry"]/*[local-name()="id"]/text()' /tmp/wle-trade-off-page3.xml >> /tmp/ids.txt
|
||||
$ sort -u /tmp/ids.txt > /tmp/ids-sorted.txt
|
||||
$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
|
||||
$ grep -oE '[0-9]+/[0-9]+' /tmp/ids.txt > /tmp/handles.txt
|
||||
</code></pre><ul>
|
||||
<li>Now I have all the handles for the matching items and I can use the REST API to get each item’s PDFs…
|
||||
<ul>
|
||||
|
Reference in New Issue
Block a user