mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2022-03-04
This commit is contained in:
@ -28,7 +28,7 @@ I checked to see if the Solr sharding task that is supposed to run on January 1s
|
||||
I tested on DSpace Test as well and it doesn’t work there either
|
||||
I asked on the dspace-tech mailing list because it seems to be broken, and actually now I’m not sure if we’ve ever had the sharding task run successfully over all these years
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.92.2" />
|
||||
<meta name="generator" content="Hugo 0.93.1" />
|
||||
|
||||
|
||||
|
||||
@ -124,7 +124,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
|
||||
<ul>
|
||||
<li>I tried to shard my local dev instance and it fails the same way:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
|
||||
<pre tabindex="0"><code>$ JAVA_OPTS="-Xms768m -Xmx768m -Dfile.encoding=UTF-8" ~/dspace/bin/dspace stats-util -s
|
||||
Moving: 9318 into core statistics-2016
|
||||
Exception: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
|
||||
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://localhost:8081/solr//statistics-2016
|
||||
@ -179,15 +179,15 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
|
||||
<li>Despite failing instantly, a <code>statistics-2016</code> directory was created, but it only has a data dir (no conf)</li>
|
||||
<li>The Tomcat access logs show more:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 4359517
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16248
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
|
||||
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41
|
||||
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
|
||||
<pre tabindex="0"><code>127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=type%3A2+AND+id%3A1&wt=javabin&version=2 HTTP/1.1" 200 107
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/statistics/select?q=*%3A*&rows=0&facet=true&facet.range=time&facet.range.start=NOW%2FYEAR-17YEARS&facet.range.end=NOW%2FYEAR%2B0YEARS&facet.range.gap=%2B1YEAR&facet.mincount=1&wt=javabin&version=2 HTTP/1.1" 200 423
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=STATUS&core=statistics-2016&indexInfo=true&wt=javabin&version=2 HTTP/1.1" 200 77
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:05 +0200] "GET /solr/admin/cores?action=CREATE&name=statistics-2016&instanceDir=statistics&dataDir=%2FUsers%2Faorth%2Fdspace%2Fsolr%2Fstatistics-2016%2Fdata&wt=javabin&version=2 HTTP/1.1" 200 63
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/select?csv.mv.separator=%7C&q=*%3A*&fq=time%3A%28%5B2016%5C-01%5C-01T00%5C%3A00%5C%3A00Z+TO+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%5D+NOT+2017%5C-01%5C-01T00%5C%3A00%5C%3A00Z%29&rows=10000&wt=csv HTTP/1.1" 200 4359517
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "GET /solr/statistics/admin/luke?show=schema&wt=javabin&version=2 HTTP/1.1" 200 16248
|
||||
127.0.0.1 - - [04/Jan/2017:22:39:07 +0200] "POST /solr//statistics-2016/update/csv?commit=true&softCommit=false&waitSearcher=true&f.previousWorkflowStep.split=true&f.previousWorkflowStep.separator=%7C&f.previousWorkflowStep.encapsulator=%22&f.actingGroupId.split=true&f.actingGroupId.separator=%7C&f.actingGroupId.encapsulator=%22&f.containerCommunity.split=true&f.containerCommunity.separator=%7C&f.containerCommunity.encapsulator=%22&f.range.split=true&f.range.separator=%7C&f.range.encapsulator=%22&f.containerItem.split=true&f.containerItem.separator=%7C&f.containerItem.encapsulator=%22&f.p_communities_map.split=true&f.p_communities_map.separator=%7C&f.p_communities_map.encapsulator=%22&f.ngram_query_search.split=true&f.ngram_query_search.separator=%7C&f.ngram_query_search.encapsulator=%22&f.containerBitstream.split=true&f.containerBitstream.separator=%7C&f.containerBitstream.encapsulator=%22&f.owningItem.split=true&f.owningItem.separator=%7C&f.owningItem.encapsulator=%22&f.actingGroupParentId.split=true&f.actingGroupParentId.separator=%7C&f.actingGroupParentId.encapsulator=%22&f.text.split=true&f.text.separator=%7C&f.text.encapsulator=%22&f.simple_query_search.split=true&f.simple_query_search.separator=%7C&f.simple_query_search.encapsulator=%22&f.owningComm.split=true&f.owningComm.separator=%7C&f.owningComm.encapsulator=%22&f.owner.split=true&f.owner.separator=%7C&f.owner.encapsulator=%22&f.filterquery.split=true&f.filterquery.separator=%7C&f.filterquery.encapsulator=%22&f.p_group_map.split=true&f.p_group_map.separator=%7C&f.p_group_map.encapsulator=%22&f.actorMemberGroupId.split=true&f.actorMemberGroupId.separator=%7C&f.actorMemberGroupId.encapsulator=%22&f.bitstreamId.split=true&f.bitstreamId.separator=%7C&f.bitstreamId.encapsulator=%22&f.group_name.split=true&f.group_name.separator=%7C&f.group_name.encapsulator=%22&f.p_communities_name.split=true&f.p_communities_name.separator=%7C&f.p_communities_name.encapsulator=%22&f.query.split=true&f.query.separator=%7C&f.query.encapsulator=%22&f.workflowStep.split=true&f.workflowStep.separator=%7C&f.workflowStep.encapsulator=%22&f.containerCollection.split=true&f.containerCollection.separator=%7C&f.containerCollection.encapsulator=%22&f.complete_query_search.split=true&f.complete_query_search.separator=%7C&f.complete_query_search.encapsulator=%22&f.p_communities_id.split=true&f.p_communities_id.separator=%7C&f.p_communities_id.encapsulator=%22&f.rangeDescription.split=true&f.rangeDescription.separator=%7C&f.rangeDescription.encapsulator=%22&f.group_id.split=true&f.group_id.separator=%7C&f.group_id.encapsulator=%22&f.bundleName.split=true&f.bundleName.separator=%7C&f.bundleName.encapsulator=%22&f.ngram_simplequery_search.split=true&f.ngram_simplequery_search.separator=%7C&f.ngram_simplequery_search.encapsulator=%22&f.group_map.split=true&f.group_map.separator=%7C&f.group_map.encapsulator=%22&f.owningColl.split=true&f.owningColl.separator=%7C&f.owningColl.encapsulator=%22&f.p_group_id.split=true&f.p_group_id.separator=%7C&f.p_group_id.encapsulator=%22&f.p_group_name.split=true&f.p_group_name.separator=%7C&f.p_group_name.encapsulator=%22&wt=javabin&version=2 HTTP/1.1" 409 156
|
||||
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update?wt=javabin&version=2 HTTP/1.1" 200 41
|
||||
127.0.0.1 - - [04/Jan/2017:22:44:00 +0200] "POST /solr/datatables/update HTTP/1.1" 200 40
|
||||
</code></pre><ul>
|
||||
<li>Very interesting… it creates the core and then fails somehow</li>
|
||||
</ul>
|
||||
@ -208,11 +208,11 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
|
||||
<li>I found an old post on the mailing list discussing a similar issue, and listing some SQL commands that might help</li>
|
||||
<li>For example, this shows 186 mappings for the item, the first three of which are real:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80596';
|
||||
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80596';
|
||||
</code></pre><ul>
|
||||
<li>Then I deleted the others:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
|
||||
<pre tabindex="0"><code>dspace=# delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
|
||||
</code></pre><ul>
|
||||
<li>And in the item view it now shows the correct mappings</li>
|
||||
<li>I will have to ask the DSpace people if this is a valid approach</li>
|
||||
@ -224,19 +224,19 @@ Caused by: java.net.SocketException: Broken pipe (Write failed)
|
||||
<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung & Ländlicher Raum:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>Traceback (most recent call last):
|
||||
File "./fix-metadata-values.py", line 80, in <module>
|
||||
print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
|
||||
File "./fix-metadata-values.py", line 80, in <module>
|
||||
print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
|
||||
</code></pre><ul>
|
||||
<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
|
||||
<pre tabindex="0"><code>print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
|
||||
</code></pre><ul>
|
||||
<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li>
|
||||
<li>I’m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database… I’ve never had this issue before</li>
|
||||
<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'fuuu'
|
||||
</code></pre><ul>
|
||||
<li>Now get the top 500 journal titles:</li>
|
||||
</ul>
|
||||
@ -255,9 +255,9 @@ UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15:
|
||||
<li>Fix the two items Maria found with duplicate mappings with this script:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>/* 184 in correct mappings: https://cgspace.cgiar.org/handle/10568/78596 */
|
||||
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
|
||||
delete from collection2item where item_id = '80596' and id not in (90792, 90806, 90807);
|
||||
/* 1 incorrect mapping: https://cgspace.cgiar.org/handle/10568/78658 */
|
||||
delete from collection2item where id = '91082';
|
||||
delete from collection2item where id = '91082';
|
||||
</code></pre><h2 id="2017-01-17">2017-01-17</h2>
|
||||
<ul>
|
||||
<li>Helping clean up some file names in the 232 CIAT records that Sisay worked on last week</li>
|
||||
@ -266,15 +266,15 @@ delete from collection2item where id = '91082';
|
||||
<li>And the file names don’t really matter either, as long as the SAF Builder tool can read them—after that DSpace renames them with a hash in the assetstore</li>
|
||||
<li>Seems like the only ones I should replace are the <code>'</code> apostrophe characters, as <code>%27</code>:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>value.replace("'",'%27')
|
||||
<pre tabindex="0"><code>value.replace("'",'%27')
|
||||
</code></pre><ul>
|
||||
<li>Add the item’s Type to the filename column as a hint to SAF Builder so it can set a more useful description field:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>value + "__description:" + cells["dc.type"].value
|
||||
<pre tabindex="0"><code>value + "__description:" + cells["dc.type"].value
|
||||
</code></pre><ul>
|
||||
<li>Test importing of the new CIAT records (actually there are 232, not 234):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
|
||||
<pre tabindex="0"><code>$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/dspacetest.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/79042 --source /home/aorth/CIAT_234/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
|
||||
</code></pre><ul>
|
||||
<li>Many of the PDFs are 20, 30, 40, 50+ MB, which makes a total of 4GB</li>
|
||||
<li>These are scanned from paper and likely have no compression, so we should try to test if these compression techniques help without comprimising the quality too much:</li>
|
||||
@ -289,7 +289,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
|
||||
<li>In testing a random sample of CIAT’s PDFs for compressability, it looks like all of these methods generally increase the file size so we will just import them as they are</li>
|
||||
<li>Import 232 CIAT records into CGSpace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
|
||||
<pre tabindex="0"><code>$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/68704 --source /home/aorth/CIAT_232/SimpleArchiveFormat/ --mapfile=/tmp/ciat.map &> /tmp/ciat.log
|
||||
</code></pre><h2 id="2017-01-22">2017-01-22</h2>
|
||||
<ul>
|
||||
<li>Looking at some records that Sisay is having problems importing into DSpace Test (seems to be because of copious whitespace return characters from Excel’s CSV exporter)</li>
|
||||
@ -300,7 +300,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
|
||||
<li>I merged Atmire’s pull request into the development branch so they can deploy it on DSpace Test</li>
|
||||
<li>Move some old ILRI Program communities to a new subcommunity for former programs (10568/79164):</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
|
||||
<pre tabindex="0"><code>$ for community in 10568/171 10568/27868 10568/231 10568/27869 10568/150 10568/230 10568/32724 10568/172; do /home/cgspace.cgiar.org/bin/dspace community-filiator --remove --parent=10568/27866 --child="$community" && /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/79164 --child="$community"; done
|
||||
</code></pre><ul>
|
||||
<li>Move some collections with <a href="https://gist.github.com/alanorth/e60b530ed4989df0c731afbb0c640515"><code>move-collections.sh</code></a> using the following config:</li>
|
||||
</ul>
|
||||
@ -311,7 +311,7 @@ $ gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -
|
||||
<li>Run all updates on DSpace Test and reboot the server</li>
|
||||
<li>Run fixes for Journal titles on CGSpace:</li>
|
||||
</ul>
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
|
||||
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/fix-49-journal-titles.csv -f dc.source -t correct -m 55 -d dspace -u dspace -p 'password'
|
||||
</code></pre><ul>
|
||||
<li>Create a new list of the top 500 journal titles from the database:</li>
|
||||
</ul>
|
||||
|
Reference in New Issue
Block a user