mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2021-09-13
This commit is contained in:
@ -40,7 +40,7 @@ Testing the CMYK patch on a collection with 650 items:
|
||||
|
||||
$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.87.0" />
|
||||
<meta name="generator" content="Hugo 0.88.1" />
|
||||
|
||||
|
||||
|
||||
@ -136,16 +136,16 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
|
||||
<li>Remove redundant/duplicate text in the DSpace submission license</li>
|
||||
<li>Testing the CMYK patch on a collection with 650 items:</li>
|
||||
</ul>
|
||||
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
|
||||
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
|
||||
</code></pre><h2 id="2017-04-03">2017-04-03</h2>
|
||||
<ul>
|
||||
<li>Continue testing the CMYK patch on more communities:</li>
|
||||
</ul>
|
||||
<pre><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
|
||||
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/1 -p "ImageMagick PDF Thumbnail" -v >> /tmp/filter-media-cmyk.txt 2>&1
|
||||
</code></pre><ul>
|
||||
<li>So far there are almost 500:</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
|
||||
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
|
||||
484
|
||||
</code></pre><ul>
|
||||
<li>Looking at the CG Core document again, I’ll send some feedback to Peter and Abenet:
|
||||
@ -157,39 +157,39 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p "ImageMagick PDF Th
|
||||
</li>
|
||||
<li>Also, I’m noticing some weird outliers in <code>cg.coverage.region</code>, need to remember to go correct these later:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
|
||||
<pre tabindex="0"><code>dspace=# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=227;
|
||||
</code></pre><h2 id="2017-04-04">2017-04-04</h2>
|
||||
<ul>
|
||||
<li>The <code>filter-media</code> script has been running on more large communities and now there are many more CMYK PDFs that have been fixed:</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
|
||||
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
|
||||
1584
|
||||
</code></pre><ul>
|
||||
<li>Trying to find a way to get the number of items submitted by a certain user in 2016</li>
|
||||
<li>It’s not possible in the DSpace search / module interfaces, but might be able to be derived from <code>dc.description.provenance</code>, as that field contains the name and email of the submitter/approver, ie:</li>
|
||||
</ul>
|
||||
<pre><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
|
||||
<pre tabindex="0"><code>Submitted by Francesca Giampieri (fgiampieri) on 2016-01-19T13:56:43Z^M
|
||||
No. of bitstreams: 1^M
|
||||
ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0 (MD5)
|
||||
</code></pre><ul>
|
||||
<li>This SQL query returns fields that were submitted or approved by giampieri in 2016 and contain a “checksum” (ie, there was a bitstream in the submission):</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
|
||||
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
|
||||
</code></pre><ul>
|
||||
<li>Then this one does the same, but for fields that don’t contain checksums (ie, there was no bitstream in the submission):</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
|
||||
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^(Submitted|Approved).*giampieri.*2016-.*' and text_value !~ '^(Submitted|Approved).*giampieri.*2016-.*checksum.*';
|
||||
</code></pre><ul>
|
||||
<li>For some reason there seem to be way too many fields, for example there are 498 + 13 here, which is 511 items for just this one user.</li>
|
||||
<li>It looks like there can be a scenario where the user submitted AND approved it, so some records might be doubled…</li>
|
||||
<li>In that case it might just be better to see how many the user submitted (both <em>with</em> and <em>without</em> bitstreams):</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
|
||||
<pre tabindex="0"><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=28 and text_value ~ '^Submitted.*giampieri.*2016-.*';
|
||||
</code></pre><h2 id="2017-04-05">2017-04-05</h2>
|
||||
<ul>
|
||||
<li>After doing a few more large communities it seems this is the final count of CMYK PDFs:</li>
|
||||
</ul>
|
||||
<pre><code>$ grep -c profile /tmp/filter-media-cmyk.txt
|
||||
<pre tabindex="0"><code>$ grep -c profile /tmp/filter-media-cmyk.txt
|
||||
2505
|
||||
</code></pre><h2 id="2017-04-06">2017-04-06</h2>
|
||||
<ul>
|
||||
@ -260,7 +260,7 @@ ILAC_Brief21_PMCA.pdf: 113462 bytes, checksum: 249fef468f401c066a119f5db687add0
|
||||
<li>I will have to check the OAI cron scripts on DSpace Test, and then run them on CGSpace</li>
|
||||
<li>Running <code>dspace oai import</code> and <code>dspace oai clean-cache</code> have zero effect, but this seems to rebuild the cache from scratch:</li>
|
||||
</ul>
|
||||
<pre><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
|
||||
<pre tabindex="0"><code>$ /home/dspacetest.cgiar.org/bin/dspace oai import -c
|
||||
...
|
||||
63900 items imported so far...
|
||||
64000 items imported so far...
|
||||
@ -273,7 +273,7 @@ OAI 2.0 manager action ended. It took 829 seconds.
|
||||
<li>The import command should theoretically catch situations like this where an item’s metadata was updated, but in this case we changed the metadata schema and it doesn’t seem to catch it (could be a bug!)</li>
|
||||
<li>Attempting a full rebuild of OAI on CGSpace:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 /home/cgspace.cgiar.org/bin/dspace oai import -c
|
||||
...
|
||||
58700 items imported so far...
|
||||
@ -326,14 +326,14 @@ sys 1m29.310s
|
||||
<li>One thing we need to remember if we start using OAI is to enable the autostart of the harvester process (see <code>harvester.autoStart</code> in <code>dspace/config/modules/oai.cfg</code>)</li>
|
||||
<li>Error when running DSpace cleanup task on DSpace Test and CGSpace (on the same item), I need to look this up:</li>
|
||||
</ul>
|
||||
<pre><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
<pre tabindex="0"><code>Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
|
||||
Detail: Key (bitstream_id)=(435) is still referenced from table "bundle".
|
||||
</code></pre><h2 id="2017-04-18">2017-04-18</h2>
|
||||
<ul>
|
||||
<li>Helping Tsega test his new <a href="https://github.com/ilri/ckm-cgspace-rest-api">CGSpace REST API Rails app</a> on DSpace Test</li>
|
||||
<li>Setup and run with:</li>
|
||||
</ul>
|
||||
<pre><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
|
||||
<pre tabindex="0"><code>$ git clone https://github.com/ilri/ckm-cgspace-rest-api.git
|
||||
$ cd ckm-cgspace-rest-api/app
|
||||
$ gem install bundler
|
||||
$ bundle
|
||||
@ -342,12 +342,12 @@ $ rails -s
|
||||
</code></pre><ul>
|
||||
<li>I used Ansible to create a PostgreSQL user that only has <code>SELECT</code> privileges on the tables it needs:</li>
|
||||
</ul>
|
||||
<pre><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
|
||||
<pre tabindex="0"><code>$ ansible linode02 -u aorth -b --become-user=postgres -K -m postgresql_user -a 'db=database name=username password=password priv=CONNECT/item:SELECT/metadatavalue:SELECT/metadatafieldregistry:SELECT/metadataschemaregistry:SELECT/collection:SELECT/handle:SELECT/bundle2bitstream:SELECT/bitstream:SELECT/bundle:SELECT/item2bundle:SELECT state=present
|
||||
</code></pre><ul>
|
||||
<li>Need to look into <a href="https://github.com/puma/puma/blob/master/docs/systemd.md">running this via systemd</a></li>
|
||||
<li>This is interesting for creating runnable commands from <code>bundle</code>:</li>
|
||||
</ul>
|
||||
<pre><code>$ bundle binstubs puma --path ./sbin
|
||||
<pre tabindex="0"><code>$ bundle binstubs puma --path ./sbin
|
||||
</code></pre><h2 id="2017-04-19">2017-04-19</h2>
|
||||
<ul>
|
||||
<li>Usman sent another link to their OAI interface, where the country names are now capitalized: <a href="https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&metadataPrefix=dim&identifier=oai:data.cifor.org:11463/947">https://data.cifor.org/dspace/oai/cgiar?verb=GetRecord&metadataPrefix=dim&identifier=oai:data.cifor.org:11463/947</a></li>
|
||||
@ -360,15 +360,15 @@ $ rails -s
|
||||
<li>Looking at 933 CIAT records from Sisay, he’s having problems creating a SAF bundle to import to DSpace Test</li>
|
||||
<li>I started by looking at his CSV in OpenRefine, and I see there a <em>bunch</em> of fields with whitespace issues that I cleaned up:</li>
|
||||
</ul>
|
||||
<pre><code>value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
|
||||
<pre tabindex="0"><code>value.replace(" ||","||").replace("|| ","||").replace(" || ","||")
|
||||
</code></pre><ul>
|
||||
<li>Also, all the filenames have spaces and URL encoded characters in them, so I decoded them from URL encoding:</li>
|
||||
</ul>
|
||||
<pre><code>unescape(value,"url")
|
||||
<pre tabindex="0"><code>unescape(value,"url")
|
||||
</code></pre><ul>
|
||||
<li>Then create the filename column using the following transform from URL:</li>
|
||||
</ul>
|
||||
<pre><code>value.split('/')[-1].replace(/#.*$/,"")
|
||||
<pre tabindex="0"><code>value.split('/')[-1].replace(/#.*$/,"")
|
||||
</code></pre><ul>
|
||||
<li>The <code>replace</code> part is because some URLs have an anchor like <code>#page=14</code> which we obviously don’t want on the filename</li>
|
||||
<li>Also, we need to only use the PDF on the item corresponding with page 1, so we don’t end up with literally hundreds of duplicate PDFs</li>
|
||||
@ -381,7 +381,7 @@ $ rails -s
|
||||
<li>Looking at the CIAT data again, a bunch of items have metadata values ending in <code>||</code>, which might cause blank fields to be added at import time</li>
|
||||
<li>Cleaning them up with OpenRefine:</li>
|
||||
</ul>
|
||||
<pre><code>value.replace(/\|\|$/,"")
|
||||
<pre tabindex="0"><code>value.replace(/\|\|$/,"")
|
||||
</code></pre><ul>
|
||||
<li>Working with the CIAT data in OpenRefine to remove the filename column from all but the first item which requires a particular PDF, as there are many items pointing to the same PDF, which would cause hundreds of duplicates to be added if we included them in the SAF bundle</li>
|
||||
<li>I did some massaging in OpenRefine, flagging duplicates with stars and flags, then filtering and removing the filenames of those items</li>
|
||||
@ -391,15 +391,15 @@ $ rails -s
|
||||
<li>Also there are loads of whitespace errors in almost every field, so I trimmed leading/trailing whitespace</li>
|
||||
<li>Unbelievable, there are also metadata values like:</li>
|
||||
</ul>
|
||||
<pre><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
|
||||
<pre tabindex="0"><code>COLLETOTRICHUM LINDEMUTHIANUM|| FUSARIUM||GERMPLASM
|
||||
</code></pre><ul>
|
||||
<li>Add a description to the file names using:</li>
|
||||
</ul>
|
||||
<pre><code>value + "__description:" + cells["dc.type"].value
|
||||
<pre tabindex="0"><code>value + "__description:" + cells["dc.type"].value
|
||||
</code></pre><ul>
|
||||
<li>Test import of 933 records:</li>
|
||||
</ul>
|
||||
<pre><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
|
||||
<pre tabindex="0"><code>$ [dspace]/bin/dspace import -a -e aorth@mjanja.ch -c 10568/87193 -s /home/aorth/src/CIAT-Books/SimpleArchiveFormat/ -m /tmp/ciat
|
||||
$ wc -l /tmp/ciat
|
||||
933 /tmp/ciat
|
||||
</code></pre><ul>
|
||||
@ -409,7 +409,7 @@ $ wc -l /tmp/ciat
|
||||
<li>More work on Ansible infrastructure stuff for Tsega’s CKM DSpace REST API</li>
|
||||
<li>I’m going to start re-processing all the PDF thumbnails on CGSpace, one community at a time:</li>
|
||||
</ul>
|
||||
<pre><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
<pre tabindex="0"><code>$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
|
||||
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media -f -v -i 10568/71249 -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media-cmyk.txt
|
||||
</code></pre><h2 id="2017-04-22">2017-04-22</h2>
|
||||
<ul>
|
||||
@ -417,13 +417,13 @@ $ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace filter-media
|
||||
<li>The solution is to remove the ID (ie set to NULL) from the <code>primary_bitstream_id</code> column in the <code>bundle</code> table</li>
|
||||
<li>After doing that and running the <code>cleanup</code> task again I find more bitstreams that are affected and end up with a long list of IDs that need to be fixed:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
|
||||
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1136, 1132, 1220, 1236, 3002, 3255, 5322);
|
||||
</code></pre><h2 id="2017-04-24">2017-04-24</h2>
|
||||
<ul>
|
||||
<li>Two users mentioned some items they recently approved not showing up in the search / XMLUI</li>
|
||||
<li>I looked at the logs from yesterday and it seems the Discovery indexing has been crashing:</li>
|
||||
</ul>
|
||||
<pre><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
|
||||
<pre tabindex="0"><code>2017-04-24 00:00:15,578 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (55 of 58853): 70590
|
||||
2017-04-24 00:00:15,586 INFO com.atmire.dspace.discovery.AtmireSolrService @ Processing (56 of 58853): 74507
|
||||
2017-04-24 00:00:15,614 ERROR com.atmire.dspace.discovery.AtmireSolrService @ this IndexWriter is closed
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this IndexWriter is closed
|
||||
@ -447,7 +447,7 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
|
||||
</code></pre><ul>
|
||||
<li>Looking at the past few days of logs, it looks like the indexing process started crashing on 2017-04-20:</li>
|
||||
</ul>
|
||||
<pre><code># grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
|
||||
<pre tabindex="0"><code># grep -c 'IndexWriter is closed' [dspace]/log/dspace.log.2017-04-*
|
||||
[dspace]/log/dspace.log.2017-04-01:0
|
||||
[dspace]/log/dspace.log.2017-04-02:0
|
||||
[dspace]/log/dspace.log.2017-04-03:0
|
||||
@ -475,12 +475,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
|
||||
</code></pre><ul>
|
||||
<li>I restarted Tomcat and re-ran the discovery process manually:</li>
|
||||
</ul>
|
||||
<pre><code>[dspace]/bin/dspace index-discovery
|
||||
<pre tabindex="0"><code>[dspace]/bin/dspace index-discovery
|
||||
</code></pre><ul>
|
||||
<li>Now everything is ok</li>
|
||||
<li>Finally finished manually running the cleanup task over and over and null’ing the conflicting IDs:</li>
|
||||
</ul>
|
||||
<pre><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
|
||||
<pre tabindex="0"><code>dspace=# update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (435, 1132, 1136, 1220, 1236, 3002, 3255, 5322, 5098, 5982, 5897, 6245, 6184, 4927, 6070, 4925, 6888, 7368, 7136, 7294, 7698, 7864, 10799, 10839, 11765, 13241, 13634, 13642, 14127, 14146, 15582, 16116, 16254, 17136, 17486, 17824, 18098, 22091, 22149, 22206, 22449, 22548, 22559, 22454, 22253, 22553, 22897, 22941, 30262, 33657, 39796, 46943, 56561, 58237, 58739, 58734, 62020, 62535, 64149, 64672, 66988, 66919, 76005, 79780, 78545, 81078, 83620, 84492, 92513, 93915);
|
||||
</code></pre><ul>
|
||||
<li>Now running the cleanup script on DSpace Test and already seeing 11GB freed from the assetstore—it’s likely we haven’t had a cleanup task complete successfully in years…</li>
|
||||
</ul>
|
||||
@ -489,12 +489,12 @@ org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: this Index
|
||||
<li>Finally finished running the PDF thumbnail re-processing on CGSpace, the final count of CMYK PDFs is about 2751</li>
|
||||
<li>Preparing to run the cleanup task on CGSpace, I want to see how many files are in the assetstore:</li>
|
||||
</ul>
|
||||
<pre><code># find [dspace]/assetstore/ -type f | wc -l
|
||||
<pre tabindex="0"><code># find [dspace]/assetstore/ -type f | wc -l
|
||||
113104
|
||||
</code></pre><ul>
|
||||
<li>Troubleshooting the Atmire Solr update process that runs at 3:00 AM every morning, after finishing at 100% it has this error:</li>
|
||||
</ul>
|
||||
<pre><code>[=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
|
||||
<pre tabindex="0"><code>[=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
|
||||
[=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
|
||||
[=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:12
|
||||
[=================================================> ]99% time remaining: 0 seconds. timestamp: 2017-04-25 09:07:13
|
||||
@ -557,7 +557,7 @@ Caused by: java.lang.ClassNotFoundException: org.dspace.statistics.content.DSpac
|
||||
<li>The size of the CGSpace database dump went from 111MB to 96MB, not sure about actual database size though</li>
|
||||
<li>Update RVM’s Ruby from 2.3.0 to 2.4.0 on DSpace Test:</li>
|
||||
</ul>
|
||||
<pre><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
|
||||
<pre tabindex="0"><code>$ gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3
|
||||
$ \curl -sSL https://raw.githubusercontent.com/wayneeseguin/rvm/master/binscripts/rvm-installer | bash -s stable --ruby
|
||||
... reload shell to get new Ruby
|
||||
$ gem install sass -v 3.3.14
|
||||
|
Reference in New Issue
Block a user