mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-27
This commit is contained in:
@ -35,7 +35,7 @@ I noticed we have a very interesting list of countries on CGSpace:
|
||||
Not only are there 49,000 countries, we have some blanks (25)…
|
||||
Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE”
|
||||
"/>
|
||||
<meta name="generator" content="Hugo 0.62.2" />
|
||||
<meta name="generator" content="Hugo 0.63.1" />
|
||||
|
||||
|
||||
|
||||
@ -65,7 +65,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
|
||||
|
||||
<!-- combined, minified CSS -->
|
||||
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel="stylesheet" integrity="sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy+piAwENoVPTw=" crossorigin="anonymous">
|
||||
<link href="https://alanorth.github.io/cgspace-notes/css/style.23e2c3298bcc8c1136c19aba330c211ec94c36f7c4454ea15cf4d3548370042a.css" rel="stylesheet" integrity="sha256-I+LDKYvMjBE2wZq6MwwhHslMNvfERU6hXPTTVINwBCo=" crossorigin="anonymous">
|
||||
|
||||
|
||||
<!-- RSS 2.0 feed -->
|
||||
@ -113,7 +113,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
|
||||
<h2 class="blog-post-title" dir="auto"><a href="https://alanorth.github.io/cgspace-notes/2016-02/">February, 2016</a></h2>
|
||||
<p class="blog-post-meta"><time datetime="2016-02-05T13:18:00+03:00">Fri Feb 05, 2016</time> by Alan Orth in
|
||||
|
||||
<i class="fa fa-tag" aria-hidden="true"></i> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||||
<span class="fas fa-tag" aria-hidden="true"></span> <a href="/cgspace-notes/tags/notes" rel="tag">Notes</a>
|
||||
|
||||
</p>
|
||||
</header>
|
||||
@ -144,7 +144,7 @@ Also, lots of things like “COTE D`LVOIRE” and “COTE D IVOIRE&r
|
||||
</ul>
|
||||
<pre><code>dspacetest=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '22678';
|
||||
</code></pre><ul>
|
||||
<li>It's 25 items so editing in the web UI is annoying, let's try SQL!</li>
|
||||
<li>It’s 25 items so editing in the web UI is annoying, let’s try SQL!</li>
|
||||
</ul>
|
||||
<pre><code>dspacetest=# delete from metadatavalue where metadata_field_id=78 and text_value='';
|
||||
DELETE 25
|
||||
@ -157,7 +157,7 @@ DELETE 25
|
||||
</ul>
|
||||
<h2 id="2016-02-07">2016-02-07</h2>
|
||||
<ul>
|
||||
<li>Working on cleaning up Abenet's DAGRIS data with OpenRefine</li>
|
||||
<li>Working on cleaning up Abenet’s DAGRIS data with OpenRefine</li>
|
||||
<li>I discovered two really nice functions in OpenRefine: <code>value.trim()</code> and <code>value.escape("javascript")</code> which shows whitespace characters like <code>\r\n</code>!</li>
|
||||
<li>For some reason when you import an Excel file into OpenRefine it exports dates like 1949 to 1949.0 in the CSV</li>
|
||||
<li>I re-import the resulting CSV and run a GREL on the date issued column: <code>value.replace("\.0", "")</code></li>
|
||||
@ -178,7 +178,7 @@ postgres=# \q
|
||||
$ vacuumdb dspacetest
|
||||
$ psql -U dspacetest -f ~/src/git/DSpace/dspace/etc/postgres/update-sequences.sql dspacetest -h localhost
|
||||
</code></pre><ul>
|
||||
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat's webapps folder:</li>
|
||||
<li>After building and running a <code>fresh_install</code> I symlinked the webapps into Tomcat’s webapps folder:</li>
|
||||
</ul>
|
||||
<pre><code>$ mv /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT.orig
|
||||
$ ln -sfv ~/dspace/webapps/xmlui /opt/brew/Cellar/tomcat/8.0.30/libexec/webapps/ROOT
|
||||
@ -199,7 +199,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
|
||||
</code></pre><h2 id="2016-02-08">2016-02-08</h2>
|
||||
<ul>
|
||||
<li>Finish cleaning up and importing ~400 DAGRIS items into CGSpace</li>
|
||||
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme's brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
|
||||
<li>Whip up some quick CSS to make the button in the submission workflow use the XMLUI theme’s brand colors (<a href="https://github.com/ilri/DSpace/issues/154">#154</a>)</li>
|
||||
</ul>
|
||||
<p><img src="/cgspace-notes/2016/02/submit-button-ilri.png" alt="ILRI submission buttons">
|
||||
<img src="/cgspace-notes/2016/02/submit-button-drylands.png" alt="Drylands submission buttons"></p>
|
||||
@ -207,7 +207,7 @@ $ /opt/brew/Cellar/tomcat/8.0.30/bin/catalina start
|
||||
<ul>
|
||||
<li>Re-sync DSpace Test with CGSpace</li>
|
||||
<li>Help Sisay with OpenRefine</li>
|
||||
<li>Enable HTTPS on DSpace Test using Let's Encrypt:</li>
|
||||
<li>Enable HTTPS on DSpace Test using Let’s Encrypt:</li>
|
||||
</ul>
|
||||
<pre><code>$ cd ~/src/git
|
||||
$ git clone https://github.com/letsencrypt/letsencrypt
|
||||
@ -222,7 +222,7 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
|
||||
<li>I had to export some CIAT items that were being cleaned up on the test server and I noticed their <code>dc.contributor.author</code> fields have DSpace 5 authority index UUIDs…</li>
|
||||
<li>To clean those up in OpenRefine I used this GREL expression: <code>value.replace(/::\w{8}-\w{4}-\w{4}-\w{4}-\w{12}::600/,"")</code></li>
|
||||
<li>Getting more and more hangs on DSpace Test, seemingly random but also during CSV import</li>
|
||||
<li>Logs don't always show anything right when it fails, but eventually one of these appears:</li>
|
||||
<li>Logs don’t always show anything right when it fails, but eventually one of these appears:</li>
|
||||
</ul>
|
||||
<pre><code>org.dspace.discovery.SearchServiceException: Error while processing facet fields: java.lang.OutOfMemoryError: Java heap space
|
||||
</code></pre><ul>
|
||||
@ -230,7 +230,7 @@ $ ansible-playbook dspace.yml -l linode02 -t nginx,firewall -u aorth --ask-becom
|
||||
</ul>
|
||||
<pre><code>Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
|
||||
</code></pre><ul>
|
||||
<li>Right now DSpace Test's Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
|
||||
<li>Right now DSpace Test’s Tomcat heap is set to 1536m and we have quite a bit of free RAM:</li>
|
||||
</ul>
|
||||
<pre><code># free -m
|
||||
total used free shared buffers cached
|
||||
@ -238,7 +238,7 @@ Mem: 3950 3902 48 9 37 1311
|
||||
-/+ buffers/cache: 2552 1397
|
||||
Swap: 255 57 198
|
||||
</code></pre><ul>
|
||||
<li>So I'll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
|
||||
<li>So I’ll bump up the Tomcat heap to 2048 (CGSpace production server is using 3GB)</li>
|
||||
</ul>
|
||||
<h2 id="2016-02-11">2016-02-11</h2>
|
||||
<ul>
|
||||
@ -259,16 +259,16 @@ Processing 64195.pdf
|
||||
> Creating thumbnail for 64195.pdf
|
||||
</code></pre><h2 id="2016-02-12">2016-02-12</h2>
|
||||
<ul>
|
||||
<li>Looking at CIAT's records again, there are some problems with a dozen or so files (out of 1200)</li>
|
||||
<li>Looking at CIAT’s records again, there are some problems with a dozen or so files (out of 1200)</li>
|
||||
<li>A few items are using the same exact PDF</li>
|
||||
<li>A few items are using HTM or DOC files</li>
|
||||
<li>A few items link to PDFs on IFPRI's e-Library or Research Gate</li>
|
||||
<li>A few items link to PDFs on IFPRI’s e-Library or Research Gate</li>
|
||||
<li>A few items have no item</li>
|
||||
<li>Also, I'm not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
|
||||
<li>Also, I’m not sure if we import these items, will be remove the <code>dc.identifier.url</code> field from the records?</li>
|
||||
</ul>
|
||||
<h2 id="2016-02-12-1">2016-02-12</h2>
|
||||
<ul>
|
||||
<li>Looking at CIAT's records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I'm not sure if we can use those</li>
|
||||
<li>Looking at CIAT’s records again, there are some files linking to PDFs on Slide Share, Embrapa, UEA UK, and Condesan, so I’m not sure if we can use those</li>
|
||||
<li>265 items have dirty, URL-encoded filenames:</li>
|
||||
</ul>
|
||||
<pre><code>$ ls | grep -c -E "%"
|
||||
@ -291,7 +291,7 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
|
||||
</code></pre><ul>
|
||||
<li>This turns the URLs into human-readable versions that we can use as proper filenames</li>
|
||||
<li>Run web server and system updates on DSpace Test and reboot</li>
|
||||
<li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn't have the brackets, like <code>dc.identifier.url2</code></li>
|
||||
<li>To merge <code>dc.identifier.url</code> and <code>dc.identifier.url[]</code>, rename the second column so it doesn’t have the brackets, like <code>dc.identifier.url2</code></li>
|
||||
<li>Then you create a facet for blank values on each column, show the rows that have values for one and not the other, then transform each independently to have the contents of the other, with “||” in between</li>
|
||||
<li>Work on Python script for parsing and downloading PDF records from <code>dc.identifier.url</code></li>
|
||||
<li>To get filenames from <code>dc.identifier.url</code>, create a new column based on this transform: <code>forEach(value.split('||'), v, v.split('/')[-1]).join('||')</code></li>
|
||||
@ -306,8 +306,8 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
|
||||
</ul>
|
||||
<h2 id="2016-02-20">2016-02-20</h2>
|
||||
<ul>
|
||||
<li>Turns out the “bug” in SAFBuilder isn't a bug, it's a feature that allows you to encode extra information like the destintion bundle in the filename</li>
|
||||
<li>Also, it seems DSpace's SAF import tool doesn't like importing filenames that have accents in them:</li>
|
||||
<li>Turns out the “bug” in SAFBuilder isn’t a bug, it’s a feature that allows you to encode extra information like the destintion bundle in the filename</li>
|
||||
<li>Also, it seems DSpace’s SAF import tool doesn’t like importing filenames that have accents in them:</li>
|
||||
</ul>
|
||||
<pre><code>java.io.FileNotFoundException: /usr/share/tomcat7/SimpleArchiveFormat/item_1021/CIAT_COLOMBIA_000075_Medición_de_palatabilidad_en_forrajes.pdf (No such file or directory)
|
||||
</code></pre><ul>
|
||||
@ -327,29 +327,29 @@ CIAT_COLOMBIA_000169_Técnicas_para_el_aislamiento_y_cultivo_de_protoplastos_de_
|
||||
Bitstream: tést señora.pdf
|
||||
Bitstream: tést señora alimentación.pdf
|
||||
</code></pre><ul>
|
||||
<li>Seems it could be something with the HFS+ filesystem actually, as it's not UTF-8 (<a href="http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html">it's something like UCS-2</a>)</li>
|
||||
<li>HFS+ stores filenames as a string, and filenames with accents get stored as <a href="https://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/">character+accent</a> whereas Linux's ext4 stores them as an array of bytes</li>
|
||||
<li>Running the SAFBuilder on Mac OS X works if you're going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem's encoding matches</li>
|
||||
<li>Seems it could be something with the HFS+ filesystem actually, as it’s not UTF-8 (<a href="http://www.cio.com/article/2868393/linus-torvalds-apples-hfs-is-probably-the-worst-file-system-ever.html">it’s something like UCS-2</a>)</li>
|
||||
<li>HFS+ stores filenames as a string, and filenames with accents get stored as <a href="https://blog.vrypan.net/2012/11/13/hfsplus-unicode-and-accented-chars/">character+accent</a> whereas Linux’s ext4 stores them as an array of bytes</li>
|
||||
<li>Running the SAFBuilder on Mac OS X works if you’re going to import the resulting bundle on Mac OS X, but if your DSpace is running on Linux you need to run the SAFBuilder there where the filesystem’s encoding matches</li>
|
||||
</ul>
|
||||
<h2 id="2016-02-29">2016-02-29</h2>
|
||||
<ul>
|
||||
<li>Got notified by some CIFOR colleagues that the Google Scholar team had contacted them about CGSpace's incorrect ordering of authors in Google Scholar metadata</li>
|
||||
<li>Got notified by some CIFOR colleagues that the Google Scholar team had contacted them about CGSpace’s incorrect ordering of authors in Google Scholar metadata</li>
|
||||
<li>Turns out there is a patch, and it was merged in DSpace 5.4: <a href="https://jira.duraspace.org/browse/DS-2679">https://jira.duraspace.org/browse/DS-2679</a></li>
|
||||
<li>I've merged it into our <code>5_x-prod</code> branch that is currently based on DSpace 5.1</li>
|
||||
<li>I’ve merged it into our <code>5_x-prod</code> branch that is currently based on DSpace 5.1</li>
|
||||
<li>We found a bug when a user searches from the homepage, sorts the results, and then tries to click “View More” in a sidebar facet</li>
|
||||
<li>I am not sure what causes it yet, but I opened an issue for it: <a href="https://github.com/ilri/DSpace/issues/179">https://github.com/ilri/DSpace/issues/179</a></li>
|
||||
<li>Have more problems with SAFBuilder on Mac OS X</li>
|
||||
<li>Now it doesn't recognize description hints in the filename column, like: <code>test.pdf__description:Blah</code></li>
|
||||
<li>Now it doesn’t recognize description hints in the filename column, like: <code>test.pdf__description:Blah</code></li>
|
||||
<li>But on Linux it works fine</li>
|
||||
<li>Trying to test Atmire's series of stats and CUA fixes from January and February, but their branch history is really messy and it's hard to see what's going on</li>
|
||||
<li>Rebasing their branch on top of our production branch results in a broken Tomcat, so I'm going to tell them to fix their history and make a proper pull request</li>
|
||||
<li>Trying to test Atmire’s series of stats and CUA fixes from January and February, but their branch history is really messy and it’s hard to see what’s going on</li>
|
||||
<li>Rebasing their branch on top of our production branch results in a broken Tomcat, so I’m going to tell them to fix their history and make a proper pull request</li>
|
||||
<li>Looking at the filenames for the CIAT Reports, some have some really ugly characters, like: <code>'</code> or <code>,</code> or <code>=</code> or <code>[</code> or <code>]</code> or <code>(</code> or <code>)</code> or <code>_.pdf</code> or <code>._</code> etc</li>
|
||||
<li>It's tricky to parse those things in some programming languages so I'd rather just get rid of the weird stuff now in OpenRefine:</li>
|
||||
<li>It’s tricky to parse those things in some programming languages so I’d rather just get rid of the weird stuff now in OpenRefine:</li>
|
||||
</ul>
|
||||
<pre><code>value.replace("'",'').replace('_=_','_').replace(',','').replace('[','').replace(']','').replace('(','').replace(')','').replace('_.pdf','.pdf').replace('._','_')
|
||||
</code></pre><ul>
|
||||
<li>Finally import the 1127 CIAT items into CGSpace: <a href="https://cgspace.cgiar.org/handle/10568/35710">https://cgspace.cgiar.org/handle/10568/35710</a></li>
|
||||
<li>Re-deploy CGSpace with the Google Scholar fix, but I'm waiting on the Atmire fixes for now, as the branch history is ugly</li>
|
||||
<li>Re-deploy CGSpace with the Google Scholar fix, but I’m waiting on the Atmire fixes for now, as the branch history is ugly</li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
Reference in New Issue
Block a user