Add notes for 2016-09-06

This commit is contained in:
Alan Orth 2016-09-06 15:17:40 +03:00
parent ca94154759
commit 31f36b37b8
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 239 additions and 0 deletions

View File

@ -113,3 +113,50 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
- After updating the Authority indexes (`bin/dspace index-authority`) everything looks good
- Run authority updates on CGSpace
## 2016-09-05
- After one week of logging TLS connections on CGSpace:
```
# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
```
- So this represents `0.02%` of 1.16M connections over a one-week period
- Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:
```
value + "__description:" + cells["dc.type"].value
```
- This gives you, for example: `Mainstreaming gender in agricultural R&D.pdf__description:Brief`
## 2016-09-06
- Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file
- Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF:
- Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf
- Imports fine on DSpace running on Mac OS X
- Fails to import on DSpace running on Linux with error `No such file or directory`
- Change diacritic in file name from á to a and re-create SAF bundle and zip
- Success on both Mac OS X and Linux...
- Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)
- See: http://www.fileformat.info/info/unicode/char/e1/index.htm
- See: http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&s=&uv=0
- If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8
- We should definitely clean filenames so they don't use characters that are tricky to process in CSV and shell scripts, like: `,`, `'`, and `"`
```
value.replace("'","").replace(",","").replace('"','')
```
- I need to write a Python script to match that for renaming files in the file system
- When importing SAF bundles it seems you can specify the target collection on the command line using `-c 10568/4003` or in the `collections` file inside each item in the bundle
- Seems that the latter method causes a null pointer exception, so I will just have to use the former method
- In the end I was able to import the files after unzipping them ONLY on Linux
- The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above

View File

@ -199,6 +199,70 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
<ul>
<li>After updating the Authority indexes (<code>bin/dspace index-authority</code>) everything looks good</li>
<li>Run authority updates on CGSpace</li>
</ul>
<h2 id="2016-09-05">2016-09-05</h2>
<ul>
<li>After one week of logging TLS connections on CGSpace:</li>
</ul>
<pre><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
</code></pre>
<ul>
<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li>
<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li>
</ul>
<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value
</code></pre>
<ul>
<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li>
</ul>
<h2 id="2016-09-06">2016-09-06</h2>
<ul>
<li>Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file</li>
<li>Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF:
<ul>
<li>Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf</li>
<li>Imports fine on DSpace running on Mac OS X</li>
<li>Fails to import on DSpace running on Linux with error <code>No such file or directory</code></li>
</ul></li>
<li>Change diacritic in file name from á to a and re-create SAF bundle and zip
<ul>
<li>Success on both Mac OS X and Linux&hellip;</li>
</ul></li>
<li>Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)</li>
<li>See: <a href="http://www.fileformat.info/info/unicode/char/e1/index.htm">http://www.fileformat.info/info/unicode/char/e1/index.htm</a></li>
<li>See: <a href="http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0">http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0</a></li>
<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li>
<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li>
</ul>
<pre><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','')
</code></pre>
<ul>
<li>I need to write a Python script to match that for renaming files in the file system</li>
<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li>
<li>Seems that the latter method causes a null pointer exception, so I will just have to use the former method</li>
<li>In the end I was able to import the files after unzipping them ONLY on Linux
<ul>
<li>The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above</li>
</ul></li>
</ul>
</section>

View File

@ -138,6 +138,70 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
&lt;li&gt;After updating the Authority indexes (&lt;code&gt;bin/dspace index-authority&lt;/code&gt;) everything looks good&lt;/li&gt;
&lt;li&gt;Run authority updates on CGSpace&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-09-05&#34;&gt;2016-09-05&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;After one week of logging TLS connections on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zgrep &amp;quot;DES-CBC3&amp;quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep &amp;quot;DES-CBC3&amp;quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk &#39;{print $6}&#39; | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;So this represents &lt;code&gt;0.02%&lt;/code&gt; of 1.16M connections over a one-week period&lt;/li&gt;
&lt;li&gt;Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;value + &amp;quot;__description:&amp;quot; + cells[&amp;quot;dc.type&amp;quot;].value
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;This gives you, for example: &lt;code&gt;Mainstreaming gender in agricultural R&amp;amp;D.pdf__description:Brief&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-09-06&#34;&gt;2016-09-06&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file&lt;/li&gt;
&lt;li&gt;Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF:
&lt;ul&gt;
&lt;li&gt;Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf&lt;/li&gt;
&lt;li&gt;Imports fine on DSpace running on Mac OS X&lt;/li&gt;
&lt;li&gt;Fails to import on DSpace running on Linux with error &lt;code&gt;No such file or directory&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Change diacritic in file name from á to a and re-create SAF bundle and zip
&lt;ul&gt;
&lt;li&gt;Success on both Mac OS X and Linux&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)&lt;/li&gt;
&lt;li&gt;See: &lt;a href=&#34;http://www.fileformat.info/info/unicode/char/e1/index.htm&#34;&gt;http://www.fileformat.info/info/unicode/char/e1/index.htm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;See: &lt;a href=&#34;http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;amp;s=&amp;amp;uv=0&#34;&gt;http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;amp;s=&amp;amp;uv=0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8&lt;/li&gt;
&lt;li&gt;We should definitely clean filenames so they don&amp;rsquo;t use characters that are tricky to process in CSV and shell scripts, like: &lt;code&gt;,&lt;/code&gt;, &lt;code&gt;&#39;&lt;/code&gt;, and &lt;code&gt;&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;value.replace(&amp;quot;&#39;&amp;quot;,&amp;quot;&amp;quot;).replace(&amp;quot;,&amp;quot;,&amp;quot;&amp;quot;).replace(&#39;&amp;quot;&#39;,&#39;&#39;)
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;I need to write a Python script to match that for renaming files in the file system&lt;/li&gt;
&lt;li&gt;When importing SAF bundles it seems you can specify the target collection on the command line using &lt;code&gt;-c 10568/4003&lt;/code&gt; or in the &lt;code&gt;collections&lt;/code&gt; file inside each item in the bundle&lt;/li&gt;
&lt;li&gt;Seems that the latter method causes a null pointer exception, so I will just have to use the former method&lt;/li&gt;
&lt;li&gt;In the end I was able to import the files after unzipping them ONLY on Linux
&lt;ul&gt;
&lt;li&gt;The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>

View File

@ -138,6 +138,70 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu
&lt;li&gt;After updating the Authority indexes (&lt;code&gt;bin/dspace index-authority&lt;/code&gt;) everything looks good&lt;/li&gt;
&lt;li&gt;Run authority updates on CGSpace&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-09-05&#34;&gt;2016-09-05&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;After one week of logging TLS connections on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;# zgrep &amp;quot;DES-CBC3&amp;quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
217
# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
1164376
# zgrep &amp;quot;DES-CBC3&amp;quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk &#39;{print $6}&#39; | sort | uniq
TLSv1/DES-CBC3-SHA
TLSv1/EDH-RSA-DES-CBC3-SHA
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;So this represents &lt;code&gt;0.02%&lt;/code&gt; of 1.16M connections over a one-week period&lt;/li&gt;
&lt;li&gt;Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;value + &amp;quot;__description:&amp;quot; + cells[&amp;quot;dc.type&amp;quot;].value
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;This gives you, for example: &lt;code&gt;Mainstreaming gender in agricultural R&amp;amp;D.pdf__description:Brief&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2016-09-06&#34;&gt;2016-09-06&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file&lt;/li&gt;
&lt;li&gt;Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF:
&lt;ul&gt;
&lt;li&gt;Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf&lt;/li&gt;
&lt;li&gt;Imports fine on DSpace running on Mac OS X&lt;/li&gt;
&lt;li&gt;Fails to import on DSpace running on Linux with error &lt;code&gt;No such file or directory&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Change diacritic in file name from á to a and re-create SAF bundle and zip
&lt;ul&gt;
&lt;li&gt;Success on both Mac OS X and Linux&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)&lt;/li&gt;
&lt;li&gt;See: &lt;a href=&#34;http://www.fileformat.info/info/unicode/char/e1/index.htm&#34;&gt;http://www.fileformat.info/info/unicode/char/e1/index.htm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;See: &lt;a href=&#34;http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;amp;s=&amp;amp;uv=0&#34;&gt;http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;amp;s=&amp;amp;uv=0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8&lt;/li&gt;
&lt;li&gt;We should definitely clean filenames so they don&amp;rsquo;t use characters that are tricky to process in CSV and shell scripts, like: &lt;code&gt;,&lt;/code&gt;, &lt;code&gt;&#39;&lt;/code&gt;, and &lt;code&gt;&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;value.replace(&amp;quot;&#39;&amp;quot;,&amp;quot;&amp;quot;).replace(&amp;quot;,&amp;quot;,&amp;quot;&amp;quot;).replace(&#39;&amp;quot;&#39;,&#39;&#39;)
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;I need to write a Python script to match that for renaming files in the file system&lt;/li&gt;
&lt;li&gt;When importing SAF bundles it seems you can specify the target collection on the command line using &lt;code&gt;-c 10568/4003&lt;/code&gt; or in the &lt;code&gt;collections&lt;/code&gt; file inside each item in the bundle&lt;/li&gt;
&lt;li&gt;Seems that the latter method causes a null pointer exception, so I will just have to use the former method&lt;/li&gt;
&lt;li&gt;In the end I was able to import the files after unzipping them ONLY on Linux
&lt;ul&gt;
&lt;li&gt;The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
</item>