From 31f36b37b82a1a6b543ef9c75cbf3c0913675a6e Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Tue, 6 Sep 2016 15:17:40 +0300 Subject: [PATCH] Add notes for 2016-09-06 --- content/2016-09.md | 47 +++++++++++++++++++++++++++ public/2016-09/index.html | 64 +++++++++++++++++++++++++++++++++++++ public/index.xml | 64 +++++++++++++++++++++++++++++++++++++ public/tags/notes/index.xml | 64 +++++++++++++++++++++++++++++++++++++ 4 files changed, 239 insertions(+) diff --git a/content/2016-09.md b/content/2016-09.md index 497f343a8..fbb1d752a 100644 --- a/content/2016-09.md +++ b/content/2016-09.md @@ -113,3 +113,50 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu - After updating the Authority indexes (`bin/dspace index-authority`) everything looks good - Run authority updates on CGSpace + +## 2016-09-05 + +- After one week of logging TLS connections on CGSpace: + +``` +# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l +217 +# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l +1164376 +# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq +TLSv1/DES-CBC3-SHA +TLSv1/EDH-RSA-DES-CBC3-SHA +``` +- So this represents `0.02%` of 1.16M connections over a one-week period +- Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder: + +``` +value + "__description:" + cells["dc.type"].value +``` + +- This gives you, for example: `Mainstreaming gender in agricultural R&D.pdf__description:Brief` + +## 2016-09-06 + +- Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file +- Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF: + - Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf + - Imports fine on DSpace running on Mac OS X + - Fails to import on DSpace running on Linux with error `No such file or directory` +- Change diacritic in file name from á to a and re-create SAF bundle and zip + - Success on both Mac OS X and Linux... +- Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301) +- See: http://www.fileformat.info/info/unicode/char/e1/index.htm +- See: http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&s=&uv=0 +- If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8 +- We should definitely clean filenames so they don't use characters that are tricky to process in CSV and shell scripts, like: `,`, `'`, and `"` + +``` +value.replace("'","").replace(",","").replace('"','') +``` + +- I need to write a Python script to match that for renaming files in the file system +- When importing SAF bundles it seems you can specify the target collection on the command line using `-c 10568/4003` or in the `collections` file inside each item in the bundle +- Seems that the latter method causes a null pointer exception, so I will just have to use the former method +- In the end I was able to import the files after unzipping them ONLY on Linux + - The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above diff --git a/public/2016-09/index.html b/public/2016-09/index.html index 6f1627671..2de7e4749 100644 --- a/public/2016-09/index.html +++ b/public/2016-09/index.html @@ -199,6 +199,70 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu + +

2016-09-05

+ + + +
# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
+217
+# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l
+1164376
+# zgrep "DES-CBC3" /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq
+TLSv1/DES-CBC3-SHA
+TLSv1/EDH-RSA-DES-CBC3-SHA
+
+ + + +
value + "__description:" + cells["dc.type"].value
+
+ + + +

2016-09-06

+ + + +
value.replace("'","").replace(",","").replace('"','')
+
+ + diff --git a/public/index.xml b/public/index.xml index 112b31b4d..44616fdb3 100644 --- a/public/index.xml +++ b/public/index.xml @@ -138,6 +138,70 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu <li>After updating the Authority indexes (<code>bin/dspace index-authority</code>) everything looks good</li> <li>Run authority updates on CGSpace</li> </ul> + +<h2 id="2016-09-05">2016-09-05</h2> + +<ul> +<li>After one week of logging TLS connections on CGSpace:</li> +</ul> + +<pre><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l +217 +# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l +1164376 +# zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq +TLSv1/DES-CBC3-SHA +TLSv1/EDH-RSA-DES-CBC3-SHA +</code></pre> + +<ul> +<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li> +<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li> +</ul> + +<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value +</code></pre> + +<ul> +<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li> +</ul> + +<h2 id="2016-09-06">2016-09-06</h2> + +<ul> +<li>Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file</li> +<li>Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF: + +<ul> +<li>Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf</li> +<li>Imports fine on DSpace running on Mac OS X</li> +<li>Fails to import on DSpace running on Linux with error <code>No such file or directory</code></li> +</ul></li> +<li>Change diacritic in file name from á to a and re-create SAF bundle and zip + +<ul> +<li>Success on both Mac OS X and Linux&hellip;</li> +</ul></li> +<li>Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)</li> +<li>See: <a href="http://www.fileformat.info/info/unicode/char/e1/index.htm">http://www.fileformat.info/info/unicode/char/e1/index.htm</a></li> +<li>See: <a href="http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0">http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0</a></li> +<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li> +<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li> +</ul> + +<pre><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','') +</code></pre> + +<ul> +<li>I need to write a Python script to match that for renaming files in the file system</li> +<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li> +<li>Seems that the latter method causes a null pointer exception, so I will just have to use the former method</li> +<li>In the end I was able to import the files after unzipping them ONLY on Linux + +<ul> +<li>The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above</li> +</ul></li> +</ul> diff --git a/public/tags/notes/index.xml b/public/tags/notes/index.xml index 1e091756f..14e585d4b 100644 --- a/public/tags/notes/index.xml +++ b/public/tags/notes/index.xml @@ -138,6 +138,70 @@ dspacetest=# select distinct text_value, authority, confidence from metadatavalu <li>After updating the Authority indexes (<code>bin/dspace index-authority</code>) everything looks good</li> <li>Run authority updates on CGSpace</li> </ul> + +<h2 id="2016-09-05">2016-09-05</h2> + +<ul> +<li>After one week of logging TLS connections on CGSpace:</li> +</ul> + +<pre><code># zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l +217 +# zcat -f -- /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | wc -l +1164376 +# zgrep &quot;DES-CBC3&quot; /var/log/nginx/cgspace.cgiar.org-access-ssl.log* | awk '{print $6}' | sort | uniq +TLSv1/DES-CBC3-SHA +TLSv1/EDH-RSA-DES-CBC3-SHA +</code></pre> + +<ul> +<li>So this represents <code>0.02%</code> of 1.16M connections over a one-week period</li> +<li>Transforming some filenames in OpenRefine so they can have a useful description for SAFBuilder:</li> +</ul> + +<pre><code>value + &quot;__description:&quot; + cells[&quot;dc.type&quot;].value +</code></pre> + +<ul> +<li>This gives you, for example: <code>Mainstreaming gender in agricultural R&amp;D.pdf__description:Brief</code></li> +</ul> + +<h2 id="2016-09-06">2016-09-06</h2> + +<ul> +<li>Trying to import the records for CIAT from yesterday, but having filename encoding issues from their zip file</li> +<li>Create a zip on Mac OS X from a SAF bundle containing only one record with one PDF: + +<ul> +<li>Filename: Complementing Farmers Genetic Knowledge Farmer Breeding Workshop in Turipaná, Colombia.pdf</li> +<li>Imports fine on DSpace running on Mac OS X</li> +<li>Fails to import on DSpace running on Linux with error <code>No such file or directory</code></li> +</ul></li> +<li>Change diacritic in file name from á to a and re-create SAF bundle and zip + +<ul> +<li>Success on both Mac OS X and Linux&hellip;</li> +</ul></li> +<li>Looks like on the Mac OS X file system the file names represent á as: a (U+0061) + ́ (U+0301)</li> +<li>See: <a href="http://www.fileformat.info/info/unicode/char/e1/index.htm">http://www.fileformat.info/info/unicode/char/e1/index.htm</a></li> +<li>See: <a href="http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0">http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%A1&amp;s=&amp;uv=0</a></li> +<li>If I unzip the original zip from CIAT on Windows, re-zip it with 7zip on Windows, and then unzip it on Linux directly, the file names seem to be proper UTF-8</li> +<li>We should definitely clean filenames so they don&rsquo;t use characters that are tricky to process in CSV and shell scripts, like: <code>,</code>, <code>'</code>, and <code>&quot;</code></li> +</ul> + +<pre><code>value.replace(&quot;'&quot;,&quot;&quot;).replace(&quot;,&quot;,&quot;&quot;).replace('&quot;','') +</code></pre> + +<ul> +<li>I need to write a Python script to match that for renaming files in the file system</li> +<li>When importing SAF bundles it seems you can specify the target collection on the command line using <code>-c 10568/4003</code> or in the <code>collections</code> file inside each item in the bundle</li> +<li>Seems that the latter method causes a null pointer exception, so I will just have to use the former method</li> +<li>In the end I was able to import the files after unzipping them ONLY on Linux + +<ul> +<li>The CSV file was giving file names in UTF-8, and unzipping the zip on Mac OS X and transferring it was converting the file names to Unicode equivalence like I saw above</li> +</ul></li> +</ul>