Update notes for 2017-08-01

2025-01-27 05:49:12 +01:00 · 2017-08-01 16:31:58 +03:00
parent 5b11434f0f
commit e10b1fecb4
9 changed files with 59 additions and 8 deletions
--- a/content/post/2017-08.md
+++ b/content/post/2017-08.md
@@ -18,5 +18,11 @@ tags = ["Notes"]
 - Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
 - It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
 - Also, the bot has to successfully browse the page first so it can receive the HTTP header...
+- We might actually have to _block_ these requests with HTTP 403 depending on the user agent
+- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
+- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
+- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
+- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet

 <!--more-->
+
--- a/public/2017-08/index.html
+++ b/public/2017-08/index.html
@@ -23,6 +23,11 @@ The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we w
 Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
 It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
 Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
+We might actually have to block these requests with HTTP 403 depending on the user agent
+Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
+This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
+I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
+Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet


 " />
@@ -32,7 +37,7 @@ Also, the bot has to successfully browse the page first so it can receive the HT


 <meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
-<meta property="article:modified_time" content="2017-08-01T11:57:37&#43;03:00"/>
+<meta property="article:modified_time" content="2017-08-01T12:03:37&#43;03:00"/>



@@ -69,6 +74,11 @@ The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we w
 Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
 It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
 Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
+We might actually have to block these requests with HTTP 403 depending on the user agent
+Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
+This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
+I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
+Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet


 "/>
@@ -83,9 +93,9 @@ Also, the bot has to successfully browse the page first so it can receive the HT
  "@type": "BlogPosting",
  "headline": "August, 2017",
  "url": "https://alanorth.github.io/cgspace-notes/2017-08/",
-  "wordCount": "166",
+  "wordCount": "262",
  "datePublished": "2017-08-01T11:51:52&#43;03:00",
-  "dateModified": "2017-08-01T11:57:37&#43;03:00",
+  "dateModified": "2017-08-01T12:03:37&#43;03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@@ -165,6 +175,11 @@ Also, the bot has to successfully browse the page first so it can receive the HT
 <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
 <li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
 <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
+<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
+<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
+<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
+<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
+<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
 </ul>

 <p></p>
--- a/public/index.html
+++ b/public/index.html
@@ -121,6 +121,11 @@
 <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
 <li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
 <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
+<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
+<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
+<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
+<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
+<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
 </ul>

 <p></p>
--- a/public/index.xml
+++ b/public/index.xml
@@ -34,6 +34,11 @@
 &lt;li&gt;Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): &lt;a href=&#34;https://jira.duraspace.org/browse/DS-2962&#34;&gt;https://jira.duraspace.org/browse/DS-2962&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;It turns out that we&amp;rsquo;re already adding the &lt;code&gt;X-Robots-Tag &amp;quot;none&amp;quot;&lt;/code&gt; HTTP header, but this only forbids the search engine from &lt;em&gt;indexing&lt;/em&gt; the page, not crawling it!&lt;/li&gt;
 &lt;li&gt;Also, the bot has to successfully browse the page first so it can receive the HTTP header&amp;hellip;&lt;/li&gt;
+&lt;li&gt;We might actually have to &lt;em&gt;block&lt;/em&gt; these requests with HTTP 403 depending on the user agent&lt;/li&gt;
+&lt;li&gt;Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415&lt;/li&gt;
+&lt;li&gt;This was due to newline characters in the &lt;code&gt;dc.description.abstract&lt;/code&gt; column, which caused OpenRefine to choke when exporting the CSV&lt;/li&gt;
+&lt;li&gt;I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using &lt;code&gt;g/^$/d&lt;/code&gt;&lt;/li&gt;
+&lt;li&gt;Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;&lt;/p&gt;</description>
--- a/public/post/index.html
+++ b/public/post/index.html
@@ -121,6 +121,11 @@
 <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
 <li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
 <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
+<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
+<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
+<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
+<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
+<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
 </ul>

 <p></p>
--- a/public/post/index.xml
+++ b/public/post/index.xml
@@ -34,6 +34,11 @@
 &lt;li&gt;Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): &lt;a href=&#34;https://jira.duraspace.org/browse/DS-2962&#34;&gt;https://jira.duraspace.org/browse/DS-2962&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;It turns out that we&amp;rsquo;re already adding the &lt;code&gt;X-Robots-Tag &amp;quot;none&amp;quot;&lt;/code&gt; HTTP header, but this only forbids the search engine from &lt;em&gt;indexing&lt;/em&gt; the page, not crawling it!&lt;/li&gt;
 &lt;li&gt;Also, the bot has to successfully browse the page first so it can receive the HTTP header&amp;hellip;&lt;/li&gt;
+&lt;li&gt;We might actually have to &lt;em&gt;block&lt;/em&gt; these requests with HTTP 403 depending on the user agent&lt;/li&gt;
+&lt;li&gt;Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415&lt;/li&gt;
+&lt;li&gt;This was due to newline characters in the &lt;code&gt;dc.description.abstract&lt;/code&gt; column, which caused OpenRefine to choke when exporting the CSV&lt;/li&gt;
+&lt;li&gt;I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using &lt;code&gt;g/^$/d&lt;/code&gt;&lt;/li&gt;
+&lt;li&gt;Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;&lt;/p&gt;</description>
--- a/public/sitemap.xml
+++ b/public/sitemap.xml
@@ -4,7 +4,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/2017-08/</loc>
-    <lastmod>2017-08-01T11:57:37+03:00</lastmod>
+    <lastmod>2017-08-01T12:03:37+03:00</lastmod>
  </url>
  
  <url>
@@ -114,7 +114,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/</loc>
-    <lastmod>2017-08-01T11:57:37+03:00</lastmod>
+    <lastmod>2017-08-01T12:03:37+03:00</lastmod>
    <priority>0</priority>
  </url>
  
@@ -125,19 +125,19 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
-    <lastmod>2017-08-01T11:57:37+03:00</lastmod>
+    <lastmod>2017-08-01T12:03:37+03:00</lastmod>
    <priority>0</priority>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/post/</loc>
-    <lastmod>2017-08-01T11:57:37+03:00</lastmod>
+    <lastmod>2017-08-01T12:03:37+03:00</lastmod>
    <priority>0</priority>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
-    <lastmod>2017-08-01T11:57:37+03:00</lastmod>
+    <lastmod>2017-08-01T12:03:37+03:00</lastmod>
    <priority>0</priority>
  </url>
  
--- a/public/tags/notes/index.html
+++ b/public/tags/notes/index.html
@@ -121,6 +121,11 @@
 <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
 <li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
 <li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
+<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
+<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
+<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
+<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
+<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
 </ul>

 <p></p>
--- a/public/tags/notes/index.xml
+++ b/public/tags/notes/index.xml
@@ -34,6 +34,11 @@
 &lt;li&gt;Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): &lt;a href=&#34;https://jira.duraspace.org/browse/DS-2962&#34;&gt;https://jira.duraspace.org/browse/DS-2962&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;It turns out that we&amp;rsquo;re already adding the &lt;code&gt;X-Robots-Tag &amp;quot;none&amp;quot;&lt;/code&gt; HTTP header, but this only forbids the search engine from &lt;em&gt;indexing&lt;/em&gt; the page, not crawling it!&lt;/li&gt;
 &lt;li&gt;Also, the bot has to successfully browse the page first so it can receive the HTTP header&amp;hellip;&lt;/li&gt;
+&lt;li&gt;We might actually have to &lt;em&gt;block&lt;/em&gt; these requests with HTTP 403 depending on the user agent&lt;/li&gt;
+&lt;li&gt;Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415&lt;/li&gt;
+&lt;li&gt;This was due to newline characters in the &lt;code&gt;dc.description.abstract&lt;/code&gt; column, which caused OpenRefine to choke when exporting the CSV&lt;/li&gt;
+&lt;li&gt;I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using &lt;code&gt;g/^$/d&lt;/code&gt;&lt;/li&gt;
+&lt;li&gt;Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet&lt;/li&gt;
 &lt;/ul&gt;

 &lt;p&gt;&lt;/p&gt;</description>