mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-29 18:08:20 +01:00
Update notes for 2017-08-01
This commit is contained in:
parent
5b11434f0f
commit
e10b1fecb4
@ -18,5 +18,11 @@ tags = ["Notes"]
|
|||||||
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
||||||
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
|
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
|
||||||
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
|
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
|
||||||
|
- We might actually have to _block_ these requests with HTTP 403 depending on the user agent
|
||||||
|
- Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
|
||||||
|
- This was due to newline characters in the `dc.description.abstract` column, which caused OpenRefine to choke when exporting the CSV
|
||||||
|
- I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using `g/^$/d`
|
||||||
|
- Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
|
||||||
|
|
||||||
<!--more-->
|
<!--more-->
|
||||||
|
|
||||||
|
@ -23,6 +23,11 @@ The robots.txt only blocks the top-level /discover and /browse URLs… we w
|
|||||||
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
||||||
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
||||||
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
||||||
|
We might actually have to block these requests with HTTP 403 depending on the user agent
|
||||||
|
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
|
||||||
|
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
|
||||||
|
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
|
||||||
|
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
|
||||||
|
|
||||||
|
|
||||||
" />
|
" />
|
||||||
@ -32,7 +37,7 @@ Also, the bot has to successfully browse the page first so it can receive the HT
|
|||||||
|
|
||||||
|
|
||||||
<meta property="article:published_time" content="2017-08-01T11:51:52+03:00"/>
|
<meta property="article:published_time" content="2017-08-01T11:51:52+03:00"/>
|
||||||
<meta property="article:modified_time" content="2017-08-01T11:57:37+03:00"/>
|
<meta property="article:modified_time" content="2017-08-01T12:03:37+03:00"/>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -69,6 +74,11 @@ The robots.txt only blocks the top-level /discover and /browse URLs… we w
|
|||||||
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
||||||
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
||||||
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
||||||
|
We might actually have to block these requests with HTTP 403 depending on the user agent
|
||||||
|
Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415
|
||||||
|
This was due to newline characters in the dc.description.abstract column, which caused OpenRefine to choke when exporting the CSV
|
||||||
|
I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using g/^$/d
|
||||||
|
Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet
|
||||||
|
|
||||||
|
|
||||||
"/>
|
"/>
|
||||||
@ -83,9 +93,9 @@ Also, the bot has to successfully browse the page first so it can receive the HT
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "August, 2017",
|
"headline": "August, 2017",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
|
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
|
||||||
"wordCount": "166",
|
"wordCount": "262",
|
||||||
"datePublished": "2017-08-01T11:51:52+03:00",
|
"datePublished": "2017-08-01T11:51:52+03:00",
|
||||||
"dateModified": "2017-08-01T11:57:37+03:00",
|
"dateModified": "2017-08-01T12:03:37+03:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -165,6 +175,11 @@ Also, the bot has to successfully browse the page first so it can receive the HT
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p>
|
<p></p>
|
||||||
|
@ -121,6 +121,11 @@
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p>
|
<p></p>
|
||||||
|
@ -34,6 +34,11 @@
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p></description>
|
<p></p></description>
|
||||||
|
@ -121,6 +121,11 @@
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p>
|
<p></p>
|
||||||
|
@ -34,6 +34,11 @@
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p></description>
|
<p></p></description>
|
||||||
|
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2017-08/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2017-08/</loc>
|
||||||
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
|
<lastmod>2017-08-01T12:03:37+03:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -114,7 +114,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
|
<lastmod>2017-08-01T12:03:37+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -125,19 +125,19 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
|
<lastmod>2017-08-01T12:03:37+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
||||||
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
|
<lastmod>2017-08-01T12:03:37+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
|
<lastmod>2017-08-01T12:03:37+03:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
@ -121,6 +121,11 @@
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p>
|
<p></p>
|
||||||
|
@ -34,6 +34,11 @@
|
|||||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||||
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
|
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
|
||||||
|
<li>We might actually have to <em>block</em> these requests with HTTP 403 depending on the user agent</li>
|
||||||
|
<li>Abenet pointed out that the CGIAR Library Historical Archive collection I sent July 20th only had ~100 entries, instead of 2415</li>
|
||||||
|
<li>This was due to newline characters in the <code>dc.description.abstract</code> column, which caused OpenRefine to choke when exporting the CSV</li>
|
||||||
|
<li>I exported a new CSV from the collection on DSpace Test and then manually removed the characters in vim using <code>g/^$/d</code></li>
|
||||||
|
<li>Then I cleaned up the author authorities and HTML characters in OpenRefine and sent the file back to Abenet</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p></p></description>
|
<p></p></description>
|
||||||
|
Loading…
Reference in New Issue
Block a user