Update notes for 2017-08-01

2025-01-27 05:49:12 +01:00 · 2017-08-01 12:03:37 +03:00
parent e3e602881e
commit 5b11434f0f
30 changed files with 91 additions and 71 deletions
--- a/public/2017-08/index.html
+++ b/public/2017-08/index.html
@ -21,6 +21,8 @@ But many of the bots are browsing dynamic URLs like:

 The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
 Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
+It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;


 " />
@ -30,7 +32,7 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura


 <meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
-<meta property="article:modified_time" content="2017-08-01T11:51:52&#43;03:00"/>
+<meta property="article:modified_time" content="2017-08-01T11:57:37&#43;03:00"/>



@ -65,6 +67,8 @@ But many of the bots are browsing dynamic URLs like:

 The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
 Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
+It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
+Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;


 "/>
@ -79,9 +83,9 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
  "@type": "BlogPosting",
  "headline": "August, 2017",
  "url": "https://alanorth.github.io/cgspace-notes/2017-08/",
-  "wordCount": "123",
+  "wordCount": "166",
  "datePublished": "2017-08-01T11:51:52&#43;03:00",
-  "dateModified": "2017-08-01T11:51:52&#43;03:00",
+  "dateModified": "2017-08-01T11:57:37&#43;03:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@ -159,6 +163,8 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
 </ul></li>
 <li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
 <li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
+<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
+<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
 </ul>

 <p></p>