Update notes for 2017-08-01

This commit is contained in:
2017-08-01 12:03:37 +03:00
parent e3e602881e
commit 5b11434f0f
30 changed files with 91 additions and 71 deletions

View File

@ -21,6 +21,8 @@ But many of the bots are browsing dynamic URLs like:
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
" />
@ -30,7 +32,7 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
<meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T11:57:37&#43;03:00"/>
@ -65,6 +67,8 @@ But many of the bots are browsing dynamic URLs like:
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
"/>
@ -79,9 +83,9 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
"@type": "BlogPosting",
"headline": "August, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
"wordCount": "123",
"wordCount": "166",
"datePublished": "2017-08-01T11:51:52&#43;03:00",
"dateModified": "2017-08-01T11:51:52&#43;03:00",
"dateModified": "2017-08-01T11:57:37&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -159,6 +163,8 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
</ul>
<p></p>