mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes for 2017-08-01
This commit is contained in:
@ -21,6 +21,8 @@ But many of the bots are browsing dynamic URLs like:
|
||||
|
||||
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
|
||||
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
||||
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
||||
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
||||
|
||||
|
||||
" />
|
||||
@ -30,7 +32,7 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
|
||||
|
||||
|
||||
<meta property="article:published_time" content="2017-08-01T11:51:52+03:00"/>
|
||||
<meta property="article:modified_time" content="2017-08-01T11:51:52+03:00"/>
|
||||
<meta property="article:modified_time" content="2017-08-01T11:57:37+03:00"/>
|
||||
|
||||
|
||||
|
||||
@ -65,6 +67,8 @@ But many of the bots are browsing dynamic URLs like:
|
||||
|
||||
The robots.txt only blocks the top-level /discover and /browse URLs… we will need to find a way to forbid them from accessing these!
|
||||
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
|
||||
It turns out that we’re already adding the X-Robots-Tag "none" HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
|
||||
Also, the bot has to successfully browse the page first so it can receive the HTTP header…
|
||||
|
||||
|
||||
"/>
|
||||
@ -79,9 +83,9 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
|
||||
"@type": "BlogPosting",
|
||||
"headline": "August, 2017",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
|
||||
"wordCount": "123",
|
||||
"wordCount": "166",
|
||||
"datePublished": "2017-08-01T11:51:52+03:00",
|
||||
"dateModified": "2017-08-01T11:51:52+03:00",
|
||||
"dateModified": "2017-08-01T11:57:37+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -159,6 +163,8 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
|
||||
</ul></li>
|
||||
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs… we will need to find a way to forbid them from accessing these!</li>
|
||||
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
|
||||
<li>It turns out that we’re already adding the <code>X-Robots-Tag "none"</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
|
||||
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header…</li>
|
||||
</ul>
|
||||
|
||||
<p></p>
|
||||
|
Reference in New Issue
Block a user