Update notes for 2017-08-01

This commit is contained in:
Alan Orth 2017-08-01 12:03:37 +03:00
parent e3e602881e
commit 5b11434f0f
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
30 changed files with 91 additions and 71 deletions

View File

@ -16,5 +16,7 @@ tags = ["Notes"]
- /handle/10568/16510/browse
- The `robots.txt` only blocks the top-level `/discover` and `/browse` URLs... we will need to find a way to forbid them from accessing these!
- Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
- It turns out that we're already adding the `X-Robots-Tag "none"` HTTP header, but this only forbids the search engine from _indexing_ the page, not crawling it!
- Also, the bot has to successfully browse the page first so it can receive the HTTP header...
<!--more-->

View File

@ -25,7 +25,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
<meta property="article:published_time" content="2015-11-23T17:00:57&#43;03:00"/>
<meta property="article:modified_time" content="2015-11-23T17:00:57&#43;03:00"/>
<meta property="article:modified_time" content="2016-09-28T17:02:30&#43;03:00"/>
@ -71,7 +71,7 @@ $ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspac
"url": "https://alanorth.github.io/cgspace-notes/2015-11/",
"wordCount": "798",
"datePublished": "2015-11-23T17:00:57&#43;03:00",
"dateModified": "2015-11-23T17:00:57&#43;03:00",
"dateModified": "2016-09-28T17:02:30&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -26,7 +26,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
<meta property="article:published_time" content="2015-12-02T13:18:00&#43;03:00"/>
<meta property="article:modified_time" content="2015-12-02T13:18:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -73,7 +73,7 @@ Replace lzop with xz in log compression cron jobs on DSpace Test—it uses less
"url": "https://alanorth.github.io/cgspace-notes/2015-12/",
"wordCount": "753",
"datePublished": "2015-12-02T13:18:00&#43;03:00",
"dateModified": "2015-12-02T13:18:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -21,7 +21,7 @@ Update GitHub wiki for documentation of maintenance tasks.
<meta property="article:published_time" content="2016-01-13T13:18:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-01-13T13:18:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -63,7 +63,7 @@ Update GitHub wiki for documentation of maintenance tasks.
"url": "https://alanorth.github.io/cgspace-notes/2016-01/",
"wordCount": "466",
"datePublished": "2016-01-13T13:18:00&#43;03:00",
"dateModified": "2016-01-13T13:18:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -28,7 +28,7 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
<meta property="article:published_time" content="2016-02-05T13:18:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-02-05T13:18:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -77,7 +77,7 @@ Also, lots of things like &ldquo;COTE D`LVOIRE&rdquo; and &ldquo;COTE D IVOIRE&r
"url": "https://alanorth.github.io/cgspace-notes/2016-02/",
"wordCount": "1657",
"datePublished": "2016-02-05T13:18:00&#43;03:00",
"dateModified": "2016-02-05T13:18:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -21,7 +21,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
<meta property="article:published_time" content="2016-03-02T16:50:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-03-02T16:50:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -63,7 +63,7 @@ Reinstall my local (Mac OS X) DSpace stack with Tomcat 7, PostgreSQL 9.3, and Ja
"url": "https://alanorth.github.io/cgspace-notes/2016-03/",
"wordCount": "1581",
"datePublished": "2016-03-02T16:50:00&#43;03:00",
"dateModified": "2016-03-02T16:50:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -23,7 +23,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
<meta property="article:published_time" content="2016-04-04T11:06:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-04-04T11:06:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-09-28T17:02:30&#43;03:00"/>
@ -67,7 +67,7 @@ Also, I noticed the checker log has some errors we should pay attention to:
"url": "https://alanorth.github.io/cgspace-notes/2016-04/",
"wordCount": "2006",
"datePublished": "2016-04-04T11:06:00&#43;03:00",
"dateModified": "2016-04-04T11:06:00&#43;03:00",
"dateModified": "2016-09-28T17:02:30&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -25,7 +25,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
<meta property="article:published_time" content="2016-05-01T23:06:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-05-01T23:06:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -71,7 +71,7 @@ There are 3,000 IPs accessing the REST API in a 24-hour period!
"url": "https://alanorth.github.io/cgspace-notes/2016-05/",
"wordCount": "1349",
"datePublished": "2016-05-01T23:06:00&#43;03:00",
"dateModified": "2016-05-01T23:06:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -24,7 +24,7 @@ Working on second phase of metadata migration, looks like this will work for mov
<meta property="article:published_time" content="2016-06-01T10:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-06-01T10:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -69,7 +69,7 @@ Working on second phase of metadata migration, looks like this will work for mov
"url": "https://alanorth.github.io/cgspace-notes/2016-06/",
"wordCount": "1549",
"datePublished": "2016-06-01T10:53:00&#43;03:00",
"dateModified": "2016-06-01T10:53:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -32,7 +32,7 @@ In this case the select query was showing 95 results before the update
<meta property="article:published_time" content="2016-07-01T10:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-07-01T10:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -85,7 +85,7 @@ In this case the select query was showing 95 results before the update
"url": "https://alanorth.github.io/cgspace-notes/2016-07/",
"wordCount": "866",
"datePublished": "2016-07-01T10:53:00&#43;03:00",
"dateModified": "2016-07-01T10:53:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -29,7 +29,7 @@ $ git rebase -i dspace-5.5
<meta property="article:published_time" content="2016-08-01T15:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-08-01T15:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -79,7 +79,7 @@ $ git rebase -i dspace-5.5
"url": "https://alanorth.github.io/cgspace-notes/2016-08/",
"wordCount": "1514",
"datePublished": "2016-08-01T15:53:00&#43;03:00",
"dateModified": "2016-08-01T15:53:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -25,7 +25,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
<meta property="article:published_time" content="2016-09-01T15:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-09-01T15:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-09T16:18:07&#43;02:00"/>
@ -71,7 +71,7 @@ $ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=or
"url": "https://alanorth.github.io/cgspace-notes/2016-09/",
"wordCount": "3298",
"datePublished": "2016-09-01T15:53:00&#43;03:00",
"dateModified": "2016-09-01T15:53:00&#43;03:00",
"dateModified": "2017-01-09T16:18:07&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -29,7 +29,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
<meta property="article:published_time" content="2016-10-03T15:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-10-03T15:53:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-10T16:21:47&#43;02:00"/>
@ -79,7 +79,7 @@ I exported a random item&rsquo;s metadata as CSV, deleted all columns except id
"url": "https://alanorth.github.io/cgspace-notes/2016-10/",
"wordCount": "1828",
"datePublished": "2016-10-03T15:53:00&#43;03:00",
"dateModified": "2016-10-03T15:53:00&#43;03:00",
"dateModified": "2017-01-10T16:21:47&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -21,7 +21,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
<meta property="article:published_time" content="2016-11-01T09:21:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-11-01T09:21:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-10T16:21:47&#43;02:00"/>
@ -63,7 +63,7 @@ Add dc.type to the output options for Atmire&rsquo;s Listings and Reports module
"url": "https://alanorth.github.io/cgspace-notes/2016-11/",
"wordCount": "2825",
"datePublished": "2016-11-01T09:21:00&#43;03:00",
"dateModified": "2016-11-01T09:21:00&#43;03:00",
"dateModified": "2017-01-10T16:21:47&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -33,7 +33,7 @@ Another worrying error from dspace.log is:
<meta property="article:published_time" content="2016-12-02T10:43:00&#43;03:00"/>
<meta property="article:modified_time" content="2016-12-02T10:43:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-10T16:21:47&#43;02:00"/>
@ -87,7 +87,7 @@ Another worrying error from dspace.log is:
"url": "https://alanorth.github.io/cgspace-notes/2016-12/",
"wordCount": "4078",
"datePublished": "2016-12-02T10:43:00&#43;03:00",
"dateModified": "2016-12-02T10:43:00&#43;03:00",
"dateModified": "2017-01-10T16:21:47&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -21,7 +21,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
<meta property="article:published_time" content="2017-01-02T10:43:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-02T10:43:00&#43;03:00"/>
<meta property="article:modified_time" content="2017-01-29T13:18:32&#43;02:00"/>
@ -63,7 +63,7 @@ I asked on the dspace-tech mailing list because it seems to be broken, and actua
"url": "https://alanorth.github.io/cgspace-notes/2017-01/",
"wordCount": "1594",
"datePublished": "2017-01-02T10:43:00&#43;03:00",
"dateModified": "2017-01-02T10:43:00&#43;03:00",
"dateModified": "2017-01-29T13:18:32&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -35,7 +35,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
<meta property="article:published_time" content="2017-02-07T07:04:52-08:00"/>
<meta property="article:modified_time" content="2017-02-07T07:04:52-08:00"/>
<meta property="article:modified_time" content="2017-02-28T22:58:29&#43;02:00"/>
@ -91,7 +91,7 @@ Looks like we&rsquo;ll be using cg.identifier.ccafsprojectpii as the field name
"url": "https://alanorth.github.io/cgspace-notes/2017-02/",
"wordCount": "2028",
"datePublished": "2017-02-07T07:04:52-08:00",
"dateModified": "2017-02-07T07:04:52-08:00",
"dateModified": "2017-02-28T22:58:29&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -37,7 +37,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
<meta property="article:published_time" content="2017-03-01T17:08:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-03-01T17:08:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-03-31T05:36:10&#43;03:00"/>
@ -95,7 +95,7 @@ $ identify ~/Desktop/alc_contrastes_desafios.jpg
"url": "https://alanorth.github.io/cgspace-notes/2017-03/",
"wordCount": "1538",
"datePublished": "2017-03-01T17:08:52&#43;02:00",
"dateModified": "2017-03-01T17:08:52&#43;02:00",
"dateModified": "2017-03-31T05:36:10&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -30,7 +30,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
<meta property="article:published_time" content="2017-04-02T17:08:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-04-02T17:08:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-04-26T13:35:10&#43;03:00"/>
@ -81,7 +81,7 @@ $ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Th
"url": "https://alanorth.github.io/cgspace-notes/2017-04/",
"wordCount": "2917",
"datePublished": "2017-04-02T17:08:52&#43;02:00",
"dateModified": "2017-04-02T17:08:52&#43;02:00",
"dateModified": "2017-04-26T13:35:10&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -13,7 +13,7 @@
<meta property="article:published_time" content="2017-05-01T16:21:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-05-01T16:21:52&#43;02:00"/>
<meta property="article:modified_time" content="2017-05-29T13:15:22&#43;03:00"/>
@ -47,7 +47,7 @@
"url": "https://alanorth.github.io/cgspace-notes/2017-05/",
"wordCount": "2412",
"datePublished": "2017-05-01T16:21:52&#43;02:00",
"dateModified": "2017-05-01T16:21:52&#43;02:00",
"dateModified": "2017-05-29T13:15:22&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -13,7 +13,7 @@
<meta property="article:published_time" content="2017-06-01T10:14:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-06-01T10:14:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-06-30T18:34:51&#43;03:00"/>
@ -47,7 +47,7 @@
"url": "https://alanorth.github.io/cgspace-notes/2017-06/",
"wordCount": "1261",
"datePublished": "2017-06-01T10:14:52&#43;03:00",
"dateModified": "2017-06-01T10:14:52&#43;03:00",
"dateModified": "2017-06-30T18:34:51&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -27,7 +27,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
<meta property="article:published_time" content="2017-07-01T18:03:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-07-01T18:03:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T08:55:37&#43;03:00"/>
@ -75,7 +75,7 @@ We can use PostgreSQL&rsquo;s extended output format (-x) plus sed to format the
"url": "https://alanorth.github.io/cgspace-notes/2017-07/",
"wordCount": "1151",
"datePublished": "2017-07-01T18:03:52&#43;03:00",
"dateModified": "2017-07-01T18:03:52&#43;03:00",
"dateModified": "2017-08-01T08:55:37&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"

View File

@ -21,6 +21,8 @@ But many of the bots are browsing dynamic URLs like:
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
" />
@ -30,7 +32,7 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
<meta property="article:published_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T11:51:52&#43;03:00"/>
<meta property="article:modified_time" content="2017-08-01T11:57:37&#43;03:00"/>
@ -65,6 +67,8 @@ But many of the bots are browsing dynamic URLs like:
The robots.txt only blocks the top-level /discover and /browse URLs&hellip; we will need to find a way to forbid them from accessing these!
Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.duraspace.org/browse/DS-2962
It turns out that we&rsquo;re already adding the X-Robots-Tag &quot;none&quot; HTTP header, but this only forbids the search engine from indexing the page, not crawling it!
Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;
"/>
@ -79,9 +83,9 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
"@type": "BlogPosting",
"headline": "August, 2017",
"url": "https://alanorth.github.io/cgspace-notes/2017-08/",
"wordCount": "123",
"wordCount": "166",
"datePublished": "2017-08-01T11:51:52&#43;03:00",
"dateModified": "2017-08-01T11:51:52&#43;03:00",
"dateModified": "2017-08-01T11:57:37&#43;03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -159,6 +163,8 @@ Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): https://jira.dura
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
</ul>
<p></p>

View File

@ -119,6 +119,8 @@
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
</ul>
<p></p>

View File

@ -32,6 +32,8 @@
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;robots.txt&lt;/code&gt; only blocks the top-level &lt;code&gt;/discover&lt;/code&gt; and &lt;code&gt;/browse&lt;/code&gt; URLs&amp;hellip; we will need to find a way to forbid them from accessing these!&lt;/li&gt;
&lt;li&gt;Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): &lt;a href=&#34;https://jira.duraspace.org/browse/DS-2962&#34;&gt;https://jira.duraspace.org/browse/DS-2962&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;It turns out that we&amp;rsquo;re already adding the &lt;code&gt;X-Robots-Tag &amp;quot;none&amp;quot;&lt;/code&gt; HTTP header, but this only forbids the search engine from &lt;em&gt;indexing&lt;/em&gt; the page, not crawling it!&lt;/li&gt;
&lt;li&gt;Also, the bot has to successfully browse the page first so it can receive the HTTP header&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;/p&gt;</description>

View File

@ -119,6 +119,8 @@
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
</ul>
<p></p>

View File

@ -32,6 +32,8 @@
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;robots.txt&lt;/code&gt; only blocks the top-level &lt;code&gt;/discover&lt;/code&gt; and &lt;code&gt;/browse&lt;/code&gt; URLs&amp;hellip; we will need to find a way to forbid them from accessing these!&lt;/li&gt;
&lt;li&gt;Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): &lt;a href=&#34;https://jira.duraspace.org/browse/DS-2962&#34;&gt;https://jira.duraspace.org/browse/DS-2962&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;It turns out that we&amp;rsquo;re already adding the &lt;code&gt;X-Robots-Tag &amp;quot;none&amp;quot;&lt;/code&gt; HTTP header, but this only forbids the search engine from &lt;em&gt;indexing&lt;/em&gt; the page, not crawling it!&lt;/li&gt;
&lt;li&gt;Also, the bot has to successfully browse the page first so it can receive the HTTP header&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;/p&gt;</description>

View File

@ -4,117 +4,117 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-08/</loc>
<lastmod>2017-08-01T11:51:52+03:00</lastmod>
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-07/</loc>
<lastmod>2017-07-01T18:03:52+03:00</lastmod>
<lastmod>2017-08-01T08:55:37+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-06/</loc>
<lastmod>2017-06-01T10:14:52+03:00</lastmod>
<lastmod>2017-06-30T18:34:51+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-05/</loc>
<lastmod>2017-05-01T16:21:52+02:00</lastmod>
<lastmod>2017-05-29T13:15:22+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-04/</loc>
<lastmod>2017-04-02T17:08:52+02:00</lastmod>
<lastmod>2017-04-26T13:35:10+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-03/</loc>
<lastmod>2017-03-01T17:08:52+02:00</lastmod>
<lastmod>2017-03-31T05:36:10+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-02/</loc>
<lastmod>2017-02-07T07:04:52-08:00</lastmod>
<lastmod>2017-02-28T22:58:29+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2017-01/</loc>
<lastmod>2017-01-02T10:43:00+03:00</lastmod>
<lastmod>2017-01-29T13:18:32+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-12/</loc>
<lastmod>2016-12-02T10:43:00+03:00</lastmod>
<lastmod>2017-01-10T16:21:47+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-11/</loc>
<lastmod>2016-11-01T09:21:00+03:00</lastmod>
<lastmod>2017-01-10T16:21:47+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-10/</loc>
<lastmod>2016-10-03T15:53:00+03:00</lastmod>
<lastmod>2017-01-10T16:21:47+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-09/</loc>
<lastmod>2016-09-01T15:53:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-08/</loc>
<lastmod>2016-08-01T15:53:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-07/</loc>
<lastmod>2016-07-01T10:53:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-06/</loc>
<lastmod>2016-06-01T10:53:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-05/</loc>
<lastmod>2016-05-01T23:06:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-04/</loc>
<lastmod>2016-04-04T11:06:00+03:00</lastmod>
<lastmod>2016-09-28T17:02:30+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-03/</loc>
<lastmod>2016-03-02T16:50:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-02/</loc>
<lastmod>2016-02-05T13:18:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2016-01/</loc>
<lastmod>2016-01-13T13:18:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2015-12/</loc>
<lastmod>2015-12-02T13:18:00+03:00</lastmod>
<lastmod>2017-01-09T16:18:07+02:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/2015-11/</loc>
<lastmod>2015-11-23T17:00:57+03:00</lastmod>
<lastmod>2016-09-28T17:02:30+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2017-08-01T11:51:52+03:00</lastmod>
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
<priority>0</priority>
</url>
@ -125,19 +125,19 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2017-08-01T11:51:52+03:00</lastmod>
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
<lastmod>2017-08-01T11:51:52+03:00</lastmod>
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2017-08-01T11:51:52+03:00</lastmod>
<lastmod>2017-08-01T11:57:37+03:00</lastmod>
<priority>0</priority>
</url>

View File

@ -119,6 +119,8 @@
</ul></li>
<li>The <code>robots.txt</code> only blocks the top-level <code>/discover</code> and <code>/browse</code> URLs&hellip; we will need to find a way to forbid them from accessing these!</li>
<li>Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): <a href="https://jira.duraspace.org/browse/DS-2962">https://jira.duraspace.org/browse/DS-2962</a></li>
<li>It turns out that we&rsquo;re already adding the <code>X-Robots-Tag &quot;none&quot;</code> HTTP header, but this only forbids the search engine from <em>indexing</em> the page, not crawling it!</li>
<li>Also, the bot has to successfully browse the page first so it can receive the HTTP header&hellip;</li>
</ul>
<p></p>

View File

@ -32,6 +32,8 @@
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;robots.txt&lt;/code&gt; only blocks the top-level &lt;code&gt;/discover&lt;/code&gt; and &lt;code&gt;/browse&lt;/code&gt; URLs&amp;hellip; we will need to find a way to forbid them from accessing these!&lt;/li&gt;
&lt;li&gt;Relevant issue from DSpace Jira (semi resolved in DSpace 6.0): &lt;a href=&#34;https://jira.duraspace.org/browse/DS-2962&#34;&gt;https://jira.duraspace.org/browse/DS-2962&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;It turns out that we&amp;rsquo;re already adding the &lt;code&gt;X-Robots-Tag &amp;quot;none&amp;quot;&lt;/code&gt; HTTP header, but this only forbids the search engine from &lt;em&gt;indexing&lt;/em&gt; the page, not crawling it!&lt;/li&gt;
&lt;li&gt;Also, the bot has to successfully browse the page first so it can receive the HTTP header&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;/p&gt;</description>