Update notes for 2020-08-05

This commit is contained in:
Alan Orth 2020-08-05 16:58:31 +03:00
parent 6a0e08aff0
commit c6ac8b9afe
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
19 changed files with 82 additions and 22 deletions

View File

@ -120,5 +120,65 @@ $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspa
- Seems that something happened yesterday afternoon at around 5PM...
- For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue
- I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly
- I checked the nginx logs around 5PM yesterday to see who was accessing the server:
```
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
```
- I see the Macaroni Bros are using their new user agent for harvesting: `RTB website BOT`
- But that pattern doesn't match in the nginx bot list or Tomcat's crawler session manager valve because we're only checking for `[Bb]ot`!
- So they have created thousands of Tomcat sessions:
```
$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5693
```
- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources
- Perhaps `[Bb][Oo][Tt]`...
- I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent
- I purged all the hits from Solr, including a few thousand from 64.62.202.71
- A few more IPs causing lots of Tomcat sessions yesterday:
```
$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
1585
$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
5691
```
- 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
```
Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
```
- 64.62.202.71 is using a user agent I've never seen before:
```
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
```
- So now our "bot" regex can't even match that...
- Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt):
```
RTB website BOT
Altmetribot
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
```
- And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):
```
$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2777
```
- I will add `Turnitin` to the Tomcat Crawler Session Manager Valve regex as well...
<!-- vim: set sw=2 ts=2: -->

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Categories"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="CGSpace Notes"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -9,7 +9,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2020-08-03T16:27:51+03:00" />
<meta property="og:updated_time" content="2020-08-05T15:00:06+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>

View File

@ -4,27 +4,27 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2020-07/</loc>
<lastmod>2020-08-03T16:27:51+03:00</lastmod>
<lastmod>2020-08-05T15:00:06+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2020-08-03T16:27:51+03:00</lastmod>
<lastmod>2020-08-05T15:00:06+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2020-08-03T16:27:51+03:00</lastmod>
<lastmod>2020-08-05T15:00:06+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2020-08-03T16:27:51+03:00</lastmod>
<lastmod>2020-08-05T15:00:06+03:00</lastmod>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2020-08-03T16:27:51+03:00</lastmod>
<lastmod>2020-08-05T15:00:06+03:00</lastmod>
</url>
<url>