Add notes for 2020-07-12

This commit is contained in:
2020-07-12 15:52:26 +03:00
parent 1cc1e23aba
commit 425a85df5b
28 changed files with 118 additions and 26 deletions

View File

@ -411,4 +411,47 @@ $ csvgrep -c matched -m true 2020-07-09-cgspace-sponsors-crossref.csv | wc -l
174
```
## 2020-07-12
- On 2020-07-10 Macaroni Bros emailed to ask if there are issues with CGSpace because they are getting HTTP 504 on the REST API
- First, I looked in Munin and I see high number of DSpace sessions and threads on Friday evening around midnight, though that was much later than his email:
![DSpace sessions](/cgspace-notes/2020/07/jmx_dspace_sessions-day.png)
![Threads](/cgspace-notes/2020/07/threads-day.png)
![PostgreSQL locks](/cgspace-notes/2020/07/postgres_locks_ALL-day.png)
![PostgreSQL transactions](/cgspace-notes/2020/07/postgres_transactions_ALL-day.png)
- CPU load and memory were not high then, but there was some load on the database and firewall...
- Looking in the nginx logs I see a few IPs we've seen recently, like those 199.47.x.x IPs from Turnitin (which I need to remember to purge from Solr again because I didn't update the spider agents on CGSpace yet) and some new one 186.112.8.167
- Also, the Turnitin bot doesn't re-use its Tomcat JSESSIONID, I see this from today:
```
# grep 199.47.87 dspace.log.2020-07-12 | grep -o -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
2815
```
- So I need to add this alternative user-agent to the Tomcat Crawler Session Manager valve to force it to re-use a common bot session
- There are around 9,000 requests from `186.112.8.167` in Colombia and has the user agent `Java/1.8.0_241`, but those were mostly to REST API and I don't see any hits in Solr
- Earlier in the day Linode had alerted that there was high outgoing bandwidth
- I see some new bot from 134.155.96.78 made ~10,000 requests with the user agent... but it appears to already be in our DSpace user agent list via COUNTER-Robots:
```
Mozilla/5.0 (compatible; heritrix/3.4.0-SNAPSHOT-2019-02-07T13:53:20Z +http://ifm.uni-mannheim.de)
```
- Generate a list of sponsors to update our controlled vocabulary:
```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.description.sponsorship", count(text_value) FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=29 GROUP BY "dc.description.sponsorship" ORDER BY count DESC LIMIT 125) TO /tmp/2020-07-12-sponsors.csv;
COPY 125
dspace=# \q
$ csvcut -c 1 --tabs /tmp/2020-07-12-sponsors.csv > dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
# add XML formatting
$ dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
$ tidy -xml -utf8 -m -iq -w 0 dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
```
- Deploy latest `5_x-prod` branch on CGSpace (linode18), run all system updates, and reboot the server
- After rebooting it I had to restart Tomcat 7 once to get all Solr statistics cores to come up properly
<!-- vim: set sw=2 ts=2: -->