cgspace-notes/content/posts/2024-09.md

148 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "September, 2024"
date: 2024-09-01T21:16:00-07:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2024-09-01
- Upgrade CGSpace to DSpace 7.6.2
<!--more-->
## 2024-09-05
- Finalize work on migrating DSpace Angular from Yarn to NPM
## 2024-09-06
- This morning Tomcat crashed due to an OOM kill:
```
Sep 06 00:00:24 server systemd[1]: tomcat9.service: A process of this unit has been killed by the OOM killer.
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Main process exited, code=killed, status=9/KILL
Sep 06 00:00:25 server systemd[1]: tomcat9.service: Failed with result 'oom-kill'.
```
- According to the system journal, it was a Node.js dspace-angular process that tried to allocate memory and failed, thus invoking the OOM killer
- Currently I see high memory usage in those processes:
```console
$ pm2 status
┌────┬──────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │
├────┼──────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 0 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 994 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 1 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1015 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 2 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1029 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
│ 3 │ dspace-ui │ default │ 7.6.3-… │ cluster │ 1042 │ 4D │ 0 │ online │ 0% │ 3.4gb │ dspace │ disabled │
└────┴──────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
```
- I bet if I look in the logs I'd find some kind of heavy traffic on the frontend, causing high caching for Angular SSR
## 2024-09-08
- Analyzing memory use in our DSpace hosts, which have 32GB of memory
- Effective cache of PostgreSQL is estimated at 11GB, which seems way high since the database is only 2GB
- Realistically this should be how we adjust, with PostgreSQL using ~8GB (or less) and each dspace-angular process pinned at 2GB...
> Total - Solr - Tomcat Postgres - Nginx - Angular
> 31366 (1024×4.4) 7168 (8×1024) 512 - (4x2048) = 2796.4 left...
- I put some of these changes in on DSpace Test and will monitor this week
## 2024-09-10
- Some bot in South Africa made a ton of requests on the API and made the load hit the roof:
```
# grep -E '10/Sep/2024:[10-11]' /var/log/nginx/api-access.log | awk '{print $1}' | sort | uniq -c | sort -h
...
149720 102.182.38.90
```
- They are using several user agents so are obviously a bot:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:130.0) Gecko/20100101 Firefox/130.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0
Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
```
- I added them to the list of bot networks in nginx and the load went down
## 2024-09-11
- Upgrade DSpace 7 Test to Ubuntu 24.04
- I did some minor maintenance to test dspace-statistics-api with Python 3.12
- I tagged version 1.4.4 and released it on GitHub
## 2024-09-14
- Noticed a persistent higher than usual load on CGSpace and checked the server logs
- Found some new data center subnets to block because they were making thousands of requests with normal user agents
- I enabled HTTP/3 in nginx
- I enabled the SSR patch in Angular: https://github.com/DSpace/dspace-angular/issues/3110
## 2024-09-16
- Experiment with the <a href="https://github.com/codeobia/dspace-statistics-api-js">dspace-statistics-api-js</a> on DSpace 7 Test
- In the past it always caused Solr to run out of memory, but I increased Solr's heap from 2g to 3g and it runs without crashing
- I attached VisualVM to Solr with a 3g and 4g heap and iterated over 1260 pages of results in the dspace-statistics-api-js:
![Solr with 3g heap](/cgspace-notes/2024/09/2024-09-16-Solr-3g-heap.png)
![Solr with 4g heap](/cgspace-notes/2024/09/2024-09-16-Solr-4g-heap.png)
## 2024-09-23
- Upgrade PostgreSQL from version 14 to 15 on DSpace Test the same way I did last year:
```console
# apt update
# apt install postgresql-15
# Update configs with Ansible
# systemctl stop tomcat9
# pg_ctlcluster 14 main stop
# tar -cvzpf var-lib-postgresql-14.tar.gz /var/lib/postgresql/14
# tar -cvzpf etc-postgresql-14.tar.gz /etc/postgresql/14
# pg_ctlcluster 15 main stop
# pg_dropcluster 15 main
# pg_upgradecluster 14 main
# pg_ctlcluster 15 main start
...
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_is_well_formed(text) does not exist
ERROR: could not find function "xml_is_well_formed" in file "/usr/lib/postgresql/15/lib/pgxml.so"
ERROR: function public.xml_valid(text) does not exist
```
- After that I [re-indexed the database indexes](https://adamj.eu/tech/2021/04/13/reindexing-all-tables-after-upgrading-to-postgresql-13/) using a query:
```console
$ su - postgres
$ cat /tmp/generate-reindex.sql
SELECT 'REINDEX TABLE CONCURRENTLY ' || quote_ident(relname) || ' /*' || pg_size_pretty(pg_total_relation_size(C.oid)) || '*/;'
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname = 'public'
AND C.relkind = 'r'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(C.oid) ASC;
$ psql dspace < /tmp/generate-reindex.sql > /tmp/reindex.sql
$ <trim the extra stuff from /tmp/reindex.sql>
$ psql dspace < /tmp/reindex.sql
```
- The database shrunk by 186MB!
## 2024-09-29
- I upgraded the database on CGSpace to PostgreSQL 15
<!-- vim: set sw=2 ts=2: -->