mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Update notes
This commit is contained in:
@ -168,3 +168,114 @@ UPDATE 108
|
||||
- cg.identifier.ccafsprojectpii
|
||||
- cg.subject.cifor
|
||||
- For now we will keep them in the search filters
|
||||
- I modified my `check-duplicates.py` script a bit to fix a logic error for deleted items and add similarity scores from spacy (see: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents)
|
||||
- I want to use this with the MARLO innovation reports, to find related publications and working papers on CGSpace
|
||||
- I am curious to see how the similarity scores compare to those from trgm... perhaps we don't need them actually
|
||||
- Deploy latest changes to submission form, Discovery, and browse on CGSpace
|
||||
- Also run all system updates and reboot the host
|
||||
- Fix 152 `dcterms.relation` that are using "cgspace.cgiar.org" links instead of handles:
|
||||
|
||||
```console
|
||||
UPDATE metadatavalue SET text_value = REGEXP_REPLACE(text_value, '.*cgspace\.cgiar\.org/handle/(\d+/\d+)$', 'https://hdl.handle.net/\1') WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=180 AND text_value ~ 'cgspace\.cgiar\.org/handle/\d+/\d+$';
|
||||
```
|
||||
|
||||
## 2022-07-10
|
||||
|
||||
- UptimeRobot says that CGSpace is down
|
||||
- I see high load around 22, high CPU around 800%
|
||||
- Doesn't seem to be a lot of unique IPs:
|
||||
|
||||
```console
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access,oai,rest}.log | sort -u | wc -l
|
||||
2243
|
||||
```
|
||||
|
||||
Looking at the top twenty I see some usual IPs, but some new ones on Hetzner that are using many DSpace sessions:
|
||||
|
||||
```console
|
||||
$ grep 65.109.2.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
||||
1613
|
||||
$ grep 95.216.174.97 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
||||
1696
|
||||
$ grep 65.109.15.213 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
||||
1708
|
||||
$ grep 65.108.80.78 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
||||
1830
|
||||
$ grep 65.108.95.23 dspace.log.2022-07-10 | grep -oE 'session_id=[A-Z0-9]{32}:ip_addr=' | sort | uniq | wc -l
|
||||
1811
|
||||
```
|
||||
|
||||

|
||||
|
||||
- These IPs are using normal-looking user agents:
|
||||
- `Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1`
|
||||
- `Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0"`
|
||||
- `Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1`
|
||||
- `Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36`
|
||||
|
||||
- I will add networks I'm seeing now to nginx's bot-networks.conf for now (not all of Hetzner) and purge the hits later:
|
||||
- 65.108.0.0/16
|
||||
- 65.21.0.0/16
|
||||
- 95.216.0.0/16
|
||||
- 135.181.0.0/16
|
||||
- 138.201.0.0/16
|
||||
|
||||
- I think I'm going to get to a point where I categorize all commercial subnets as bots by default and then whitelist those we need
|
||||
- Sheesh, there are a bunch more IPv6 addresses also on Hetzner:
|
||||
|
||||
```console
|
||||
# awk '{print $1}' /var/log/nginx/{access,library-access}.log | sort | grep 2a01:4f9 | uniq -c | sort -h
|
||||
1 2a01:4f9:6a:1c2b::2
|
||||
2 2a01:4f9:2b:5a8::2
|
||||
2 2a01:4f9:4b:4495::2
|
||||
96 2a01:4f9:c010:518c::1
|
||||
137 2a01:4f9:c010:a9bc::1
|
||||
142 2a01:4f9:c010:58c9::1
|
||||
142 2a01:4f9:c010:58ea::1
|
||||
144 2a01:4f9:c010:58eb::1
|
||||
145 2a01:4f9:c010:6ff8::1
|
||||
148 2a01:4f9:c010:5190::1
|
||||
149 2a01:4f9:c010:7d6d::1
|
||||
153 2a01:4f9:c010:5226::1
|
||||
156 2a01:4f9:c010:7f74::1
|
||||
160 2a01:4f9:c010:5188::1
|
||||
161 2a01:4f9:c010:58e5::1
|
||||
168 2a01:4f9:c010:58ed::1
|
||||
170 2a01:4f9:c010:548e::1
|
||||
170 2a01:4f9:c010:8c97::1
|
||||
175 2a01:4f9:c010:58c8::1
|
||||
175 2a01:4f9:c010:aada::1
|
||||
182 2a01:4f9:c010:58ec::1
|
||||
182 2a01:4f9:c010:ae8c::1
|
||||
502 2a01:4f9:c010:ee57::1
|
||||
530 2a01:4f9:c011:567a::1
|
||||
535 2a01:4f9:c010:d04e::1
|
||||
539 2a01:4f9:c010:3d9a::1
|
||||
586 2a01:4f9:c010:93db::1
|
||||
593 2a01:4f9:c010:a04a::1
|
||||
601 2a01:4f9:c011:4166::1
|
||||
607 2a01:4f9:c010:9881::1
|
||||
640 2a01:4f9:c010:87fb::1
|
||||
648 2a01:4f9:c010:e680::1
|
||||
1141 2a01:4f9:3a:2696::2
|
||||
1146 2a01:4f9:3a:2555::2
|
||||
3207 2a01:4f9:3a:2c19::2
|
||||
```
|
||||
|
||||
- Maybe it's time I ban all of Hetzner... sheesh.
|
||||
- I left for a few hours and the server was going up and down the whole time, still very high CPU and database when I got back
|
||||
|
||||

|
||||
|
||||
- I am not sure what's going on
|
||||
- I extracted all the IPs and used `resolve-addresses-geoip2.py` to analyze them and extract all the Hetzner networks and block them
|
||||
- It's 181 IPs on Hetzner...
|
||||
- I rebooted the server to see if it was just some stuck locks in PostgreSQL...
|
||||
- The load is still higher than I would expect, and after a few more hours I see more Hetzner IPs coming through? Two more subnets to block
|
||||
- Start a harvest on AReS
|
||||
|
||||
## 2022-07-12
|
||||
|
||||
- Update an incorrect ORCID identifier for Alliance
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user