Add notes

This commit is contained in:
2024-01-18 15:59:49 +03:00
parent 20ace46614
commit 57fe0587a4
34 changed files with 372 additions and 39 deletions

View File

@ -210,4 +210,166 @@ Time: 240.041 ms
- I think we can move those to a new `cg.identifier.project` if we create one
- The `cg.identifier.cpwfproject` field is similarly sparse, but the CCAFS ones are widely used
## 2024-01-12
- Export a list of affiliations to do some cleanup:
```console
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
COPY 11719
```
- I first did some clustering and editing in OpenRefine, then I'll import those back into CGSpace and then do another export
- Troubleshooting the statistics pages that aren't working on DSpace 7
- On a hunch, I queried for for Solr statistics documents that **did not have an `id` matching the 36-character UUID pattern**:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"-id:/.{36}/",
"rows":"0"}},
"response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
}}
```
- They seem to come mostly from 2020, 2023, and 2024:
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
{
"responseHeader":{
"status":0,
"QTime":13,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1YEAR",
"rows":"0",
"facet":"true",
"facet.range.start":"2010-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2010-01-01T00:00:00Z",0,
"2011-01-01T00:00:00Z",0,
"2012-01-01T00:00:00Z",0,
"2013-01-01T00:00:00Z",0,
"2014-01-01T00:00:00Z",0,
"2015-01-01T00:00:00Z",89,
"2016-01-01T00:00:00Z",11,
"2017-01-01T00:00:00Z",0,
"2018-01-01T00:00:00Z",0,
"2019-01-01T00:00:00Z",0,
"2020-01-01T00:00:00Z",1339,
"2021-01-01T00:00:00Z",0,
"2022-01-01T00:00:00Z",0,
"2023-01-01T00:00:00Z",653736,
"2024-01-01T00:00:00Z",144993],
"gap":"+1YEAR",
"start":"2010-01-01T00:00:00Z",
"end":"2025-01-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
```console
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
{
"responseHeader":{
"status":0,
"QTime":196,
"params":{
"facet.range":"time",
"q":"-id:/.{36}/",
"facet.range.gap":"+1MONTH",
"rows":"0",
"facet":"true",
"facet.range.start":"2023-01-01T00:00:00Z",
"facet.range.end":"NOW"}},
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_ranges":{
"time":{
"counts":[
"2023-01-01T00:00:00Z",1,
"2023-02-01T00:00:00Z",0,
"2023-03-01T00:00:00Z",0,
"2023-04-01T00:00:00Z",0,
"2023-05-01T00:00:00Z",0,
"2023-06-01T00:00:00Z",0,
"2023-07-01T00:00:00Z",0,
"2023-08-01T00:00:00Z",27621,
"2023-09-01T00:00:00Z",59165,
"2023-10-01T00:00:00Z",115338,
"2023-11-01T00:00:00Z",96147,
"2023-12-01T00:00:00Z",355464,
"2024-01-01T00:00:00Z",125429],
"gap":"+1MONTH",
"start":"2023-01-01T00:00:00Z",
"end":"2024-02-01T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
```
- I see that we had 31,744 statistic events yesterday, and 799 have no `id`!
- I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
- Several people said they have them, so it's a bug of some sort in DSpace, not our configuration
## 2024-01-13
- Yesterday alone we had 37,000 unique IPs making requests to nginx
- I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14
## 2024-01-15
- Investigating the CSS selector warning that I've seen in PM2 logs:
```console
0|dspace-ui | 1 rules skipped due to selector errors:
0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
```
- It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
- But that led me to a more interesting issue with `inlineCritical` optimization for styles in Angular SSR that might be responsible for causing high load in the frontend
- See: https://github.com/angular/angular/issues/42098
- See: https://github.com/angular/universal/issues/2106
- See: https://github.com/GoogleChromeLabs/critters/issues/78
- Since the production site was flapping a lot I decided to try disabling inlineCriticalCss
- There have been on and off load issues with the Angular frontend today
- I think I will just block all data center network blocks for now
- In the last week I see almost 200,000 unique IPs:
```console
# zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
tee /tmp/ips.txt | wc -l
196493
```
- Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
- I highly doubt these are home users browsing CGSpace... seems super fishy
- Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
- I will temporarily add a few new datacenter ISP network blocks to our rate limit:
- 16509 Amazon-02
- 701 UUNET
- 8075 Microsoft
- 15169 Google
- 14618 Amazon-AES
- 396982 Google Cloud
- The load on the server *immediately* dropped
<!-- vim: set sw=2 ts=2: -->