mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes
This commit is contained in:
@ -210,4 +210,166 @@ Time: 240.041 ms
|
||||
- I think we can move those to a new `cg.identifier.project` if we create one
|
||||
- The `cg.identifier.cpwfproject` field is similarly sparse, but the CCAFS ones are widely used
|
||||
|
||||
## 2024-01-12
|
||||
|
||||
- Export a list of affiliations to do some cleanup:
|
||||
|
||||
```console
|
||||
localhost/dspace7= ☘ \COPY (SELECT DISTINCT text_value AS "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id = 211 GROUP BY "cg.contributor.affiliation" ORDER BY count DESC) to /tmp/2024-01-affiliations.csv WITH CSV HEADER;
|
||||
COPY 11719
|
||||
```
|
||||
|
||||
- I first did some clustering and editing in OpenRefine, then I'll import those back into CGSpace and then do another export
|
||||
- Troubleshooting the statistics pages that aren't working on DSpace 7
|
||||
- On a hunch, I queried for for Solr statistics documents that **did not have an `id` matching the 36-character UUID pattern**:
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&rows=0'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":0,
|
||||
"params":{
|
||||
"q":"-id:/.{36}/",
|
||||
"rows":"0"}},
|
||||
"response":{"numFound":800167,"start":0,"numFoundExact":true,"docs":[]
|
||||
}}
|
||||
```
|
||||
|
||||
- They seem to come mostly from 2020, 2023, and 2024:
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2010-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1YEAR&rows=0'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":13,
|
||||
"params":{
|
||||
"facet.range":"time",
|
||||
"q":"-id:/.{36}/",
|
||||
"facet.range.gap":"+1YEAR",
|
||||
"rows":"0",
|
||||
"facet":"true",
|
||||
"facet.range.start":"2010-01-01T00:00:00Z",
|
||||
"facet.range.end":"NOW"}},
|
||||
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
|
||||
},
|
||||
"facet_counts":{
|
||||
"facet_queries":{},
|
||||
"facet_fields":{},
|
||||
"facet_ranges":{
|
||||
"time":{
|
||||
"counts":[
|
||||
"2010-01-01T00:00:00Z",0,
|
||||
"2011-01-01T00:00:00Z",0,
|
||||
"2012-01-01T00:00:00Z",0,
|
||||
"2013-01-01T00:00:00Z",0,
|
||||
"2014-01-01T00:00:00Z",0,
|
||||
"2015-01-01T00:00:00Z",89,
|
||||
"2016-01-01T00:00:00Z",11,
|
||||
"2017-01-01T00:00:00Z",0,
|
||||
"2018-01-01T00:00:00Z",0,
|
||||
"2019-01-01T00:00:00Z",0,
|
||||
"2020-01-01T00:00:00Z",1339,
|
||||
"2021-01-01T00:00:00Z",0,
|
||||
"2022-01-01T00:00:00Z",0,
|
||||
"2023-01-01T00:00:00Z",653736,
|
||||
"2024-01-01T00:00:00Z",144993],
|
||||
"gap":"+1YEAR",
|
||||
"start":"2010-01-01T00:00:00Z",
|
||||
"end":"2025-01-01T00:00:00Z"}},
|
||||
"facet_intervals":{},
|
||||
"facet_heatmaps":{}}}
|
||||
```
|
||||
|
||||
- They seem to come from 2023-08 until now (so way before we migrated to DSpace 7):
|
||||
|
||||
```console
|
||||
$ curl 'http://localhost:8983/solr/statistics/select?q=-id%3A%2F.\{36\}%2F&facet.range=time&facet=true&facet.range.start=2023-01-01T00:00:00Z&facet.range.end=NOW&facet.range.gap=%2B1MONTH&rows=0'
|
||||
{
|
||||
"responseHeader":{
|
||||
"status":0,
|
||||
"QTime":196,
|
||||
"params":{
|
||||
"facet.range":"time",
|
||||
"q":"-id:/.{36}/",
|
||||
"facet.range.gap":"+1MONTH",
|
||||
"rows":"0",
|
||||
"facet":"true",
|
||||
"facet.range.start":"2023-01-01T00:00:00Z",
|
||||
"facet.range.end":"NOW"}},
|
||||
"response":{"numFound":800168,"start":0,"numFoundExact":true,"docs":[]
|
||||
},
|
||||
"facet_counts":{
|
||||
"facet_queries":{},
|
||||
"facet_fields":{},
|
||||
"facet_ranges":{
|
||||
"time":{
|
||||
"counts":[
|
||||
"2023-01-01T00:00:00Z",1,
|
||||
"2023-02-01T00:00:00Z",0,
|
||||
"2023-03-01T00:00:00Z",0,
|
||||
"2023-04-01T00:00:00Z",0,
|
||||
"2023-05-01T00:00:00Z",0,
|
||||
"2023-06-01T00:00:00Z",0,
|
||||
"2023-07-01T00:00:00Z",0,
|
||||
"2023-08-01T00:00:00Z",27621,
|
||||
"2023-09-01T00:00:00Z",59165,
|
||||
"2023-10-01T00:00:00Z",115338,
|
||||
"2023-11-01T00:00:00Z",96147,
|
||||
"2023-12-01T00:00:00Z",355464,
|
||||
"2024-01-01T00:00:00Z",125429],
|
||||
"gap":"+1MONTH",
|
||||
"start":"2023-01-01T00:00:00Z",
|
||||
"end":"2024-02-01T00:00:00Z"}},
|
||||
"facet_intervals":{},
|
||||
"facet_heatmaps":{}}}
|
||||
```
|
||||
|
||||
- I see that we had 31,744 statistic events yesterday, and 799 have no `id`!
|
||||
- I asked about this on Slack and will file an issue on GitHub if someone else also finds such records
|
||||
- Several people said they have them, so it's a bug of some sort in DSpace, not our configuration
|
||||
|
||||
## 2024-01-13
|
||||
|
||||
- Yesterday alone we had 37,000 unique IPs making requests to nginx
|
||||
- I looked up the ASNs and found 6,000 IPs from this network in Amazon Singapore: 47.128.0.0/14
|
||||
|
||||
## 2024-01-15
|
||||
|
||||
- Investigating the CSS selector warning that I've seen in PM2 logs:
|
||||
|
||||
```console
|
||||
0|dspace-ui | 1 rules skipped due to selector errors:
|
||||
0|dspace-ui | .custom-file-input:lang(en)~.custom-file-label -> unmatched pseudo-class :lang
|
||||
```
|
||||
|
||||
- It seems to be a bug in Angular, as this selector comes from Bootstrap 4.6.x and is not invalid
|
||||
- But that led me to a more interesting issue with `inlineCritical` optimization for styles in Angular SSR that might be responsible for causing high load in the frontend
|
||||
- See: https://github.com/angular/angular/issues/42098
|
||||
- See: https://github.com/angular/universal/issues/2106
|
||||
- See: https://github.com/GoogleChromeLabs/critters/issues/78
|
||||
- Since the production site was flapping a lot I decided to try disabling inlineCriticalCss
|
||||
- There have been on and off load issues with the Angular frontend today
|
||||
- I think I will just block all data center network blocks for now
|
||||
- In the last week I see almost 200,000 unique IPs:
|
||||
|
||||
```console
|
||||
# zcat -f /var/log/nginx/*access.log /var/log/nginx/*access.log.1 /var/log/nginx/*access.log.2.gz /var/log/nginx/*access.log.3.gz /var/log/nginx/*access.log.4.gz /var/log/nginx/*access.log.5.gz /var/log/nginx/*access.log.6.gz | awk '{print $1}' | sort -u |
|
||||
tee /tmp/ips.txt | wc -l
|
||||
196493
|
||||
```
|
||||
|
||||
- Looking these IPs up I see there are 18,000 coming from Comcast, 10,000 from AT&T, 4110 from Charter, 3500 from Cox and dozens of other residential IPs
|
||||
- I highly doubt these are home users browsing CGSpace... seems super fishy
|
||||
- Also, over 1,000 IPs from SpaceX Starlink in the last week. RIGHT
|
||||
- I will temporarily add a few new datacenter ISP network blocks to our rate limit:
|
||||
- 16509 Amazon-02
|
||||
- 701 UUNET
|
||||
- 8075 Microsoft
|
||||
- 15169 Google
|
||||
- 14618 Amazon-AES
|
||||
- 396982 Google Cloud
|
||||
- The load on the server *immediately* dropped
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user