Add notes

This commit is contained in:
2023-01-22 21:53:45 +03:00
parent ddb1ce8f4e
commit 2c7f6b3e39
29 changed files with 288 additions and 34 deletions

View File

@ -285,4 +285,126 @@ Total number of bot hits purged: 4027
- Export entire CGSpace to check Initiative mappings, and add nineteen...
- Start a harvest on AReS
## 2023-01-18
- I'm looking at all the ORCID identifiers in the database, which seem to be way more than I realized:
```console
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
COPY 4231
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt
$ wc -l /tmp/2023-01-18-orcids.txt
4518 /tmp/2023-01-18-orcids.txt
```
- Then I resolved them from ORCID and updated them in the database:
```console
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d
$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
```
- Then I updated the controlled vocabulary
- CGSpace became inactive in the afternoon, with a high number of locks, but surprisingly low CPU usage:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
83 dspaceApi
7829 dspaceWeb
```
- In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7...
- I hope this doesn't cause some issue with in-progress workflows...
- I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python
- I will add python to the list of bad bot user agents in nginx
- While looking into the locks I see some potential Java heap issues
- Indeed, I see two out of memory errors in Tomcat's journal:
```console
tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
```
- Which explains why the locks went down to normal numbers as I was watching... (because Java crashed)
## 2023-01-19
- Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace
- So it seems an IFPRI user got caught up in the blocking I did yesterday
- Their ISP is Comcast...
- I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast:
```console
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
https://asn.ipinfo.app/api/text/list/AS16276 \
https://asn.ipinfo.app/api/text/list/AS15169 \
https://asn.ipinfo.app/api/text/list/AS23576 \
https://asn.ipinfo.app/api/text/list/AS24940 \
https://asn.ipinfo.app/api/text/list/AS13238 \
https://asn.ipinfo.app/api/text/list/AS32934 \
https://asn.ipinfo.app/api/text/list/AS14061 \
https://asn.ipinfo.app/api/text/list/AS12876 \
https://asn.ipinfo.app/api/text/list/AS55286 \
https://asn.ipinfo.app/api/text/list/AS203020 \
https://asn.ipinfo.app/api/text/list/AS204287 \
https://asn.ipinfo.app/api/text/list/AS50245 \
https://asn.ipinfo.app/api/text/list/AS6939 \
https://asn.ipinfo.app/api/text/list/AS16509 \
https://asn.ipinfo.app/api/text/list/AS14618
$ cat AS* | sort | uniq | wc -l
18179
$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
$ wc -l /tmp/networks.txt
5872 /tmp/networks.txt
```
## 2023-01-20
- A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives)
- I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara
## 2023-01-21
- Export the Initiatives community again to perform collection mappings and country/region fixes
## 2023-01-22
- There has been a high load on the server for a few days, currently 8.0... and I've been seeing some PostgreSQL locks stuck all day:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
11 dspaceApi
28 dspaceCli
981 dspaceWeb
```
- Looking at the locks I see they are from this morning at 5:00 AM, which is the `dspace checker-email` script
- Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too...
- Then I killed the PIDs of the locks
```console
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S
...
$ ps auxw | grep 18986
postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT
```
- Also, I checked the age of the locks and killed anything over 1 day:
```console
$ psql < locks-age.sql | grep days | less -S
```
- Then I ran all updates on the server and restarted it...
- Salem responded to my question about the SDG mismatch between MEL and CGSpace
- We agreed to use a version based on the text of [this site](http://metadata.un.org/sdg/?lang=en)
- Salem is having issues with some REST API submission / updates
- I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test
- Clean and normalize fifty-eight IFPRI records for batch import to CGSpace
- I did a duplicate check and found six, so that's good!
- I exported the entire CGSpace to check for missing Initiative mappings
- Then I exported the Initiatives community to check for missing regions
- Then I ran the script to check for missing ORCID identifiers
- Then *finally*, I started a harvest on AReS
<!-- vim: set sw=2 ts=2: -->