diff --git a/content/posts/2023-01.md b/content/posts/2023-01.md index 198ab6335..3caccfaab 100644 --- a/content/posts/2023-01.md +++ b/content/posts/2023-01.md @@ -285,4 +285,126 @@ Total number of bot hits purged: 4027 - Export entire CGSpace to check Initiative mappings, and add nineteen... - Start a harvest on AReS +## 2023-01-18 + +- I'm looking at all the ORCID identifiers in the database, which seem to be way more than I realized: + +```console +localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt; +COPY 4231 +$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt +$ wc -l /tmp/2023-01-18-orcids.txt +4518 /tmp/2023-01-18-orcids.txt +``` + +- Then I resolved them from ORCID and updated them in the database: + +```console +$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d +$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247 +``` + +- Then I updated the controlled vocabulary +- CGSpace became inactive in the afternoon, with a high number of locks, but surprisingly low CPU usage: + +```console +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c + 83 dspaceApi + 7829 dspaceWeb +``` + +- In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7... + - I hope this doesn't cause some issue with in-progress workflows... +- I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python + - I will add python to the list of bad bot user agents in nginx +- While looking into the locks I see some potential Java heap issues + - Indeed, I see two out of memory errors in Tomcat's journal: + +```console +tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space +tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon +``` + +- Which explains why the locks went down to normal numbers as I was watching... (because Java crashed) + +## 2023-01-19 + +- Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace +- So it seems an IFPRI user got caught up in the blocking I did yesterday + - Their ISP is Comcast... + - I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast: + +```console +$ wget https://asn.ipinfo.app/api/text/list/AS714 \ + https://asn.ipinfo.app/api/text/list/AS16276 \ + https://asn.ipinfo.app/api/text/list/AS15169 \ + https://asn.ipinfo.app/api/text/list/AS23576 \ + https://asn.ipinfo.app/api/text/list/AS24940 \ + https://asn.ipinfo.app/api/text/list/AS13238 \ + https://asn.ipinfo.app/api/text/list/AS32934 \ + https://asn.ipinfo.app/api/text/list/AS14061 \ + https://asn.ipinfo.app/api/text/list/AS12876 \ + https://asn.ipinfo.app/api/text/list/AS55286 \ + https://asn.ipinfo.app/api/text/list/AS203020 \ + https://asn.ipinfo.app/api/text/list/AS204287 \ + https://asn.ipinfo.app/api/text/list/AS50245 \ + https://asn.ipinfo.app/api/text/list/AS6939 \ + https://asn.ipinfo.app/api/text/list/AS16509 \ + https://asn.ipinfo.app/api/text/list/AS14618 +$ cat AS* | sort | uniq | wc -l +18179 +$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt +$ wc -l /tmp/networks.txt +5872 /tmp/networks.txt +``` + +## 2023-01-20 + +- A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives) +- I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara + +## 2023-01-21 + +- Export the Initiatives community again to perform collection mappings and country/region fixes + +## 2023-01-22 + +- There has been a high load on the server for a few days, currently 8.0... and I've been seeing some PostgreSQL locks stuck all day: + +```console +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c + 11 dspaceApi + 28 dspaceCli + 981 dspaceWeb +``` + +- Looking at the locks I see they are from this morning at 5:00 AM, which is the `dspace checker-email` script + - Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too... + - Then I killed the PIDs of the locks + +```console +$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S +... +$ ps auxw | grep 18986 +postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT +``` + +- Also, I checked the age of the locks and killed anything over 1 day: + +```console +$ psql < locks-age.sql | grep days | less -S +``` + +- Then I ran all updates on the server and restarted it... +- Salem responded to my question about the SDG mismatch between MEL and CGSpace + - We agreed to use a version based on the text of [this site](http://metadata.un.org/sdg/?lang=en) +- Salem is having issues with some REST API submission / updates + - I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test +- Clean and normalize fifty-eight IFPRI records for batch import to CGSpace + - I did a duplicate check and found six, so that's good! +- I exported the entire CGSpace to check for missing Initiative mappings + - Then I exported the Initiatives community to check for missing regions + - Then I ran the script to check for missing ORCID identifiers + - Then *finally*, I started a harvest on AReS + diff --git a/docs/2023-01/index.html b/docs/2023-01/index.html index 4a3345e97..c17b3736e 100644 --- a/docs/2023-01/index.html +++ b/docs/2023-01/index.html @@ -19,7 +19,7 @@ I see we have some new ones that aren’t in our list if I combine with this - + @@ -44,9 +44,9 @@ I see we have some new ones that aren’t in our list if I combine with this "@type": "BlogPosting", "headline": "January, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-01/", - "wordCount": "2250", + "wordCount": "2969", "datePublished": "2023-01-01T08:44:36+03:00", - "dateModified": "2023-01-15T08:10:16+03:00", + "dateModified": "2023-01-17T22:38:55+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -452,6 +452,138 @@ I see we have some new ones that aren’t in our list if I combine with this
localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
+COPY 4231
+$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt
+$ wc -l /tmp/2023-01-18-orcids.txt
+4518 /tmp/2023-01-18-orcids.txt
+
$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d
+$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
+
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
+ 83 dspaceApi
+ 7829 dspaceWeb
+
tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
+tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
+
$ wget https://asn.ipinfo.app/api/text/list/AS714 \
+ https://asn.ipinfo.app/api/text/list/AS16276 \
+ https://asn.ipinfo.app/api/text/list/AS15169 \
+ https://asn.ipinfo.app/api/text/list/AS23576 \
+ https://asn.ipinfo.app/api/text/list/AS24940 \
+ https://asn.ipinfo.app/api/text/list/AS13238 \
+ https://asn.ipinfo.app/api/text/list/AS32934 \
+ https://asn.ipinfo.app/api/text/list/AS14061 \
+ https://asn.ipinfo.app/api/text/list/AS12876 \
+ https://asn.ipinfo.app/api/text/list/AS55286 \
+ https://asn.ipinfo.app/api/text/list/AS203020 \
+ https://asn.ipinfo.app/api/text/list/AS204287 \
+ https://asn.ipinfo.app/api/text/list/AS50245 \
+ https://asn.ipinfo.app/api/text/list/AS6939 \
+ https://asn.ipinfo.app/api/text/list/AS16509 \
+ https://asn.ipinfo.app/api/text/list/AS14618
+$ cat AS* | sort | uniq | wc -l
+18179
+$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
+$ wc -l /tmp/networks.txt
+5872 /tmp/networks.txt
+
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
+ 11 dspaceApi
+ 28 dspaceCli
+ 981 dspaceWeb
+
dspace checker-email
script
+$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S
+...
+$ ps auxw | grep 18986
+postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT
+
$ psql < locks-age.sql | grep days | less -S
+