diff --git a/content/posts/2023-01.md b/content/posts/2023-01.md index 198ab6335..3caccfaab 100644 --- a/content/posts/2023-01.md +++ b/content/posts/2023-01.md @@ -285,4 +285,126 @@ Total number of bot hits purged: 4027 - Export entire CGSpace to check Initiative mappings, and add nineteen... - Start a harvest on AReS +## 2023-01-18 + +- I'm looking at all the ORCID identifiers in the database, which seem to be way more than I realized: + +```console +localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt; +COPY 4231 +$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt +$ wc -l /tmp/2023-01-18-orcids.txt +4518 /tmp/2023-01-18-orcids.txt +``` + +- Then I resolved them from ORCID and updated them in the database: + +```console +$ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d +$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247 +``` + +- Then I updated the controlled vocabulary +- CGSpace became inactive in the afternoon, with a high number of locks, but surprisingly low CPU usage: + +```console +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c + 83 dspaceApi + 7829 dspaceWeb +``` + +- In the DSpace logs I see some weird SQL messages, so I decided to restart PostgreSQL and Tomcat 7... + - I hope this doesn't cause some issue with in-progress workflows... +- I see another user on Cox in the US (98.186.216.144) crawling and scraping XMLUI with Python + - I will add python to the list of bad bot user agents in nginx +- While looking into the locks I see some potential Java heap issues + - Indeed, I see two out of memory errors in Tomcat's journal: + +```console +tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space +tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon +``` + +- Which explains why the locks went down to normal numbers as I was watching... (because Java crashed) + +## 2023-01-19 + +- Update a bunch of ORCID identifiers, Initiative mappings, and regions on CGSpace +- So it seems an IFPRI user got caught up in the blocking I did yesterday + - Their ISP is Comcast... + - I need to re-work the ASN blocking on nginx, but for now I will just get the ASNs again minus Comcast: + +```console +$ wget https://asn.ipinfo.app/api/text/list/AS714 \ + https://asn.ipinfo.app/api/text/list/AS16276 \ + https://asn.ipinfo.app/api/text/list/AS15169 \ + https://asn.ipinfo.app/api/text/list/AS23576 \ + https://asn.ipinfo.app/api/text/list/AS24940 \ + https://asn.ipinfo.app/api/text/list/AS13238 \ + https://asn.ipinfo.app/api/text/list/AS32934 \ + https://asn.ipinfo.app/api/text/list/AS14061 \ + https://asn.ipinfo.app/api/text/list/AS12876 \ + https://asn.ipinfo.app/api/text/list/AS55286 \ + https://asn.ipinfo.app/api/text/list/AS203020 \ + https://asn.ipinfo.app/api/text/list/AS204287 \ + https://asn.ipinfo.app/api/text/list/AS50245 \ + https://asn.ipinfo.app/api/text/list/AS6939 \ + https://asn.ipinfo.app/api/text/list/AS16509 \ + https://asn.ipinfo.app/api/text/list/AS14618 +$ cat AS* | sort | uniq | wc -l +18179 +$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt +$ wc -l /tmp/networks.txt +5872 /tmp/networks.txt +``` + +## 2023-01-20 + +- A lot of work on CGSpace metadata (ORCID identifiers, regions, and Initiatives) +- I noticed that MEL and CGSpace are using slightly different vocabularies for SDGs so I sent an email to Salem and Sara + +## 2023-01-21 + +- Export the Initiatives community again to perform collection mappings and country/region fixes + +## 2023-01-22 + +- There has been a high load on the server for a few days, currently 8.0... and I've been seeing some PostgreSQL locks stuck all day: + +```console +$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c + 11 dspaceApi + 28 dspaceCli + 981 dspaceWeb +``` + +- Looking at the locks I see they are from this morning at 5:00 AM, which is the `dspace checker-email` script + - Last week I disabled the one that ones at 4:00 AM, but I guess I will experiment with disabling this too... + - Then I killed the PIDs of the locks + +```console +$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S +... +$ ps auxw | grep 18986 +postgres 1429108 1.9 1.5 3359712 508148 ? Ss 05:00 13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT +``` + +- Also, I checked the age of the locks and killed anything over 1 day: + +```console +$ psql < locks-age.sql | grep days | less -S +``` + +- Then I ran all updates on the server and restarted it... +- Salem responded to my question about the SDG mismatch between MEL and CGSpace + - We agreed to use a version based on the text of [this site](http://metadata.un.org/sdg/?lang=en) +- Salem is having issues with some REST API submission / updates + - I updated DSpace Test with a recent CGSpace backup and created a super admin user for him to test +- Clean and normalize fifty-eight IFPRI records for batch import to CGSpace + - I did a duplicate check and found six, so that's good! +- I exported the entire CGSpace to check for missing Initiative mappings + - Then I exported the Initiatives community to check for missing regions + - Then I ran the script to check for missing ORCID identifiers + - Then *finally*, I started a harvest on AReS + diff --git a/docs/2023-01/index.html b/docs/2023-01/index.html index 4a3345e97..c17b3736e 100644 --- a/docs/2023-01/index.html +++ b/docs/2023-01/index.html @@ -19,7 +19,7 @@ I see we have some new ones that aren’t in our list if I combine with this - + @@ -44,9 +44,9 @@ I see we have some new ones that aren’t in our list if I combine with this "@type": "BlogPosting", "headline": "January, 2023", "url": "https://alanorth.github.io/cgspace-notes/2023-01/", - "wordCount": "2250", + "wordCount": "2969", "datePublished": "2023-01-01T08:44:36+03:00", - "dateModified": "2023-01-15T08:10:16+03:00", + "dateModified": "2023-01-17T22:38:55+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -452,6 +452,138 @@ I see we have some new ones that aren’t in our list if I combine with this
  • Export entire CGSpace to check Initiative mappings, and add nineteen…
  • Start a harvest on AReS
  • +

    2023-01-18

    + +
    localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(text_value) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=247) to /tmp/2023-01-18-orcid-identifiers.txt;
    +COPY 4231
    +$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-identifier.xml /tmp/2023-01-18-orcid-identifiers.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort -u > /tmp/2023-01-18-orcids.txt
    +$ wc -l /tmp/2023-01-18-orcids.txt
    +4518 /tmp/2023-01-18-orcids.txt
    +
    +
    $ ./ilri/resolve-orcids.py -i /tmp/2023-01-18-orcids.txt -o /tmp/2023-01-18-orcids-names.txt -d
    +$ ./ilri/update-orcids.py -i /tmp/2023-01-18-orcids-names.txt -db dspace -u dspace -p 'fuuu' -m 247
    +
    +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +     83 dspaceApi
    +   7829 dspaceWeb
    +
    +
    tomcat7[310996]: java.lang.OutOfMemoryError: Java heap space
    +tomcat7[310996]: Jan 18, 2023 1:37:03 PM org.apache.tomcat.jdbc.pool.ConnectionPool abandon
    +
    +

    2023-01-19

    + +
    $ wget https://asn.ipinfo.app/api/text/list/AS714 \
    +     https://asn.ipinfo.app/api/text/list/AS16276 \
    +     https://asn.ipinfo.app/api/text/list/AS15169 \
    +     https://asn.ipinfo.app/api/text/list/AS23576 \
    +     https://asn.ipinfo.app/api/text/list/AS24940 \
    +     https://asn.ipinfo.app/api/text/list/AS13238 \
    +     https://asn.ipinfo.app/api/text/list/AS32934 \
    +     https://asn.ipinfo.app/api/text/list/AS14061 \
    +     https://asn.ipinfo.app/api/text/list/AS12876 \
    +     https://asn.ipinfo.app/api/text/list/AS55286 \
    +     https://asn.ipinfo.app/api/text/list/AS203020 \
    +     https://asn.ipinfo.app/api/text/list/AS204287 \
    +     https://asn.ipinfo.app/api/text/list/AS50245 \
    +     https://asn.ipinfo.app/api/text/list/AS6939 \
    +     https://asn.ipinfo.app/api/text/list/AS16509 \
    +     https://asn.ipinfo.app/api/text/list/AS14618
    +$ cat AS* | sort | uniq | wc -l
    +18179
    +$ cat /tmp/AS* | ~/go/bin/mapcidr -a > /tmp/networks.txt
    +$ wc -l /tmp/networks.txt
    +5872 /tmp/networks.txt
    +

    2023-01-20

    + +

    2023-01-21

    + +

    2023-01-22

    + +
    $ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
    +     11 dspaceApi
    +     28 dspaceCli
    +    981 dspaceWeb
    +
    +
    $ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid WHERE application_name='dspaceCli';" | less -S
    +...
    +$ ps auxw | grep 18986
    +postgres 1429108  1.9  1.5 3359712 508148 ?      Ss   05:00  13:40 postgres: 12/main: dspace dspace 127.0.0.1(18986) SELECT
    +
    +
    $ psql < locks-age.sql | grep days | less -S
    +
    diff --git a/docs/categories/index.html b/docs/categories/index.html index 6a1a542ca..2f4b36db6 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 2a3f395fa..62eda5505 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index 595b4f0a9..2889a0c48 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 92e901562..2530f3b24 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 156503523..c2e768746 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/5/index.html b/docs/categories/notes/page/5/index.html index 9997c18a7..adb5073f7 100644 --- a/docs/categories/notes/page/5/index.html +++ b/docs/categories/notes/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/6/index.html b/docs/categories/notes/page/6/index.html index d74a5a7a7..06a458b9c 100644 --- a/docs/categories/notes/page/6/index.html +++ b/docs/categories/notes/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/categories/notes/page/7/index.html b/docs/categories/notes/page/7/index.html index 105899e6f..e3f7c9dd0 100644 --- a/docs/categories/notes/page/7/index.html +++ b/docs/categories/notes/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 7dc153ec1..b4c4bf718 100644 --- a/docs/index.html +++ b/docs/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 8ba04298f..dcc1fcf8c 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index d34d17573..b395bf067 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index ea4cbf2f1..0ecf22de6 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index 1f96b8e7a..7b92d87f6 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index e2930acfa..80fdf3be5 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/7/index.html b/docs/page/7/index.html index 6b851f0f1..ba7d6054c 100644 --- a/docs/page/7/index.html +++ b/docs/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/8/index.html b/docs/page/8/index.html index 9e143c254..7b05233af 100644 --- a/docs/page/8/index.html +++ b/docs/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/page/9/index.html b/docs/page/9/index.html index ac67f4219..531d91492 100644 --- a/docs/page/9/index.html +++ b/docs/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index 82efe447e..b5f990677 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 9004ef403..8f5ec674d 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index bb24b4ad9..650261370 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index 9357c90eb..6773b6c45 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 862e0aede..4dd6df847 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index e5bace86f..bfbc5e87e 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/7/index.html b/docs/posts/page/7/index.html index ad3ba1dc0..abf34d624 100644 --- a/docs/posts/page/7/index.html +++ b/docs/posts/page/7/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/8/index.html b/docs/posts/page/8/index.html index 400311775..281fbdf7a 100644 --- a/docs/posts/page/8/index.html +++ b/docs/posts/page/8/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/posts/page/9/index.html b/docs/posts/page/9/index.html index f630c6633..3fbfd9126 100644 --- a/docs/posts/page/9/index.html +++ b/docs/posts/page/9/index.html @@ -10,7 +10,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 4e9820d38..933fef13e 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -3,19 +3,19 @@ xmlns:xhtml="http://www.w3.org/1999/xhtml"> https://alanorth.github.io/cgspace-notes/categories/ - 2023-01-15T08:10:16+03:00 + 2023-01-17T22:38:55+03:00 https://alanorth.github.io/cgspace-notes/ - 2023-01-15T08:10:16+03:00 + 2023-01-17T22:38:55+03:00 https://alanorth.github.io/cgspace-notes/2023-01/ - 2023-01-15T08:10:16+03:00 + 2023-01-17T22:38:55+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2023-01-15T08:10:16+03:00 + 2023-01-17T22:38:55+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2023-01-15T08:10:16+03:00 + 2023-01-17T22:38:55+03:00 https://alanorth.github.io/cgspace-notes/2022-12/ 2023-01-01T10:12:13+02:00