diff --git a/content/posts/2021-12.md b/content/posts/2021-12.md new file mode 100644 index 000000000..2b771ca68 --- /dev/null +++ b/content/posts/2021-12.md @@ -0,0 +1,90 @@ +--- +title: "December, 2021" +date: 2021-12-01T16:07:07+02:00 +author: "Alan Orth" +categories: ["Notes"] +--- + +## 2021-12-01 + +- Atmire merged some changes I had submitted to the COUNTER-Robots project +- I updated our local spider user agents and then re-ran the list with my `check-spider-hits.sh` script on CGSpace: + +```console +$ ./ilri/check-spider-hits.sh -f /tmp/agents -p +Purging 1989 hits from The Knowledge AI in statistics +Purging 1235 hits from MaCoCu in statistics +Purging 455 hits from WhatsApp in statistics + +Total number of bot hits purged: 3679 +``` + + + +## 2021-12-02 + +- Francesca from Alliance asked me for help with approving a submission that gets stuck + - I looked at the PostgreSQL activity and the locks are back up like they were earlier this week + +```console +$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n + 1 + 1 ------------------ + 1 (1437 rows) + 1 application_name + 9 psql + 1428 dspaceWeb +``` + +- Munin shows the same: + +![PostgreSQL locks week](/cgspace-notes/2021/12/postgres_locks_ALL-week.png) + +- Last month I enabled the `log_lock_waits` in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago: + +```console +# grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for' +15 +``` + +- I think you could analyze the locks for the `dspaceWeb` user (XMLUI) and find out what queries were locking... but it's so much information and I don't know where to start + - For now I just restarted PostgreSQL... + - Francesca was able to do her submission immediately... +- On a related note, I want to enable the `pg_stat_statement` feature to see which queries get run the most, so I created the extension on the CGSpace database +- I was doing some research on PostgreSQL locks and found some interesting things to consider + - The default `lock_timeout` is 0, aka disabled + - The default `statement_timeout` is 0, aka disabled + - It seems to be recommended to start by setting `statement_timeout` first, rule of thumb [ten times longer than your longest query](https://github.com/jberkus/annotated.conf/blob/master/postgresql.10.simple.conf#L211) +- Mark Wood mentioned the `checker` cron job that apparently runs in one transaction and might be an issue + - I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks +- Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace + - After some troubleshooting it turns out that the emails from CGSpace were going in her Junk! + +## 2021-12-03 + +- I see GARDIAN is now using a "GARDIAN" user agent finally + - I will add them to our local spider agent override in DSpace so that the hits don't get counted in Solr + +## 2021-12-05 + +- Proof fifty records Abenet sent me from Africa Rice Center ("AfricaRice 1st batch Import") + - Fixed forty-six incorrect collections + - Cleaned up and normalize affiliations + - Cleaned up dates (extra `*` character in all?) + - Cleaned up citation format + - Fixed some encoding issues in abstracts + - Removed empty columns + - Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna + - Added volume and issue metadata by extracting it from the citations + - All PDFs hosted on davidpublishing.com are dead... + - All DOIs linking to African Journal of Agricultural Research are dead... + - Fixed a handful of items marked as "Open Access" that are actually closed + - Added many missing ISSNs + - Added many missing countries/regions + - Fixed invalid AGROVOC terms and added some more based on article subjects +- I also made some minor changes to the [CSV Metadata Quality Checker](https://github.com/ilri/csv-metadata-quality) + - Added the ability to check if the item's title exists in the citation + - Updated to only run the mojibake check if we're not running in unsafe mode (so we don't print the same warning during both the check and fix steps) +- I ran the re-harvesting on AReS + + diff --git a/docs/2021-12/index.html b/docs/2021-12/index.html index 56ac69fe8..7338e67d9 100644 --- a/docs/2021-12/index.html +++ b/docs/2021-12/index.html @@ -50,7 +50,7 @@ Total number of bot hits purged: 3679 "@type": "BlogPosting", "headline": "December, 2021", "url": "https://alanorth.github.io/cgspace-notes/2021-12/", - "wordCount": "404", + "wordCount": "597", "datePublished": "2021-12-01T16:07:07+02:00", "dateModified": "2021-12-01T16:07:07+02:00", "author": { @@ -191,10 +191,38 @@ Purging 455 hits from WhatsApp in statistics +

2021-12-05

+