--- title: "December, 2021" date: 2021-12-01T16:07:07+02:00 author: "Alan Orth" categories: ["Notes"] --- ## 2021-12-01 - Atmire merged some changes I had submitted to the COUNTER-Robots project - I updated our local spider user agents and then re-ran the list with my `check-spider-hits.sh` script on CGSpace: ```console $ ./ilri/check-spider-hits.sh -f /tmp/agents -p Purging 1989 hits from The Knowledge AI in statistics Purging 1235 hits from MaCoCu in statistics Purging 455 hits from WhatsApp in statistics Total number of bot hits purged: 3679 ``` ## 2021-12-02 - Francesca from Alliance asked me for help with approving a submission that gets stuck - I looked at the PostgreSQL activity and the locks are back up like they were earlier this week ```console $ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n 1 1 ------------------ 1 (1437 rows) 1 application_name 9 psql 1428 dspaceWeb ``` - Munin shows the same: ![PostgreSQL locks week](/cgspace-notes/2021/12/postgres_locks_ALL-week.png) - Last month I enabled the `log_lock_waits` in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago: ```console # grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for' 15 ``` - I think you could analyze the locks for the `dspaceWeb` user (XMLUI) and find out what queries were locking... but it's so much information and I don't know where to start - For now I just restarted PostgreSQL... - Francesca was able to do her submission immediately... - On a related note, I want to enable the `pg_stat_statement` feature to see which queries get run the most, so I created the extension on the CGSpace database - I was doing some research on PostgreSQL locks and found some interesting things to consider - The default `lock_timeout` is 0, aka disabled - The default `statement_timeout` is 0, aka disabled - It seems to be recommended to start by setting `statement_timeout` first, rule of thumb [ten times longer than your longest query](https://github.com/jberkus/annotated.conf/blob/master/postgresql.10.simple.conf#L211) - Mark Wood mentioned the `checker` cron job that apparently runs in one transaction and might be an issue - I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks - Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace - After some troubleshooting it turns out that the emails from CGSpace were going in her Junk! ## 2021-12-03 - I see GARDIAN is now using a "GARDIAN" user agent finally - I will add them to our local spider agent override in DSpace so that the hits don't get counted in Solr ## 2021-12-05 - Proof fifty records Abenet sent me from Africa Rice Center ("AfricaRice 1st batch Import") - Fixed forty-six incorrect collections - Cleaned up and normalize affiliations - Cleaned up dates (extra `*` character in all?) - Cleaned up citation format - Fixed some encoding issues in abstracts - Removed empty columns - Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna - Added volume and issue metadata by extracting it from the citations - All PDFs hosted on davidpublishing.com are dead... - All DOIs linking to African Journal of Agricultural Research are dead... - Fixed a handful of items marked as "Open Access" that are actually closed - Added many missing ISSNs - Added many missing countries/regions - Fixed invalid AGROVOC terms and added some more based on article subjects - I also made some minor changes to the [CSV Metadata Quality Checker](https://github.com/ilri/csv-metadata-quality) - Added the ability to check if the item's title exists in the citation - Updated to only run the mojibake check if we're not running in unsafe mode (so we don't print the same warning during both the check and fix steps) - I ran the re-harvesting on AReS ## 2021-12-06 - Some minor work on the `check-duplicates.py` script I wrote last month - I found some corner cases where there were items that matched in the database, but they were `in_archive=f` and or `withdrawn=t` so I check that before trying to resolve the handles of potential duplicates - More work on the Africa Rice Center 1st batch import - I merged the metadata for three duplicates in Africa Rice's items and mapped them on CGSpace - I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace - I started looking at the seventy CAS records that Abenet has been working on for the past few months