CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

December, 2021

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679

2021-12-02

  • Francesca from Alliance asked me for help with approving a submission that gets stuck
    • I looked at the PostgreSQL activity and the locks are back up like they were earlier this week
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
      1 
      1 ------------------
      1 (1437 rows)
      1  application_name 
      9  psql
   1428  dspaceWeb
  • Munin shows the same:

PostgreSQL locks week

  • Last month I enabled the log_lock_waits in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:
# grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
15
  • I think you could analyze the locks for the dspaceWeb user (XMLUI) and find out what queries were locking… but it’s so much information and I don’t know where to start
    • For now I just restarted PostgreSQL…
    • Francesca was able to do her submission immediately…
  • On a related note, I want to enable the pg_stat_statement feature to see which queries get run the most, so I created the extension on the CGSpace database
  • I was doing some research on PostgreSQL locks and found some interesting things to consider
    • The default lock_timeout is 0, aka disabled
    • The default statement_timeout is 0, aka disabled
    • It seems to be recommended to start by setting statement_timeout first, rule of thumb ten times longer than your longest query
  • Mark Wood mentioned the checker cron job that apparently runs in one transaction and might be an issue
    • I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks
  • Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
    • After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!

2021-12-03

  • I see GARDIAN is now using a “GARDIAN” user agent finally
    • I will add them to our local spider agent override in DSpace so that the hits don’t get counted in Solr

2021-12-05

  • Proof fifty records Abenet sent me from Africa Rice Center (“AfricaRice 1st batch Import”)
    • Fixed forty-six incorrect collections
    • Cleaned up and normalize affiliations
    • Cleaned up dates (extra * character in all?)
    • Cleaned up citation format
    • Fixed some encoding issues in abstracts
    • Removed empty columns
    • Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna
    • Added volume and issue metadata by extracting it from the citations
    • All PDFs hosted on davidpublishing.com are dead…
    • All DOIs linking to African Journal of Agricultural Research are dead…
    • Fixed a handful of items marked as “Open Access” that are actually closed
    • Added many missing ISSNs
    • Added many missing countries/regions
    • Fixed invalid AGROVOC terms and added some more based on article subjects
  • I also made some minor changes to the CSV Metadata Quality Checker
    • Added the ability to check if the item’s title exists in the citation
    • Updated to only run the mojibake check if we’re not running in unsafe mode (so we don’t print the same warning during both the check and fix steps)
  • I ran the re-harvesting on AReS