cgspace-notes/content/posts/2021-12.md

6.5 KiB

title date author categories
December, 2021 2021-12-01T16:07:07+02:00 Alan Orth
Notes

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679

2021-12-02

  • Francesca from Alliance asked me for help with approving a submission that gets stuck
    • I looked at the PostgreSQL activity and the locks are back up like they were earlier this week
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
      1 
      1 ------------------
      1 (1437 rows)
      1  application_name 
      9  psql
   1428  dspaceWeb
  • Munin shows the same:

PostgreSQL locks week

  • Last month I enabled the log_lock_waits in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:
# grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
15
  • I think you could analyze the locks for the dspaceWeb user (XMLUI) and find out what queries were locking... but it's so much information and I don't know where to start
    • For now I just restarted PostgreSQL...
    • Francesca was able to do her submission immediately...
  • On a related note, I want to enable the pg_stat_statement feature to see which queries get run the most, so I created the extension on the CGSpace database
  • I was doing some research on PostgreSQL locks and found some interesting things to consider
    • The default lock_timeout is 0, aka disabled
    • The default statement_timeout is 0, aka disabled
    • It seems to be recommended to start by setting statement_timeout first, rule of thumb ten times longer than your longest query
  • Mark Wood mentioned the checker cron job that apparently runs in one transaction and might be an issue
    • I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks
  • Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
    • After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!

2021-12-03

  • I see GARDIAN is now using a "GARDIAN" user agent finally
    • I will add them to our local spider agent override in DSpace so that the hits don't get counted in Solr

2021-12-05

  • Proof fifty records Abenet sent me from Africa Rice Center ("AfricaRice 1st batch Import")
    • Fixed forty-six incorrect collections
    • Cleaned up and normalize affiliations
    • Cleaned up dates (extra * character in all?)
    • Cleaned up citation format
    • Fixed some encoding issues in abstracts
    • Removed empty columns
    • Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna
    • Added volume and issue metadata by extracting it from the citations
    • All PDFs hosted on davidpublishing.com are dead...
    • All DOIs linking to African Journal of Agricultural Research are dead...
    • Fixed a handful of items marked as "Open Access" that are actually closed
    • Added many missing ISSNs
    • Added many missing countries/regions
    • Fixed invalid AGROVOC terms and added some more based on article subjects
  • I also made some minor changes to the CSV Metadata Quality Checker
    • Added the ability to check if the item's title exists in the citation
    • Updated to only run the mojibake check if we're not running in unsafe mode (so we don't print the same warning during both the check and fix steps)
  • I ran the re-harvesting on AReS

2021-12-06

  • Some minor work on the check-duplicates.py script I wrote last month
    • I found some corner cases where there were items that matched in the database, but they were in_archive=f and or withdrawn=t so I check that before trying to resolve the handles of potential duplicates
  • More work on the Africa Rice Center 1st batch import
    • I merged the metadata for three duplicates in Africa Rice's items and mapped them on CGSpace
    • I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace
  • I started looking at the seventy CAS records that Abenet has been working on for the past few months

2021-12-07

  • I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday
    • Also, I ran the check-duplicates.py script on them and found that they might ALL be duplicates!!!
    • I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it's still at least twenty or so...
    • The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default)
    • For example, many items like "Annual Report 2020" can have similar title and type to previous annual reports, but are not duplicates
  • I noticed a strange user agent in the XMLUI logs on CGSpace:
20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1"
20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
  • I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
    • It could be someone on Azure?
    • I opened a pull request to COUNTER-Robots and I'll add this user agent to our local override until they decide to include it or not
  • I purged 34,000 hits from this user agent in our Solr statistics:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 34458 hits from HeadlessChrome in statistics

Total number of bot hits purged: 34458
  • Meeting with partners about repositories in the One CGIAR