cgspace-notes/content/posts/2021-12.md

19 KiB

title date author categories
December, 2021 2021-12-01T16:07:07+02:00 Alan Orth
Notes

2021-12-01

  • Atmire merged some changes I had submitted to the COUNTER-Robots project
  • I updated our local spider user agents and then re-ran the list with my check-spider-hits.sh script on CGSpace:
$ ./ilri/check-spider-hits.sh -f /tmp/agents -p  
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics

Total number of bot hits purged: 3679

2021-12-02

  • Francesca from Alliance asked me for help with approving a submission that gets stuck
    • I looked at the PostgreSQL activity and the locks are back up like they were earlier this week
$ psql -c "SELECT application_name FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | sort | uniq -c | sort -n
      1 
      1 ------------------
      1 (1437 rows)
      1  application_name 
      9  psql
   1428  dspaceWeb
  • Munin shows the same:

PostgreSQL locks week

  • Last month I enabled the log_lock_waits in PostgreSQL so I checked the log and was surprised to find only a few since I restarted PostgreSQL three days ago:
# grep -E '^2021-(11-29|11-30|12-01|12-02)' /var/log/postgresql/postgresql-10-main.log | grep -c 'still waiting for'
15
  • I think you could analyze the locks for the dspaceWeb user (XMLUI) and find out what queries were locking... but it's so much information and I don't know where to start
    • For now I just restarted PostgreSQL...
    • Francesca was able to do her submission immediately...
  • On a related note, I want to enable the pg_stat_statement feature to see which queries get run the most, so I created the extension on the CGSpace database
  • I was doing some research on PostgreSQL locks and found some interesting things to consider
    • The default lock_timeout is 0, aka disabled
    • The default statement_timeout is 0, aka disabled
    • It seems to be recommended to start by setting statement_timeout first, rule of thumb ten times longer than your longest query
  • Mark Wood mentioned the checker cron job that apparently runs in one transaction and might be an issue
    • I definitely saw it holding a bunch of locks for ~30 minutes during the first part of its execution, then it dropped them and did some other less-intensive things without locks
  • Bizuwork was still not receiving emails even after we fixed the SMTP access on CGSpace
    • After some troubleshooting it turns out that the emails from CGSpace were going in her Junk!

2021-12-03

  • I see GARDIAN is now using a "GARDIAN" user agent finally
    • I will add them to our local spider agent override in DSpace so that the hits don't get counted in Solr

2021-12-05

  • Proof fifty records Abenet sent me from Africa Rice Center ("AfricaRice 1st batch Import")
    • Fixed forty-six incorrect collections
    • Cleaned up and normalize affiliations
    • Cleaned up dates (extra * character in all?)
    • Cleaned up citation format
    • Fixed some encoding issues in abstracts
    • Removed empty columns
    • Removed one duplicate: Enhancing Rice Productivity and Soil Nitrogen Using Dual-Purpose Cowpea-NERICA® Rice Sequence in Degraded Savanna
    • Added volume and issue metadata by extracting it from the citations
    • All PDFs hosted on davidpublishing.com are dead...
    • All DOIs linking to African Journal of Agricultural Research are dead...
    • Fixed a handful of items marked as "Open Access" that are actually closed
    • Added many missing ISSNs
    • Added many missing countries/regions
    • Fixed invalid AGROVOC terms and added some more based on article subjects
  • I also made some minor changes to the CSV Metadata Quality Checker
    • Added the ability to check if the item's title exists in the citation
    • Updated to only run the mojibake check if we're not running in unsafe mode (so we don't print the same warning during both the check and fix steps)
  • I ran the re-harvesting on AReS

2021-12-06

  • Some minor work on the check-duplicates.py script I wrote last month
    • I found some corner cases where there were items that matched in the database, but they were in_archive=f and or withdrawn=t so I check that before trying to resolve the handles of potential duplicates
  • More work on the Africa Rice Center 1st batch import
    • I merged the metadata for three duplicates in Africa Rice's items and mapped them on CGSpace
    • I did a bit more work to add missing AGROVOC subjects, countries, regions, extents, etc and then uploaded the forty-six items to CGSpace
  • I started looking at the seventy CAS records that Abenet has been working on for the past few months

2021-12-07

  • I sent Vini from CGIAR CAS some questions about the seventy records I was working on yesterday
    • Also, I ran the check-duplicates.py script on them and found that they might ALL be duplicates!!!
    • I tweaked the script a bit more to use the issue dates as a third criteria and now there are less duplicates, but it's still at least twenty or so...
    • The script now checks if the issue date of the item in the CSV and the issue date of the item in the database are less than 365 days apart (by default)
    • For example, many items like "Annual Report 2020" can have similar title and type to previous annual reports, but are not duplicates
  • I noticed a strange user agent in the XMLUI logs on CGSpace:
20.84.225.129 - - [07/Dec/2021:11:51:24 +0100] "GET /handle/10568/33203 HTTP/1.1" 200 6328 "-" "python-requests/2.25.1"
20.84.225.129 - - [07/Dec/2021:11:51:27 +0100] "GET /handle/10568/33203 HTTP/2.0" 200 6315 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4298.0 Safari/537.36"
  • I looked into it more and I see a dozen other IPs using that user agent, and they are all owned by Microsoft
    • It could be someone on Azure?
    • I opened a pull request to COUNTER-Robots and I'll add this user agent to our local override until they decide to include it or not
  • I purged 34,000 hits from this user agent in our Solr statistics:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 34458 hits from HeadlessChrome in statistics

Total number of bot hits purged: 34458
  • Meeting with partners about repositories in the One CGIAR

2021-12-08

2021-12-09

  • Help Francesca upload the dataset for one CIAT publication (it has like 100 authors so we did it via CSV)

2021-12-12

  • Patch OpenRXV's Elasticsearch for the CVE-2021-44228 log4j vulnerability and re-deploy AReS
    • I added -Dlog4j2.formatMsgNoLookups=true to the Elasticsearch Java environment
  • Run AReS harvesting

2021-12-13

  • I ran the check-duplicates.py script on the 1,000 items from the CGIAR System Office TAC/ICW/Green Cover archives and found hundreds or thousands of potential duplicates
    • I sent feedback to Gaia
  • Help Jacquie from WorldFish try to find all outputs for the Fish CRP because there are a few different formats for that name
  • Create a temporary account for Rafael Rodriguez on DSpace Test so he can investigate the submission workflow
    • I added him to the admin group on the Alliance community...

2021-12-14

  • I finally caught some stuck locks on CGSpace after checking several times per day for the last week:
$ psql -c "SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid" | wc -l
1508
  • Now looking at the locks query sorting by age of locks:
$ cat locks-age.sql 
SELECT a.datname,
         l.relation::regclass,
         l.transactionid,
         l.mode,
         l.GRANTED,
         a.usename,
         a.query,
         a.query_start,
         age(now(), a.query_start) AS "age",
         a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
ORDER BY a.query_start;
  • The oldest locks are 9 hours and 26 minutes old and the time on the server is Tue Dec 14 18:41:58 CET 2021, so it seems something happened around 9:15 this morning
    • I looked at the maintenance tasks and there is nothing running around then (only the sitemap update that runs at 8AM, and should be quick)
    • I looked at the DSpace log, but didn't see anything interesting there: only editors making edits...
    • I looked at the nginx REST API logs and saw lots of GET action there from Drupal sites harvesting us...
    • So I'm not sure what it causing this... perhaps something in the XMLUI submission / task workflow
    • For now I just ran all system updates and rebooted the server
    • I also enabled Atmire's log-db-activity.sh script to run every four hours (in the DSpace user's crontab) so perhaps that will be better than me checking manually
  • Regarding Gaia's 1,000 items to upload to CGSpace, I checked the eighteen Green Cover records and there are no duplicates, so that's at least a starting point!
    • I sent her a spreadsheet with the eighteen items with a new collection column to indicate where they should go

2021-12-16

  • Working on the CGIAR CAS Green Cover records for Gaia
    • Add months to dcterms.issued from PDFs
    • Add languages
    • Format and fix several authors
  • I created a SAF archive with SAFBuilder and then imported it to DSpace Test:
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=fuuu@fuuu.com --source /tmp/SimpleArchiveFormat --mapfile=./2021-12-16-green-covers.map

2021-12-19

  • I tried to update all Docker containers on AReS and then run a build, but I got an error in the backend:
> openrxv-backend@0.0.1 build
> nest build

node_modules/@elastic/elasticsearch/api/types.d.ts:2454:13 - error TS2456: Type alias 'AggregationsAggregate' circularly references itself.

2454 export type AggregationsAggregate = AggregationsSingleBucketAggregate | AggregationsAutoDateHistogramAggregate | AggregationsFiltersAggregate | AggregationsSignificantTermsAggregate<any> | AggregationsTermsAggregate<any> | AggregationsBucketAggregate | AggregationsCompositeBucketAggregate | AggregationsMultiBucketAggregate<AggregationsBucket> | AggregationsMatrixStatsAggregate | AggregationsKeyedValueAggregate | AggregationsMetricAggregate
                 ~~~~~~~~~~~~~~~~~~~~~
node_modules/@elastic/elasticsearch/api/types.d.ts:3209:13 - error TS2456: Type alias 'AggregationsSingleBucketAggregate' circularly references itself.

3209 export type AggregationsSingleBucketAggregate = AggregationsSingleBucketAggregateKeys
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Found 2 error(s).
  • I'm not sure why because I build the backend successfully on my local machine...
    • For now I just ran all the system updates and rebooted the machine (linode20)
    • Then I started a fresh harvest
  • Now I cleared all images on my local machine and I get the same error when building the backend
    • It seems to be related to @elastic/elasticsearch-js](https://github.com/elastic/elasticsearch-js), which our package.json pins with version ^7.13.0
    • I see that AReS is currently using 7.15.0 in its package-lock.json, and 7.16.0 was released four days ago so perhaps it's that...
    • Pinning ~7.15.0 allows nest to build fine...
    • I made a pull request
  • But since software sucks, now I get an error in the frontend while starting nginx:
nginx: [emerg] host not found in upstream "backend:3000" in /etc/nginx/conf.d/default.conf:2
  • In other news, looking at updating our Redis from version 5 to 6 (which is slightly less old, but still old!) and I'm happy to see that the release notes for version 6 say that it is compatible with 5 except for one minor thing that we don't seem to be using (SPOP?)
  • For reference I see that our Redis 5 container is based on Debian 11, which I didn't expect... but I still want to try to upgrade to Redis 6 eventually:
$ docker exec -it redis bash
root@23692d6b51c5:/data# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • I bumped the version to 6 on my local test machine and the logs look good:
$ docker logs redis
1:C 19 Dec 2021 19:27:15.583 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 19 Dec 2021 19:27:15.583 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 19 Dec 2021 19:27:15.583 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
1:M 19 Dec 2021 19:27:15.584 * monotonic clock: POSIX clock_gettime
1:M 19 Dec 2021 19:27:15.584 * Running mode=standalone, port=6379.
1:M 19 Dec 2021 19:27:15.584 # Server initialized
1:M 19 Dec 2021 19:27:15.585 * Loading RDB produced by version 5.0.14
1:M 19 Dec 2021 19:27:15.585 * RDB age 33 seconds
1:M 19 Dec 2021 19:27:15.585 * RDB memory usage when created 3.17 Mb
1:M 19 Dec 2021 19:27:15.595 # Done loading RDB, keys loaded: 932, keys expired: 1.
1:M 19 Dec 2021 19:27:15.595 * DB loaded from disk: 0.011 seconds
1:M 19 Dec 2021 19:27:15.595 * Ready to accept connections
  • The interface and harvesting all work as expected...
    • I pushed the update to OpenRXV
  • I also fixed the weird "unsafe" issue in the links on AReS that Abenet told me about last week
    • When testing my local instance I realized that the thumbnail field was missing on the production AReS, and that somehow breaks the links

2021-12-22

  • Fix apt error on DSpace servers due to updated /etc/java-8-openjdk/security/java.security file

2021-12-23

  • Add support for dropping invalid AGROVOC subjects to csv-metadata-quality
  • Move invalid AGROVOC subjects in Gaia's eighteen green cover items on DSpace Test to cg.subject.system
  • I created an "approve" user for Rafael from CIAT to do tests on DSpace Test:
$ dspace user -a -m rafael-approve@cgiar.org -g Rafael -s Rodriguez -p 'fuuuuuu'

2021-12-27

  • Start a fresh harvest on AReS

2021-12-29

  • Looking at the top IPs and user agents on CGSpace's Solr statistics I see a strange user agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.{random.randint(0, 9999)} Safari/537.{random.randint(0, 99)}
  • I found two IPs using user agents with the "randint" bug:
    • 47.252.80.214 (AliCloud in the US)
    • 61.143.40.50 (ChinaNet in China)
  • I wonder what other requests have been made from those hosts where the randint spoofer was working... ugh.
  • I found some IPs from the Russian SELECTEL network making thousands of requests with SQL injection attempts...
    • 45.134.26.171
    • 45.146.166.173
  • 3.225.28.105 is on Amazon and making thousands of requests for the same URL:
/rest/collections/1118/items?expand=all&limit=1
  • Most of the time it has a real-looking user agent, but sometimes it uses Apache-HttpClient/4.3.4 (java 1.5)
  • Another 82.65.26.228 is doing SQL injection attempts from France
  • 216.213.28.138 is some scrape-as-a-service bot from Sprious
  • I used my resolve-addresses-geoip2.py script to get the ASNs for all the IPs in Solr stats this month, then extracted the ASNs that were responsible for more than one IP:
$ ./ilri/resolve-addresses-geoip2.py -i /tmp/ips.txt -o /tmp/2021-12-29-ips.csv
$ csvcut -c asn /tmp/2021-12-29-ips.csv | sed 1d | sort | uniq -c | sort -h | awk '$1 > 1'
      2 10620
      2 265696
      2 6147
      2 9299
      3 3269
      5 16509
      5 49505
      9 24757
      9 24940
      9 64267
  • AS 64267 is Sprious, and it has used these IPs this month:
    • 216.213.28.136
    • 207.182.27.191
    • 216.41.235.187
    • 216.41.232.169
    • 216.41.235.186
    • 52.124.19.190
    • 216.213.28.138
    • 216.41.234.163
  • To be honest I want to ban all their networks but I'm afraid it's too many IPs... hmmm
  • AS 24940 is Hetzner, but I don't feel like going through all the IPs to see... they always pretend to be normal users and make semi-sane requests so it might be a proxy or something
  • AS 24757 is Ethiopian Telecom
  • I'm going to purge all these for sure, as they are a scraping-as-a-service company and don't use proper user agents or request robots.txt
  • AS 49505 is the Russian Selectel, and it has used these IPs this month:
    • 45.146.166.173
    • 45.134.26.171
    • 45.146.164.123
    • 45.155.205.231
    • 195.54.167.122
  • I will purge them all too because they are up to no good, as I already saw earlier today (SQL injections)
  • AS 16509 is Amazon, and it has used these IPs this month:
    • 18.135.23.223 (made requests using the Mozilla/5.0 (compatible; U; Koha checkurl) user agent, so I will purge it and add it to our DSpace user agent override and submit to COUNTER-Robots)
    • 54.76.137.83 (made hundreds of requests to "/" with a normal user agent)
    • 34.253.119.85 (made hundreds of requests to "/" with a normal user agent)
    • 34.216.201.131 (made hundreds of requests to "/" with a normal user agent)
    • 54.203.193.46 (made hundreds of requests to "/" with a normal user agent)
  • I ran the script to purge spider agents with the latest updates:
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
Purging 2530 hits from HeadlessChrome in statistics
Purging 10676 hits from randint in statistics
Purging 3579 hits from Koha in statistics

Total number of bot hits purged: 16785
  • Then the IPs:
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips-to-purge.txt -p
Purging 1190 hits from 216.213.28.136 in statistics
Purging 1128 hits from 207.182.27.191 in statistics
Purging 1095 hits from 216.41.235.187 in statistics
Purging 1087 hits from 216.41.232.169 in statistics
Purging 1011 hits from 216.41.235.186 in statistics
Purging 945 hits from 52.124.19.190 in statistics
Purging 933 hits from 216.213.28.138 in statistics
Purging 930 hits from 216.41.234.163 in statistics
Purging 4410 hits from 45.146.166.173 in statistics
Purging 2688 hits from 45.134.26.171 in statistics
Purging 1130 hits from 45.146.164.123 in statistics
Purging 536 hits from 45.155.205.231 in statistics
Purging 10676 hits from 195.54.167.122 in statistics
Purging 1350 hits from 54.76.137.83 in statistics
Purging 1240 hits from 34.253.119.85 in statistics
Purging 2879 hits from 34.216.201.131 in statistics
Purging 2909 hits from 54.203.193.46 in statistics
Purging 1822 hits from 2605\:b100\:316\:7f74\:8d67\:5860\:a9f3\:d87c in statistics

Total number of bot hits purged: 37959