cgspace-notes/content/posts/2022-02.md

160 lines
9.3 KiB
Markdown
Raw Normal View History

2022-02-01 15:54:45 +01:00
---
title: "February, 2022"
date: 2022-02-01T14:06:54+02:00
author: "Alan Orth"
categories: ["Notes"]
---
## 2022-02-01
- Meeting with Peter and Abenet about CGSpace in the One CGIAR
- We agreed to buy $5,000 worth of credits from Atmire for future upgrades
- We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization
- We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one
- We agreed to try to do more alignment of affiliations/funders with ROR
<!--more-->
- I moved a bunch of communities:
```console
$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115089
$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115087
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/108598
$ dspace community-filiator --remove --parent=10568/83389 --child=10947/1
$ dspace community-filiator --set --parent=10568/35697 --child=10568/80211
$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2517
$ dspace community-filiator --set --parent=10568/97114 --child=10947/2517
$ dspace community-filiator --set --parent=10568/97114 --child=10568/89416
$ dspace community-filiator --set --parent=10568/97114 --child=10568/3530
$ dspace community-filiator --set --parent=10568/97114 --child=10568/80099
$ dspace community-filiator --set --parent=10568/97114 --child=10568/80100
$ dspace community-filiator --set --parent=10568/97114 --child=10568/34494
$ dspace community-filiator --set --parent=10568/117867 --child=10568/114644
$ dspace community-filiator --set --parent=10568/117867 --child=10568/16573
$ dspace community-filiator --set --parent=10568/117867 --child=10568/42211
$ dspace community-filiator --set --parent=10568/117865 --child=10568/109945
$ dspace community-filiator --set --parent=10568/117865 --child=10568/16498
$ dspace community-filiator --set --parent=10568/117865 --child=10568/99453
$ dspace community-filiator --set --parent=10568/117865 --child=10568/2983
$ dspace community-filiator --set --parent=10568/117865 --child=10568/133
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/1208
$ dspace community-filiator --set --parent=10568/117865 --child=10568/1208
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/56924
$ dspace community-filiator --set --parent=10568/117865 --child=10568/56924
2022-02-02 07:11:43 +01:00
$ dspace community-filiator --remove --parent=10568/83389 --child=10568/91688
$ dspace community-filiator --set --parent=10947/1 --child=10568/91688
$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2515
$ dspace community-filiator --set --parent=10947/1 --child=10947/2515
2022-02-01 15:54:45 +01:00
```
2022-02-02 07:11:43 +01:00
- Remove CPWF and CTA subjects from the Discovery facets
- Start a full Discovery index on CGSpace:
```console
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 275m15.777s
user 182m52.171s
sys 2m51.573s
```
- I got a request to confirm validation of CGSpace on openarchives.org, with the requestor's IP being 128.84.116.66
- That is at Cornell... hmmmm who could that be?!
- Oh, the OpenArchives initiative is at Cornell... maybe this is an automated periodic check?
2022-02-02 21:51:22 +01:00
## 2022-02-02
- Looking at the top user agents and IP addresses in CGSpace's Solr statistics for 2022-01
- 64.39.98.40 made 26,000 requests, owned by Qualys so it's some kind of security scanning
- 45.134.26.171 made 8,000 requests and it's own by some Russian company and makes requests like this hmmmmm:
```console
45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] "GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1" 200 1157807 "https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf" "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917"
```
- 3.225.28.105 made 3,000 requests mostly for one CIAT collection on the REST API and it is owned by Amazon
- The user agent is sometimes a normal user one, and sometimes `Apache-HttpClient/4.3.4 (java 1.5)`
- 217.182.21.193 made 2,400 requests and is on OVH
- I purged these hits
```console
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
Purging 26817 hits from 64.39.98.40 in statistics
Purging 9446 hits from 45.134.26.171 in statistics
Purging 6490 hits from 3.225.28.105 in statistics
Purging 11949 hits from 217.182.21.193 in statistics
Total number of bot hits purged: 54702
```
- Export donors and affiliations from CGSpace database:
```console
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
COPY 1036
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
COPY 7901
```
- Then check matches against the latest ROR dump:
```console
$ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed '1d' > /tmp/2022-02-02-donors.txt
$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
...
```
- I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump)
- I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump)
- Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test
- Mishell from CIP sent me a copy of a security scan their ICT had done on CGSpace using QualysGuard
- The report was very long and generic, highlighting low-severity things like being able to post crap to search forms and have it appear on the results page
- Also they say we're using old jQuery and bootstrap, etc (fair enough) but there are no exploits per se
- At least now I know why all those Qualys IPs are scanning us all the time!!!
- Mishell also said she's having issues logging into CGSpace
- According to the logs her account is failing on LDAP authentication
- I checked CGSpace's LDAP credentials using ldapsearch and was able to connect so it's gotta be something with her account
2022-02-04 06:15:52 +01:00
## 2022-02-03
- I synchronized DSpace Test with a fresh snapshot of CGSpace
- I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the `dspace filter-media` script manually and eventually it crashed:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media
...
SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.txt' already exists
Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.jpg' already exists
File: Agreement_on_the_Estab_of_ILRI.doc.txt
Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
at org.textmining.extraction.word.model.FormattedDiskPage.<init>(FormattedDiskPage.java:66)
at org.textmining.extraction.word.model.CHPFormattedDiskPage.<init>(CHPFormattedDiskPage.java:62)
at org.textmining.extraction.word.model.CHPBinTable.<init>(CHPBinTable.java:70)
at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I should look up that issue and report a bug somewhere perhaps, but for now I just forced the JPG thumbnails with:
```console
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
```
2022-02-01 15:54:45 +01:00
<!-- vim: set sw=2 ts=2: -->