--- title: "February, 2022" date: 2022-02-01T14:06:54+02:00 author: "Alan Orth" categories: ["Notes"] --- ## 2022-02-01 - Meeting with Peter and Abenet about CGSpace in the One CGIAR - We agreed to buy $5,000 worth of credits from Atmire for future upgrades - We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization - We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one - We agreed to try to do more alignment of affiliations/funders with ROR - I moved a bunch of communities: ```console $ dspace community-filiator --remove --parent=10568/114639 --child=10568/115089 $ dspace community-filiator --remove --parent=10568/114639 --child=10568/115087 $ dspace community-filiator --remove --parent=10568/83389 --child=10568/108598 $ dspace community-filiator --remove --parent=10568/83389 --child=10947/1 $ dspace community-filiator --set --parent=10568/35697 --child=10568/80211 $ dspace community-filiator --remove --parent=10568/83389 --child=10947/2517 $ dspace community-filiator --set --parent=10568/97114 --child=10947/2517 $ dspace community-filiator --set --parent=10568/97114 --child=10568/89416 $ dspace community-filiator --set --parent=10568/97114 --child=10568/3530 $ dspace community-filiator --set --parent=10568/97114 --child=10568/80099 $ dspace community-filiator --set --parent=10568/97114 --child=10568/80100 $ dspace community-filiator --set --parent=10568/97114 --child=10568/34494 $ dspace community-filiator --set --parent=10568/117867 --child=10568/114644 $ dspace community-filiator --set --parent=10568/117867 --child=10568/16573 $ dspace community-filiator --set --parent=10568/117867 --child=10568/42211 $ dspace community-filiator --set --parent=10568/117865 --child=10568/109945 $ dspace community-filiator --set --parent=10568/117865 --child=10568/16498 $ dspace community-filiator --set --parent=10568/117865 --child=10568/99453 $ dspace community-filiator --set --parent=10568/117865 --child=10568/2983 $ dspace community-filiator --set --parent=10568/117865 --child=10568/133 $ dspace community-filiator --remove --parent=10568/83389 --child=10568/1208 $ dspace community-filiator --set --parent=10568/117865 --child=10568/1208 $ dspace community-filiator --remove --parent=10568/83389 --child=10568/56924 $ dspace community-filiator --set --parent=10568/117865 --child=10568/56924 $ dspace community-filiator --remove --parent=10568/83389 --child=10568/91688 $ dspace community-filiator --set --parent=10947/1 --child=10568/91688 $ dspace community-filiator --remove --parent=10568/83389 --child=10947/2515 $ dspace community-filiator --set --parent=10947/1 --child=10947/2515 ``` - Remove CPWF and CTA subjects from the Discovery facets - Start a full Discovery index on CGSpace: ```console $ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b real 275m15.777s user 182m52.171s sys 2m51.573s ``` - I got a request to confirm validation of CGSpace on openarchives.org, with the requestor's IP being - That is at Cornell... hmmmm who could that be?! - Oh, the OpenArchives initiative is at Cornell... maybe this is an automated periodic check? ## 2022-02-02 - Looking at the top user agents and IP addresses in CGSpace's Solr statistics for 2022-01 - made 26,000 requests, owned by Qualys so it's some kind of security scanning - made 8,000 requests and it's own by some Russian company and makes requests like this hmmmmm: ```console - - [12/Jan/2022:06:25:27 +0100] "GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1" 200 1157807 "https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf" "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917" ``` - made 3,000 requests mostly for one CIAT collection on the REST API and it is owned by Amazon - The user agent is sometimes a normal user one, and sometimes `Apache-HttpClient/4.3.4 (java 1.5)` - made 2,400 requests and is on OVH - I purged these hits ```console $ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p Purging 26817 hits from in statistics Purging 9446 hits from in statistics Purging 6490 hits from in statistics Purging 11949 hits from in statistics Total number of bot hits purged: 54702 ``` - Export donors and affiliations from CGSpace database: ```console localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER; COPY 1036 localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER; COPY 7901 ``` - Then check matches against the latest ROR dump: ```console $ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed '1d' > /tmp/2022-02-02-donors.txt $ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv ... ``` - I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump) - I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump) - Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test - Mishell from CIP sent me a copy of a security scan their ICT had done on CGSpace using QualysGuard - The report was very long and generic, highlighting low-severity things like being able to post crap to search forms and have it appear on the results page - Also they say we're using old jQuery and bootstrap, etc (fair enough) but there are no exploits per se - At least now I know why all those Qualys IPs are scanning us all the time!!! - Mishell also said she's having issues logging into CGSpace - According to the logs her account is failing on LDAP authentication - I checked CGSpace's LDAP credentials using ldapsearch and was able to connect so it's gotta be something with her account ## 2022-02-03 - I synchronized DSpace Test with a fresh snapshot of CGSpace - I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the `dspace filter-media` script manually and eventually it crashed: ```console $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media ... SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.txt' already exists Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable. SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.jpg' already exists File: Agreement_on_the_Estab_of_ILRI.doc.txt Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I at org.textmining.extraction.word.model.FormattedDiskPage.(FormattedDiskPage.java:66) at org.textmining.extraction.word.model.CHPFormattedDiskPage.(CHPFormattedDiskPage.java:62) at org.textmining.extraction.word.model.CHPBinTable.(CHPBinTable.java:70) at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122) at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63) at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83) at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103) at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61) at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181) at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159) at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111) at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81) ``` - I should look up that issue and report a bug somewhere perhaps, but for now I just forced the JPG thumbnails with: ```console $ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log ```