February, 2022
+ +2022-02-01
+-
+
- Meeting with Peter and Abenet about CGSpace in the One CGIAR
+
-
+
- We agreed to buy $5,000 worth of credits from Atmire for future upgrades +
- We agreed to move CRPs and non-CGIAR communities off the home page, as well as some other things for the CGIAR System Organization +
- We agreed to make a Discovery facet for CGIAR Action Areas above the existing CGIAR Impact Areas one +
- We agreed to try to do more alignment of affiliations/funders with ROR +
+
-
+
- I moved a bunch of communities: +
$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115089
+$ dspace community-filiator --remove --parent=10568/114639 --child=10568/115087
+$ dspace community-filiator --remove --parent=10568/83389 --child=10568/108598
+$ dspace community-filiator --remove --parent=10568/83389 --child=10947/1
+$ dspace community-filiator --set --parent=10568/35697 --child=10568/80211
+$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2517
+$ dspace community-filiator --set --parent=10568/97114 --child=10947/2517
+$ dspace community-filiator --set --parent=10568/97114 --child=10568/89416
+$ dspace community-filiator --set --parent=10568/97114 --child=10568/3530
+$ dspace community-filiator --set --parent=10568/97114 --child=10568/80099
+$ dspace community-filiator --set --parent=10568/97114 --child=10568/80100
+$ dspace community-filiator --set --parent=10568/97114 --child=10568/34494
+$ dspace community-filiator --set --parent=10568/117867 --child=10568/114644
+$ dspace community-filiator --set --parent=10568/117867 --child=10568/16573
+$ dspace community-filiator --set --parent=10568/117867 --child=10568/42211
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/109945
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/16498
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/99453
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/2983
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/133
+$ dspace community-filiator --remove --parent=10568/83389 --child=10568/1208
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/1208
+$ dspace community-filiator --remove --parent=10568/83389 --child=10568/56924
+$ dspace community-filiator --set --parent=10568/117865 --child=10568/56924
+$ dspace community-filiator --remove --parent=10568/83389 --child=10568/91688
+$ dspace community-filiator --set --parent=10947/1 --child=10568/91688
+$ dspace community-filiator --remove --parent=10568/83389 --child=10947/2515
+$ dspace community-filiator --set --parent=10947/1 --child=10947/2515
+
-
+
- Remove CPWF and CTA subjects from the Discovery facets +
- Start a full Discovery index on CGSpace: +
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
+
+real 275m15.777s
+user 182m52.171s
+sys 2m51.573s
+
-
+
- I got a request to confirm validation of CGSpace on openarchives.org, with the requestor’s IP being 128.84.116.66
+
-
+
- That is at Cornell… hmmmm who could that be?! +
- Oh, the OpenArchives initiative is at Cornell… maybe this is an automated periodic check? +
+
2022-02-02
+-
+
- Looking at the top user agents and IP addresses in CGSpace’s Solr statistics for 2022-01
+
-
+
- 64.39.98.40 made 26,000 requests, owned by Qualys so it’s some kind of security scanning +
- 45.134.26.171 made 8,000 requests and it’s own by some Russian company and makes requests like this hmmmmm: +
+
45.134.26.171 - - [12/Jan/2022:06:25:27 +0100] "GET /bitstream/handle/10568/81964/varietal-2faea58f.pdf?sequence=1 HTTP/1.1" 200 1157807 "https://cgspace.cgiar.org:443/bitstream/handle/10568/81964/varietal-2faea58f.pdf" "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1)) AND 4734=CTXSYS.DRITHSX.SN(4734,(CHR(113)||CHR(120)||CHR(120)||CHR(112)||CHR(113)||(SELECT (CASE WHEN (4734=4734) THEN 1 ELSE 0 END) FROM DUAL)||CHR(113)||CHR(120)||CHR(113)||CHR(122)||CHR(113))) AND ((3917=3917"
+
-
+
- 3.225.28.105 made 3,000 requests mostly for one CIAT collection on the REST API and it is owned by Amazon
+
-
+
- The user agent is sometimes a normal user one, and sometimes
Apache-HttpClient/4.3.4 (java 1.5)
+
+ - The user agent is sometimes a normal user one, and sometimes
- 217.182.21.193 made 2,400 requests and is on OVH +
- I purged these hits +
$ ./ilri/check-spider-ip-hits.sh -f /tmp/ips.txt -p
+Purging 26817 hits from 64.39.98.40 in statistics
+Purging 9446 hits from 45.134.26.171 in statistics
+Purging 6490 hits from 3.225.28.105 in statistics
+Purging 11949 hits from 217.182.21.193 in statistics
+
+Total number of bot hits purged: 54702
+
-
+
- Export donors and affiliations from CGSpace database: +
localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.donor", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 248 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-donors.csv WITH CSV HEADER;
+COPY 1036
+localhost/dspace63= ☘ \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2022-02-02-affiliations.csv WITH CSV HEADER;
+COPY 7901
+
-
+
- Then check matches against the latest ROR dump: +
$ csvcut -c cg.contributor.donor /tmp/2022-02-02-donors.csv | sed '1d' > /tmp/2022-02-02-donors.txt
+$ ./ilri/ror-lookup.py -i /tmp/2022-02-02-donors.txt -r 2021-09-23-ror-data.json -o /tmp/donor-ror-matches.csv
+...
+
-
+
- I see we have 258/1036 (24.9%) of our donors matching ROR (as of the 2021-09-23 ROR dump) +
- I see we have 1986/7901 (25.1%) of our affiliations matching ROR (as of the 2021-09-23 ROR dump) +
- Update the PostgreSQL JDBC driver to 42.3.2 in the Ansible Infrastructure playbooks and deploy on DSpace Test +
- Mishell from CIP sent me a copy of a security scan their ICT had done on CGSpace using QualysGuard
+
-
+
- The report was very long and generic, highlighting low-severity things like being able to post crap to search forms and have it appear on the results page +
- Also they say we’re using old jQuery and bootstrap, etc (fair enough) but there are no exploits per se +
- At least now I know why all those Qualys IPs are scanning us all the time!!! +
+ - Mishell also said she’s having issues logging into CGSpace
+
-
+
- According to the logs her account is failing on LDAP authentication +
- I checked CGSpace’s LDAP credentials using ldapsearch and was able to connect so it’s gotta be something with her account +
+
2022-02-03
+-
+
- I synchronized DSpace Test with a fresh snapshot of CGSpace +
- I noticed a bunch of thumbnails missing for items submitted in the last week on CGSpace so I ran the
dspace filter-media
script manually and eventually it crashed:
+
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media
+...
+SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.txt' already exists
+Generated Thumbnail ilri_establishiment.pdf matches pattern and is replacable.
+SKIPPED: bitstream 48612de7-eec5-4990-8f1b-589a87219a39 (item: 10568/67391) because 'ilri_establishiment.pdf.jpg' already exists
+File: Agreement_on_the_Estab_of_ILRI.doc.txt
+Exception: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
+java.lang.NoSuchMethodError: org.apache.poi.util.LittleEndian.getUnsignedByte([BI)I
+ at org.textmining.extraction.word.model.FormattedDiskPage.<init>(FormattedDiskPage.java:66)
+ at org.textmining.extraction.word.model.CHPFormattedDiskPage.<init>(CHPFormattedDiskPage.java:62)
+ at org.textmining.extraction.word.model.CHPBinTable.<init>(CHPBinTable.java:70)
+ at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:122)
+ at org.textmining.extraction.word.Word97TextExtractor.getText(Word97TextExtractor.java:63)
+ at org.dspace.app.mediafilter.WordFilter.getDestinationStream(WordFilter.java:83)
+ at com.atmire.dspace.app.mediafilter.AtmireMediaFilter.processBitstream(AtmireMediaFilter.java:103)
+ at com.atmire.dspace.app.mediafilter.AtmireMediaFilterServiceImpl.filterBitstream(AtmireMediaFilterServiceImpl.java:61)
+ at org.dspace.app.mediafilter.MediaFilterServiceImpl.filterItem(MediaFilterServiceImpl.java:181)
+ at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersItem(MediaFilterServiceImpl.java:159)
+ at org.dspace.app.mediafilter.MediaFilterServiceImpl.applyFiltersAllItems(MediaFilterServiceImpl.java:111)
+ at org.dspace.app.mediafilter.MediaFilterCLITool.main(MediaFilterCLITool.java:212)
+ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
+ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+ at java.lang.reflect.Method.invoke(Method.java:498)
+ at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
+ at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
+
-
+
- I should look up that issue and report a bug somewhere perhaps, but for now I just forced the JPG thumbnails with: +
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -p "ImageMagick PDF Thumbnail" -v >& /tmp/filter-media.log
+
2022-02-04
+-
+
- I found a thread on the dspace-tech mailing list about the
media-filter
crash above +-
+
- The problem is that the default filter for Word files is outdated, so we need to switch to the PoiWordFilter extractor +
- After changing that I was able to filter the Word file on that item above: +
+
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace filter-media -i 10568/67391 -p "Word Text Extractor" -v
+The following MediaFilters are enabled:
+Full Filter Name: org.dspace.app.mediafilter.PoiWordFilter
+org.dspace.app.mediafilter.PoiWordFilter
+File: Agreement_on_the_Estab_of_ILRI.doc.txt
+
+FILTERED: bitstream 31db7d05-5369-4309-adeb-3b888c80b73d (item: 10568/67391) and created 'Agreement_on_the_Estab_of_ILRI.doc.txt'
+
-
+
- Meeting with the repositories working group to discuss issues moving forward in the One CGIAR +
2022-02-07
+-
+
- Gaia sent me her feedback on the duplicates for the TAC and ICW items for CGSpace a few days ago
+
-
+
- I used the IDs marked “delete” in her spreadsheet to create a custom text facet with this GREL in OpenRefine: +
+
or(
+isNotNull(value.match('1')),
+isNotNull(value.match('4')),
+isNotNull(value.match('5')),
+isNotNull(value.match('6')),
+isNotNull(value.match('8')),
+...
+sNotNull(value.match('178')),
+isNotNull(value.match('186')),
+isNotNull(value.match('188')),
+isNotNull(value.match('189')),
+isNotNull(value.match('197'))
+)
+
-
+
- Then I flagged all of these (seventy-five items)…
+
-
+
- I decided to flag the deletes instead of star the keeps because there are some items in the original file that we not marked as duplicates so we have to keep those too +
+ - I generated the next batch of 200 items, from IDs 201 to 400, checked them for duplicates, and then added the PDF file names to the CSV for reference: +
$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/tac.csv
+$ ./ilri/check-duplicates.py -i /tmp/tac.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -o /tmp/2022-02-07-tac-batch2-201-400.csv
+$ csvcut -c id,filename ~/Downloads/2022-01-21-CGSpace-TAC-ICW-batch201-400.csv > /tmp/batch2-filenames.csv
+$ csvjoin -c id /tmp/2022-02-07-tac-batch2-201-400.csv /tmp/batch2-filenames.csv > /tmp/2022-02-07-tac-batch2-201-400-filenames.csv
+
-
+
- Then I sent this second batch of items to Gaia to look at +
2022-02-08
+-
+
- Create a SAF archive for the first 200 items (IDs 1 to 200) that were not flagged as duplicates and upload them to a new collection on DSpace Test: +
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" dspace import --add --eperson=bngo@mfin.com --source /tmp/SimpleArchiveFormat --mapfile=./2022-02-08-tac-batch1-1to200.map
+
-
+
- Fix some occurrences of “Hammond, Jim” to be “Hammond, James” on CGSpace +
- Start a full index on AReS +
2022-02-09
+-
+
- UptimeRobot said that CGSpace was down yesterday evening, but when I looked it was up and I didn’t see a high database load or anything wrong +
- Maria from Bioversity wrote to say that CGSpace was very slow also… +
2022-02-10
+-
+
- Looking at the Munin graphs on CGSpace I see several metrics showing that there was likely just increased load… +
+ + +
+-
+
- I extract the logs from nginx for yesterday so I can analyze the traffic: +
# zcat --force /var/log/nginx/access.log.1 /var/log/nginx/access.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-access.log
+# zcat --force /var/log/nginx/rest.log.1 /var/log/nginx/rest.log.2.gz | grep '09/Feb/2022' > /tmp/feb9-rest.log
+# awk '{print $1}' /tmp/feb9-* | less | sort -u > /tmp/feb9-ips.txt
+# wc -l /tmp/feb9-ips.txt
+11636 /tmp/feb9-ips.tx
+
-
+
- I started resolving them with my
resolve-addresses-geoip2.py
script
+ - In the mean time I am looking at the requests and I see a new user agent:
1science Resolver 1.0.0
+-
+
- Seems to be a defunct project from Elsevier (website down, Twitter account inactive since 2020) +
+ - I also see 3,400 requests from
EyeMonIT_bot_version_0.1_(http://www.eyemon.it/)
, but because it has “bot” in the name it gets heavily throttled… +-
+
- I wonder who is monitoring CGSpace with that service… +
+ - Looking at the top twenty or so ASNs for the resolved IPs I see lots of bot traffic, but nothing malicious: +
$ csvcut -c asn /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
+ 79 24940
+ 89 36908
+ 100 9299
+ 107 2635
+ 110 44546
+ 111 16509
+ 118 7552
+ 120 4837
+ 123 50245
+ 123 55836
+ 147 45899
+ 173 33771
+ 192 39832
+ 202 32934
+ 235 29465
+ 260 15169
+ 466 14618
+ 607 24757
+ 768 714
+ 1214 8075
+
-
+
- The same information, but by org name: +
$ csvcut -c org /tmp/feb9-ips.csv | sort | uniq -c | sort -h | tail -n 20
+ 92 Orange
+ 100 Hetzner Online GmbH
+ 100 Philippine Long Distance Telephone Company
+ 107 AUTOMATTIC
+ 110 ALFA TELECOM s.r.o.
+ 111 AMAZON-02
+ 118 Viettel Group
+ 120 CHINA UNICOM China169 Backbone
+ 123 Reliance Jio Infocomm Limited
+ 123 Serverel Inc.
+ 147 VNPT Corp
+ 173 SAFARICOM-LIMITED
+ 192 Opera Software AS
+ 202 FACEBOOK
+ 235 MTN NIGERIA Communication limited
+ 260 GOOGLE
+ 466 AMAZON-AES
+ 607 Ethiopian Telecommunication Corporation
+ 768 APPLE-ENGINEERING
+ 1214 MICROSOFT-CORP-MSN-AS-BLOCK
+
-
+
- Most of these are pretty normal except “Serverel” and Hetzner perhaps, but their user agents are pretending to be normal users so who knows… +
- I decided to look in the Solr stats with
facet.limit=1000&facet.mincount=1
and found a few more definitely non-human agents: +-
+
- scalaj-http/2.4.2 +
- scpitspi-rs +
- lua-resty-http +
- AHC/2.1 +
- acebookexternalhit <—- typo, but purge it!!! +
- Iframely/1.3.1 (+https://iframely.com/docs/about) Atlassian +
- qbhttp/1.0.0 +
- got (https://github.com/sindresorhus/got) +
- colly - https://github.com/gocolly/colly/v2 +
- article-parser/4.2.10 +
- SomeRandomText +
- adreview/1.0 +
+ - I added them to the ILRI override in the DSpace spider list and ran the
check-spider-hits.sh
script:
+
$ ./ilri/check-spider-hits.sh -f dspace/config/spiders/agents/ilri -p
+Purging 234 hits from randint in statistics
+Purging 337 hits from Koha in statistics
+Purging 1164 hits from scalaj-http in statistics
+Purging 1528 hits from scpitspi-rs in statistics
+Purging 3050 hits from lua-resty-http in statistics
+Purging 1683 hits from AHC in statistics
+Purging 1129 hits from acebookexternalhit in statistics
+Purging 534 hits from Iframely in statistics
+Purging 1022 hits from qbhttp in statistics
+Purging 330 hits from ^got in statistics
+Purging 156 hits from ^colly in statistics
+Purging 38 hits from article-parser in statistics
+Purging 1148 hits from SomeRandomText in statistics
+Purging 3126 hits from adreview in statistics
+
+Total number of bot hits purged: 14479
+
-
+
- I don’t have time right now to add any of these to the COUNTER-Robots list… +
- Peter asked me to add a new item type on CGSpace: Opinion Piece +
- Map an item on CGSpace for Maria since she couldn’t find it in the item mapper +