- Export all affiliations on CGSpace and run them against the latest RoR data dump:
```console
localhost/dspace63= > \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c 1 /tmp/2021-10-01-affiliations.csv | sed 1d > /tmp/2021-10-01-affiliations.txt
- Start looking at the last month of Solr statistics on CGSpace
- I see a number of IPs with "normal" user agents who clearly behave like bots
- 198.15.130.18: 21,000 requests to /discover with a normal-looking user agent, from ASN 11282 (SERVERYOU, US)
- 93.158.90.107: 8,500 requests to handle and browse links with a Firefox 84.0 user agent, from ASN 12552 (IPO-EU, SE)
- 193.235.141.162: 4,800 requests to handle, browse, and discovery links with a Firefox 84.0 user agent, from ASN 51747 (INTERNETBOLAGET, SE)
- 3.225.28.105: 2,900 requests to REST API for the CIAT Story Maps collection with a normal user agent, from ASN 14618 (AMAZON-AES, US)
- 34.228.236.6: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 18.212.137.2: 2,800 requests to discovery for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 3.81.123.72: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- 3.227.16.188: 2,800 requests to discovery and handles for the CGIAR System community with user agent `Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)`, from ASN 14618 (AMAZON-AES, US)
- Looking closer into the requests with this Mozilla/4.0 user agent, I see 500+ IPs using it:
- Thinking about how we could check for duplicates before importing
- I found out that [PostgreSQL has a built-in similarity function](https://www.freecodecamp.org/news/fuzzy-string-matching-with-postgresql/):
```console
localhost/dspace63= > CREATE EXTENSION pg_trgm;
localhost/dspace63= > SELECT metadata_value_id, text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Molecular marker based genetic diversity assessment of Striga resistant maize inbred lines') > 0.5;
$ csv-metadata-quality -i /tmp/iwmi-to-check.csv -o /tmp/iwmi.csv | tee /tmp/out.log
$ xsv split -s 2000 /tmp /tmp/iwmi.csv
```
- I noticed each CSV only had 10 or 20 corrections, mostly that none of the duplicate metadata values were removed in the CSVs...
- I cut a subset of the fields from the main CSV and tried again, but DSpace said "no changes detected"
- The duplicates are definitely removed from the CSV, but DSpace doesn't detect them
- I realized this is an issue I've had before, but forgot because I usually use csv-metadata-quality for new items, not ones already inside DSpace!
- I found a comment on thread on the dspace-tech mailing list from helix84 in 2015 ("No changes were detected" when importing metadata via XMLUI") where he says:
> It's very likely that multiple values in a single field are being compared as an unordered set rather than an ordered list.
> Try doing it in two imports. In first import, remove all authors. In second import, add them in the new order.
- I decided to upload the cleaned IWMI community by moving the cleaned metadata field from `dcterms.subject[en_US]` to `dcterms.subject[en_Fu]` temporarily, uploading them, then moving them back, and uploading again
- I started by copying just a handful of fields from the iwmi.csv community export:
- It takes a few hours per 2,000 items because DSpace processes them so slowly... sigh...
## 2021-10-08
- I decided to update these records in PostgreSQL instead of via several CSV batches, as there were several others to normalize too:
```console
cgspace=# SELECT DISTINCT text_lang, count(text_lang) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) GROUP BY text_lang ORDER BY count DESC;
text_lang | count
-----------+---------
en_US | 2603711
en_Fu | 115568
en | 8818
| 5286
fr | 2
vn | 2
| 0
(7 rows)
cgspace=# BEGIN;
cgspace=# UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item) AND text_lang IN ('en_Fu', 'en', '');
UPDATE 129673
cgspace=# COMMIT;
```
- So all this effort to remove ~400 duplicate metadata values in the IWMI community hmmm:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
391
```
- I tried to export ILRI's community, but ran into the export bug (DS-4211)
- After applying the patch on my local instance I was able to export, but found many duplicate items in the CSV (as I also noticed in 2021-02):
```console
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sed '1d' | wc -l
32070
$ csvcut -c id /tmp/ilri-duplicate-metadata.csv | sort -u | sed '1d' | wc -l
19315
```
- It seems there are only about 200 duplicate values in this subset of fields in ILRI's community:
```console
$ grep -c 'Removing duplicate value' /tmp/out.log
220
```
- I found a cool way to select only the items with corrections
- First, extract a handful of fields from the CSV with csvcut
- Second, clean the CSV with csv-metadata-quality
- Third, rename the columns to something obvious in the cleaned CSV
- Fourth, use csvjoin to merge the cleaned file with the original
- Delete Atmire migrations and some others that were "unresolved":
```console
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';"
$ psql -h localhost -p 5433 -U postgres dspace7 -c "DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');"
```
- Now DSpace 7 starts with my CGSpace data... nice
- The only thing I did on 2021-10-07 was import a few thousand metadata corrections...
- I restarted PostgreSQL (instead of restarting Tomcat), so let's see if that helps
- I filed [a bug for the DSpace 6/7 duplicate values metadata import issue](https://github.com/DSpace/DSpace/issues/7989)
- I tested the two patches for removing abandoned submissions from the workflow but unfortunately it seems that they are for the configurable aka XML workflow, and we are using the basic workflow
- I discussed PostgreSQL issues with some people on the DSpace Slack
- Looking at postgresqltuner.pl and https://pgtune.leopard.in.ua I realized that there were some settings that I hadn't changed in a few years that I probably need to re-evaluate
- For example, `random_page_cost` is recommended to be 1.1 in the PostgreSQL 10 docs (default is 4.0, but we use 1 since 2017 when it came up in Hacker News)
- Also, `effective_io_concurrency` is recommended to be "hundreds" if you are using an SSD (default is 1)
- I also enabled the `pg_stat_statements` extension to try to understand what queries are being run the most often, and how long they take
## 2021-10-12
- I looked again at the duplicate items query I was doing with trigrams recently and found a few new things
- Looking at the `EXPLAIN ANALYZE` plan for the query I noticed it wasn't using any indexes
- I [read on StackExchange](https://dba.stackexchange.com/questions/103821/best-index-for-similarity-function/103823) that, if we want to make use of indexes, we need to use the similarity operator (`%`), not the function `similarity()` because "index support is bound to operators in Postgres, not to functions"
- A note about the query plan output is that we need to read it from the bottom up!
- So with the similary operator we need to set the threshold like this now:
```console
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.5;
```
- Next I experimented with using GIN or GiST indexes on `metadatavalue`, but they were slower than the existing DSpace indexes
- I tested a few variations of the query I had been using and found it's _much_ faster if I use the similarity operator and keep the condition that object IDs are in the item table...
```console
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 739.948 ms
```
- Now this script runs in four minutes (versus twenty-four!) and it still finds the same seven duplicates! Amazing!
- I still don't understand the differences in the query plan well enough, but I see it is using the DSpace default indexes and the results are accurate
- So to summarize, the best to the worst query, all returning the same result:
```console
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 683.165 ms
Time: 635.364 ms
Time: 674.666 ms
localhost/dspace= > DISCARD ALL;
localhost/dspace= > SET pg_trgm.similarity_threshold = 0.6;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND text_value % 'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas';
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 1584.765 ms (00:01.585)
Time: 1665.594 ms (00:01.666)
Time: 1623.726 ms (00:01.624)
localhost/dspace= > DISCARD ALL;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
Traditional knowledge affects soil management ability of smallholder farmers in marginal areas │ 7af059af-9cd7-431b-8a79-7514896ca7dc
(1 row)
Time: 4028.939 ms (00:04.029)
Time: 4022.239 ms (00:04.022)
Time: 4061.820 ms (00:04.062)
localhost/dspace= > DISCARD ALL;
localhost/dspace= > SELECT text_value, dspace_object_id FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id=64 AND SIMILARITY(text_value,'Traditional knowledge affects soil management ability of smallholder farmers in marginal areas') > 0.6;
- I started doing some tests to upgrade Elasticsearch from 7.6.2 to 7.7, 7.8, 7.9, and eventually 7.10 on OpenRXV
- I tested harvesting, reporting, filtering, and various admin actions with each version and they all worked fine, with no errors in any logs as far as I can see
- This fixes bunches of issues, updates Java from 13 to 15, and the base image from CentOS 7 to 8, so it's a decent amount of technical debt!
- I even tried Elasticsearch 7.13.2, which has Java 16, and it works fine...
- I submitted a pull request: https://github.com/ilri/OpenRXV/pull/126
## 2021-10-20
- Meeting with Big Data and CGIAR repository players about the feasibility of moving to a single repository
- We discussed several options, for example moving all DSpaces to CGSpace along with their permanent identifiers
- The issue would be for centers like IFPRI who don't use DSpace and have integrations with their website etc with their current repository
- The IP is owned by [Internet Vikings](internetvikings.com) in Sweden
- I purged their statistics and set up a temporary HTTP 403 telling them to use a real user agent
- I see another one in Sweden a few days ago (192.36.109.131), also using the same exact user agent as above, but belonging to [Resilans AB](http://webb.resilans.se/)
- I purged another 74,619 hits from this bot
- I added these two IPs to the nginx IP bot identifier
- Jesus I found a few Russian IPs attempting SQL injection and path traversal, ie: