--- title: "August, 2020" date: 2020-08-02T15:35:54+03:00 author: "Alan Orth" categories: ["Notes"] --- ## 2020-08-02 - I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their `cg.coverage.country` text values - It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter's preferred "display" country names) - It implements a "force" mode too that will clear existing country codes and re-tag everything - It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa... - The code is currently on my personal GitHub: https://github.com/alanorth/dspace-curation-tasks - I still need to figure out how to integrate this with the DSpace build because currently you have to package it and copy the JAR to the `dspace/lib` directory (not to mention the config) - I forked the [dspace-curation-tasks to ILRI's GitHub](https://github.com/ilri/dspace-curation-tasks) and [submitted the project to Maven Central](https://issues.sonatype.org/browse/OSSRH-59650) so I can integrate it more easily with our DSpace build via dependencies ## 2020-08-03 - Atmire responded to the ticket about the ongoing upgrade issues - They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works! - They also said they have never experienced the `type: 5` site statistics issue, so I need to try to purge those and continue with the stats processing - I purged all unmigrated stats in a few cores and then restarted processing: ``` $ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary 'id:/.*unmigrated.*/' $ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m' $ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics ``` - Andrea from Macaroni Bros emailed me a few days ago to say he's having issues with the CGSpace REST API - He said he noticed the issues when they were developing the WordPress plugin to harvest CGSpace for the RTB website: https://www.rtb.cgiar.org/publications/ ## 2020-08-04 - Look into the REST API issues that Macaroni Bros raised last week: - The first one was about the `collections` endpoint returning empty items: - https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=2 (offset=2 is correct) - https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=3 (offset=3 is empty) - https://cgspace.cgiar.org/rest/collections/1445/items?limit=1&offset=4 (offset=4 is correct again) - I confirm that the second link returns zero items on CGSpace... - I tested on my local development instance and it returns one item correctly... - I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly... - Perhaps an indexing issue? - The second issue is the `collections` endpoint returning the wrong number of items: - https://cgspace.cgiar.org/rest/collections/1445 (numberItems: 63) - https://cgspace.cgiar.org/rest/collections/1445/items (real number of items: 61) - I confirm that it is indeed happening on CGSpace... - And actually I can replicate the same issue on my local CGSpace 5.8 instance: ``` $ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems "numberItems" : 63, $ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length' 61 ``` - Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there: ``` $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems "numberItems" : 61, $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length' 59 ``` - Ah! I exported that collection's metadata and checked it in OpenRefine, where I noticed that two items are mapped twice - I dealt with this problem in 2017-01 and the solution is to check the `collection2item` table: ``` dspace=# SELECT * FROM collection2item WHERE item_id = '107687'; id | collection_id | item_id --------+---------------+--------- 133698 | 966 | 107687 134685 | 1445 | 107687 134686 | 1445 | 107687 (3 rows) ``` - So for each id you can delete one duplicate mapping: ``` dspace=# DELETE FROM collection2item WHERE id='134686'; dspace=# DELETE FROM collection2item WHERE id='128819'; ``` - Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter's preferred display names ``` $ cat 2020-08-04-PB-new-countries.csv cg.coverage.country,correct CAPE VERDE,CABO VERDE COCOS ISLANDS,COCOS (KEELING) ISLANDS "CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF" COTE D'IVOIRE,CÔTE D'IVOIRE "KOREA, REPUBLIC","KOREA, REPUBLIC OF" PALESTINE,"PALESTINE, STATE OF" $ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228 ``` - I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly - I started a full Discovery re-indexing ## 2020-08-05 - Port my [dspace-curation-tasks](https://github.com/ilri/dspace-curation-tasks) to DSpace 6 and tag version `6.0-SNAPSHOT` - I downloaded the [UN M.49](https://unstats.un.org/unsd/methodology/m49/overview/) CSV file to start working on updating the CGSpace regions - First issue is they don't version the file so you have no idea when it was released - Second issue is that three rows have errors due to not using quotes around "China, Macao Special Administrative Region" - Bizu said she was having problems approving tasks on CGSpace - I looked at the PostgreSQL locks and they have skyrocketed since yesterday: ![PostgreSQL locks day](/cgspace-notes/2020/08/postgres_locks_ALL-day.png) ![PostgreSQL query length day](/cgspace-notes/2020/08/postgres_querylength_ALL-day.png) - Seems that something happened yesterday afternoon at around 5PM... - For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue - I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly - I checked the nginx logs around 5PM yesterday to see who was accessing the server: ``` # cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED - ``` - I see the Macaroni Bros are using their new user agent for harvesting: `RTB website BOT` - But that pattern doesn't match in the nginx bot list or Tomcat's crawler session manager valve because we're only checking for `[Bb]ot`! - So they have created thousands of Tomcat sessions: ``` $ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l 5693 ``` - DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don't misuse the resources - Perhaps `[Bb][Oo][Tt]`... - I see another IP 104.198.96.245, which is also using the "RTB website BOT" but there are 70,000 hits in Solr from earlier this year before they started using the user agent - I purged all the hits from Solr, including a few thousand from 64.62.202.71 - A few more IPs causing lots of Tomcat sessions yesterday: ``` $ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l 1585 $ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l 5691 ``` - 38.128.66.10 isn't creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat: ``` Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2 ``` - 64.62.202.71 is using a user agent I've never seen before: ``` Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com) ``` - So now our "bot" regex can't even match that... - Unless we change it to `[Bb]\.?[Oo]\.?[Tt]\.?`... which seems to match all variations of "bot" I can think of right now, according to [regexr.com](https://regexr.com/59lpt): ``` RTB website BOT Altmetribot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com) Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/) ``` - And another IP belonging to Turnitin (the alternate user agent of Turnitinbot): ``` $ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi on_id=[A-Z0-9]{32}' | sort | uniq | wc -l 2777 ``` - I will add `Turnitin` to the Tomcat Crawler Session Manager Valve regex as well... ## 2020-08-06 - I have been working on processing the Solr statistics with the Atmire tool on DSpace Test the last few days: - statistics: - 2,040,385 docs: 2h 28m 49s - statistics-2019: - 8,960,000 docs: 12h 7s - 1,780,575 docs: 2h 7m 29s - statistics-2018: - 1,970,000 docs: 12h 1m 28s - 360,000 docs: 2h 54m 56s (Linode rebooted) - 1,110,000 docs: 7h 1m 44s (Restarted Tomcat, oops) - I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I'm having the same issues with Java heap space that I had last month - The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m... - Also, I decided to try to purge all the `-unmigrated` docs that it had found so far to see if that helps... - There were about 466,000 records unmigrated so far, most of which were `type: 5` (SITE statistics) - Now it is processing again... - I developed a small Java class called `FixJpgJpgThumbnails` to remove ".jpg.jpg" thumbnails from the `THUMBNAIL` bundle and replace them with their originals from the `ORIGINAL` bundle - The code is based on [RemovePNGThumbnailsForPDFs.java](https://github.com/UoW-IRRs/DSpace-Scripts/blob/master/src/main/java/nz/ac/waikato/its/irr/scripts/RemovePNGThumbnailsForPDFs.java) by Andrea Schweer - I incorporated it into my dspace-curation-tasks repository, then renamed it to [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) - In testing I found that I can replace ~4,000 thumbnails on CGSpace! ## 2020-08-07 - I improved the `RemovePNGThumbnailsForPDFs.java` a bit more to exclude infographics and original bitstreams larger than 100KiB - I ran it on CGSpace and it cleaned up 3,769 thumbnails! - Afterwards I ran `dspace cleanup -v` to remove the deleted thumbnails