+
+ 2020-08-02
+
+- I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their
cg.coverage.country
text values
+
+- It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)
+- It implements a “force” mode too that will clear existing country codes and re-tag everything
+- It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…
+
+
+
+
+2020-08-03
+
+- Atmire responded to the ticket about the ongoing upgrade issues
+
+- They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!
+- They also said they have never experienced the
type: 5
site statistics issue, so I need to try to purge those and continue with the stats processing
+
+
+- I purged all unmigrated stats in a few cores and then restarted processing:
+
+$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
+$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
+$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
+
+- Andrea from Macaroni Bros emailed me a few days ago to say he’s having issues with the CGSpace REST API
+
+
+
+2020-08-04
+
+- Look into the REST API issues that Macaroni Bros raised last week:
+
+- The first one was about the
collections
endpoint returning empty items:
+
+
+- I confirm that the second link returns zero items on CGSpace…
+
+- I tested on my local development instance and it returns one item correctly…
+- I tested on DSpace Test (currently DSpace 6 with UUIDs) and it returns one item correctly…
+- Perhaps an indexing issue?
+
+
+- The second issue is the
collections
endpoint returning the wrong number of items:
+
+
+- I confirm that it is indeed happening on CGSpace…
+
+- And actually I can replicate the same issue on my local CGSpace 5.8 instance:
+
+
+
+
+
+$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
+ "numberItems" : 63,
+$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
+61
+
+- Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:
+
+$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
+ "numberItems" : 61,
+$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
+59
+
+- Ah! I exported that collection’s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice
+
+- I dealt with this problem in 2017-01 and the solution is to check the
collection2item
table:
+
+
+
+dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
+ id | collection_id | item_id
+--------+---------------+---------
+ 133698 | 966 | 107687
+ 134685 | 1445 | 107687
+ 134686 | 1445 | 107687
+(3 rows)
+
+- So for each id you can delete one duplicate mapping:
+
+dspace=# DELETE FROM collection2item WHERE id='134686';
+dspace=# DELETE FROM collection2item WHERE id='128819';
+
+- Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter’s preferred display names
+
+$ cat 2020-08-04-PB-new-countries.csv
+cg.coverage.country,correct
+CAPE VERDE,CABO VERDE
+COCOS ISLANDS,COCOS (KEELING) ISLANDS
+"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
+COTE D'IVOIRE,CÔTE D'IVOIRE
+"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
+PALESTINE,"PALESTINE, STATE OF"
+$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
+
+- I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly
+
+- I started a full Discovery re-indexing
+
+
+
+2020-08-05
+
+- Port my dspace-curation-tasks to DSpace 6 and tag version
6.0-SNAPSHOT
+- I downloaded the UN M.49 CSV file to start working on updating the CGSpace regions
+
+- First issue is they don’t version the file so you have no idea when it was released
+- Second issue is that three rows have errors due to not using quotes around “China, Macao Special Administrative Region”
+
+
+- Bizu said she was having problems approving tasks on CGSpace
+
+- I looked at the PostgreSQL locks and they have skyrocketed since yesterday:
+
+
+
+
+
+
+- Seems that something happened yesterday afternoon at around 5PM…
+
+- For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue
+- I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly
+
+
+- I checked the nginx logs around 5PM yesterday to see who was accessing the server:
+
+# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
+
+- I see the Macaroni Bros are using their new user agent for harvesting:
RTB website BOT
+
+- But that pattern doesn’t match in the nginx bot list or Tomcat’s crawler session manager valve because we’re only checking for
[Bb]ot
!
+- So they have created thousands of Tomcat sessions:
+
+
+
+$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+5693
+
+- DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources
+
+
+- I see another IP 104.198.96.245, which is also using the “RTB website BOT” but there are 70,000 hits in Solr from earlier this year before they started using the user agent
+
+- I purged all the hits from Solr, including a few thousand from 64.62.202.71
+
+
+- A few more IPs causing lots of Tomcat sessions yesterday:
+
+$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+1585
+$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+5691
+
+- 38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
+
+Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
+
+- 64.62.202.71 is using a user agent I’ve never seen before:
+
+Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
+
+- So now our “bot” regex can’t even match that…
+
+- Unless we change it to
[Bb]\.?[Oo]\.?[Tt]\.?
… which seems to match all variations of “bot” I can think of right now, according to regexr.com:
+
+
+
+RTB website BOT
+Altmetribot
+Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
+Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
+Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
+
+- And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):
+
+$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
+on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+2777
+
+- I will add
Turnitin
to the Tomcat Crawler Session Manager Valve regex as well…
+
+2020-08-06
+
+- I have been working on processing the Solr statistics with the Atmire tool on DSpace Test the last few days:
+
+- statistics:
+
+- 2,040,385 docs: 2h 28m 49s
+
+
+- statistics-2019:
+
+- 8,960,000 docs: 12h 7s
+- 1,780,575 docs: 2h 7m 29s
+
+
+- statistics-2018:
+
+- 1,970,000 docs: 12h 1m 28s
+- 360,000 docs: 2h 54m 56s (Linode rebooted)
+- 1,110,000 docs: 7h 1m 44s (Restarted Tomcat, oops)
+
+
+
+
+- I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I’m having the same issues with Java heap space that I had last month
+
+- The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m…
+- Also, I decided to try to purge all the
-unmigrated
docs that it had found so far to see if that helps…
+- There were about 466,000 records unmigrated so far, most of which were
type: 5
(SITE statistics)
+- Now it is processing again…
+
+
+- I developed a small Java class called
FixJpgJpgThumbnails
to remove “.jpg.jpg” thumbnails from the THUMBNAIL
bundle and replace them with their originals from the ORIGINAL
bundle
+
+
+
+
+
+
+
+
+
+