From 811a12cb5e34de78c6364df3e4197ba444d70879 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Thu, 6 Aug 2020 16:24:01 +0300 Subject: [PATCH] Update --- content/posts/2020-08.md | 24 +- docs/2020-08/index.html | 436 ++++++++++++++++++++++++ docs/categories/index.html | 2 +- docs/categories/notes/index.html | 2 +- docs/categories/notes/page/2/index.html | 2 +- docs/categories/notes/page/3/index.html | 2 +- docs/categories/notes/page/4/index.html | 2 +- docs/index.html | 2 +- docs/page/2/index.html | 2 +- docs/page/3/index.html | 2 +- docs/page/4/index.html | 2 +- docs/page/5/index.html | 2 +- docs/page/6/index.html | 2 +- docs/posts/index.html | 2 +- docs/posts/page/2/index.html | 2 +- docs/posts/page/3/index.html | 2 +- docs/posts/page/4/index.html | 2 +- docs/posts/page/5/index.html | 2 +- docs/posts/page/6/index.html | 2 +- docs/sitemap.xml | 10 +- 20 files changed, 481 insertions(+), 23 deletions(-) create mode 100644 docs/2020-08/index.html diff --git a/content/posts/2020-08.md b/content/posts/2020-08.md index ab29704fe..adb375104 100644 --- a/content/posts/2020-08.md +++ b/content/posts/2020-08.md @@ -58,7 +58,7 @@ $ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length' 61 ``` - - Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there: +- Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there: ``` $ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems @@ -181,4 +181,26 @@ on_id=[A-Z0-9]{32}' | sort | uniq | wc -l - I will add `Turnitin` to the Tomcat Crawler Session Manager Valve regex as well... +## 2020-08-06 + +- I have been working on processing the Solr statistics with the Atmire tool on DSpace Test the last few days: + - statistics: + - 2,040,385 docs: 2h 28m 49s + - statistics-2019: + - 8,960,000 docs: 12h 7s + - 1,780,575 docs: 2h 7m 29s + - statistics-2018: + - 1,970,000 docs: 12h 1m 28s + - 360,000 docs: 2h 54m 56s (Linode rebooted) + - 1,110,000 docs: 7h 1m 44s (Restarted Tomcat, oops) +- I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I'm having the same issues with Java heap space that I had last month + - The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m... + - Also, I decided to try to purge all the `-unmigrated` docs that it had found so far to see if that helps... + - There were about 466,000 records unmigrated so far, most of which were `type: 5` (SITE statistics) + - Now it is processing again... +- I developed a small Java class called `FixJpgJpgThumbnails` to remove ".jpg.jpg" thumbnails from the `THUMBNAIL` bundle and replace them with their originals from the `ORIGINAL` bundle + - The code is based on [RemovePNGThumbnailsForPDFs.java](https://github.com/UoW-IRRs/DSpace-Scripts/blob/master/src/main/java/nz/ac/waikato/its/irr/scripts/RemovePNGThumbnailsForPDFs.java) by Andrea Schweer + - I incorporated it into my dspace-curation-tasks repository, then renamed it to [cgspace-java-helpers](https://github.com/ilri/cgspace-java-helpers) + - In testing I found that I can replace ~3,500 thumbnails on CGSpace! + diff --git a/docs/2020-08/index.html b/docs/2020-08/index.html new file mode 100644 index 000000000..ea340c211 --- /dev/null +++ b/docs/2020-08/index.html @@ -0,0 +1,436 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + August, 2020 | CGSpace Notes + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+ + + + +
+
+

CGSpace Notes

+

Documenting day-to-day work on the CGSpace repository.

+
+
+ + + + +
+
+
+ + + + +
+
+

August, 2020

+ +
+

2020-08-02

+
    +
  • I spent a few days working on a Java-based curation task to tag items with ISO 3166-1 Alpha2 country codes based on their cg.coverage.country text values +
      +
    • It looks up the names in ISO 3166-1 first, and then in our CGSpace countries mapping (which has five or so of Peter’s preferred “display” country names)
    • +
    • It implements a “force” mode too that will clear existing country codes and re-tag everything
    • +
    • It is class based so I can easily add support for other vocabularies, and the technique could even be used for organizations with mappings to ROR and Clarisa…
    • +
    +
  • +
+ +

2020-08-03

+
    +
  • Atmire responded to the ticket about the ongoing upgrade issues +
      +
    • They pushed an RC2 version of the CUA module that fixes the FontAwesome issue so that they now use classes instead of Unicode hex characters so our JS + SVG works!
    • +
    • They also said they have never experienced the type: 5 site statistics issue, so I need to try to purge those and continue with the stats processing
    • +
    +
  • +
  • I purged all unmigrated stats in a few cores and then restarted processing:
  • +
+
$ curl -s "http://localhost:8081/solr/statistics/update?softCommit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>id:/.*unmigrated.*/</query></delete>'
+$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
+$ chrt -b 0 dspace dsrun com.atmire.statistics.util.update.atomic.AtomicStatisticsUpdateCLI -t 12 -c statistics
+
    +
  • Andrea from Macaroni Bros emailed me a few days ago to say he’s having issues with the CGSpace REST API + +
  • +
+

2020-08-04

+ +
$ http 'http://localhost:8080/rest/collections/1445' | json_pp | grep numberItems
+   "numberItems" : 63,
+$ http 'http://localhost:8080/rest/collections/1445/items' jq '. | length'
+61
+
    +
  • Also on DSpace Test (which is running DSpace 6!), though the issue is slightly different there:
  • +
+
$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708' | json_pp | grep numberItems
+   "numberItems" : 61,
+$ http 'https://dspacetest.cgiar.org/rest/collections/5471c3aa-202e-42f0-96c2-497a18e3b708/items' | jq '. | length'
+59
+
    +
  • Ah! I exported that collection’s metadata and checked it in OpenRefine, where I noticed that two items are mapped twice +
      +
    • I dealt with this problem in 2017-01 and the solution is to check the collection2item table:
    • +
    +
  • +
+
dspace=# SELECT * FROM collection2item WHERE item_id = '107687';
+   id   | collection_id | item_id
+--------+---------------+---------
+ 133698 |           966 |  107687
+ 134685 |          1445 |  107687
+ 134686 |          1445 |  107687
+(3 rows)
+
    +
  • So for each id you can delete one duplicate mapping:
  • +
+
dspace=# DELETE FROM collection2item WHERE id='134686';
+dspace=# DELETE FROM collection2item WHERE id='128819';
+
    +
  • Update countries on CGSpace to be closer to ISO 3166-1 with some minor differences based on Peter’s preferred display names
  • +
+
$ cat 2020-08-04-PB-new-countries.csv
+cg.coverage.country,correct
+CAPE VERDE,CABO VERDE
+COCOS ISLANDS,COCOS (KEELING) ISLANDS
+"CONGO, DR","CONGO, DEMOCRATIC REPUBLIC OF"
+COTE D'IVOIRE,CÔTE D'IVOIRE
+"KOREA, REPUBLIC","KOREA, REPUBLIC OF"
+PALESTINE,"PALESTINE, STATE OF"
+$ ./fix-metadata-values.py -i 2020-08-04-PB-new-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
+
    +
  • I had to restart Tomcat 7 three times before all the Solr statistics cores came up properly +
      +
    • I started a full Discovery re-indexing
    • +
    +
  • +
+

2020-08-05

+
    +
  • Port my dspace-curation-tasks to DSpace 6 and tag version 6.0-SNAPSHOT
  • +
  • I downloaded the UN M.49 CSV file to start working on updating the CGSpace regions +
      +
    • First issue is they don’t version the file so you have no idea when it was released
    • +
    • Second issue is that three rows have errors due to not using quotes around “China, Macao Special Administrative Region”
    • +
    +
  • +
  • Bizu said she was having problems approving tasks on CGSpace +
      +
    • I looked at the PostgreSQL locks and they have skyrocketed since yesterday:
    • +
    +
  • +
+

PostgreSQL locks day

+

PostgreSQL query length day

+
    +
  • Seems that something happened yesterday afternoon at around 5PM… +
      +
    • For now I will just run all updates on the server and reboot it, as I have no idea what causes this issue
    • +
    • I had to restart Tomcat 7 three times after the server came back up before all Solr statistics cores came up properly
    • +
    +
  • +
  • I checked the nginx logs around 5PM yesterday to see who was accessing the server:
  • +
+
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E '04/Aug/2020:(17|18)' | goaccess --log-format=COMBINED -
+
    +
  • I see the Macaroni Bros are using their new user agent for harvesting: RTB website BOT +
      +
    • But that pattern doesn’t match in the nginx bot list or Tomcat’s crawler session manager valve because we’re only checking for [Bb]ot!
    • +
    • So they have created thousands of Tomcat sessions:
    • +
    +
  • +
+
$ cat dspace.log.2020-08-04 | grep -E "(63.32.242.35|64.62.202.71)" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+5693
+
    +
  • DSpace itself uses a case-sensitive regex for user agents so there are no hits from those IPs in Solr, but I need to tweak the other regexes so they don’t misuse the resources +
      +
    • Perhaps [Bb][Oo][Tt]
    • +
    +
  • +
  • I see another IP 104.198.96.245, which is also using the “RTB website BOT” but there are 70,000 hits in Solr from earlier this year before they started using the user agent +
      +
    • I purged all the hits from Solr, including a few thousand from 64.62.202.71
    • +
    +
  • +
  • A few more IPs causing lots of Tomcat sessions yesterday:
  • +
+
$ cat dspace.log.2020-08-04 | grep "38.128.66.10" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+1585
+$ cat dspace.log.2020-08-04 | grep "64.62.202.71" | grep -E 'session_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+5691
+
    +
  • 38.128.66.10 isn’t creating any Solr statistics due to our DSpace agents pattern, but they are creating lots of sessions so perhaps I need to force them to use one session in Tomcat:
  • +
+
Mozilla/5.0 (Windows NT 5.1) brokenlinkcheck.com/1.2
+
    +
  • 64.62.202.71 is using a user agent I’ve never seen before:
  • +
+
Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
+
    +
  • So now our “bot” regex can’t even match that… +
      +
    • Unless we change it to [Bb]\.?[Oo]\.?[Tt]\.?… which seems to match all variations of “bot” I can think of right now, according to regexr.com:
    • +
    +
  • +
+
RTB website BOT
+Altmetribot
+Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
+Mozilla/5.0 (compatible; +centuryb.o.t9[at]gmail.com)
+Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
+
    +
  • And another IP belonging to Turnitin (the alternate user agent of Turnitinbot):
  • +
+
$ cat dspace.log.2020-08-04 | grep "199.47.87.145" | grep -E 'sessi
+on_id=[A-Z0-9]{32}' | sort | uniq | wc -l
+2777
+
    +
  • I will add Turnitin to the Tomcat Crawler Session Manager Valve regex as well…
  • +
+

2020-08-06

+
    +
  • I have been working on processing the Solr statistics with the Atmire tool on DSpace Test the last few days: +
      +
    • statistics: +
        +
      • 2,040,385 docs: 2h 28m 49s
      • +
      +
    • +
    • statistics-2019: +
        +
      • 8,960,000 docs: 12h 7s
      • +
      • 1,780,575 docs: 2h 7m 29s
      • +
      +
    • +
    • statistics-2018: +
        +
      • 1,970,000 docs: 12h 1m 28s
      • +
      • 360,000 docs: 2h 54m 56s (Linode rebooted)
      • +
      • 1,110,000 docs: 7h 1m 44s (Restarted Tomcat, oops)
      • +
      +
    • +
    +
  • +
  • I decided to start the 2018 core over again, so I re-synced it from CGSpace and started again from the solr-upgrade-statistics-6x tool and now I’m having the same issues with Java heap space that I had last month +
      +
    • The process kept crashing due to memory, so I increased the memory to 3072m and finally 4096m…
    • +
    • Also, I decided to try to purge all the -unmigrated docs that it had found so far to see if that helps…
    • +
    • There were about 466,000 records unmigrated so far, most of which were type: 5 (SITE statistics)
    • +
    • Now it is processing again…
    • +
    +
  • +
  • I developed a small Java class called FixJpgJpgThumbnails to remove “.jpg.jpg” thumbnails from the THUMBNAIL bundle and replace them with their originals from the ORIGINAL bundle + +
  • +
+ + + + + + +
+ + + +
+ + + + +
+
+ + + + + + + + + diff --git a/docs/categories/index.html b/docs/categories/index.html index aeb8fe83f..f0188b048 100644 --- a/docs/categories/index.html +++ b/docs/categories/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/index.html b/docs/categories/notes/index.html index 9180ee399..48ad611a8 100644 --- a/docs/categories/notes/index.html +++ b/docs/categories/notes/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/2/index.html b/docs/categories/notes/page/2/index.html index c414de441..26aacb3ee 100644 --- a/docs/categories/notes/page/2/index.html +++ b/docs/categories/notes/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/3/index.html b/docs/categories/notes/page/3/index.html index 926534eb1..91e4514c2 100644 --- a/docs/categories/notes/page/3/index.html +++ b/docs/categories/notes/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/categories/notes/page/4/index.html b/docs/categories/notes/page/4/index.html index 620a6529a..9a0a1363d 100644 --- a/docs/categories/notes/page/4/index.html +++ b/docs/categories/notes/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/index.html b/docs/index.html index 8dacd4f63..9715258c8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/2/index.html b/docs/page/2/index.html index 698d2620d..6beb051f1 100644 --- a/docs/page/2/index.html +++ b/docs/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/3/index.html b/docs/page/3/index.html index 2d4d6c77e..efc34ac1d 100644 --- a/docs/page/3/index.html +++ b/docs/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/4/index.html b/docs/page/4/index.html index 92b572254..be323f486 100644 --- a/docs/page/4/index.html +++ b/docs/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/5/index.html b/docs/page/5/index.html index d215cee4c..7c6ab3a94 100644 --- a/docs/page/5/index.html +++ b/docs/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/page/6/index.html b/docs/page/6/index.html index 570830902..6cb63e81d 100644 --- a/docs/page/6/index.html +++ b/docs/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/index.html b/docs/posts/index.html index a4cff738e..e78ea4207 100644 --- a/docs/posts/index.html +++ b/docs/posts/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/2/index.html b/docs/posts/page/2/index.html index 9c9e99562..0acde5589 100644 --- a/docs/posts/page/2/index.html +++ b/docs/posts/page/2/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/3/index.html b/docs/posts/page/3/index.html index fe124aff1..1fc117436 100644 --- a/docs/posts/page/3/index.html +++ b/docs/posts/page/3/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/4/index.html b/docs/posts/page/4/index.html index bcb0b1135..098957cb1 100644 --- a/docs/posts/page/4/index.html +++ b/docs/posts/page/4/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/5/index.html b/docs/posts/page/5/index.html index 058b9c542..e358282ce 100644 --- a/docs/posts/page/5/index.html +++ b/docs/posts/page/5/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/posts/page/6/index.html b/docs/posts/page/6/index.html index df271936a..b3926ca7b 100644 --- a/docs/posts/page/6/index.html +++ b/docs/posts/page/6/index.html @@ -9,7 +9,7 @@ - + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 80c18fd53..ef9779b41 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/2020-08/ - 2020-08-05T16:58:31+03:00 + 2020-08-06T10:56:13+03:00 https://alanorth.github.io/cgspace-notes/categories/ - 2020-08-05T16:58:31+03:00 + 2020-08-06T10:56:13+03:00 https://alanorth.github.io/cgspace-notes/ - 2020-08-05T16:58:31+03:00 + 2020-08-06T10:56:13+03:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-08-05T16:58:31+03:00 + 2020-08-06T10:56:13+03:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-08-05T16:58:31+03:00 + 2020-08-06T10:56:13+03:00