Add notes for 2020-01-14

2025-01-27 05:49:12 +01:00 · 2020-01-14 20:40:41 +02:00
parent a1b6171b48
commit ba5755d441
92 changed files with 1116 additions and 791 deletions
--- a/content/posts/2020-01.md
+++ b/content/posts/2020-01.md
@@ -1,6 +1,6 @@
 ---
 title: "January, 2020"
-date: 2019-01-06T10:48:30+02:00
+date: 2020-01-06T10:48:30+02:00
 author: "Alan Orth"
 categories: ["Notes"]
 ---
@@ -53,7 +53,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
 00000007: 72  r
 ```

- According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81), which vim identifies (using `ga` on the character) as:
+- ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as:

 ```
 <e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
@@ -65,4 +65,32 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
 - I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
 - Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch

+## 2020-01-14
+
+- I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
+  - I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019"
+  - After a few hours it died with the same error that I had seen in the log from the first run:
+
+```
+Exception: Read timed out
+java.net.SocketTimeoutException: Read timed out
+```
+
+- I am not sure how I will fix that shard...
+- I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8
+  - I'm curious to start checking input files with this to see what it highlights
+  - I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
+  - `<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401`
+  - `<é> 233, Hex 00e9, Oct 351, Digr e'`
+- Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)!
+  - In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html):
+
+```
+In [7]: unicodedata.is_normalized('NFC', 'é')
+Out[7]: False
+
+In [8]: unicodedata.is_normalized('NFC', 'é')
+Out[8]: True
+```
+
 <!-- vim: set sw=2 ts=2: -->