mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-14
This commit is contained in:
@ -1,6 +1,6 @@
|
||||
---
|
||||
title: "January, 2020"
|
||||
date: 2019-01-06T10:48:30+02:00
|
||||
date: 2020-01-06T10:48:30+02:00
|
||||
author: "Alan Orth"
|
||||
categories: ["Notes"]
|
||||
---
|
||||
@ -53,7 +53,7 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
|
||||
00000007: 72 r
|
||||
```
|
||||
|
||||
- According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81), which vim identifies (using `ga` on the character) as:
|
||||
- ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as:
|
||||
|
||||
```
|
||||
<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
|
||||
@ -65,4 +65,32 @@ $ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
|
||||
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
|
||||
- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch
|
||||
|
||||
## 2020-01-14
|
||||
|
||||
- I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
|
||||
- I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019"
|
||||
- After a few hours it died with the same error that I had seen in the log from the first run:
|
||||
|
||||
```
|
||||
Exception: Read timed out
|
||||
java.net.SocketTimeoutException: Read timed out
|
||||
```
|
||||
|
||||
- I am not sure how I will fix that shard...
|
||||
- I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8
|
||||
- I'm curious to start checking input files with this to see what it highlights
|
||||
- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
|
||||
- `<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401`
|
||||
- `<é> 233, Hex 00e9, Oct 351, Digr e'`
|
||||
- Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)!
|
||||
- In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html):
|
||||
|
||||
```
|
||||
In [7]: unicodedata.is_normalized('NFC', 'é')
|
||||
Out[7]: False
|
||||
|
||||
In [8]: unicodedata.is_normalized('NFC', 'é')
|
||||
Out[8]: True
|
||||
```
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user