mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2025-01-27 05:49:12 +01:00
Add notes for 2020-01-08
This commit is contained in:
@ -19,4 +19,50 @@ categories: ["Notes"]
|
||||
- The DOI has a score of 259, but the Handle has no score at all
|
||||
- I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link
|
||||
|
||||
<!--more-->
|
||||
|
||||
## 2020-01-08
|
||||
|
||||
- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:
|
||||
|
||||
```
|
||||
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
|
||||
COPY 68790
|
||||
```
|
||||
|
||||
- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:
|
||||
|
||||
```
|
||||
$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
|
||||
iconv: illegal input sequence at position 104779
|
||||
```
|
||||
|
||||
- According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227:
|
||||
|
||||
```
|
||||
$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv
|
||||
5227: "Oue
|
||||
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
|
||||
00000000: 22 "
|
||||
00000001: 4f O
|
||||
00000002: 75 u
|
||||
00000003: 65 e
|
||||
00000004: cc .
|
||||
00000005: 81 .
|
||||
00000006: 64 d
|
||||
00000007: 72 r
|
||||
```
|
||||
|
||||
- According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81), which vim identifies (using `ga` on the character) as:
|
||||
|
||||
```
|
||||
<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401
|
||||
```
|
||||
|
||||
- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...
|
||||
- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
|
||||
- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
|
||||
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
|
||||
- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
Reference in New Issue
Block a user