cgspace-notes/content/posts/2020-01.md

---
title: "January, 2020"
date: 2019-01-06T10:48:30+02:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2020-01-06

- Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6
- Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI
  - The score is now linked to the DOI
  - Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI
  - Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed

## 2020-01-07

- Peter Ballantyne highlighted one more WLE [item](https://hdl.handle.net/10568/101286) that is missing the Altmetric score that its DOI has
  - The DOI has a score of 259, but the Handle has no score at all
  - I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link

<!--more-->

## 2020-01-08

- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:

```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
```

- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:

```
$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
```

- According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227:

```
$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
5227: "Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22  "
00000001: 4f  O
00000002: 75  u
00000003: 65  e
00000004: cc  .
00000005: 81  .
00000006: 64  d
00000007: 72  r
```

- According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81), which vim identifies (using `ga` on the character) as:

```
<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
```

- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...
- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch

<!-- vim: set sw=2 ts=2: -->
Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00			`---`
			`title: "January, 2020"`
			`date: 2019-01-06T10:48:30+02:00`
			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

Add notes for 2020-01-07 2020-01-07 11:24:29 +01:00			`## 2020-01-06`
Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00
			`- Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6`
			`- Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI`
			`- The score is now linked to the DOI`
			`- Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI`
			`- Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed`

Add notes for 2020-01-07 2020-01-07 11:24:29 +01:00			`## 2020-01-07`

			`- Peter Ballantyne highlighted one more WLE [item](https://hdl.handle.net/10568/101286) that is missing the Altmetric score that its DOI has`
			`- The DOI has a score of 259, but the Handle has no score at all`
			`- I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link`

Add notes for 2020-01-08 2020-01-08 14:33:56 +01:00			`<!--more-->`

			`## 2020-01-08`

			`- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;`
			`COPY 68790`
			```

			`- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:`

			```
			`$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv`
			`iconv: illegal input sequence at position 104779`
			```

			`- According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227:`

			```
			`$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv`
			`5227: "Oue`
			`$ sed -n '5227p' /tmp/2020-01-08-authors.csv \| xxd -c1`
			`00000000: 22 "`
			`00000001: 4f O`
			`00000002: 75 u`
			`00000003: 65 e`
			`00000004: cc .`
			`00000005: 81 .`
			`00000006: 64 d`
			`00000007: 72 r`
			```

			- According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81), which vim identifies (using `ga` on the character) as:

			```
			`<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401`
			```

			`- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...`
			- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
			`- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings`
			`- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me`
			- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch

Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00			`<!-- vim: set sw=2 ts=2: -->`