cgspace-notes/content/posts/2020-01.md

---
title: "January, 2020"
date: 2020-01-06T10:48:30+02:00
author: "Alan Orth"
categories: ["Notes"]
---

## 2020-01-06

- Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6
- Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI
  - The score is now linked to the DOI
  - Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI
  - Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed

## 2020-01-07

- Peter Ballantyne highlighted one more WLE [item](https://hdl.handle.net/10568/101286) that is missing the Altmetric score that its DOI has
  - The DOI has a score of 259, but the Handle has no score at all
  - I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link

<!--more-->

## 2020-01-08

- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:

```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
```

- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:

```
$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv
iconv: illegal input sequence at position 104779
```

- According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227:

```
$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv                                   
5227: "Oue
$ sed -n '5227p' /tmp/2020-01-08-authors.csv | xxd -c1
00000000: 22  "
00000001: 4f  O
00000002: 75  u
00000003: 65  e
00000004: cc  .
00000005: 81  .
00000006: 64  d
00000007: 72  r
```

- ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as:

```
<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401
```

- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...
- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch

## 2020-01-14

- I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
  - I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019"
  - After a few hours it died with the same error that I had seen in the log from the first run:

```
Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
```

- I am not sure how I will fix that shard...
- I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8
  - I'm curious to start checking input files with this to see what it highlights
  - I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
  - `<e>  101,  Hex 65,  Octal 145 < ́> 769, Hex 0301, Octal 1401`
  - `<é> 233, Hex 00e9, Oct 351, Digr e'`
- Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)!
  - In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html):

```
In [7]: unicodedata.is_normalized('NFC', 'é')
Out[7]: False

In [8]: unicodedata.is_normalized('NFC', 'é')
Out[8]: True
```

## 2020-01-15

- I added support for Unicode normalization to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) tool in [v0.4.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0)
- Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:

```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
```

- She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
- I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my `fix-metadata.py` script:

```
$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d
```

## 2020-01-16

- Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:

```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
```

- Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
  - We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months
  - Sisay uploaded the records to DSpace Test as [IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567)
  - I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data
  - I corrected one invalid AGROVOC subject
  - Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
    - `$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id`
    - I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)`

## 2020-01-20

- Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago
  - I forwarded it to Peter et al for their comment
  - We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development
- Visit CodeObia to discuss the next phase of AReS development

## 2020-01-21

- Create two accounts on CGSpace for CTA users
- Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month:
  - Merged: [HTML syntax fixes](https://github.com/AgriculturalSemantics/cg-core/pull/16)
  - Merged: [Add LICENSE file](https://github.com/AgriculturalSemantics/cg-core/pull/17)
  - Merged: [Build main.css using npm build](https://github.com/AgriculturalSemantics/cg-core/pull/18)
  - Approved a [wider scope for `cg.peer-reviewed`](https://github.com/AgriculturalSemantics/cg-core/issues/14) (renaming the field and using non-boolean values), but there is more discussion needed
- I opened a new [pull request](https://github.com/AgriculturalSemantics/cg-core/pull/24) on the cg-core repository validate and fix the formatting of the HTML files
- Create more issues for OpenRXV:
  - Based on Peter's feedback on the [text for labels and tooltips](https://github.com/ilri/OpenRXV/issues/33)
  - Based on Peter's feedback for the [export icon](https://github.com/ilri/OpenRXV/issues/35)
  - Based on Peter's feedback for the [sort options](https://github.com/ilri/OpenRXV/issues/31)
  - Based on Abenet's feedback that [PDF and Word exports are not working](https://github.com/ilri/OpenRXV/issues/34)

## 2020-01-22

- I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:

```
Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
```

- They started [limiting public access to the database in December, 2019 due to GDPR and CCPA](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/)
  - This will be a problem in the future (see [DS-4409](https://jira.lyrasis.org/browse/DS-4409))
- Peter sent me his corrections for the list of authors that I had sent him earlier in the month
  - There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8
  - I will apply them on CGSpace and DSpace Test using my `fix-metadata-values.py` script:

```
$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
```

- Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality):

```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
COPY 67314
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
```

- Peter asked me to send him a list of affiliations to correct
  - First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:

```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6170
dspace=# \q
$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
```

- I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:

```
$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
```

- Then I generated a new list for Peter:

```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162
```

- Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author "Hung, Nguyen"
  - I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:

```
$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
$ wc -l hung-nguyen-a*handles.txt
  46 hung-nguyen-ares-handles.txt
  56 hung-nguyen-atmire-handles.txt
 102 total
```

- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
  - I am curious to check tomorrow to see if they are there

## 2020-01-23

- I checked AReS and I see that there are now 55 items for author "Hung Nguyen-Viet" 
- Linode sent an alert that the outbound traffic rate of CGSpace (linode18) was high for several hours this morning around 5AM UTC+1
  - I checked the nginx logs this morning for the few hours before and after that using goaccess:

```
# cat /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E "23/Jan/2020:0[12345678]" | goaccess --log-format=COMBINED -
```

- The top two hosts according to the amount of data transferred are:
  - 2a01:7e00::f03c:91ff:fe9a:3a37
  - 2a01:7e00::f03c:91ff:fe18:7396
- Both are on Linode, and appear to be the new and old ilri.org servers
 - I will ask the web team
  - Judging from the [ILRI publications site](https://www.ilri.org/publications/trade-offs-related-agricultural-use-antimicrobials-and-synergies-emanating-efforts) it seems they are downloading the PDFs so they can generate higher-quality thumbnails:
  - They are apparently using this Drupal module to generate the thumbnails: `sites/all/modules/contrib/pdf_to_imagefield`
  - I see some excellent suggestions in this [ImageMagick thread from 2012](https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589) that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as [this blog post](https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/):

```
$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg
```

- Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using `-flatten` like DSpace already does
- I did some tests with a modified version of above that uses uses `-flatten` and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):

```
$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg
$ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg
$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg
$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg
```

- This emulate's DSpace's method of generating a high-quality image from the PDF and then creating a thumbnail
- I put together a proof of concept of this by adding the extra options to dspace-api's `ImageMagickThumbnailFilter.java` and it works
- I need to run tests on a handful of PDFs to see if there are any side effects
- The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org's 400KiB PNG!
- Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:

```
$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct
$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
```

<!-- vim: set sw=2 ts=2: -->
Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00			`---`
			`title: "January, 2020"`
Add notes for 2020-01-14 2020-01-14 19:40:41 +01:00			`date: 2020-01-06T10:48:30+02:00`
Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00			`author: "Alan Orth"`
			`categories: ["Notes"]`
			`---`

Add notes for 2020-01-07 2020-01-07 11:24:29 +01:00			`## 2020-01-06`
Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00
			`- Open [a ticket](https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=706) with Atmire to request a quote for the upgrade to DSpace 6`
			`- Last week Altmetric responded about the [item](https://hdl.handle.net/10568/97087) that had a lower score than than its DOI`
			`- The score is now linked to the DOI`
			`- Another [item](https://handle.hdl.net/10568/91278) that had the same problem in 2019 has now also linked to the score for its DOI`
			`- Another [item](https://hdl.handle.net/10568/81236) that had the same problem in 2019 has also been fixed`

Add notes for 2020-01-07 2020-01-07 11:24:29 +01:00			`## 2020-01-07`

			`- Peter Ballantyne highlighted one more WLE [item](https://hdl.handle.net/10568/101286) that is missing the Altmetric score that its DOI has`
			`- The DOI has a score of 259, but the Handle has no score at all`
			`- I [tweeted](https://twitter.com/mralanorth/status/1214471427157626881) the CGSpace repository link`

Add notes for 2020-01-08 2020-01-08 14:33:56 +01:00			`<!--more-->`

			`## 2020-01-08`

			`- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;`
			`COPY 68790`
			```

			`- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:`

			```
			`$ iconv -f utf-8 -t windows-1252 /tmp/2020-01-08-authors.csv -o /tmp/2020-01-08-authors-windows.csv`
			`iconv: illegal input sequence at position 104779`
			```

			`- According to [this trick](https://www.datafix.com.au/BASHing/2018-09-13.html) the troublesome character is on line 5227:`

			```
			`$ awk 'END {print NR": "$0}' /tmp/2020-01-08-authors-windows.csv`
			`5227: "Oue`
			`$ sed -n '5227p' /tmp/2020-01-08-authors.csv \| xxd -c1`
			`00000000: 22 "`
			`00000001: 4f O`
			`00000002: 75 u`
			`00000003: 65 e`
			`00000004: cc .`
			`00000005: 81 .`
			`00000006: 64 d`
			`00000007: 72 r`
			```

Add notes for 2020-01-14 2020-01-14 19:40:41 +01:00			- ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as:
Add notes for 2020-01-08 2020-01-08 14:33:56 +01:00
			```
			`<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401`
			```

			`- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...`
			- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
			`- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings`
			`- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me`
			- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch

Add notes for 2020-01-14 2020-01-14 19:40:41 +01:00			`## 2020-01-14`

			`- I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error`
			`- I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019"`
			`- After a few hours it died with the same error that I had seen in the log from the first run:`

			```
			`Exception: Read timed out`
			`java.net.SocketTimeoutException: Read timed out`
			```

			`- I am not sure how I will fix that shard...`
			`- I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8`
			`- I'm curious to start checking input files with this to see what it highlights`
			`- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:`
			- `<e> 101, Hex 65, Octal 145 < ́> 769, Hex 0301, Octal 1401`
			- `<é> 233, Hex 00e9, Oct 351, Digr e'`
			`- Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)!`
			- In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html):

			```
			`In [7]: unicodedata.is_normalized('NFC', 'é')`
			`Out[7]: False`

			`In [8]: unicodedata.is_normalized('NFC', 'é')`
			`Out[8]: True`
			```

Add notes for 2020-01-15 2020-01-15 12:51:35 +01:00			`## 2020-01-15`

			`- I added support for Unicode normalization to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) tool in [v0.4.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0)`
			`- Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;`
			`COPY 144`
			`dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;`
			`COPY 1325`
			```

			`- She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC`
			- I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my `fix-metadata.py` script:

			```
			`$ ./fix-metadata-values.py -i 2020-01-15-fix-8-ilri-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.ilri -m 203 -t correct -d`
			```

Add notes for 2020-01-16 2020-01-16 11:49:21 +01:00			`## 2020-01-16`

			`- Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;`
			`COPY 35`
			```

			`- Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)`
			`- We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months`
			`- Sisay uploaded the records to DSpace Test as [IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567)`
			`- I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data`
			`- I corrected one invalid AGROVOC subject`
			`- Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:`
			- `$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id`
			- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)`

Add notes for 2020-01-20 2020-01-20 09:49:11 +01:00			`## 2020-01-20`

			`- Last week Atmire sent a quotation for the DSpace 6 upgrade that I had requested a few weeks ago`
			`- I forwarded it to Peter et al for their comment`
Regenerate docs 2020-01-21 09:24:39 +01:00			`- We decided that we should probably buy enough credits to cover the upgrade and have 100 remaining for future development`
Add notes for 2020-01-20 2020-01-20 19:48:52 +01:00			`- Visit CodeObia to discuss the next phase of AReS development`
Add notes for 2020-01-20 2020-01-20 09:49:11 +01:00
Update notes for 2020-01-21 2020-01-21 16:31:46 +01:00			`## 2020-01-21`

			`- Create two accounts on CGSpace for CTA users`
			`- Marie-Angelique finally responded to some of the pull requests I made on the CG Core v2 repository last month:`
			`- Merged: [HTML syntax fixes](https://github.com/AgriculturalSemantics/cg-core/pull/16)`
			`- Merged: [Add LICENSE file](https://github.com/AgriculturalSemantics/cg-core/pull/17)`
			`- Merged: [Build main.css using npm build](https://github.com/AgriculturalSemantics/cg-core/pull/18)`
			- Approved a [wider scope for `cg.peer-reviewed`](https://github.com/AgriculturalSemantics/cg-core/issues/14) (renaming the field and using non-boolean values), but there is more discussion needed
			`- I opened a new [pull request](https://github.com/AgriculturalSemantics/cg-core/pull/24) on the cg-core repository validate and fix the formatting of the HTML files`
			`- Create more issues for OpenRXV:`
			`- Based on Peter's feedback on the [text for labels and tooltips](https://github.com/ilri/OpenRXV/issues/33)`
			`- Based on Peter's feedback for the [export icon](https://github.com/ilri/OpenRXV/issues/35)`
			`- Based on Peter's feedback for the [sort options](https://github.com/ilri/OpenRXV/issues/31)`
			`- Based on Abenet's feedback that [PDF and Word exports are not working](https://github.com/ilri/OpenRXV/issues/34)`

Add notes for 2020-01-22 2020-01-22 09:35:46 +01:00			`## 2020-01-22`

			`- I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:`

			```
			`Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.`
			```

			`- They started [limiting public access to the database in December, 2019 due to GDPR and CCPA](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/)`
			`- This will be a problem in the future (see [DS-4409](https://jira.lyrasis.org/browse/DS-4409))`
Update notes for 2020-01-22 2020-01-22 13:16:08 +01:00			`- Peter sent me his corrections for the list of authors that I had sent him earlier in the month`
			`- There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8`
			- I will apply them on CGSpace and DSpace Test using my `fix-metadata-values.py` script:

			```
			`$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d`
			```

			`- Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality):`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;`
			`COPY 67314`
			`dspace=# \q`
			`$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'`
			`$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct`
			```

			`- Peter asked me to send him a list of affiliations to correct`
			`- First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;`
			`COPY 6170`
			`dspace=# \q`
			`$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'`
			`$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n`
			```

			`- I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:`

			```
			`$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b`
			```

			`- Then I generated a new list for Peter:`

			```
			`dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;`
			`COPY 6162`
			```

			`- Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author "Hung, Nguyen"`
			`- I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:`

			```
			`$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx \| sed -E 's/10568 ([0-9]+)/10568\/\1/' \| csvcut -c Handle \| grep -v Handle \| sort -u > hung-nguyen-ares-handles.txt`
			`$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt \| sort -u > hung-nguyen-atmire-handles.txt`
			`$ wc -l hung-nguyen-a*handles.txt`
			`46 hung-nguyen-ares-handles.txt`
			`56 hung-nguyen-atmire-handles.txt`
			`102 total`
			```

			`- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet`
			`- I am curious to check tomorrow to see if they are there`
Add notes for 2020-01-22 2020-01-22 09:35:46 +01:00
Add notes for 2020-01-23 2020-01-23 11:46:39 +01:00			`## 2020-01-23`

			`- I checked AReS and I see that there are now 55 items for author "Hung Nguyen-Viet"`
			`- Linode sent an alert that the outbound traffic rate of CGSpace (linode18) was high for several hours this morning around 5AM UTC+1`
			`- I checked the nginx logs this morning for the few hours before and after that using goaccess:`

			```
			`# cat /var/log/nginx/.log /var/log/nginx/.log.1 \| grep -E "23/Jan/2020:0[12345678]" \| goaccess --log-format=COMBINED -`
			```

			`- The top two hosts according to the amount of data transferred are:`
			`- 2a01:7e00::f03c:91ff:fe9a:3a37`
			`- 2a01:7e00::f03c:91ff:fe18:7396`
			`- Both are on Linode, and appear to be the new and old ilri.org servers`
			`- I will ask the web team`
			`- Judging from the [ILRI publications site](https://www.ilri.org/publications/trade-offs-related-agricultural-use-antimicrobials-and-synergies-emanating-efforts) it seems they are downloading the PDFs so they can generate higher-quality thumbnails:`
			- They are apparently using this Drupal module to generate the thumbnails: `sites/all/modules/contrib/pdf_to_imagefield`
			`- I see some excellent suggestions in this [ImageMagick thread from 2012](https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589) that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as [this blog post](https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/):`

			```
			`$ convert -density 288 -filter lagrange -thumbnail 25% -background white -alpha remove -sampling-factor 1:1 -colorspace sRGB 10568-97925.pdf\[0\] 10568-97925.jpg`
			```

			- Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using `-flatten` like DSpace already does
Update notes for 2020-01-23 2020-01-23 14:56:46 +01:00			- I did some tests with a modified version of above that uses uses `-flatten` and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):

			```
			`$ convert -density 288 -filter lagrange -resize 25% -flatten 10568-97925.pdf\[0\] 10568-97925-d288-lagrange.pdf.jpg`
			`$ convert -flatten 10568-97925.pdf\[0\] 10568-97925.pdf.jpg`
			`$ convert -thumbnail x600 10568-97925-d288-lagrange.pdf.jpg 10568-97925-d288-lagrange-thumbnail.pdf.jpg`
			`$ convert -thumbnail x600 10568-97925.pdf.jpg 10568-97925-thumbnail.pdf.jpg`
			```

			`- This emulate's DSpace's method of generating a high-quality image from the PDF and then creating a thumbnail`
			- I put together a proof of concept of this by adding the extra options to dspace-api's `ImageMagickThumbnailFilter.java` and it works
			`- I need to run tests on a handful of PDFs to see if there are any side effects`
			`- The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org's 400KiB PNG!`
			`- Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace:`

			```
			`$ csv-metadata-quality -i ~/Downloads/2020-01-22-fix-1113-affiliations.csv -o /tmp/2020-01-22-fix-1113-affiliations.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'`
			`$ ./fix-metadata-values.py -i /tmp/2020-01-22-fix-1113-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct`
			`$ ./delete-metadata-values.py -i /tmp/2020-01-22-delete-36-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211`
			```
Add notes for 2020-01-23 2020-01-23 11:46:39 +01:00
Add notes for 2020-01-06 2020-01-06 10:14:44 +01:00			`<!-- vim: set sw=2 ts=2: -->`