- Export a list of authors from CGSpace for Peter Ballantyne to look through and correct:
```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-08-authors.csv WITH CSV HEADER;
COPY 68790
```
- As I always have encoding issues with files Peter sends, I tried to convert it to some Windows encoding, but got an error:
- ~~According to the blog post linked above the troublesome character is probably the "High Octect Preset" (81)~~, which vim identifies (using `ga` on the character) as:
- If I understand the situation correctly it sounds like this means that the character is not actually encoded as UTF-8, so it's stored incorrectly in the database...
- Other encodings like `windows-1251` and `windows-1257` also fail on different characters like "ž" and "é" that _are_ legitimate UTF-8 characters
- Then there is the issue of Russian, Chinese, etc characters, which are simply not representable in any of those encodings
- I think the solution is to upload it to Google Docs, or just send it to him and deal with each case manually in the corrections he sends me
- Re-deploy DSpace Test (linode19) with a fresh snapshot of the CGSpace database and assetstore, and using the `5_x-prod` (no CG Core v2) branch
- I checked the yearly Solr statistics sharding cron job that should have run on 2020-01 on CGSpace (linode18) and saw that there was an error
- I manually ran it on the server as the DSpace user and it said "Moving: 51633080 into core statistics-2019"
- After a few hours it died with the same error that I had seen in the log from the first run:
```
Exception: Read timed out
java.net.SocketTimeoutException: Read timed out
```
- I am not sure how I will fix that shard...
- I discovered a very interesting tool called [ftfy](https://github.com/LuminosoInsight/python-ftfy) that attempts to fix errors in UTF-8
- I'm curious to start checking input files with this to see what it highlights
- I ran it on the authors file from last week and it converted characters like those with Spanish accents from multi-byte sequences (I don't know what it's called?) to digraphs (é→é), which vim identifies as:
- Ah hah! We need to be [normalizing characters into their canonical forms](https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html)!
- In Python 3.8 we can even [check if the string is normalized using the `unicodedata` library](https://docs.python.org/3/library/unicodedata.html):
- I added support for Unicode normalization to my [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) tool in [v0.4.0](https://github.com/ilri/csv-metadata-quality/releases/tag/v0.4.0)
- Generate ILRI and Bioversity subject lists for Elizabeth Arnaud from Bioversity:
```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ilri", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 203 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-ilri-subjects.csv WITH CSV HEADER;
COPY 144
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.bioversity", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 120 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-15-bioversity-subjects.csv WITH CSV HEADER;
COPY 1325
```
- She will be meeting with FAO and will look over the terms to see if they can add some to AGROVOC
- I noticed a few errors in the ILRI subjects so I fixed them locally and on CGSpace (linode18) using my `fix-metadata.py` script:
- Extract a list of CIAT subjects from CGSpace for Elizabeth Arnaud from Bioversity:
```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.subject.ciat", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 122 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-16-ciat-subjects.csv WITH CSV HEADER;
COPY 35
```
- Start examining the 175 IITA records that Bosede originally sent in October, 2019 (201907.xls)
- We had delayed processing them because DSpace Test (linode19) was testing CG Core v2 implementation for the last few months
- Sisay uploaded the records to DSpace Test as [IITA_201907_Jan13](https://dspacetest.cgiar.org/handle/10568/106567)
- I started first with basic sanity checks using my csv-metadata-quality tool and found twenty-two items with extra whitespace, invalid multi-value separators, and duplicates, which means Sisay did not do any quality checking on the data
- I corrected one invalid AGROVOC subject
- Validate and normalize affiliations against our 2019-04 list using reconcile-csv and OpenRefine:
-`$ lein run ~/src/git/DSpace/2019-04-08-affiliations.csv name id`
- I always forget how to copy the reconciled values in OpenRefine, but you need to make a new colum and populate it using this GREL: `if(cell.recon.matched, cell.recon.match.name, value)`
- Merged: [Build main.css using npm build](https://github.com/AgriculturalSemantics/cg-core/pull/18)
- Approved a [wider scope for `cg.peer-reviewed`](https://github.com/AgriculturalSemantics/cg-core/issues/14) (renaming the field and using non-boolean values), but there is more discussion needed
- I opened a new [pull request](https://github.com/AgriculturalSemantics/cg-core/pull/24) on the cg-core repository validate and fix the formatting of the HTML files
- Create more issues for OpenRXV:
- Based on Peter's feedback on the [text for labels and tooltips](https://github.com/ilri/OpenRXV/issues/33)
- Based on Peter's feedback for the [export icon](https://github.com/ilri/OpenRXV/issues/35)
- Based on Peter's feedback for the [sort options](https://github.com/ilri/OpenRXV/issues/31)
- Based on Abenet's feedback that [PDF and Word exports are not working](https://github.com/ilri/OpenRXV/issues/34)
- I tried to create a MaxMind account so I can download the GeoLite2-City database with a license key, but their server refuses to accept me:
```
Sorry, we were not able to create your account. Please ensure that you are using an email that is not disposable, and that you are not connecting via a proxy or VPN.
```
- They started [limiting public access to the database in December, 2019 due to GDPR and CCPA](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/)
- This will be a problem in the future (see [DS-4409](https://jira.lyrasis.org/browse/DS-4409))
- Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality):
```
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
- Peter asked me to send him a list of affiliations to correct
- First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:
```
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
COPY 6162
```
- Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author "Hung, Nguyen"
- I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:
- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
- I am curious to check tomorrow to see if they are there
- The top two hosts according to the amount of data transferred are:
- 2a01:7e00::f03c:91ff:fe9a:3a37
- 2a01:7e00::f03c:91ff:fe18:7396
- Both are on Linode, and appear to be the new and old ilri.org servers
- I will ask the web team
- Judging from the [ILRI publications site](https://www.ilri.org/publications/trade-offs-related-agricultural-use-antimicrobials-and-synergies-emanating-efforts) it seems they are downloading the PDFs so they can generate higher-quality thumbnails:
- They are apparently using this Drupal module to generate the thumbnails: `sites/all/modules/contrib/pdf_to_imagefield`
- I see some excellent suggestions in this [ImageMagick thread from 2012](https://www.imagemagick.org/discourse-server/viewtopic.php?t=21589) that lead me to some nice thumbnails (default PDF density is 72, so supersample to 4X and then resize back to 25%) as well as [this blog post](https://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/):
- Here I'm also explicitly setting the background to white and removing any alpha layers, but I could probably also just keep using `-flatten` like DSpace already does
- I did some tests with a modified version of above that uses uses `-flatten` and drops the sampling-factor and colorspace, but bumps up the image size to 600px (default on CGSpace is currently 300):
- This emulate's DSpace's method of generating a high-quality image from the PDF and then creating a thumbnail
- I put together a proof of concept of this by adding the extra options to dspace-api's `ImageMagickThumbnailFilter.java` and it works
- I need to run tests on a handful of PDFs to see if there are any side effects
- The file size is about double the old ones, but the quality is very good and the file size is nowhere near ilri.org's 400KiB PNG!
- Peter sent me the corrections and deletions for affiliations last night so I imported them into OpenRefine to work around the normal UTF-8 issue, ran them through csv-metadata-quality to make sure all Unicode values were normalized (NFC), then applied them on DSpace Test and CGSpace: