Update notes for 2019-09-21

2025-01-27 05:49:12 +01:00 · 2019-09-22 01:36:39 +03:00
parent 77bc2f3b6b
commit ddf3b1346b
77 changed files with 122 additions and 83 deletions
--- a/content/posts/2019-09.md
+++ b/content/posts/2019-09.md
@ -291,4 +291,18 @@ $ dspace import -a me@cgiar.org -m 2019-09-20-bioversity2.map -s /home/aorth/Bio
  - Continue with institutional author normalization
  - Ask which collection to map items with type Brochure, Journal Item, and Thesis?

+## 2019-09-21
+
+- Re-upload the [IITA Sept 6 (20196th.xls) records to DSpace Test](https://dspacetest.cgiar.org/handle/10568/105116) after I did the re-sync yesterday
+  - Then I looked at the records again and sent some feedback about three duplicates to Bosede
+  - Also I noticed that many journal articles have the journal and page information in the citation, but are missing `dc.source` and `dc.format.extent` fields
+- Play with language identification using the langdetect, fasttext, polyglot, and langid libraries
+  - ployglot requires too many system things to compile
+  - langdetect didn't seem as accurate as the others
+  - fasttext is likely the best, but [prints a blank link to the console when loading a model](https://github.com/facebookresearch/fastText/issues/909)
+  - langid seems to be the best considering the above experiences
+- I added very experimental language detection to the [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality) module
+  - It works by checking the predicted language of the `dc.title` field against the item's `dc.language.iso` field
+  - I tested it on the Bioversity migration data set and actually managed to correct about eight incorrect language fields in their records!
+
 <!-- vim: set sw=2 ts=2: -->