mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-18 04:37:04 +01:00
13 KiB
13 KiB
title | date | author | tags | |
---|---|---|---|---|
June, 2018 | 2018-06-04T19:49:54-07:00 | Alan Orth |
|
2018-06-04
- Test the DSpace 5.8 module upgrades from Atmire (#378)
- There seems to be a problem with the CUA and L&R versions in
pom.xml
because they are using SNAPSHOT and it doesn't build
- There seems to be a problem with the CUA and L&R versions in
- I added the new CCAFS Phase II Project Tag
PII-FP1_PACCA2
and merged it into the5_x-prod
branch (#379) - I proofed and tested the ILRI author corrections that Peter sent back to me this week:
$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
- I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in [March, 2018]({{< relref "2018-03.md" >}})
- Time to index ~70,000 items on CGSpace:
$ time schedtool -D -e ionice -c2 -n7 nice -n19 [dspace]/bin/dspace index-discovery -b
real 74m42.646s
user 8m5.056s
sys 2m7.289s
2018-06-06
- It turns out that I needed to add a server block for
atmire.com-snapshots
to my Maven settings, so now the Atmire code builds - Now Maven and Ant run properly, but I'm getting SQL migration errors in
dspace.log
after starting Tomcat - I've updated my ticket on Atmire's bug tracker: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=560
2018-06-07
- Proofing 200 IITA records on DSpace Test for Sisay: IITA_Junel_06 (10568/95391)
- Mispelled authorship type: CGAIR single center should be: CGIAR single centre
- I see some encoding errors in author affiliations, for example:
- Universidade de SÆo Paulo
- Institut National des Recherches Agricoles du B nin
- Centre de Coop ration Internationale en Recherche Agronomique pour le D veloppement
- Institut des Recherches Agricoles du B nin
- Institut des Savannes, C te d' Ivoire
- Institut f r Pflanzenpathologie und Pflanzenschutz der Universit t, Germany
- Projet de Gestion des Ressources Naturelles, B nin
- Universit t Hannover
- Universit F lix Houphouet-Boigny
- I uploaded fixes for all those now, but I will continue with the rest of the data later
- Regarding the SQL migration errors, Atmire told me I need to run some migrations manually in PostgreSQL:
delete from schema_version where version = '5.6.2015.12.03.2';
update schema_version set version = '5.6.2015.12.03.2' where version = '5.5.2015.12.03.2';
update schema_version set version = '5.8.2015.12.03.3' where version = '5.5.2015.12.03.3';
- And then I need to ignore the ignored ones:
$ ~/dspace/bin/dspace database migrate ignored
- Now DSpace starts up properly!
- Gabriela from CIP got back to me about the author names we were correcting on CGSpace
- I did a quick sanity check on them and then did a test import with my
fix-metadata-value.py
script:
$ ./fix-metadata-values.py -i /tmp/2018-06-08-CIP-Authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
- I will apply them on CGSpace tomorrow I think...
2018-06-09
- It's pretty annoying, but the JVM monitoring for Munin was never set up when I migrated DSpace Test to its new server a few months ago
- I ran the tomcat and munin-node tags in Ansible again and now the stuff is all wired up and recording stats properly
- I applied the CIP author corrections on CGSpace and DSpace Test and re-ran the Discovery indexing
2018-06-10
- I spent some time removing the Atmire Metadata Quality Module (MQM) from the proposed DSpace 5.8 changes
- After removing all code mentioning MQM, mqm, metadata-quality, batchedit, duplicatechecker, etc, I think I got most of it removed, but there is a Spring error during Tomcat startup:
INFO [org.dspace.servicemanager.DSpaceServiceManager] Shutdown DSpace core service manager
Failed to startup the DSpace Service Manager: failure starting up spring service manager: Error creating bean with name 'org.dspace.servicemanager.spring.DSpaceBeanPostProcessor#0' defined in class path resource [spring/spring-dspace-applicationContext.xml]: Unsatisfied dependency expressed through constructor argument with index 0 of type [org.dspace.servicemanager.config.DSpaceConfigurationService]: : Cannot find class [com.atmire.dspace.discovery.ItemCollectionPlugin] for bean with name 'itemCollectionPlugin' defined in file [/home/aorth/dspace/config/spring/api/discovery.xml];
- I can fix this by commenting out the
ItemCollectionPlugin
line ofdiscovery.xml
, but from looking at the git log I'm not actually sure if that is related to MQM or not - I will have to ask Atmire
- I continued to look at Sisay's IITA records from last week
- I normalized all DOIs to use HTTPS and "doi.org" instead of "dx.doi.org"
- I cleaned up white space in
cg.subject.iita
anddc.subject
- Even a bunch of IITA and AGROVOC subjects are missing accents, ie "FERTILIT DU SOL"
- More organization names in
dc.description.sponsorship
are incorrect (ie, missing accents) or inconsistent (ie, CGIAR centers should be spelled in English or multiple spellings of the same one, like "Rockefeller Foundation" and "Rockefeller foundation") - A few dozen items have abstracts with character encoding errors, ie:
- 33.7øC
- MgSO4ú7H2O
- ha??1&/sup;
- En gen6ral
- dÕpassÕ
- Also the abstracts have missing accents, ie "recherche sur le d veloppement"
- I will have to tell IITA people to redo these entirely I think...
2018-06-11
- Sisay sent a new version of the last IITA records that he created from the original CSV from IITA
- The 200 records are in the IITA_Junel_11 (10568/95870) collection
- Many errors:
- Authorship types: "CGIAR ans advanced research institute", "CGAIR and advanced research institute", "CGIAR and advanced research institutes", "CGAIR single center"
- Lots of inconsistencies and mispellings in author affiliations:
- "Institut des Recherches Agricoles du Bénin" and "Institut National des Recherche Agricoles du Benin" and "National Agricultural Research Institute, Benin"
- International Insitute of Tropical Agriculture
- Centro Internacional de Agricultura Tropical
- "Rivers State University of Science and Technology" and "Rivers State University"
- "Institut de la Recherche Agronomique, Cameroon" and "Institut de Recherche Agronomique, Cameroon"
- Inconsistency in countries: "COTE D’IVOIRE" and "COTE D'IVOIRE"
- A few DOIs with spaces or invalid characters
- Inconsistency in IITA subjects, for example "PRODUCTION VEGETALE" and "PRODUCTION VÉGÉTALE" and several others
- I ran
value.unescape('javascript')
on the abstract and citation fields because it looks like this data came from a SQL database and some stuff was escaped
- It turns out that Abenet actually did a lot of small corrections on this data so when Sisay uses Bosede's original file it doesn't have all those corrections
- So I told Sisay to re-create the collection using Abenet's XLS from last week (
Mercy1805_AY.xls
) - I was curious to see if I could create a GREL for use with a custom text facet in Open Refine to find cells with two or more consecutive spaces
- I always use the built-in trim and collapse transformations anyways, but this seems to work to find the offending cells:
isNotNull(value.match(/.*?\s{2,}.*?/))
- I wonder if I should start checking for "smart" quotes like ’ (hex 2019)
2018-06-12
- Udana from IWMI asked about the OAI base URL for their community on CGSpace
- I think it should be this: https://cgspace.cgiar.org/oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=com_10568_16814
- The style sheet obfuscates the data, but if you look at the source it is all there, including information about pagination of results
- Regarding Udana's Book Chapters and Reports on DSpace Test last week, Abenet told him to fix some character encoding and CRP issues, then I told him I'd check them after that
- The latest batch of IITA's 200 records (based on Abenet's version
Mercy1805_AY.xls
) are now in the IITA_Jan_9_II_Ab collection - So here are some corrections:
- use of Unicode smart quote (hex 2019) in countries and affiliations, for example "COTE D’IVOIRE" and "Institut d’Economic Rurale, Mali"
- inconsistencies in
cg.contributor.affiliation
:- "Centro Internacional de Agricultura Tropical" and "Centro International de Agricultura Tropical" should use the English name of CIAT (International Center for Tropical Agriculture)
- "Institut International d'Agriculture Tropicale" should use the English name of IITA (International Institute of Tropical Agriculture)
- "East and Southern Africa Regional Center" and "Eastern and Southern Africa Regional Centre"
- "Institut de la Recherche Agronomique, Cameroon" and "Institut de Recherche Agronomique, Cameroon"
- "Institut des Recherches Agricoles du Bénin" and "Institut National des Recherche Agricoles du Benin" and "National Agricultural Research Institute, Benin"
- "Institute of Agronomic Research, Cameroon" and "Institute of Agronomy Research, Cameroon"
- "Rivers State University" and "Rivers State University of Science and Technology"
- "Universität Hannover" and "University of Hannover"
- inconsistencies in
cg.subject.iita
:- "AMELIORATION DES PLANTES" and "AMÉLIORATION DES PLANTES"
- "PRODUCTION VEGETALE" and "PRODUCTION VÉGÉTALE"
- "CONTRÔLE DE MALADIES" and "CONTROLE DES MALADIES"
- "HANDLING, TRANSPORT, STORAGE AND PROTECTION OF AGRICULTURAL PRODUCT" and "HANDLING, TRANSPORT, STORAGE AND PROTECTION OF AGRICULTURAL PRODUCTS"
- "RAVAGEURS DE PLANTES" and "RAVAGEURS DES PLANTES"
- "SANTE DES PLANTES" and "SANTÉ DES PLANTES"
- "SOCIOECONOMIE" and "SOCIOECONOMY"
- inconsistencies in
dc.description.sponsorship
:- "Belgian Corporation" and "Belgium Corporation"
- inconsistencies in
dc.subject
:- "AFRICAN CASSAVA MOSAIC" and "AFRICAN CASSAVA MOSAIC DISEASE"
- "ASPERGILLU FLAVUS" and "ASPERGILLUS FLAVUS"
- "BIOTECHNOLOGIES" and "BIOTECHNOLOGY"
- "CASSAVA MOSAIC DISEASE" and "CASSAVA MOSAIC DISEASES" and "CASSAVA MOSAIC VIRUS"
- "CASSAVA PROCESSING" and "CASSAVA PROCESSING TECHNOLOGY"
- "CROPPING SYSTEM" and "CROPPING SYSTEMS"
- "DRY SEASON" and "DRY-SEASON"
- "FERTILIZER" and "FERTILIZERS"
- "LEGUME" and "LEGUMES"
- "LEGUMINOSAE" and "LEGUMINOUS"
- "LEGUMINOUS COVER CROP" and "LEGUMINOUS COVER CROPS"
- "MATÉRIEL DE PLANTATION" and "MATÉRIELS DE PLANTATION"
- I noticed that some records do have encoding errors in the
dc.description.abstract
field, but only four of them so probably not from Abenet's handling of the XLS file - Based on manually eyeballing the text I used a custom text facet with this GREL to identify the records:
or(
value.contains('€'),
value.contains('6g'),
value.contains('6m'),
value.contains('6d'),
value.contains('6e')
)
- So IITA should double check the abstracts for these:
2018-06-13
- Elizabeth from CIAT contacted me to ask if I could add ORCID identifiers to all of Robin Buruchara's items
- I used my add-orcid-identifiers-csv.py script:
$ ./add-orcid-identifiers-csv.py -i 2018-06-13-Robin-Buruchara.csv -db dspace -u dspace -p 'fuuu'
- The contents of
2018-06-13-Robin-Buruchara.csv
were:
dc.contributor.author,cg.creator.id
"Buruchara, Robin",Robin Buruchara: 0000-0003-0934-1218
"Buruchara, Robin A.",Robin Buruchara: 0000-0003-0934-1218
- On a hunch I checked to see if CGSpace's bitstream cleanup was working properly and of course it's broken:
$ dspace cleanup -v
...
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
Detail: Key (bitstream_id)=(152402) is still referenced from table "bundle".
- As always, the solution is to delete that ID manually in PostgreSQL:
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (152402);'
UPDATE 1