From 5353ceb221e6cf63c1c5e7e3119155e7783c2da8 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Tue, 9 Jul 2019 18:39:15 +0300 Subject: [PATCH] Add notes for 2019-07-09 --- content/posts/2019-07.md | 37 ++++++++++++++++++++++++++++++++ docs/2019-07/index.html | 46 +++++++++++++++++++++++++++++++++++++--- docs/sitemap.xml | 10 ++++----- 3 files changed, 85 insertions(+), 8 deletions(-) diff --git a/content/posts/2019-07.md b/content/posts/2019-07.md index 37ed6d9ca..3db1043bb 100644 --- a/content/posts/2019-07.md +++ b/content/posts/2019-07.md @@ -161,5 +161,42 @@ $ ./fix-metadata-values.py -i 2019-07-04-update-orcids.csv -db dspace -u dspace - Meeting with AgroKnow and CTA about their new ICT Update story telling thing - AgroKnow has developed a React application to display tag clouds based on harvesting metadata and full text from CGSpace items - We discussed how to host it technically, perhaps we purchase a server to run it on and just give AgroKnow guys access +- Playing with the idea of using [xsv](https://github.com/BurntSushi/xsv) to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title: + +``` +$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1' +field,value,count +cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2 +$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1' +field,value,count +dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2 +``` + +- Or perhaps if DOIs are valid or not (having doi.org in the URL): + +``` +$ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org' +field,value,count +cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1 +``` + +- Or perhaps items with invalid ISSNs (according to the [ISSN code format](https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format)): + +``` +$ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$' +dc.identifier.issn +978-3-319-71997-9 +978-3-319-71997-9 +978-3-319-71997-9 +978-3-319-58789-9 +2320-7035 + 2593-9173 +``` + +## 2019-07-09 + +- Thinking about data cleaning automation again and found some resources about Python and Pandas: + - https://realpython.com/python-data-cleaning-numpy-pandas/ + - https://mode.com/blog/python-data-cleaning-libraries diff --git a/docs/2019-07/index.html b/docs/2019-07/index.html index 90564c2a8..2cf38ccb3 100644 --- a/docs/2019-07/index.html +++ b/docs/2019-07/index.html @@ -21,7 +21,7 @@ Abenet had another similar issue a few days ago when trying to find the stats fo - + @@ -47,9 +47,9 @@ Abenet had another similar issue a few days ago when trying to find the stats fo "@type": "BlogPosting", "headline": "July, 2019", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2019-07\/", - "wordCount": "855", + "wordCount": "1008", "datePublished": "2019-07-01T12:13:51\x2b03:00", - "dateModified": "2019-07-08T11:12:32\x2b03:00", + "dateModified": "2019-07-08T15:23:05\x2b03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -309,6 +309,46 @@ $ ./resolve-orcids.py -i /tmp/2019-07-04-orcid-ids.txt -o 2019-07-04-orcid-names
  • AgroKnow has developed a React application to display tag clouds based on harvesting metadata and full text from CGSpace items
  • We discussed how to host it technically, perhaps we purchase a server to run it on and just give AgroKnow guys access
  • + +
  • Playing with the idea of using xsv to do some basic batch quality checks on CSVs, for example to find items that might be duplicates if they have the same DOI or title:

    + +
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'
    +field,value,count
    +cg.identifier.doi,https://doi.org/10.1016/j.agwat.2018.06.018,2
    +$ xsv frequency --select dc.title --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E ',1'         
    +field,value,count
    +dc.title,Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: Regional case study in Burkina Faso,2
    +
  • + +
  • Or perhaps if DOIs are valid or not (having doi.org in the URL):

    + +
    $ xsv frequency --select cg.identifier.doi --no-nulls cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v -E 'doi.org'
    +field,value,count
    +cg.identifier.doi,https://hdl.handle.net/10520/EJC-1236ac700f,1
    +
  • + +
  • Or perhaps items with invalid ISSNs (according to the ISSN code format):

    + +
    $ xsv select dc.identifier.issn cgspace_metadata_africaRice-11to73_ay_id.csv | grep -v '"' | grep -v -E '^[0-9]{4}-[0-9]{3}[0-9xX]$'
    +dc.identifier.issn
    +978-3-319-71997-9
    +978-3-319-71997-9
    +978-3-319-71997-9
    +978-3-319-58789-9
    +2320-7035 
    +2593-9173
    +
  • + + +

    2019-07-09

    + + diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 5646452bd..9a2edd965 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,30 +4,30 @@ https://alanorth.github.io/cgspace-notes/ - 2019-07-08T11:18:51+03:00 + 2019-07-08T15:23:05+03:00 0 https://alanorth.github.io/cgspace-notes/2019-07/ - 2019-07-08T11:12:32+03:00 + 2019-07-08T15:23:05+03:00 https://alanorth.github.io/cgspace-notes/tags/notes/ - 2019-07-08T11:18:51+03:00 + 2019-07-08T15:23:05+03:00 0 https://alanorth.github.io/cgspace-notes/posts/ - 2019-07-08T11:18:51+03:00 + 2019-07-08T15:23:05+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2019-07-08T11:18:51+03:00 + 2019-07-08T15:23:05+03:00 0