diff --git a/content/posts/2020-01.md b/content/posts/2020-01.md index 6d0845fc1..8e3909eb0 100644 --- a/content/posts/2020-01.md +++ b/content/posts/2020-01.md @@ -162,5 +162,61 @@ Sorry, we were not able to create your account. Please ensure that you are using - They started [limiting public access to the database in December, 2019 due to GDPR and CCPA](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/) - This will be a problem in the future (see [DS-4409](https://jira.lyrasis.org/browse/DS-4409)) +- Peter sent me his corrections for the list of authors that I had sent him earlier in the month + - There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8 + - I will apply them on CGSpace and DSpace Test using my `fix-metadata-values.py` script: + +``` +$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d +``` + +- Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality): + +``` +dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER; +COPY 67314 +dspace=# \q +$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author' +$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct +``` + +- Peter asked me to send him a list of affiliations to correct + - First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values: + +``` +dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER; +COPY 6170 +dspace=# \q +$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation' +$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n +``` + +- I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight: + +``` +$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b +``` + +- Then I generated a new list for Peter: + +``` +dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER; +COPY 6162 +``` + +- Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author "Hung, Nguyen" + - I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R: + +``` +$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt +$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt +$ wc -l hung-nguyen-a*handles.txt + 46 hung-nguyen-ares-handles.txt + 56 hung-nguyen-atmire-handles.txt + 102 total +``` + +- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet + - I am curious to check tomorrow to see if they are there diff --git a/docs/2020-01/index.html b/docs/2020-01/index.html index 43131c0f5..45be6d635 100644 --- a/docs/2020-01/index.html +++ b/docs/2020-01/index.html @@ -29,7 +29,7 @@ I tweeted the CGSpace repository link - + @@ -63,9 +63,9 @@ I tweeted the CGSpace repository link "@type": "BlogPosting", "headline": "January, 2020", "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-01\/", - "wordCount": "1209", + "wordCount": "1674", "datePublished": "2020-01-06T10:48:30+02:00", - "dateModified": "2020-01-21T17:31:46+02:00", + "dateModified": "2020-01-22T10:35:46+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -300,6 +300,62 @@ COPY 35
  • This will be a problem in the future (see DS-4409)
  • +
  • Peter sent me his corrections for the list of authors that I had sent him earlier in the month + +
  • + +
    $ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
    +
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
    +COPY 67314
    +dspace=# \q
    +$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
    +$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
    +
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
    +COPY 6170
    +dspace=# \q
    +$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
    +$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
    +
    +
    $ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
    +
    +
    dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
    +COPY 6162
    +
    +
    $ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
    +$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
    +$ wc -l hung-nguyen-a*handles.txt
    +  46 hung-nguyen-ares-handles.txt
    +  56 hung-nguyen-atmire-handles.txt
    + 102 total
    +
    diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 0bd7123a8..5ed9a1f90 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,27 +4,27 @@ https://alanorth.github.io/cgspace-notes/categories/ - 2020-01-21T17:31:46+02:00 + 2020-01-22T10:35:46+02:00 https://alanorth.github.io/cgspace-notes/ - 2020-01-21T17:31:46+02:00 + 2020-01-22T10:35:46+02:00 https://alanorth.github.io/cgspace-notes/2020-01/ - 2020-01-21T17:31:46+02:00 + 2020-01-22T10:35:46+02:00 https://alanorth.github.io/cgspace-notes/categories/notes/ - 2020-01-21T17:31:46+02:00 + 2020-01-22T10:35:46+02:00 https://alanorth.github.io/cgspace-notes/posts/ - 2020-01-21T17:31:46+02:00 + 2020-01-22T10:35:46+02:00