Update notes for 2020-01-22

2025-01-27 05:49:12 +01:00 · 2020-01-22 14:16:08 +02:00
parent 8d304b1fca
commit 7c401b10bf
3 changed files with 120 additions and 8 deletions
--- a/content/posts/2020-01.md
+++ b/content/posts/2020-01.md
@ -162,5 +162,61 @@ Sorry, we were not able to create your account. Please ensure that you are using

 - They started [limiting public access to the database in December, 2019 due to GDPR and CCPA](https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/)
  - This will be a problem in the future (see [DS-4409](https://jira.lyrasis.org/browse/DS-4409))
+- Peter sent me his corrections for the list of authors that I had sent him earlier in the month
+  - There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8
+  - I will apply them on CGSpace and DSpace Test using my `fix-metadata-values.py` script:
+
+```
+$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
+```
+
+- Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to [csv-metadata-quality](https://github.com/ilri/csv-metadata-quality):
+
+```
+dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.author", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
+COPY 67314
+dspace=# \q
+$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
+$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
+```
+
+- Peter asked me to send him a list of affiliations to correct
+  - First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:
+
+```
+dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", text_value as "correct", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
+COPY 6170
+dspace=# \q
+$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
+$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
+```
+
+- I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:
+
+```
+$ sleep 4h && time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
+```
+
+- Then I generated a new list for Peter:
+
+```
+dspace=# \COPY (SELECT DISTINCT text_value as "cg.contributor.affiliation", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
+COPY 6162
+```
+
+- Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author "Hung, Nguyen"
+  - I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&R:
+
+```
+$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u > hung-nguyen-ares-handles.txt
+$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u > hung-nguyen-atmire-handles.txt
+$ wc -l hung-nguyen-a*handles.txt
+  46 hung-nguyen-ares-handles.txt
+  56 hung-nguyen-atmire-handles.txt
+ 102 total
+```
+
+- Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
+  - I am curious to check tomorrow to see if they are there

 <!-- vim: set sw=2 ts=2: -->
--- a/docs/2020-01/index.html
+++ b/docs/2020-01/index.html
@ -29,7 +29,7 @@ I tweeted the CGSpace repository link
 <meta property="og:type" content="article" />
 <meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-01/" />
 <meta property="article:published_time" content="2020-01-06T10:48:30+02:00" />
-<meta property="article:modified_time" content="2020-01-21T17:31:46+02:00" />
+<meta property="article:modified_time" content="2020-01-22T10:35:46+02:00" />

 <meta name="twitter:card" content="summary"/>
 <meta name="twitter:title" content="January, 2020"/>
@ -63,9 +63,9 @@ I tweeted the CGSpace repository link
  "@type": "BlogPosting",
  "headline": "January, 2020",
  "url": "https:\/\/alanorth.github.io\/cgspace-notes\/2020-01\/",
-  "wordCount": "1209",
+  "wordCount": "1674",
  "datePublished": "2020-01-06T10:48:30+02:00",
-  "dateModified": "2020-01-21T17:31:46+02:00",
+  "dateModified": "2020-01-22T10:35:46+02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@ -300,6 +300,62 @@ COPY 35
 <li>This will be a problem in the future (see <a href="https://jira.lyrasis.org/browse/DS-4409">DS-4409</a>)</li>
 </ul>
 </li>
+<li>Peter sent me his corrections for the list of authors that I had sent him earlier in the month
+<ul>
+<li>There were encoding issues when I checked the file in vim and using Python-based tools, but OpenRefine was able to read and export it as UTF-8</li>
+<li>I will apply them on CGSpace and DSpace Test using my <code>fix-metadata-values.py</code> script:</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-01-08-fix-2302-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct -d
+</code></pre><ul>
+<li>Then I decided to export them again (with two author columns) so I can perform the new Unicode normalization mode I added to <a href="https://github.com/ilri/csv-metadata-quality">csv-metadata-quality</a>:</li>
+</ul>
+<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;dc.contributor.author&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 3 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-authors.csv WITH CSV HEADER;
+COPY 67314
+dspace=# \q
+$ csv-metadata-quality -i /tmp/2020-01-22-authors.csv -o /tmp/authors-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],dc.contributor.author'
+$ ./fix-metadata-values.py -i /tmp/authors-normalized.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -m 3 -t correct
+</code></pre><ul>
+<li>Peter asked me to send him a list of affiliations to correct
+<ul>
+<li>First I decided to export them and run the Unicode normalizations and syntax checks with csv-metadata-quality and re-import the cleaned up values:</li>
+</ul>
+</li>
+</ul>
+<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, text_value as &quot;correct&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
+COPY 6170
+dspace=# \q
+$ csv-metadata-quality -i /tmp/2020-01-22-affiliations.csv -o /tmp/affiliations-normalized.csv -u --exclude-fields 'dc.date.issued,dc.date.issued[],cg.contributor.affiliation'
+$ ./fix-metadata-values.py -i /tmp/affiliations-normalized.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211 -t correct -n
+</code></pre><ul>
+<li>I applied the corrections on DSpace Test and CGSpace, and then scheduled a full Discovery reindex for later tonight:</li>
+</ul>
+<pre><code>$ sleep 4h &amp;&amp; time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
+</code></pre><ul>
+<li>Then I generated a new list for Peter:</li>
+</ul>
+<pre><code>dspace=# \COPY (SELECT DISTINCT text_value as &quot;cg.contributor.affiliation&quot;, count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2020-01-22-affiliations.csv WITH CSV HEADER;
+COPY 6162
+</code></pre><ul>
+<li>Abenet said she noticed that she gets different results on AReS and Atmire Listing and Reports, for example with author &ldquo;Hung, Nguyen&rdquo;
+<ul>
+<li>I generated a report for 2019 and 2020 with each and I see there are indeed ten more Handles in the results from L&amp;R:</li>
+</ul>
+</li>
+</ul>
+<pre><code>$ in2csv AReS-1-801dd394-54b5-436c-ad09-4f2e25f7e62e.xlsx | sed -E 's/10568 ([0-9]+)/10568\/\1/' | csvcut -c Handle | grep -v Handle | sort -u &gt; hung-nguyen-ares-handles.txt
+$ grep -oE '10568\/[0-9]+' hung-nguyen-atmire.txt | sort -u &gt; hung-nguyen-atmire-handles.txt
+$ wc -l hung-nguyen-a*handles.txt
+  46 hung-nguyen-ares-handles.txt
+  56 hung-nguyen-atmire-handles.txt
+ 102 total
+</code></pre><ul>
+<li>Comparing the lists of items, I see that nine of the ten missing items were added less than twenty-four hours ago, and the other was added last week, so they apparently just haven't been indexed yet
+<ul>
+<li>I am curious to check tomorrow to see if they are there</li>
+</ul>
+</li>
 </ul>
 <!-- raw HTML omitted -->

--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@ -4,27 +4,27 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
-    <lastmod>2020-01-21T17:31:46+02:00</lastmod>
+    <lastmod>2020-01-22T10:35:46+02:00</lastmod>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/</loc>
-    <lastmod>2020-01-21T17:31:46+02:00</lastmod>
+    <lastmod>2020-01-22T10:35:46+02:00</lastmod>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/2020-01/</loc>
-    <lastmod>2020-01-21T17:31:46+02:00</lastmod>
+    <lastmod>2020-01-22T10:35:46+02:00</lastmod>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
-    <lastmod>2020-01-21T17:31:46+02:00</lastmod>
+    <lastmod>2020-01-22T10:35:46+02:00</lastmod>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
-    <lastmod>2020-01-21T17:31:46+02:00</lastmod>
+    <lastmod>2020-01-22T10:35:46+02:00</lastmod>
  </url>
  
  <url>