Add notes for 2023-01-29

This commit is contained in:
Alan Orth 2023-01-29 18:19:31 +03:00
parent 2c7f6b3e39
commit 81f04f48ad
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
29 changed files with 338 additions and 34 deletions

View File

@ -407,4 +407,148 @@ $ psql < locks-age.sql | grep days | less -S
- Then I ran the script to check for missing ORCID identifiers
- Then *finally*, I started a harvest on AReS
## 2023-01-23
- Salem found that you can actually harvest everything in DSpace 7 using the [`discover/browses` endpoint](https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100)
- Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
- I noticed that we still have "North America" as a region, but according to UN M.49 that is the continent, which comprises "Northern America" the region, so I will update our controlled vocabularies and all existing entries
- I imported changes to 1,800 items
- When it finished five hours later I started a harvest on AReS
## 2023-01-24
- Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI
- Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
- I also added "CGIAR Trust Fund" to all items with an Initiative in `cg.contributor.initiative`
## 2023-01-25
- Oh shit, the import last night ran for twelve hours and then died:
```console
Error committing changes to database: could not execute statement
Aborting most recent changes.
```
- I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes
- Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted
- Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
- We looked on AReS and all the items are still there
- I looked in the DSpace log and see around 2,000 messages like this:
```console
2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.core.Context.commit(Context.java:424)
at org.dspace.core.Context.complete(Context.java:380)
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I filed a ticket with Atmire to ask them
- For now I just did a light Discovery reindex (not the full one) and all the items appeared again
- Submit an issue to MEL GitHub regarding the capitalization of CRPs: https://github.com/CodeObia/MEL/issues/11133
- I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with [our current controlled vocabulary for CRPs](https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt) and he will update it in MEL.
- On that note, Peter and Abenet and I realized that we still have an old field `cg.subject.crp` with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)
- I exported this list of values to lowercase them and move them to `cg.contributor.crp`
- Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon
```console
$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t correct
$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t cg.contributor.crp
```
- After fixing and moving them all, I deleted the `cg.subject.crp` field from the metadata registry
- I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:
```sql
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn) AND text_lang IS NULL OR text_lang IN ('en', '');
```
- I tried that in a transaction and it hung, so I canceled it and rolled back
- I see some PostgreSQL locks attributed to `dspaceApi` that were started at `2023-01-25 13:40:04.529087+01` and haven't changed since then (that's eight hours ago)
- I killed the pid...
- There were also saw some locks owned by `dspaceWeb` that were nine and four hours old, so I killed those too...
- Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can't run the update on the text langs...
- Export entire CGSpace to do Initiative mappings again
- Started a harvest on AReS
## 2023-01-26
- Export entire CGSpace to do some metadata cleanup on various fields
- I also added "CGIAR Trust Fund" to all items in the Initiatives community
## 2023-01-27
- Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting *everything* from PostgreSQL:
```console
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
$ csvcut -c 'cg.contributor.affiliation[en_US]' 2023-01-27-initiatives.csv \
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
| sort | uniq -c | sort -h \
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
> /tmp/2023-01-27-initiatives-affiliations.csv
```
- The first sed command strips the quotes, deletes empty lines, and splits multiple values on "||"
- The awk command sets the field separator to something so we can get the second "field" of the sort command, ie:
```console
...
309 International Center for Agricultural Research in the Dry Areas
412 International Livestock Research Institute
```
- The second sed command adds the CSV header and quotes back
- I did the same for authors and donors and send them to Peter to make corrections
## 2023-01-28
- Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API
## 2023-01-29
- Export the entire CGSpace to do Initiatives collection mappings
- I was thinking about a way to use Crossref's API to enrich our data, for example checking registered DOIs for license information, publishers, etc
- Turns out I had already written `crossref-doi-lookup.py` last year, and it works
- I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren't registered on Crossref, which is about 11,800 DOIs
```console
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
| sed 1d > /tmp/2023-01-29-dois.txt
$ wc -l /tmp/2023-01-29-dois.txt
11819 /tmp/2023-01-29-dois.txt
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
$ csvcut -c 'id,cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
| sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
> /tmp/cgspace-temp.csv
$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
| csvgrep -c license -r 'creative' \
| sed '1s/license/dcterms.license[en_US]/' \
| csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
```
- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
- Then I imported 635 new licenses to CGSpace woooo
- After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
<!-- vim: set sw=2 ts=2: -->

View File

@ -19,7 +19,7 @@ I see we have some new ones that aren&rsquo;t in our list if I combine with this
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-01/" />
<meta property="article:published_time" content="2023-01-01T08:44:36+03:00" />
<meta property="article:modified_time" content="2023-01-17T22:38:55+03:00" />
<meta property="article:modified_time" content="2023-01-22T21:53:45+03:00" />
@ -44,9 +44,9 @@ I see we have some new ones that aren&rsquo;t in our list if I combine with this
"@type": "BlogPosting",
"headline": "January, 2023",
"url": "https://alanorth.github.io/cgspace-notes/2023-01/",
"wordCount": "2969",
"wordCount": "4065",
"datePublished": "2023-01-01T08:44:36+03:00",
"dateModified": "2023-01-17T22:38:55+03:00",
"dateModified": "2023-01-22T21:53:45+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -584,6 +584,166 @@ I see we have some new ones that aren&rsquo;t in our list if I combine with this
</ul>
</li>
</ul>
<h2 id="2023-01-23">2023-01-23</h2>
<ul>
<li>Salem found that you can actually harvest everything in DSpace 7 using the <a href="https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&amp;size=100"><code>discover/browses</code> endpoint</a></li>
<li>Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
<ul>
<li>I noticed that we still have &ldquo;North America&rdquo; as a region, but according to UN M.49 that is the continent, which comprises &ldquo;Northern America&rdquo; the region, so I will update our controlled vocabularies and all existing entries</li>
<li>I imported changes to 1,800 items</li>
<li>When it finished five hours later I started a harvest on AReS</li>
</ul>
</li>
</ul>
<h2 id="2023-01-24">2023-01-24</h2>
<ul>
<li>Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI</li>
<li>Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
<ul>
<li>I also added &ldquo;CGIAR Trust Fund&rdquo; to all items with an Initiative in <code>cg.contributor.initiative</code></li>
</ul>
</li>
</ul>
<h2 id="2023-01-25">2023-01-25</h2>
<ul>
<li>Oh shit, the import last night ran for twelve hours and then died:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Error committing changes to database: could not execute statement
</span></span><span style="display:flex;"><span>Aborting most recent changes.
</span></span></code></pre></div><ul>
<li>I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes</li>
<li>Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted</li>
<li>Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
<ul>
<li>We looked on AReS and all the items are still there</li>
<li>I looked in the DSpace log and see around 2,000 messages like this:</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
</span></span><span style="display:flex;"><span>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
</span></span><span style="display:flex;"><span> at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
</span></span><span style="display:flex;"><span> at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
</span></span><span style="display:flex;"><span> at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.dispatchEvents(Context.java:455)
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.commit(Context.java:424)
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.complete(Context.java:380)
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
</span></span></code></pre></div><ul>
<li>I filed a ticket with Atmire to ask them</li>
<li>For now I just did a light Discovery reindex (not the full one) and all the items appeared again</li>
<li>Submit an issue to MEL GitHub regarding the capitalization of CRPs: <a href="https://github.com/CodeObia/MEL/issues/11133">https://github.com/CodeObia/MEL/issues/11133</a>
<ul>
<li>I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with <a href="https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt">our current controlled vocabulary for CRPs</a> and he will update it in MEL.</li>
<li>On that note, Peter and Abenet and I realized that we still have an old field <code>cg.subject.crp</code> with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)</li>
<li>I exported this list of values to lowercase them and move them to <code>cg.contributor.crp</code></li>
<li>Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.subject.crp -t correct
</span></span><span style="display:flex;"><span>$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -f cg.subject.crp -t cg.contributor.crp
</span></span></code></pre></div><ul>
<li>After fixing and moving them all, I deleted the <code>cg.subject.crp</code> field from the metadata registry</li>
<li>I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> text_lang<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;en_US&#39;</span> <span style="color:#66d9ef">WHERE</span> dspace_object_id <span style="color:#66d9ef">IN</span> (<span style="color:#66d9ef">SELECT</span> uuid <span style="color:#66d9ef">FROM</span> item <span style="color:#66d9ef">WHERE</span> in_archive <span style="color:#66d9ef">AND</span> <span style="color:#66d9ef">NOT</span> withdrawn) <span style="color:#66d9ef">AND</span> text_lang <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">OR</span> text_lang <span style="color:#66d9ef">IN</span> (<span style="color:#e6db74">&#39;en&#39;</span>, <span style="color:#e6db74">&#39;&#39;</span>);
</span></span></code></pre></div><ul>
<li>
<p>I tried that in a transaction and it hung, so I canceled it and rolled back</p>
</li>
<li>
<p>I see some PostgreSQL locks attributed to <code>dspaceApi</code> that were started at <code>2023-01-25 13:40:04.529087+01</code> and haven&rsquo;t changed since then (that&rsquo;s eight hours ago)</p>
<ul>
<li>I killed the pid&hellip;</li>
<li>There were also saw some locks owned by <code>dspaceWeb</code> that were nine and four hours old, so I killed those too&hellip;</li>
<li>Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can&rsquo;t run the update on the text langs&hellip;</li>
</ul>
</li>
<li>
<p>Export entire CGSpace to do Initiative mappings again</p>
</li>
<li>
<p>Started a harvest on AReS</p>
</li>
</ul>
<h2 id="2023-01-26">2023-01-26</h2>
<ul>
<li>Export entire CGSpace to do some metadata cleanup on various fields
<ul>
<li>I also added &ldquo;CGIAR Trust Fund&rdquo; to all items in the Initiatives community</li>
</ul>
</li>
</ul>
<h2 id="2023-01-27">2023-01-27</h2>
<ul>
<li>Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting <em>everything</em> from PostgreSQL:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.contributor.affiliation[en_US]&#39;</span> 2023-01-27-initiatives.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e 1d -e &#39;s/^&#34;//&#39; -e &#39;s/&#34;$//&#39; -e &#39;s/||/\n/g&#39; -e &#39;/^$/d&#39; \
</span></span><span style="display:flex;"><span> | sort | uniq -c | sort -h \
</span></span><span style="display:flex;"><span> | awk &#39;BEGIN { FS = &#34;^[[:space:]]+[[:digit:]]+[[:space:]]+&#34; } {print $2}&#39;\
</span></span><span style="display:flex;"><span> | sed -e &#39;1i cg.contributor.affiliation&#39; -e &#39;s/^\(.*\)$/&#34;\1&#34;/&#39; \
</span></span><span style="display:flex;"><span> &gt; /tmp/2023-01-27-initiatives-affiliations.csv
</span></span></code></pre></div><ul>
<li>The first sed command strips the quotes, deletes empty lines, and splits multiple values on &ldquo;||&rdquo;</li>
<li>The awk command sets the field separator to something so we can get the second &ldquo;field&rdquo; of the sort command, ie:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
</span></span><span style="display:flex;"><span> 309 International Center for Agricultural Research in the Dry Areas
</span></span><span style="display:flex;"><span> 412 International Livestock Research Institute
</span></span></code></pre></div><ul>
<li>The second sed command adds the CSV header and quotes back</li>
<li>I did the same for authors and donors and send them to Peter to make corrections</li>
</ul>
<h2 id="2023-01-28">2023-01-28</h2>
<ul>
<li>Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API</li>
</ul>
<h2 id="2023-01-29">2023-01-29</h2>
<ul>
<li>Export the entire CGSpace to do Initiatives collection mappings</li>
<li>I was thinking about a way to use Crossref&rsquo;s API to enrich our data, for example checking registered DOIs for license information, publishers, etc
<ul>
<li>Turns out I had already written <code>crossref-doi-lookup.py</code> last year, and it works</li>
<li>I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren&rsquo;t registered on Crossref, which is about 11,800 DOIs</li>
</ul>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;cg.identifier.doi[en_US]&#39;</span> ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c &#39;cg.identifier.doi[en_US]&#39; -r &#39;.*cifor.*&#39; -i \
</span></span><span style="display:flex;"><span> | sed 1d &gt; /tmp/2023-01-29-dois.txt
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2023-01-29-dois.txt
</span></span><span style="display:flex;"><span>11819 /tmp/2023-01-29-dois.txt
</span></span><span style="display:flex;"><span>$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">&#39;id,cg.identifier.doi[en_US]&#39;</span> ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e &#39;s_https://doi.org/__g&#39; -e &#39;s_https://dx.doi.org/__g&#39; -e &#39;s/cg.identifier.doi\[en_US\]/doi/&#39; \
</span></span><span style="display:flex;"><span> &gt; /tmp/cgspace-temp.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c license -r &#39;creative&#39; \
</span></span><span style="display:flex;"><span> | sed &#39;1s/license/dcterms.license[en_US]/&#39; \
</span></span><span style="display:flex;"><span> | csvcut -c id,license &gt; /tmp/2023-01-29-new-licenses.csv
</span></span></code></pre></div><ul>
<li>The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
<ul>
<li>Then I imported 635 new licenses to CGSpace woooo</li>
<li>After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -10,7 +10,7 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />

View File

@ -3,19 +3,19 @@
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2023-01/</loc>
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
</url><url>
<loc>https://alanorth.github.io/cgspace-notes/2022-12/</loc>
<lastmod>2023-01-01T10:12:13+02:00</lastmod>