mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 06:35:03 +01:00
Add notes for 2023-01-29
This commit is contained in:
parent
2c7f6b3e39
commit
81f04f48ad
@ -407,4 +407,148 @@ $ psql < locks-age.sql | grep days | less -S
|
||||
- Then I ran the script to check for missing ORCID identifiers
|
||||
- Then *finally*, I started a harvest on AReS
|
||||
|
||||
## 2023-01-23
|
||||
|
||||
- Salem found that you can actually harvest everything in DSpace 7 using the [`discover/browses` endpoint](https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100)
|
||||
- Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
|
||||
- I noticed that we still have "North America" as a region, but according to UN M.49 that is the continent, which comprises "Northern America" the region, so I will update our controlled vocabularies and all existing entries
|
||||
- I imported changes to 1,800 items
|
||||
- When it finished five hours later I started a harvest on AReS
|
||||
|
||||
## 2023-01-24
|
||||
|
||||
- Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI
|
||||
- Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
|
||||
- I also added "CGIAR Trust Fund" to all items with an Initiative in `cg.contributor.initiative`
|
||||
|
||||
## 2023-01-25
|
||||
|
||||
- Oh shit, the import last night ran for twelve hours and then died:
|
||||
|
||||
```console
|
||||
Error committing changes to database: could not execute statement
|
||||
Aborting most recent changes.
|
||||
```
|
||||
|
||||
- I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes
|
||||
- Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted
|
||||
- Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
|
||||
- We looked on AReS and all the items are still there
|
||||
- I looked in the DSpace log and see around 2,000 messages like this:
|
||||
|
||||
```console
|
||||
2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
|
||||
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||||
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||||
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
|
||||
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
|
||||
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
|
||||
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
|
||||
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
|
||||
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
|
||||
at org.dspace.core.Context.dispatchEvents(Context.java:455)
|
||||
at org.dspace.core.Context.commit(Context.java:424)
|
||||
at org.dspace.core.Context.complete(Context.java:380)
|
||||
at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
|
||||
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||||
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||||
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||||
at java.lang.reflect.Method.invoke(Method.java:498)
|
||||
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||||
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||||
```
|
||||
|
||||
- I filed a ticket with Atmire to ask them
|
||||
- For now I just did a light Discovery reindex (not the full one) and all the items appeared again
|
||||
- Submit an issue to MEL GitHub regarding the capitalization of CRPs: https://github.com/CodeObia/MEL/issues/11133
|
||||
- I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with [our current controlled vocabulary for CRPs](https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt) and he will update it in MEL.
|
||||
- On that note, Peter and Abenet and I realized that we still have an old field `cg.subject.crp` with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)
|
||||
- I exported this list of values to lowercase them and move them to `cg.contributor.crp`
|
||||
- Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon
|
||||
|
||||
```console
|
||||
$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t correct
|
||||
$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p 'fuuu' -f cg.subject.crp -t cg.contributor.crp
|
||||
```
|
||||
|
||||
- After fixing and moving them all, I deleted the `cg.subject.crp` field from the metadata registry
|
||||
- I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:
|
||||
|
||||
```sql
|
||||
UPDATE metadatavalue SET text_lang='en_US' WHERE dspace_object_id IN (SELECT uuid FROM item WHERE in_archive AND NOT withdrawn) AND text_lang IS NULL OR text_lang IN ('en', '');
|
||||
```
|
||||
|
||||
- I tried that in a transaction and it hung, so I canceled it and rolled back
|
||||
- I see some PostgreSQL locks attributed to `dspaceApi` that were started at `2023-01-25 13:40:04.529087+01` and haven't changed since then (that's eight hours ago)
|
||||
- I killed the pid...
|
||||
- There were also saw some locks owned by `dspaceWeb` that were nine and four hours old, so I killed those too...
|
||||
- Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can't run the update on the text langs...
|
||||
|
||||
- Export entire CGSpace to do Initiative mappings again
|
||||
- Started a harvest on AReS
|
||||
|
||||
## 2023-01-26
|
||||
|
||||
- Export entire CGSpace to do some metadata cleanup on various fields
|
||||
- I also added "CGIAR Trust Fund" to all items in the Initiatives community
|
||||
|
||||
## 2023-01-27
|
||||
|
||||
- Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting *everything* from PostgreSQL:
|
||||
|
||||
```console
|
||||
$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
|
||||
$ csvcut -c 'cg.contributor.affiliation[en_US]' 2023-01-27-initiatives.csv \
|
||||
| sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
|
||||
| sort | uniq -c | sort -h \
|
||||
| awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
|
||||
| sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
|
||||
> /tmp/2023-01-27-initiatives-affiliations.csv
|
||||
```
|
||||
|
||||
- The first sed command strips the quotes, deletes empty lines, and splits multiple values on "||"
|
||||
- The awk command sets the field separator to something so we can get the second "field" of the sort command, ie:
|
||||
|
||||
```console
|
||||
...
|
||||
309 International Center for Agricultural Research in the Dry Areas
|
||||
412 International Livestock Research Institute
|
||||
```
|
||||
|
||||
- The second sed command adds the CSV header and quotes back
|
||||
- I did the same for authors and donors and send them to Peter to make corrections
|
||||
|
||||
## 2023-01-28
|
||||
|
||||
- Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API
|
||||
|
||||
## 2023-01-29
|
||||
|
||||
- Export the entire CGSpace to do Initiatives collection mappings
|
||||
- I was thinking about a way to use Crossref's API to enrich our data, for example checking registered DOIs for license information, publishers, etc
|
||||
- Turns out I had already written `crossref-doi-lookup.py` last year, and it works
|
||||
- I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren't registered on Crossref, which is about 11,800 DOIs
|
||||
|
||||
```console
|
||||
$ csvcut -c 'cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
|
||||
| csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
|
||||
| sed 1d > /tmp/2023-01-29-dois.txt
|
||||
$ wc -l /tmp/2023-01-29-dois.txt
|
||||
11819 /tmp/2023-01-29-dois.txt
|
||||
$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
|
||||
$ csvcut -c 'id,cg.identifier.doi[en_US]' ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv \
|
||||
| sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
|
||||
> /tmp/cgspace-temp.csv
|
||||
$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv \
|
||||
| csvgrep -c license -r 'creative' \
|
||||
| sed '1s/license/dcterms.license[en_US]/' \
|
||||
| csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
|
||||
```
|
||||
|
||||
- The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
|
||||
- Then I imported 635 new licenses to CGSpace woooo
|
||||
- After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo
|
||||
|
||||
<!-- vim: set sw=2 ts=2: -->
|
||||
|
@ -19,7 +19,7 @@ I see we have some new ones that aren’t in our list if I combine with this
|
||||
<meta property="og:type" content="article" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2023-01/" />
|
||||
<meta property="article:published_time" content="2023-01-01T08:44:36+03:00" />
|
||||
<meta property="article:modified_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="article:modified_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
@ -44,9 +44,9 @@ I see we have some new ones that aren’t in our list if I combine with this
|
||||
"@type": "BlogPosting",
|
||||
"headline": "January, 2023",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2023-01/",
|
||||
"wordCount": "2969",
|
||||
"wordCount": "4065",
|
||||
"datePublished": "2023-01-01T08:44:36+03:00",
|
||||
"dateModified": "2023-01-17T22:38:55+03:00",
|
||||
"dateModified": "2023-01-22T21:53:45+03:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -584,6 +584,166 @@ I see we have some new ones that aren’t in our list if I combine with this
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-23">2023-01-23</h2>
|
||||
<ul>
|
||||
<li>Salem found that you can actually harvest everything in DSpace 7 using the <a href="https://dspace7test.ilri.org/server/api/discover/browses/title/items?page=1&size=100"><code>discover/browses</code> endpoint</a></li>
|
||||
<li>Exported CGSpace again to examine and clean up a bunch of stuff like ISBNs in the ISSN field, DOIs in the URL field, dataset URLs in the DOI field, normalized a bunch of publisher places, fixed some countries and regions, fixed some licenses, etc
|
||||
<ul>
|
||||
<li>I noticed that we still have “North America” as a region, but according to UN M.49 that is the continent, which comprises “Northern America” the region, so I will update our controlled vocabularies and all existing entries</li>
|
||||
<li>I imported changes to 1,800 items</li>
|
||||
<li>When it finished five hours later I started a harvest on AReS</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-24">2023-01-24</h2>
|
||||
<ul>
|
||||
<li>Proof and upload seven items for the Rethinking Food Markets Initiative for IFPRI</li>
|
||||
<li>Export CGSpace to do some minor cleanups, Initiative collection mappings, and region fixes
|
||||
<ul>
|
||||
<li>I also added “CGIAR Trust Fund” to all items with an Initiative in <code>cg.contributor.initiative</code></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-25">2023-01-25</h2>
|
||||
<ul>
|
||||
<li>Oh shit, the import last night ran for twelve hours and then died:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>Error committing changes to database: could not execute statement
|
||||
</span></span><span style="display:flex;"><span>Aborting most recent changes.
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I re-submitted a smaller version without the CGIAR Trust Fund changes for now just so we get the regions and other fixes</li>
|
||||
<li>Do some work on SAPLING issues for CGSpace, sending a large list of issues we found to the MEL team for items they submitted</li>
|
||||
<li>Abenet noticed that the number of items in the Initiatives community appears to have dropped by about 2,000 in the XMLUI
|
||||
<ul>
|
||||
<li>We looked on AReS and all the items are still there</li>
|
||||
<li>I looked in the DSpace log and see around 2,000 messages like this:</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>2023-01-25 07:14:59,529 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: c9fac1f2-6b2b-4941-8077-40b7b5c936b6 message:missing required field: epersonID
|
||||
</span></span><span style="display:flex;"><span>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: missing required field: epersonID
|
||||
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
|
||||
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
|
||||
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
|
||||
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
|
||||
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
|
||||
</span></span><span style="display:flex;"><span> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
|
||||
</span></span><span style="display:flex;"><span> at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
|
||||
</span></span><span style="display:flex;"><span> at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.dispatchEvents(Context.java:455)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.commit(Context.java:424)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.core.Context.complete(Context.java:380)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.app.bulkedit.MetadataImport.main(MetadataImport.java:1399)
|
||||
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
|
||||
</span></span><span style="display:flex;"><span> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
|
||||
</span></span><span style="display:flex;"><span> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
|
||||
</span></span><span style="display:flex;"><span> at java.lang.reflect.Method.invoke(Method.java:498)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
|
||||
</span></span><span style="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>I filed a ticket with Atmire to ask them</li>
|
||||
<li>For now I just did a light Discovery reindex (not the full one) and all the items appeared again</li>
|
||||
<li>Submit an issue to MEL GitHub regarding the capitalization of CRPs: <a href="https://github.com/CodeObia/MEL/issues/11133">https://github.com/CodeObia/MEL/issues/11133</a>
|
||||
<ul>
|
||||
<li>I talked to Salem and he said that this is a legacy thing from when CGSpace was using ALL CAPS for most of its metadata. I provided him with <a href="https://ilri.github.io/cgspace-submission-guidelines/cg-contributor-crp/cg-contributor-crp.txt">our current controlled vocabulary for CRPs</a> and he will update it in MEL.</li>
|
||||
<li>On that note, Peter and Abenet and I realized that we still have an old field <code>cg.subject.crp</code> with about 450 values in it, but it has not been used for a few years (they are using the old ALL CAPS CRPs)</li>
|
||||
<li>I exported this list of values to lowercase them and move them to <code>cg.contributor.crp</code></li>
|
||||
<li>Even if some items end up with multiple CRPs, they will get de-duplicated when I remove duplicate values soon</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/fix-metadata-values.py -i /tmp/2023-01-25-fix-crp-subjects.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -f cg.subject.crp -t correct
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/move-metadata-values.py -i /tmp/2023-01-25-move-crp-subjects.csv -db dspace -u dspace -p <span style="color:#e6db74">'fuuu'</span> -f cg.subject.crp -t cg.contributor.crp
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>After fixing and moving them all, I deleted the <code>cg.subject.crp</code> field from the metadata registry</li>
|
||||
<li>I realized a smarter way to update the text lang attributes of metadata would be to restrict the query to items that are in the archive and not withdrawn:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">UPDATE</span> metadatavalue <span style="color:#66d9ef">SET</span> text_lang<span style="color:#f92672">=</span><span style="color:#e6db74">'en_US'</span> <span style="color:#66d9ef">WHERE</span> dspace_object_id <span style="color:#66d9ef">IN</span> (<span style="color:#66d9ef">SELECT</span> uuid <span style="color:#66d9ef">FROM</span> item <span style="color:#66d9ef">WHERE</span> in_archive <span style="color:#66d9ef">AND</span> <span style="color:#66d9ef">NOT</span> withdrawn) <span style="color:#66d9ef">AND</span> text_lang <span style="color:#66d9ef">IS</span> <span style="color:#66d9ef">NULL</span> <span style="color:#66d9ef">OR</span> text_lang <span style="color:#66d9ef">IN</span> (<span style="color:#e6db74">'en'</span>, <span style="color:#e6db74">''</span>);
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>
|
||||
<p>I tried that in a transaction and it hung, so I canceled it and rolled back</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>I see some PostgreSQL locks attributed to <code>dspaceApi</code> that were started at <code>2023-01-25 13:40:04.529087+01</code> and haven’t changed since then (that’s eight hours ago)</p>
|
||||
<ul>
|
||||
<li>I killed the pid…</li>
|
||||
<li>There were also saw some locks owned by <code>dspaceWeb</code> that were nine and four hours old, so I killed those too…</li>
|
||||
<li>Now Maria was able to archive one submission of hers that was hanging all afternoon, but I still can’t run the update on the text langs…</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
<p>Export entire CGSpace to do Initiative mappings again</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Started a harvest on AReS</p>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-26">2023-01-26</h2>
|
||||
<ul>
|
||||
<li>Export entire CGSpace to do some metadata cleanup on various fields
|
||||
<ul>
|
||||
<li>I also added “CGIAR Trust Fund” to all items in the Initiatives community</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-27">2023-01-27</h2>
|
||||
<ul>
|
||||
<li>Export a list of affiliations in the Initiatives community for Peter, trying a new method to avoid exporting <em>everything</em> from PostgreSQL:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ dspace metadata-export -i 10568/115087 -f /tmp/2023-01-27-initiatives.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'cg.contributor.affiliation[en_US]'</span> 2023-01-27-initiatives.csv <span style="color:#ae81ff">\
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e 1d -e 's/^"//' -e 's/"$//' -e 's/||/\n/g' -e '/^$/d' \
|
||||
</span></span><span style="display:flex;"><span> | sort | uniq -c | sort -h \
|
||||
</span></span><span style="display:flex;"><span> | awk 'BEGIN { FS = "^[[:space:]]+[[:digit:]]+[[:space:]]+" } {print $2}'\
|
||||
</span></span><span style="display:flex;"><span> | sed -e '1i cg.contributor.affiliation' -e 's/^\(.*\)$/"\1"/' \
|
||||
</span></span><span style="display:flex;"><span> > /tmp/2023-01-27-initiatives-affiliations.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>The first sed command strips the quotes, deletes empty lines, and splits multiple values on “||”</li>
|
||||
<li>The awk command sets the field separator to something so we can get the second “field” of the sort command, ie:</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>...
|
||||
</span></span><span style="display:flex;"><span> 309 International Center for Agricultural Research in the Dry Areas
|
||||
</span></span><span style="display:flex;"><span> 412 International Livestock Research Institute
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>The second sed command adds the CSV header and quotes back</li>
|
||||
<li>I did the same for authors and donors and send them to Peter to make corrections</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-28">2023-01-28</h2>
|
||||
<ul>
|
||||
<li>Daniel from the Alliance said they are getting an HTTP 401 when trying to submit items to CGSpace via the REST API</li>
|
||||
</ul>
|
||||
<h2 id="2023-01-29">2023-01-29</h2>
|
||||
<ul>
|
||||
<li>Export the entire CGSpace to do Initiatives collection mappings</li>
|
||||
<li>I was thinking about a way to use Crossref’s API to enrich our data, for example checking registered DOIs for license information, publishers, etc
|
||||
<ul>
|
||||
<li>Turns out I had already written <code>crossref-doi-lookup.py</code> last year, and it works</li>
|
||||
<li>I exported a list of all DOIs without licenses from CGSpace, minus the CIFOR ones because I know they aren’t registered on Crossref, which is about 11,800 DOIs</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'cg.identifier.doi[en_US]'</span> ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv <span style="color:#ae81ff">\
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c 'cg.identifier.doi[en_US]' -r '.*cifor.*' -i \
|
||||
</span></span><span style="display:flex;"><span> | sed 1d > /tmp/2023-01-29-dois.txt
|
||||
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2023-01-29-dois.txt
|
||||
</span></span><span style="display:flex;"><span>11819 /tmp/2023-01-29-dois.txt
|
||||
</span></span><span style="display:flex;"><span>$ ./ilri/crossref-doi-lookup.py -e a.orth@cgiar.org -i /tmp/2023-01-29-dois.txt -o /tmp/crossref-results.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#e6db74">'id,cg.identifier.doi[en_US]'</span> ~/Downloads/2023-01-29-CGSpace-DOIs-without-licenses.csv <span style="color:#ae81ff">\
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | sed -e 's_https://doi.org/__g' -e 's_https://dx.doi.org/__g' -e 's/cg.identifier.doi\[en_US\]/doi/' \
|
||||
</span></span><span style="display:flex;"><span> > /tmp/cgspace-temp.csv
|
||||
</span></span><span style="display:flex;"><span>$ csvjoin -c doi /tmp/cgspace-temp.csv /tmp/crossref-results.csv <span style="color:#ae81ff">\
|
||||
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span> | csvgrep -c license -r 'creative' \
|
||||
</span></span><span style="display:flex;"><span> | sed '1s/license/dcterms.license[en_US]/' \
|
||||
</span></span><span style="display:flex;"><span> | csvcut -c id,license > /tmp/2023-01-29-new-licenses.csv
|
||||
</span></span></code></pre></div><ul>
|
||||
<li>The above was done with just 5,000 DOIs because it was taking a long time, but after the last step I imported into OpenRefine to clean up the license URLs
|
||||
<ul>
|
||||
<li>Then I imported 635 new licenses to CGSpace woooo</li>
|
||||
<li>After checking the remaining 6,500 DOIs there were another 852 new licenses, woooo</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<!-- raw HTML omitted -->
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/categories/notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -10,7 +10,7 @@
|
||||
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
|
||||
<meta property="og:type" content="website" />
|
||||
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
|
||||
<meta property="og:updated_time" content="2023-01-17T22:38:55+03:00" />
|
||||
<meta property="og:updated_time" content="2023-01-22T21:53:45+03:00" />
|
||||
|
||||
|
||||
|
||||
|
@ -3,19 +3,19 @@
|
||||
xmlns:xhtml="http://www.w3.org/1999/xhtml">
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/</loc>
|
||||
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
|
||||
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
|
||||
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2023-01/</loc>
|
||||
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
|
||||
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
|
||||
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
|
||||
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2023-01-17T22:38:55+03:00</lastmod>
|
||||
<lastmod>2023-01-22T21:53:45+03:00</lastmod>
|
||||
</url><url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2022-12/</loc>
|
||||
<lastmod>2023-01-01T10:12:13+02:00</lastmod>
|
||||
|
Loading…
Reference in New Issue
Block a user