<metaproperty="og:description"content="2023-07-01 Export CGSpace to check for missing Initiative collection mappings Start harvesting on AReS 2023-07-02 Minor edits to the crossref_doi_lookup.py script while running some checks from 22,000 CGSpace DOIs 2023-07-03 I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect I took the more accurate ones from Crossref and updated the items on CGSpace I took a few hundred ISBNs as well for where we were missing them I also tagged ~4,700 items with missing licenses as “Copyrighted; all rights reserved” based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it’s usually copyrighted (could still be open access, but we can’t tell via Crossref) I would be curious to write a script to check the Unpaywall API for open access status… In the past I found that their license status was not very accurate, but the open access status might be more reliable More minor work on the DSpace 7 item views I learned some new Angular template syntax I created a custom component to show Creative Commons licenses on the simple item page I also decided that I don’t like the Impact Area icons as a component because they don’t have any visual meaning 2023-07-04 Focus group meeting with CGSpace partners about DSpace 7 I added a themed file selection component to the CGSpace theme It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI I added a custom component to show share icons 2023-07-05 I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13 Most things work but there are some minor bugs it seems Mishell from CIP emailed me to say she was having problems approving an item on CGSpace Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again 2023-07-06 Types meeting I wrote a Python script to check Unpaywall for some information about DOIs 2023-07-7 Continue exploring Unpaywall data for some of our DOIs In the past I’ve found their licensing information to not be very reliable (preferring Crossref), but I think their open access status is more reliable, especially when the provider is listed as being the publisher Even so, sometimes the version can be “acceptedVersion”, which is presumably the author’s version, as opposed to the “publishedVersion”, which means it’s available as open access on the publisher’s website I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses Delete duplicate metadata as describe in my DSpace issue from last year: https://github."/>
<metaname="twitter:description"content="2023-07-01 Export CGSpace to check for missing Initiative collection mappings Start harvesting on AReS 2023-07-02 Minor edits to the crossref_doi_lookup.py script while running some checks from 22,000 CGSpace DOIs 2023-07-03 I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect I took the more accurate ones from Crossref and updated the items on CGSpace I took a few hundred ISBNs as well for where we were missing them I also tagged ~4,700 items with missing licenses as “Copyrighted; all rights reserved” based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it’s usually copyrighted (could still be open access, but we can’t tell via Crossref) I would be curious to write a script to check the Unpaywall API for open access status… In the past I found that their license status was not very accurate, but the open access status might be more reliable More minor work on the DSpace 7 item views I learned some new Angular template syntax I created a custom component to show Creative Commons licenses on the simple item page I also decided that I don’t like the Impact Area icons as a component because they don’t have any visual meaning 2023-07-04 Focus group meeting with CGSpace partners about DSpace 7 I added a themed file selection component to the CGSpace theme It displays the bistream description instead of the file name, just like we did in DSpace 6 XMLUI I added a custom component to show share icons 2023-07-05 I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13 Most things work but there are some minor bugs it seems Mishell from CIP emailed me to say she was having problems approving an item on CGSpace Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again 2023-07-06 Types meeting I wrote a Python script to check Unpaywall for some information about DOIs 2023-07-7 Continue exploring Unpaywall data for some of our DOIs In the past I’ve found their licensing information to not be very reliable (preferring Crossref), but I think their open access status is more reliable, especially when the provider is listed as being the publisher Even so, sometimes the version can be “acceptedVersion”, which is presumably the author’s version, as opposed to the “publishedVersion”, which means it’s available as open access on the publisher’s website I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses Delete duplicate metadata as describe in my DSpace issue from last year: https://github."/>
<li>Export CGSpace to check for missing Initiative collection mappings</li>
<li>Start harvesting on AReS</li>
</ul>
<h2id="2023-07-02">2023-07-02</h2>
<ul>
<li>Minor edits to the <code>crossref_doi_lookup.py</code> script while running some checks from 22,000 CGSpace DOIs</li>
</ul>
<h2id="2023-07-03">2023-07-03</h2>
<ul>
<li>I analyzed the licenses declared by Crossref and found with high confidence that ~400 of ours were incorrect
<ul>
<li>I took the more accurate ones from Crossref and updated the items on CGSpace</li>
<li>I took a few hundred ISBNs as well for where we were missing them</li>
<li>I also tagged ~4,700 items with missing licenses as “Copyrighted; all rights reserved” based on their Crossref license status being TDM, mostly from Elsevier, Wiley, and Springer</li>
<li>Checking a dozen or so manually, I confirmed that if Crossref only has a TDM license then it’s usually copyrighted (could still be open access, but we can’t tell via Crossref)</li>
<li>I would be curious to write a script to check the Unpaywall API for open access status…</li>
<li>In the past I found that their <em>license</em> status was not very accurate, but the open access status might be more reliable</li>
</ul>
</li>
<li>More minor work on the DSpace 7 item views
<ul>
<li>I learned some new Angular template syntax</li>
<li>I created a custom component to show Creative Commons licenses on the simple item page</li>
<li>I also decided that I don’t like the Impact Area icons as a component because they don’t have any visual meaning</li>
<li>I spent some time trying to update OpenRXV from Angular 9 to 10 to 11 to 12 to 13
<ul>
<li>Most things work but there are some minor bugs it seems</li>
</ul>
</li>
<li>Mishell from CIP emailed me to say she was having problems approving an item on CGSpace
<ul>
<li>Looking at PostgreSQL I saw there were a dozen or so locks that were several hours and even over one day old so I killed those processes and told her to try again</li>
</ul>
</li>
</ul>
<h2id="2023-07-06">2023-07-06</h2>
<ul>
<li>Types meeting</li>
<li>I wrote a Python script to check Unpaywall for some information about DOIs</li>
</ul>
<h2id="2023-07-7">2023-07-7</h2>
<ul>
<li>Continue exploring Unpaywall data for some of our DOIs
<ul>
<li>In the past I’ve found their <em>licensing</em> information to not be very reliable (preferring Crossref), but I think their <em>open access status</em> is more reliable, especially when the provider is listed as being the publisher</li>
<li>Even so, sometimes the version can be “acceptedVersion”, which is presumably the author’s version, as opposed to the “publishedVersion”, which means it’s available as open access on the publisher’s website</li>
<li>I did some quality assurance and found ~100 that were marked as Limited Access, but should have been Open Access, and fixed a handful of licenses</li>
</ul>
</li>
<li>Delete duplicate metadata as describe in my DSpace issue from last year: <ahref="https://github.com/DSpace/DSpace/issues/8253">https://github.com/DSpace/DSpace/issues/8253</a></li>
<li>Start working on some statistics on AGROVOC usage for my presenation next week
<ul>
<li>I used the following SQL query to dump values from all subject fields and lower case them:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>localhost/dspacetest= ☘ \COPY (SELECT DISTINCT(lower(text_value)) AS "subject" FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (187, 120, 210, 122, 215, 127, 208, 124, 128, 123, 125, 135, 203, 236, 238, 119)) to /tmp/2023-07-07-cgspace-subjects.csv WITH CSV HEADER;
<li>I did a lot more work on OpenRXV to test and update dependencies</li>
<li>I deployed the latest version on the production server</li>
</ul>
<h2id="2023-07-12">2023-07-12</h2>
<ul>
<li>CGSpace upgrade meeting with Americas and Africa group</li>
</ul>
<h2id="2023-07-13">2023-07-13</h2>
<ul>
<li>Michael Victor asked me to help Aditi extract some information from CGSpace
<ul>
<li>She was interested in journal articles published between 2018 and 2023 with a range of subjects related to drought, flooding, resilience, etc</li>
<li>I used an advanced query with some AGROVOC terms:</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration" OR dcterms.subject:livestock)
</span></span></code></pre></div><ul>
<li>Interestingly, some variations of this same exact query produce no search results, and I see this error in the DSpace log:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dcterms.issued:[2018 TO 2023] AND dcterms.type:"Journal Article" AND (dcterms.subject:flooding OR dcterms.subject:flood OR dcterms.subject:"extreme weather events" OR dcterms.subject:drought OR dcterms.subject:"drought resistance" OR dcterms.subject:"drought tolerance" OR dcterms.subject:"soil salinity" OR dcterms.subject:"pests of plants" OR dcterms.subject:pests OR dcterms.subject:heat OR dcterms.subject:fertilizers OR dcterms.subject:"fertilizer technology" OR dcterms.subject:"rice fields" OR dcterms.subject:livestock OR dcterms.subject:"landscape conservation" OR dcterms.subject:"landscape restoration\"\)': Lexical error at line 1, column 617. Encountered: <EOF> after : "\"landscape restoration\\\"\\)"
</span></span></code></pre></div><ul>
<li>It seems to be when there is a quoted search term at the end of the parenthesized group
<ul>
<li>For what it’s worth this same query worked fine on DSpace 7.6</li>
</ul>
</li>
</ul>
<h2id="2023-07-15">2023-07-15</h2>
<ul>
<li>Export CGSpace to fix missing Initiative collection mappings</li>
<li>Start a harvest on AReS</li>
</ul>
<h2id="2023-07-17">2023-07-17</h2>
<ul>
<li>Rasika had sent me a list of new ORCID identifiers for new IWMI staff so I combined them with our existing list and ran <code>resolve_orcids.py</code> to refresh the names in our database
<ul>
<li>I updated the list, updated names in the database, and tagged new authors with missing identifiers in existing items</li>
</ul>
</li>
</ul>
<h2id="2023-07-18">2023-07-18</h2>
<ul>
<li>Meeting with IWMI, IRRI, and IITA colleagues about CGSpace upgrade plans</li>
<li>Maria from the Alliance mentioned having some submissions stuck on CGSpace
<ul>
<li>I looked and found a number of locks stuck for many nineteen, eighteen, and more hours…</li>
<li>I killed them and told her to try again</li>
</ul>
</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ psql < locks-age.sql | less -S
<li>I had to kill a bunch more locked processes in PostgreSQL, I’m not sure what’s going on</li>
<li>After some discussion about an advanced search bug with Tim on Slack, I filed <ahref="https://github.com/DSpace/DSpace/issues/8962">an issue on GitHub</a></li>
</ul>
<h2id="2023-07-20">2023-07-20</h2>
<ul>
<li>I added a new metadata field for CGIAR Impact Platforms (<code>cg.subject.impactPlatform</code>) to CGSpace</li>
</span></span><spanstyle="display:flex;"><span>Exception: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
</span></span><spanstyle="display:flex;"><span>org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=1a92472e-e39d-4602-9b4d-da022df8f233] unknown field 'containerCommunity'
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:681)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
</span></span><spanstyle="display:flex;"><span> at org.dspace.util.SolrImportExport.importIndex(SolrImportExport.java:465)
</span></span><spanstyle="display:flex;"><span> at org.dspace.util.SolrImportExport.main(SolrImportExport.java:148)
</span></span><spanstyle="display:flex;"><span> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
</span></span><spanstyle="display:flex;"><span> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
</span></span><spanstyle="display:flex;"><span> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
</span></span><spanstyle="display:flex;"><span> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:277)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:133)
</span></span><spanstyle="display:flex;"><span> at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:98)
</span></span></code></pre></div><ul>
<li>I will try using solr-import-export-json, which I’ve used in the past to skip Atmire custom fields in Solr:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>$ chrt -b <spanstyle="color:#ae81ff">0</span> ./run.sh -s http://localhost:8081/solr/statistics -a export -o /tmp/statistics-2022.json -f <spanstyle="color:#e6db74">'time:[2022-01-01T00\:00\:00Z TO 2022-12-31T23\:59\:59Z]'</span> -k uid -S author_mtdt,author_mtdt_search,iso_mtdt_search,iso_mtdt,subject_mtdt,subject_mtdt_search,containerCollection,containerCommunity,containerItem,countryCode_ngram,countryCode_search,cua_version,dateYear,dateYearMonth,geoipcountrycode,geoIpCountryCode,ip_ngram,ip_search,isArchived,isInternal,isWithdrawn,containerBitstream,file_id,referrer_ngram,referrer_search,userAgent_ngram,userAgent_search,version_id,complete_query,complete_query_search,filterquery,ngram_query_search,ngram_simplequery_search,simple_query,simple_query_search,range,rangeDescription,rangeDescription_ngram,rangeDescription_search,range_ngram,range_search,actingGroupId,actorMemberGroupId,bitstreamCount,solr_update_time_stamp,bitstreamId,core_update_run_nb
</span></span></code></pre></div><ul>
<li>Some users complained that CGSpace was slow and I found a handful of locks that were hours and days old…
<ul>
<li>I killed those and told them to try again</li>
</ul>
</li>
<li>After importing the Solr statistics into DSpace 7 I realized that my DSpace Statistics API will work fine
<ul>
<li>I made some minor modifications to the Ansible infrastructure scripts to make sure it is enabled and then activated it on DSpace 7 Test</li>
<li>I don’t know why these can get blocked for hours without resolution, but for now I just killed them
<ul>
<li>For what it’s worth <ahref="https://github.com/DSpace/DSpace/commit/16ae96b4c3d833c2a4acd1f05985d424c3a52bd7">these sequences were removed in DSpace 7.0</a> along with the “traditional” item workflow—maybe that means we won’t have such contention issues in DSpace 7!</li>
</ul>
</li>
<li>I wrote a slightly longer regex to match locks that have been stuck for more than 1 hour based on the output of the <code>locks-age.sql</code> script and killed them:</li>
<li>I filed <ahref="https://github.com/DSpace/dspace-angular/issues/2400">an issue for missing Altmetric badges on DSpace 7 Angular</a></li>
</ul>
<h2id="2023-07-27">2023-07-27</h2>
<ul>
<li>Export CGSpace to check countries, regions, types, and Initiatives
<ul>
<li>There were a few minor issues in countries and regions, and I noticed 186 items without types!</li>
<li>Then I ran the file through csv-metadata-quality to make sure items with countries have appropriate regions</li>
</ul>
</li>
<li>Brief discussion about OpenRXV bugs and fixes with Moayad</li>
<li>I was toying with the idea of using an expanded whitespace check/fix based on <ahref="https://eslint.org/docs/latest/rules/no-irregular-whitespace">ESLint’s no-irregular-whitespace</a> rule in csv-metadata-quality
<ul>
<li>I found 176 items in CGSpace with such whitespace in their titles alone</li>
<li>I compared the results of removing these characters and replacing them with a space</li>
<li>In <em>most</em> cases removing it is the correct thing to do, for example “Pesticides: une arme à double tranchant” → “Pesticides: une arme à double tranchant”</li>
<li>But in some items it is tricky, for example “L’environnement juridique est-il propice àla gestion” → “L’environnement juridique est-il propice àla gestion”</li>
<li>I guess it would really need some good heuristics or a human to verify…</li>
</ul>
</li>
<li>I upgraded OpenRXV to Angular v14</li>
</ul>
<h2id="2023-07-28">2023-07-28</h2>
<ul>
<li>After a bit more testing I merged the <ahref="https://github.com/ilri/OpenRXV/pull/184">Angular v14 changes to OpenRXV master</a></li>
<li>I am getting an error trying to import the 2020 Solr statistics from CGSpace to DSpace 7:</li>
</ul>
<divclass="highlight"><pretabindex="0"style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><codeclass="language-console"data-lang="console"><spanstyle="display:flex;"><span>Exception in thread "main" org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/statistics: ERROR: [doc=0008a7c1-e552-4a4e-93e4-4d23bf39964b] Error adding field 'workflowItemId'='0812be47-1bfe-45e2-9208-5bf10ee46f81' msg=For input string: "0812be47-1bfe-45e2-9208-5bf10ee46f81"
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:745)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:259)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:234)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:102)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:69)
</span></span><spanstyle="display:flex;"><span> at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:82)
</span></span><spanstyle="display:flex;"><span> at it.damore.solr.importexport.App.insertBatch(App.java:295)
</span></span><spanstyle="display:flex;"><span> at it.damore.solr.importexport.App.lambda$writeAllDocuments$10(App.java:276)
</span></span><spanstyle="display:flex;"><span> at it.damore.solr.importexport.BatchCollector.lambda$accumulator$0(BatchCollector.java:71)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
</span></span><spanstyle="display:flex;"><span> at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
</span></span><spanstyle="display:flex;"><span> at it.damore.solr.importexport.App.writeAllDocuments(App.java:252)
</span></span><spanstyle="display:flex;"><span> at it.damore.solr.importexport.App.main(App.java:150)
</span></span></code></pre></div><ul>
<li>Ahhhh, in DSpace 6 this field was a string in the Solr statistics schema, but in DSpace 7 it is an integer…?
<ul>
<li>Oh, it seems to be an Atmire change in our DSpace 6… hmmm, so we need to ignore the <code>workflowItemId</code> field when exporting