- Then extract the ID, DOI, journal, volume, issue, publisher, etc from the CGSpace dump and rename the `cg.identifier.doi[en_US]` to `doi` so we can join on it with the Crossref results file:
$ csvjoin -c doi /tmp/2023-02-01-cgspace-doi-metadata.csv ~/Downloads/2023-02-01-crossref-results.csv > /tmp/2023-02-01-cgspace-crossref-check.csv
```
- And import into OpenRefine for analysis and cleaning
- I just noticed that Crossref also has types, so we could use that in the future too!
- I got a few corrections after examining manually, but I didn't manage to identify any patterns that I could use to do any automatic matching or cleaning
## 2023-02-05
- Normalize text lang attributes in PostgreSQL, run a quick Discovery index, and then export CGSpace to check Initiative mappings and countries/regions
- Run all system updates on CGSpace (linode18) and reboot it
## 2023-02-06
- Peter said that a new Initiative was approved last month so we need to add it to CGSpace: `Fragility, Conflict, and Migration`
- There is lots of discussion about the "issue date" versus "available date" with Enrico and IFPRI, after lots of feedback from the PRMS QA
- I filed [an issue on CG Core to propose using `dcterms.available` as an optional field to indicate the online date](https://github.com/AgriculturalSemantics/cg-core/issues/43)
## 2023-02-07
- IFPRI's web developer Tony managed to get his Drupal harvester to have a useful user agent:
- He also noticed that there is no pagination on POST requests to `/rest/items/find-by-metadata-field`, and that he needs to increase his timeout for requests that return 100+ results, ie:
- I need to ask on the DSpace Slack about this POST pagination
- Abenet and Udana noticed that the Handle server was not running
- Looking in the `error.log` file I see that the service is complaining about a lock file being present
- This is because Linode had to do emergency maintenance on the VM host this morning and the Handle server didn't shut down properly
- I'm having an issue with `poetry update` so I spent some time debugging and filed [an issue](https://github.com/python-poetry/poetry/issues/7482)
- Proof and import nine items for the Digital Innovation Inititive for IFPRI
- There were only some minor issues in the metadata
- I also did a duplicate check with `check-duplicates.py` just in case
- I did some minor updates on csv-metadata-quality
- First, to reduce warnings on non-SPDX licenses like "Copyrighted; all rights reserved" and "Other" since they are very common for us and I'm sick of seeing the warnings
- Second, to skip whitespace and newline fixes on the abstract field since so many times they are intended
## 2023-02-08
- Make some edits to IFPRI records requested by Jawoo and Leigh
- Help Alessandra upload a last minute report for SAPLING
- Proof and upload twenty-seven IFPRI records to CGSpace
- It's a good thing I did a duplicate check because I found three duplicates!
- Export CGSpace to update Initiative mappings and country/region mappings
- Then to handle formats like "2022-April-26" and "2021-Nov-11" I used some replacement GRELs (note the order so we don't replace short patterns in longer strings prematurely):
- Export CGSpace to do some metadata quality checks
- I added CGIAR Trust Fund as a donor to some new Initiative outputs
- I moved some abstracts from the description field
- I moved some version information to the `cg.edition` field
## 2023-02-14
- The PRMS team in Colombia sent some questions about countries on CGSpace
- I had to fix some, that were clearly wrong, but there is also a difference between CGSpace and MEL because we use mostly iso-codes, and MEL uses the UN M.49 list
- Then I re-ran the country code tagger from cgspace-java-helpers, forcing the update on all items in the Initiatives community
- Remove Alliance research levers from `cg.contributor.crp` field after discussing with Daniel and Maria
- This was a mistake on TIP's part, and there is no direct mapping between research levers and CRPs
- I exported CGSpace to check Initiative collection mappings, regions, and licenses
- Peter told me that all CGIAR blog posts for the Initiatives should be CC-BY-4.0, and I see the logo at the bottom in light gray!
- I had previously missed that and removed some licenses for blog posts
- I checked cgiar.org, ifpri.org, icarda.org, iwmi.cgiar.org, irri.org, etc and corrected a handful
- Work on rebasing my local DSpace 7 dev branches on top of the latest 7.5-SNAPSHOT
- It seems the issues I had with the `dspace submission-forms-migrate` tool in [August, 2022]({{< relref "2022-08.md" >}}) were fixed
- I imported a fresh PostgreSQL snapshot from CGSpace and then removed the Atmire migrations and ran the new migrations as I originally noted in [March, 2022]({{< relref "2022-03.md" >}}), and is pointed out in the [DSpace 7 upgrade notes](https://wiki.lyrasis.org/display/DSDOC7x/Upgrading+DSpace)
- Now I get a new error:
```console
localhost/dspace7= ☘ DELETE FROM schema_version WHERE version IN ('5.0.2017.09.25', '6.0.2017.01.30', '6.0.2017.09.25');
localhost/dspace7= ☘ DELETE FROM schema_version WHERE description LIKE '%Atmire%' OR description LIKE '%CUA%' OR description LIKE '%cua%';
localhost/dspace7= \q
$ ./bin/dspace database migrate ignored
...
CREATE INDEX resourcepolicy_action_idx ON resourcepolicy(action_id)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.handleException(DefaultSqlScriptExecutor.java:275)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:222)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.execute(DefaultSqlScriptExecutor.java:126)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.executeOnce(SqlMigrationExecutor.java:69)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.lambda$execute$0(SqlMigrationExecutor.java:58)
at org.flywaydb.core.internal.database.DefaultExecutionStrategy.execute(DefaultExecutionStrategy.java:27)
at org.flywaydb.core.internal.resolver.sql.SqlMigrationExecutor.execute(SqlMigrationExecutor.java:57)
at org.flywaydb.core.internal.command.DbMigrate.doMigrateGroup(DbMigrate.java:377)
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2676)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2366)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:356)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:496)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:413)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:333)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:319)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:295)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:290)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
at org.apache.commons.dbcp2.DelegatingStatement.execute(DelegatingStatement.java:193)
at org.flywaydb.core.internal.jdbc.JdbcTemplate.executeStatement(JdbcTemplate.java:201)
at org.flywaydb.core.internal.sqlscript.ParsedSqlStatement.execute(ParsedSqlStatement.java:95)
at org.flywaydb.core.internal.sqlscript.DefaultSqlScriptExecutor.executeStatement(DefaultSqlScriptExecutor.java:210)
... 30 more
```
- I dropped that index and then the migration succeeded:
```console
localhost/dspace7= ☘ DROP INDEX resourcepolicy_action_idx;
localhost/dspace7= ☘ \q
$ ./bin/dspace database migrate ignored
Done.
```
- I think that particular error is because I applied the [indexes in this unmerged DSpace 6 patch](https://github.com/DSpace/DSpace/pull/1792), so I don't need to report this as an error in DSpace 7
- I found a suspicious number of PostgreSQL locks on CGSpace and decided to investigate:
```console
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | grep -o -E '(dspaceWeb|dspaceApi|dspaceCli)' | sort | uniq -c
44 dspaceApi
372 dspaceCli
446 dspaceWeb
```
- This started happening yesterday and I killed a few locks that were several hours old after inspecting the `locks-age.sql` output
- I also checked the `locks.sql` output, which helpfully lists the blocked PID and the blocking PID, to find one blocking PID that was idle in transaction
- I killed that process and then all other locks were instantly processed
- I filed [a GitHub issue](https://github.com/DSpace/dspace-angular/issues/2103) on dspace-angular requesting the item view to use the bitstream description instead of the file name if present
- Weekly CG Core types meeting
- I need to go through the actions and remove those items that are only for CGSpace internal use, ie:
- CD-ROM
- Manuscript-unpublished
- Photo Report
- Questionnaire
- Wiki
- Weekly CGIAR Repository Working Group meeting
- I did some experiments with Crossref dates for about 20,000 DOIs in CGSpace using my `crossref-doi-lookup.py` script
- Some things I noted from reading the [Crossref API docs](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) and inspecting the records for a few dozen DOIs manually:
-`["created"]["date-parts"]` → Date on which the DOI was first registered (not useful for us)
-`["published-print"]["date-parts"]` → Date on which the work was published in print
-`["journal-issue"]["published-print"]["date-parts"]` → When present, is 99% the same as the above
-`["published-online"]["date-parts"]` → Date on which the work was published online
-`["journal-issue"]["published-online"]["date-parts"]` → Much more rare, and only 50% the same as the above, so unreliable
-`["issued"]["date-parts"]` → Earliest of published-print and published-online (not useful to us)
- After checking the DOIs manully I decided that when the `published-print` date exists, it is usually more accurate than our issued dates
- I set 12,300 issue dates to those from Crossref
- I also decided that, when `published-online` exists, it is usually accurate when I check the publisher page (we don't have many online dates to compare)
- I set the available date for ~7,000 items to the published-online date as long as:
- There was no `dcterms.available` date already
- It was different than the issued date, because for now I only want online dates that are different, in case this is an online only journal in which case that can be the issue date... maybe I'll re-visit that later
## 2023-02-17
- It seems some (all?) of the changes I applied to dates last night didn't get saved...
- I don't know what happened, so I will run them again after some investigation
- I submitted the first batch of ~7,600 changes and it took twelve hours!
- I almost cancelled it because after applying the changes there was a lock blocking everything for two hours, and it seemed to be stuck, but I kept checking it and saw that the `query_start` and `state_change` were being updated despite it being state "idle in transaction":
```console
$ psql -c 'SELECT * FROM pg_stat_activity WHERE pid=1025176' | less -S
```
- I will apply the other changes in smaller batches...
- Lately I've noticed a lot of activity from the country code tagger curation task
- Looking in the logs I see items being tagged that are very old and should have already been tagged years ago
- Also, I see a ton of these errors whenever the task is updating an item:
```console
2023-02-17 08:01:00,252 INFO org.dspace.curate.Curator @ Curation task: countrycodetagger performed on: 10568/89020 with status: 0. Result: '10568/89020: added 1 alpha2 country code(s)'
2023-02-17 08:01:00,467 ERROR com.atmire.versioning.ModificationLogger @ Error while writing item to versioning index: a0fe9d9a-6ac1-4b6a-8fcb-dae07a6bbf58 message:missing required field: epersonID
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at com.atmire.versioning.ModificationLogger.indexItem(ModificationLogger.java:263)
at com.atmire.versioning.ModificationConsumer.end(ModificationConsumer.java:134)
at org.dspace.event.BasicDispatcher.dispatch(BasicDispatcher.java:157)
at org.dspace.core.Context.dispatchEvents(Context.java:455)
at org.dspace.curate.Curator.visit(Curator.java:541)
at org.dspace.curate.Curator$TaskRunner.run(Curator.java:568)
at org.dspace.curate.Curator.doCollection(Curator.java:515)
at org.dspace.curate.Curator.doCommunity(Curator.java:487)
at org.dspace.curate.Curator.doSite(Curator.java:451)
at org.dspace.curate.Curator.curate(Curator.java:269)
at org.dspace.curate.Curator.curate(Curator.java:203)
at org.dspace.curate.CurationCli.main(CurationCli.java:220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- This must be related...
## 2023-02-18
- I realized why the country-code-tagger was tagging everything: I had overridden the `force` parameter last week!
- Start a harvest on AReS
## 2023-02-20
- IWMI is concerned that some of their items with top Altmetric attention scores don't show up in the AReS Explorer
- I looked into it for one and found that AReS is using the Handle, but Altmetric hasn't associated the Handle with the DOI
- Looking into country and region issues for the PRMS team
- Last week they had some questions about some invalid countries that ended up being typos
- I realized my cgspace-java-helpers country-code-tagger curation task is not using the latest version, so it was missing Türkiye
- I compiled the new version and ran it manually, but I have to upload a new version to Maven Central and then update the dependency in `dspace/modules/additions/pom.xml` ughhhhhh
- I tagged version 6.2 with the change for Türkiye and uploaded to to Maven Central with `mvn clean deploy`
- I'm having second thoughts about switching to UN M.49 for countries because there are just too many tradeoffs
- I want to find a way to keep our existing list, and codify some rules for it
- There are several discussions related to the shortcomings of ISO themselves and the iso-codes project, for example:
- [Inconsistency with articles in ISO-3166-1 English short names](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/33) (this one was filed by me two years ago!)
- [ISO 3166-1: What's the policy for `common_name`?](https://salsa.debian.org/iso-codes-team/iso-codes/-/issues/44)
- I almost want to say fuck it, let's just use iso-codes and tell everyone to deal with it, but make sure we handle ISO 3166-1 Alpha2 or probably Alpha3 in the future
- Something like:
- Prefer `common_name` if it exists
- Prefer the shorter of `name` and `official name`
## 2023-02-21
- Continue working on my `parse-iso-codes.py` script to parse the iso-codes JSON for ISO 3166-1
- I also started a spreadsheet to track current CGSpace country names, proposed new names using the compromise above, and UN M.49 names
- I proposed this to Peter but he wasn't happy because there are still some stupidly long and political names there
- I bumped the version of cgspace-java-helpers to 6.2-SNAPSHOT and pushed it to Maven Central because I can't figure out how to get non-snapshot releases to go there
- Ouch, grunt 1.6.0 was released a few weeks ago, which relies on Node.js v16, thus breaking the Mirage 2 build in DSpace 6
- I filed [an issue in DSpace](https://github.com/DSpace/DSpace/issues/8676)
- Help Moises from CIP troubleshoot harvesting issues on their WordPress site
- I see 2,000 requests with the user agent "RTB website BOT" today and they are all HTTP 200
- Then I extracted a list of AGROVOC terms the same way I did in [August, 2022]({{< relref "2022-08.md" >}}) and used this Jython code to extract matching terms:
```python
import re
with open(r"/tmp/agrovoc-subjects.txt",'r') as f :
terms = [name.rstrip().lower() for name in f]
return "||".join([term for term in terms if re.match(r".*\b" + term + r"\b.*", value.lower())])
```
- Then I used [this cool Jython to remove duplicate metadata values](https://stackoverflow.com/questions/15419080/openrefine-remove-duplicates-from-list-with-jython):
```python
deduped_list = list(set(value.split("||")))
return '||'.join(map(str, deduped_list))
```
- Then I did the same with countries, woooooo!
- I checked for duplicates and found forty-one
- I just stumbled upon UNTERM, which provides the official list of countries for the UN General Assembly, including a downloadable Excel with the short and formal names in all UN languages: https://unterm.un.org/unterm2/en/country
- I created a [pull request to add common names for Iran, Laos, and Syria on the Debian iso-codes package](https://salsa.debian.org/iso-codes-team/iso-codes/-/merge_requests/32)
- These are remarked upon in the ISO.org online browsing platform for ISO 3166-1
- I need to get some definitions from Peter for some types
- Peter sent some of the feedback from Indira to XMLUI
- I removed some old facets, limited others to less values, and adjusted the recent submissions from 5 to 10
## 2023-02-24
- More work on understanding Sam's CAS publications to prepare for uploading them to CGSpace
- I need to reconcile the duplicates and Peter's type re-classifications in the final version of the spreadsheet
- I flagged all the duplicates by creating a custom text facet matching all their titles like:
```console
or(
isNotNull(value.match("Evaluation of the CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS)")),
isNotNull(value.match("Report of the IEA Workshop on Development, Use and Assessment of TOC in CGIAR Research, Rome, 12-13 January 2017")),
isNotNull(value.match("Report of the IEA Workshop on Evaluating the Quality of Science, Rome, 10-11 December 2015")),
isNotNull(value.match("Review of CGIAR’s Intellectual Assets Principles")),
...
)
```
- Annoyingly this seems to miss the ones with parenthesis so I had to do those manually
- This matched thirty-seven items, then I flagged them so I can handle them separately after uploading the others
- Then I used the URL field in the old version of the file to match the items with types `Evaluation` and `Independent Commentary` since Peter changed them
- I added extent, volume, issue, number, and affiliation to a few journal articles
- Then I did some last minute checks to make sure we're not uploading files for items marked as having "multiple documents"
## 2023-02-25
- Oh nice, my [pull request adding common names for Iran, Laos, and Syria to iso-codes](https://salsa.debian.org/iso-codes-team/iso-codes/-/merge_requests/32) was merged
- I did a test import of the 198 CAS Publications on DSpace Test, then inspected Abenet's file with Gaia's "multiple documents" field one more time and decided to do the import on CGSpace
- Gaia's "multiple documents" column had some text like "E6" and "F7" that didn't make any sense, and those files were not in the Sharepoint even
- I found two items for the CAS Publications that were marked as a duplicates, but upon second inspection were not, so I uploaded it to CGSpace
- That makes the total number of items for CAS 200...
- I did some CSV joining and inspections with the remaining thirty-six duplicates with the metadata for their existing items on CGSpace and uploaded them
- Do some work on the new DSpace 7 submission forms
- I ended up reverting to the stock configuration to use some new techniques like the style and type bind
## 2023-02-28
- Keep working on the DSpace 7 submission forms
- As part of this I asked Maria and Francesca if they are still using the `cg.link.permalink` (Bioversity publications permalink) and they said no, so we can remove it from the submission form
- I also removed `cg.subject.ccafs` since the CRP ended over a year ago and `cg.subject.pabra` since there have only been a handful of new items in [their collection](https://hdl.handle.net/10568/80211) and they seem to be using Alliance subjects instead
- I filed [a bug](https://github.com/DSpace/DSpace/issues/8686) on DSpace regarding the inability to add freetext values from an input field that uses a vocabulary