cgspace-notes/2017-05.md at 534f0d9cf813bbb6cc433637a27b10fb0b5e4ce8

mirror of https://github.com/alanorth/cgspace-notes.git synced 2024-10-31 20:33:00 +01:00

They moved to Lyrasis in 2019.

2020-04-13 15:30:34 +03:00

18 KiB

Raw Blame History

+++ date = "2017-05-01T16:21:52+02:00" author = "Alan Orth" title = "May, 2017" tags = ["Notes"]

+++

2017-05-01

ICARDA apparently started working on CG Core on their MEL repository
They have done a few cg.* fields, but not very consistent and even copy some of CGSpace items:
- https://mel.cgiar.org/xmlui/handle/20.500.11766/6911?show=full
- https://cgspace.cgiar.org/handle/10568/73683

2017-05-02

Atmire got back about the Workflow Statistics issue, and apparently it's a bug in the CUA module so they will send us a pull request

2017-05-04

Sync DSpace Test with database and assetstore from CGSpace
Re-deploy DSpace Test with Atmire's CUA patch for workflow statistics, run system updates, and restart the server
Now I can see the workflow statistics and am able to select users, but everything returns 0 items
Megan says there are still some mapped items are not appearing since last week, so I forced a full index-discovery -b
Need to remember to check if the collection has more items (currently 39 on CGSpace, but 118 on the freshly reindexed DSPace Test) tomorrow: https://cgspace.cgiar.org/handle/10568/80731

2017-05-05

Discovered that CGSpace has ~700 items that are missing the cg.identifier.status field
Need to perhaps try using the "required metadata" curation task to find fields missing these items:

$ [dspace]/bin/dspace curate -t requiredmetadata -i 10568/1 -r - > /tmp/curation.out

It seems the curation task dies when it finds an item which has missing metadata

2017-05-06

Add "Blog Post" to dc.type
Create ticket on Atmire tracker to ask about commissioning them to develop the feature to expose ORCID via REST/OAI: https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=510
According to the DSpace curation docs the fact that the requiredmetadata curation task stops when it finds a missing metadata field is by design

2017-05-07

Testing one replacement for CCAFS Flagships (cg.subject.ccafs), first changed in the submission forms, and then in the database:

$ ./fix-metadata-values.py -i ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p fuuu

Also, CCAFS wants to re-order their flagships to prioritize the Phase II ones
Waiting for feedback from CCAFS, then I can merge #320

2017-05-08

Start working on CGIAR Library migration
We decided to use AIP export to preserve the hierarchies and handles of communities and collections
When ingesting some collections I was getting java.lang.OutOfMemoryError: GC overhead limit exceeded, which can be solved by disabling the GC timeout with -XX:-UseGCOverheadLimit
Other times I was getting an error about heap space, so I kept bumping the RAM allocation by 512MB each time (up to 4096m!) it crashed
This leads to tens of thousands of abandoned files in the assetstore, which need to be cleaned up using dspace cleanup -v, or else you'll run out of disk space
In the end I realized it's better to use submission mode (-s) to ingest the community object as a single AIP without its children, followed by each of the collections:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx2048m -XX:-UseGCOverheadLimit"
$ [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10568/87775 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done

Note that in submission mode DSpace ignores the handle specified in mets.xml in the zip file, so you need to turn that off with -o ignoreHandle=false
The -u option supresses prompts, to allow the process to run without user input
Give feedback to CIFOR about their data quality:
- Suggestion: uppercase dc.subject, cg.coverage.region, and cg.coverage.subregion in your crosswalk so they match CGSpace and therefore can be faceted / reported on easier
- Suggestion: use CGSpace's CRP names (cg.contributor.crp), see: dspace/config/input-forms.xml
- Suggestion: clean up duplicates and errors in funders, perhaps use a controlled vocabulary like ours, see: dspace/config/controlled-vocabularies/dc-description-sponsorship.xml
- Suggestion: use dc.type "Blog Post" instead of "Blog" for your blog post items (we are also adding a "Blog Post" type to CGSpace soon)
- Question: many of your items use dc.document.uri AND cg.identifier.url with the same text value?
Help Marianne from WLE with an Open Search query to show the latest WLE CRP outputs: https://cgspace.cgiar.org/open-search/discover?query=crpsubject:WATER%2C+LAND+AND+ECOSYSTEMS&sort_by=2&order=DESC
This uses the webui's item list sort options, see webui.itemlist.sort-option in dspace.cfg
The equivalent Discovery search would be: https://cgspace.cgiar.org/discover?filtertype_1=crpsubject&filter_relational_operator_1=equals&filter_1=WATER%2C+LAND+AND+ECOSYSTEMS&submit_apply_filter=&query=&rpp=10&sort_by=dc.date.issued_dt&order=desc

2017-05-09

The CGIAR Library metadata has some blank metadata values, which leads to ||| in the Discovery facets
Clean these up in the database using:

dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';

I ended up running into issues during data cleaning and decided to wipe out the entire community and re-sync DSpace Test assetstore and database from CGSpace rather than waiting for the cleanup task to clean up
Hours into the re-ingestion I ran into more errors, and had to erase everything and start over again!
Now, no matter what I do I keep getting foreign key errors...

Caused by: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "handle_pkey"
  Detail: Key (handle_id)=(80928) already exists.

I think those errors actually come from me running the update-sequences.sql script while Tomcat/DSpace are running
Apparently you need to stop Tomcat!

2017-05-10

Atmire says they are willing to extend the ORCID implementation, and I've asked them to provide a quote
I clarified that the scope of the implementation should be that ORCIDs are stored in the database and exposed via REST / API like other fields
Finally finished importing all the CGIAR Library content, final method was:

$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx3072m -XX:-UseGCOverheadLimit"
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2517/10947-2517.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2515/10947-2515.zip
$ [dspace]/bin/dspace packager -r -a -t AIP -o skipIfParentMissing=true -e some@user.com -p 10568/80923 /home/aorth/10947-2516/10947-2516.zip
$ [dspace]/bin/dspace packager -s -t AIP -o ignoreHandle=false -e some@user.com -p 10568/80923 /home/aorth/10947-1/10947-1.zip
$ for collection in /home/aorth/10947-1/COLLECTION@10947-*; do [dspace]/bin/dspace packager -s -o ignoreHandle=false -t AIP -e some@user.com -p 10947/1 $collection; done
$ for item in /home/aorth/10947-1/ITEM@10947-*; do [dspace]/bin/dspace packager -r -f -u -t AIP -e some@user.com $item; done

Basically, import the smaller communities using recursive AIP import (with skipIfParentMissing)
Then, for the larger collection, create the community, collections, and items separately, ingesting the items one by one
The -XX:-UseGCOverheadLimit JVM option helps with some issues in large imports
After this I ran the update-sequences.sql script (with Tomcat shut down), and cleaned up the 200+ blank metadata records:

dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';

2017-05-13

After quite a bit of troubleshooting with importing cleaned up data as CSV, it seems that there are actually NUL characters in the dc.description.abstract field (at least) on the lines where CSV importing was failing
I tried to find a way to remove the characters in vim or Open Refine, but decided it was quicker to just remove the column temporarily and import it
The import was successful and detected 2022 changes, which should likely be the rest that were failing to import before

2017-05-15

To delete the blank lines that cause isses during import we need to use a regex in vim g/^$/d
After that I started looking in the dc.subject field to try to pull countries and regions out, but there are too many values in there
Bump the Academicons dependency of the Mirage 2 themes from 1.6.0 to 1.8.0 because the upstream deleted the old tag and now the build is failing: #321
Merge changes to CCAFS project identifiers and flagships: #320
Run updates for CCAFS flagships on CGSpace:

$ ./fix-metadata-values.py -i /tmp/ccafs-flagships-may7.csv -f cg.subject.ccafs -t correct -m 210 -d dspace -u dspace -p 'fuuu'

These include:
- GENDER AND SOCIAL DIFFERENTIATION→GENDER AND SOCIAL INCLUSION
- MANAGING CLIMATE RISK→CLIMATE SERVICES AND SAFETY NETS
Re-deploy CGSpace and DSpace Test and run system updates
Reboot DSpace Test
Fix cron jobs for log management on DSpace Test, as they weren't catching dspace.log.* files correctly and we had over six months of them and they were taking up many gigs of disk space

2017-05-16

Discuss updates to WLE themes for their Phase II
Make an issue to track the changes to cg.subject.wle: #322

2017-05-17

Looking into the error I get when trying to create a new collection on DSpace Test:

ERROR: duplicate key value violates unique constraint "handle_pkey" Detail: Key (handle_id)=(84834) already exists.

I tried updating the sequences a few times, with Tomcat running and stopped, but it hasn't helped
It appears item with handle_id 84834 is one of the imported CGIAR Library items:

dspace=# select * from handle where handle_id=84834;
 handle_id |   handle   | resource_type_id | resource_id
-----------+------------+------------------+-------------
     84834 | 10947/1332 |                2 |       87113

Looks like the max handle_id is actually much higher:

dspace=# select * from handle where handle_id=(select max(handle_id) from handle);
 handle_id |  handle  | resource_type_id | resource_id
-----------+----------+------------------+-------------
     86873 | 10947/99 |                2 |       89153
(1 row)

I've posted on the dspace-test mailing list to see if I can just manually set the handle_seq to that value
Actually, it seems I can manually set the handle sequence using:

dspace=# select setval('handle_seq',86873);

After that I can create collections just fine, though I'm not sure if it has other side effects

2017-05-21

Start creating a basic theme for the CGIAR System Organization's community on CGSpace
Using colors from the CGIAR Branding guidelines (2014)
Make a GitHub issue to track this work: #324

2017-05-22

Do some cleanups of community and collection names in CGIAR System Management Office community on DSpace Test, as well as move some items as Peter requested
Peter wanted a list of authors in here, so I generated a list of collections using the "View Source" on each community and this hacky awk:

$ grep 10947/ /tmp/collections | grep -v cocoon | awk -F/ '{print $3"/"$4}' | awk -F\" '{print $1}' | vim -

Then I joined them together and ran this old SQL query from the dspace-tech mailing list which gives you authors for items in those collections:

dspace=# select distinct text_value
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/1
0', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '109
47/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947
/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947
/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521',
'10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '109
47/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2
531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535'
, '10947/2537', '10568/93761')));

To get a CSV (with counts) from that:

dspace=# \copy (select distinct text_value, count(*)
from metadatavalue
where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author')
AND resource_type_id = 2
AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10947/2', '10947/3', '10947/10', '10947/4', '10947/5', '10947/6', '10947/7', '10947/8', '10947/9', '10947/11', '10947/25', '10947/12', '10947/26', '10947/27', '10947/28', '10947/29', '10947/30', '10947/13', '10947/14', '10947/15', '10947/16', '10947/31', '10947/32', '10947/33', '10947/34', '10947/35', '10947/36', '10947/37', '10947/17', '10947/18', '10947/38', '10947/19', '10947/39', '10947/40', '10947/41', '10947/42', '10947/43', '10947/2512', '10947/44', '10947/20', '10947/21', '10947/45', '10947/46', '10947/47', '10947/48', '10947/49', '10947/22', '10947/23', '10947/24', '10947/50', '10947/51', '10947/2518', '10947/2776', '10947/2790', '10947/2521', '10947/2522', '10947/2782', '10947/2525', '10947/2836', '10947/2524', '10947/2878', '10947/2520', '10947/2523', '10947/2786', '10947/2631', '10947/2589', '10947/2519', '10947/2708', '10947/2526', '10947/2871', '10947/2527', '10947/4467', '10947/3457', '10947/2528', '10947/2529', '10947/2533', '10947/2530', '10947/2531', '10947/2532', '10947/2538', '10947/2534', '10947/2540', '10947/2900', '10947/2539', '10947/2784', '10947/2536', '10947/2805', '10947/2541', '10947/2535', '10947/2537', '10568/93761'))) group by text_value order by count desc) to /tmp/cgiar-librar-authors.csv with csv;

2017-05-23

Add Affiliation to filters on Listing and Reports module (#325)
Start looking at WLE's Phase II metadata updates but it seems they are not tagging their items properly, as their website importer infers which theme to use based on the name of the CGSpace collection!
For now I've suggested that they just change the collection names and that we fix their metadata manually afterwards
Also, they have a lot of messed up values in their cg.subject.wle field so I will clean up some of those first:

dspace=# \copy (select distinct text_value from metadatavalue where resource_type_id=2 and metadata_field_id=119) to /tmp/wle.csv with csv;
COPY 111

Respond to Atmire message about ORCIDs, saying that right now we'd prefer to just have them available via REST API like any other metadata field, and that I'm available for a Skype

2017-05-26

Increase max file size in nginx so that CIP can upload some larger PDFs
Agree to talk with Atmire after the June DSpace developers meeting where they will be discussing exposing ORCIDs via REST/OAI

2017-05-28

File an issue on GitHub to explore/track migration to proper country/region codes (ISO 2/3 and UN M.49): #326
Ask Peter how the Landportal.info people should acknowledge us as the source of data on their website
Communicate with MARLO people about progress on exposing ORCIDs via the REST API, as it is set to be discussed in the June, 2017 DCAT meeting
Find all of Amos Omore's author name variations so I can link them to his authority entry that has an ORCID:

dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Omore, A%';

Set the authority for all variations to one containing an ORCID:

dspace=# update metadatavalue set authority='4428ee88-90ef-4107-b837-3c0ec988520b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Omore, A%';
UPDATE 187

Next I need to do Edgar Twine:

dspace=# select distinct text_value, authority, confidence from metadatavalue where metadata_field_id=3 and text_value like 'Twine, E%';

But it doesn't look like any of his existing entries are linked to an authority which has an ORCID, so I edited the metadata via "Edit this Item" and looked up his ORCID and linked it there
Now I should be able to set his name variations to the new authority:

dspace=# update metadatavalue set authority='f70d0a01-d562-45b8-bca3-9cf7f249bc8b', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Twine, E%';

Run the corrections on CGSpace and then update discovery / authority
I notice that there are a handful of java.lang.OutOfMemoryError: Java heap space errors in the Catalina logs on CGSpace, I should go look into that...

2017-05-29

Discuss WLE themes and subjects with Mia and Macaroni Bros
We decided we need to create metadata fields for Phase I and II themes
I've updated the existing GitHub issue for Phase II (#322) and created a new one to track the changes for Phase I themes (#327)
After Macaroni Bros update the WLE website importer we will rename the WLE collections to reflect Phase II
Also, we need to have Mia and Udana look through the existing metadata in cg.subject.wle as it is quite a mess

18 KiB Raw Blame History