2018-03-02 15:09:18 +01:00
<!DOCTYPE html>
2019-10-11 10:19:42 +02:00
< html lang = "en" >
2018-03-02 15:09:18 +01:00
< head >
< meta charset = "utf-8" >
< meta name = "viewport" content = "width=device-width, initial-scale=1, shrink-to-fit=no" >
< meta property = "og:title" content = "March, 2018" / >
2018-03-06 09:17:14 +01:00
< meta property = "og:description" content = "2018-03-02
2018-03-02 15:09:18 +01:00
Export a CSV of the IITA community metadata for Martin Mueller
" />
< meta property = "og:type" content = "article" / >
2019-02-02 13:12:57 +01:00
< meta property = "og:url" content = "https://alanorth.github.io/cgspace-notes/2018-03/" / >
2019-08-08 17:10:44 +02:00
< meta property = "article:published_time" content = "2018-03-02T16:07:54+02:00" / >
2019-10-28 12:43:25 +01:00
< meta property = "article:modified_time" content = "2019-10-28T13:39:25+02:00" / >
2018-09-30 07:23:48 +02:00
2018-03-02 15:09:18 +01:00
< meta name = "twitter:card" content = "summary" / >
< meta name = "twitter:title" content = "March, 2018" / >
2018-03-06 09:17:14 +01:00
< meta name = "twitter:description" content = "2018-03-02
2018-03-02 15:09:18 +01:00
Export a CSV of the IITA community metadata for Martin Mueller
"/>
2020-01-14 19:40:41 +01:00
< meta name = "generator" content = "Hugo 0.62.2" / >
2018-03-02 15:09:18 +01:00
< script type = "application/ld+json" >
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"headline": "March, 2018",
2019-04-13 11:15:55 +02:00
"url": "https:\/\/alanorth.github.io\/cgspace-notes\/2018-03\/",
2018-04-30 18:05:39 +02:00
"wordCount": "2960",
2019-10-11 10:19:42 +02:00
"datePublished": "2018-03-02T16:07:54+02:00",
2019-10-28 12:43:25 +01:00
"dateModified": "2019-10-28T13:39:25+02:00",
2018-03-02 15:09:18 +01:00
"author": {
"@type": "Person",
"name": "Alan Orth"
},
"keywords": "Notes"
}
< / script >
< link rel = "canonical" href = "https://alanorth.github.io/cgspace-notes/2018-03/" >
< title > March, 2018 | CGSpace Notes< / title >
2019-10-11 10:19:42 +02:00
2018-03-02 15:09:18 +01:00
<!-- combined, minified CSS -->
2020-01-23 19:19:38 +01:00
< link href = "https://alanorth.github.io/cgspace-notes/css/style.a20c1a4367639632cdb341d23c27ca44fedcc75b0f8b3cbea6203010da153d3c.css" rel = "stylesheet" integrity = "sha256-ogwaQ2djljLNs0HSPCfKRP7cx1sPizy+piAwENoVPTw=" crossorigin = "anonymous" >
2019-10-11 10:19:42 +02:00
2018-03-02 15:09:18 +01:00
2019-04-14 15:59:47 +02:00
<!-- RSS 2.0 feed -->
2018-03-02 15:09:18 +01:00
< / head >
< body >
< div class = "blog-masthead" >
< div class = "container" >
< nav class = "nav blog-nav" >
< a class = "nav-link " href = "https://alanorth.github.io/cgspace-notes/" > Home< / a >
< / nav >
< / div >
< / div >
2018-12-19 12:20:39 +01:00
2018-03-02 15:09:18 +01:00
< header class = "blog-header" >
< div class = "container" >
2019-10-11 10:19:42 +02:00
< h1 class = "blog-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/" rel = "home" > CGSpace Notes< / a > < / h1 >
< p class = "lead blog-description" dir = "auto" > Documenting day-to-day work on the < a href = "https://cgspace.cgiar.org" > CGSpace< / a > repository.< / p >
2018-03-02 15:09:18 +01:00
< / div >
< / header >
2018-12-19 12:20:39 +01:00
2018-03-02 15:09:18 +01:00
< div class = "container" >
< div class = "row" >
< div class = "col-sm-8 blog-main" >
< article class = "blog-post" >
< header >
2019-10-11 10:19:42 +02:00
< h2 class = "blog-post-title" dir = "auto" > < a href = "https://alanorth.github.io/cgspace-notes/2018-03/" > March, 2018< / a > < / h2 >
2018-03-02 15:09:18 +01:00
< p class = "blog-post-meta" > < time datetime = "2018-03-02T16:07:54+02:00" > Fri Mar 02, 2018< / time > by Alan Orth in
2019-10-28 12:43:25 +01:00
< i class = "fa fa-folder" aria-hidden = "true" > < / i > < a href = "/cgspace-notes/categories/notes" rel = "category tag" > Notes< / a >
2018-03-02 15:09:18 +01:00
< / p >
< / header >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-02" > 2018-03-02< / h2 >
2018-03-02 15:09:18 +01:00
< ul >
< li > Export a CSV of the IITA community metadata for Martin Mueller< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-06" > 2018-03-06< / h2 >
2018-03-06 09:17:14 +01:00
< ul >
< li > Add three new CCAFS project tags to < code > input-forms.xml< / code > (< a href = "https://github.com/ilri/DSpace/pull/357" > #357< / a > )< / li >
< li > Andrea from Macaroni Bros had sent me an email that CCAFS needs them< / li >
2018-03-06 11:25:26 +01:00
< li > Give Udana more feedback on his WLE records from last month< / li >
< li > There were some records using a non-breaking space in their AGROVOC subject field< / li >
2019-11-28 16:30:45 +01:00
< li > I checked and tested some author corrections from Peter from last week, and then applied them on CGSpace< / li >
< / ul >
2018-03-06 11:25:26 +01:00
< pre > < code > $ ./fix-metadata-values.py -i Correct-309-authors-2018-03-06.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i Delete-3-Authors-2018-03-06.csv -db dspace -u dspace-p 'fuuu' -f dc.contributor.author -m 3
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > This time there were no errors in whitespace but I did have to correct one incorrectly encoded accent character< / li >
< li > Add new CRP subject “ GRAIN LEGUMES AND DRYLAND CEREALS” to < code > input-forms.xml< / code > (< a href = "https://github.com/ilri/DSpace/pull/358" > #358< / a > )< / li >
< li > Merge the ORCID integration stuff in to < code > 5_x-prod< / code > for deployment on CGSpace soon (< a href = "https://github.com/ilri/DSpace/pull/359" > #359< / a > )< / li >
< li > Deploy ORCID changes on CGSpace (linode18), run all system updates, and reboot the server< / li >
< li > Run all system updates on DSpace Test and reboot server< / li >
< li > I ran the < a href = "https://gist.github.com/alanorth/24d8081a5dc25e2a4e27e548e7e2389c" > orcid-authority-to-item.py< / a > script on CGSpace and mapped 2,864 ORCID identifiers from Solr to item metadata< / li >
< / ul >
2018-03-06 20:48:10 +01:00
< pre > < code > $ ./orcid-authority-to-item.py -db dspace -u dspace -p 'fuuu' -s http://localhost:8081/solr -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I ran the DSpace cleanup script on CGSpace and it threw an error (as always):< / li >
< / ul >
2018-03-06 20:48:10 +01:00
< pre > < code > Error: ERROR: update or delete on table " bitstream" violates foreign key constraint " bundle_primary_bitstream_id_fkey" on table " bundle"
2019-11-28 16:30:45 +01:00
Detail: Key (bitstream_id)=(150659) is still referenced from table " bundle" .
< / code > < / pre > < ul >
< li > The solution is, as always:< / li >
< / ul >
2018-03-06 23:24:03 +01:00
< pre > < code > $ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (150659);'
2018-03-06 20:48:10 +01:00
UPDATE 1
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Apply the proposed PostgreSQL indexes from DS-3636 (pull request < a href = "https://github.com/DSpace/DSpace/pull/1791/" > #1791< / a > on CGSpace (linode18)< / li >
2018-03-06 09:17:14 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-07" > 2018-03-07< / h2 >
2018-03-07 11:29:24 +01:00
< ul >
< li > Add CIAT author Mauricio Efren Sotelo Cabrera to controlled vocabulary for ORCID identifiers (< a href = "https://github.com/ilri/DSpace/pull/360" > #360< / a > )< / li >
< li > Help Sisay proof 200 IITA records on DSpace Test< / li >
2019-11-28 16:30:45 +01:00
< li > Finally import Udana's 24 items to < a href = "https://cgspace.cgiar.org/handle/10568/36185" > IWMI Journal Articles< / a > on CGSpace< / li >
2018-03-08 14:05:29 +01:00
< li > Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc< / li >
2018-03-07 11:29:24 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-08" > 2018-03-08< / h2 >
2018-03-08 16:32:38 +01:00
< ul >
< li > Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata< / li >
< li > This makes the CSV have tons of columns, for example < code > dc.title< / code > , < code > dc.title[]< / code > , < code > dc.title[en]< / code > , < code > dc.title[eng]< / code > , < code > dc.title[en_US]< / code > and so on!< / li >
2019-11-28 16:30:45 +01:00
< li > I think I can fix — or at least normalize — them in the database:< / li >
< / ul >
2018-03-08 16:32:38 +01:00
< pre > < code > dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
2019-11-28 16:30:45 +01:00
text_lang
2018-03-08 16:32:38 +01:00
-----------
2019-11-28 16:30:45 +01:00
ethnob
en
spa
EN
En
en_
en_US
E.
2018-03-08 16:32:38 +01:00
2019-11-28 16:30:45 +01:00
EN_US
en_U
eng
fr
es_ES
es
2018-03-08 16:32:38 +01:00
(16 rows)
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
UPDATE 122227
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
2019-11-28 16:30:45 +01:00
text_lang
2018-03-08 16:32:38 +01:00
-----------
2019-11-28 16:30:45 +01:00
ethnob
en_US
spa
E.
2018-03-08 16:32:38 +01:00
2019-11-28 16:30:45 +01:00
fr
es_ES
es
2018-03-08 16:32:38 +01:00
(9 rows)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > On second inspection it looks like < code > dc.description.provenance< / code > fields use the text_lang “ en” so that's probably why there are over 100,000 fields changed… < / li >
< li > If I skip that, there are about 2,000, which seems more reasonably like the amount of fields users have edited manually, or fucked up during CSV import, etc:< / li >
< / ul >
2018-03-08 21:47:12 +01:00
< pre > < code > dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('EN','En','en_','EN_US','en_U','eng');
UPDATE 2309
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I will apply this on CGSpace right now< / li >
< li > In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine< / li >
< li > Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the < code > cg.creator.id< / code > field< / li >
< li > For example, a GREL expression in a custom text facet to get all items with < code > dc.contributor.author[en_US]< / code > of a certain author with several name variations (this is how you use a logical OR in OpenRefine):< / li >
< / ul >
2018-03-08 16:32:38 +01:00
< pre > < code > or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:< / li >
< / ul >
2018-03-08 16:32:38 +01:00
< pre > < code > if(isBlank(value), " Hernan Ceballos: 0000-0002-8744-7918" , value + " ||Hernan Ceballos: 0000-0002-8744-7918" )
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > One thing that bothers me is that this won't honor author order< / li >
< li > It might be better to do batches of these in PostgreSQL with a script that takes the < code > place< / code > column of an author into account when setting the < code > cg.creator.id< / code > < / li >
< li > I wrote a Python script to read the author names and ORCID identifiers from CSV and create matching < code > cg.creator.id< / code > fields: < a href = "https://gist.github.com/alanorth/a49d85cd9c5dea89cddbe809813a7050" > add-orcid-identifiers-csv.py < / a > < / li >
< li > The CSV should have two columns: author name and ORCID identifier:< / li >
< / ul >
2018-03-08 20:10:16 +01:00
< pre > < code > dc.contributor.author,cg.creator.id
" Orth, Alan" ,Alan S. Orth: 0000-0002-1735-7458
" Orth, A." ,Alan S. Orth: 0000-0002-1735-7458
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I didn't integrate the ORCID API lookup for author names in this script for now because I was only interested in “ tagging” old items for a few given authors< / li >
< li > I added ORCID identifers for 187 items by CIAT's Hernan Ceballos, because that is what Elizabeth was trying to do manually!< / li >
< li > Also, I decided to add ORCID identifiers for all records from Peter, Abenet, and Sisay as well< / li >
2018-03-08 16:32:38 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-09" > 2018-03-09< / h2 >
2018-03-09 21:16:20 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Give James Stapleton input on Sisay's KRAs< / li >
2018-03-09 21:16:20 +01:00
< li > Create a pull request to disable ORCID authority integration for < code > dc.contributor.author< / code > in the submission forms and XMLUI display (< a href = "https://github.com/ilri/DSpace/pull/363" > #363< / a > )< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-11" > 2018-03-11< / h2 >
2018-03-11 12:56:20 +01:00
< ul >
< li > Peter also wrote to say he is having issues with the Atmire Listings and Reports module< / li >
2019-11-28 16:30:45 +01:00
< li > When I logged in to try it I get a blank white page after continuing and I see this in dspace.log.2018-03-11:< / li >
< / ul >
2018-03-11 12:56:20 +01:00
< pre > < code > 2018-03-11 11:38:15,592 WARN org.dspace.app.webui.servlet.InternalErrorServlet @ :session_id=91C2C0C59669B33A7683570F6010603A:internal_error:-- URL Was: https://cgspace.cgiar.or
g/jspui/listings-and-reports
-- Method: POST
-- Parameters were:
-- selected_admin_preset: " ilri authors2"
-- load: " normal"
-- next: " NEXT STEP > > "
-- step: " 1"
org.apache.jasper.JasperException: java.lang.NullPointerException
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Looks like I needed to remove the Humidtropics subject from Listings and Reports because it was looking for the terms and couldn't find them< / li >
< li > I made a quick fix and it's working now (< a href = "https://github.com/ilri/DSpace/pull/364" > #364< / a > )< / li >
2018-03-11 12:56:20 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-12" > 2018-03-12< / h2 >
2018-03-12 20:53:42 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Increase upload size on CGSpace's nginx config to 85MB so Sisay can upload some data< / li >
2018-03-12 20:53:42 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-13" > 2018-03-13< / h2 >
2018-03-13 18:21:40 +01:00
< ul >
< li > I created a new Linode server for DSpace Test (linode6623840) so I could try the block storage stuff, but when I went to add a 300GB volume it said that block storage capacity was exceeded in that datacenter (Newark, NJ)< / li >
< li > I deleted the Linode and created another one (linode6624164) in the Fremont, CA region< / li >
< li > After that I deployed the Ubuntu 16.04 image and attached a 300GB block storage volume to the image< / li >
< li > Magdalena wrote to ask why there was no Altmetric donut for an item on CGSpace, but there was one on the related CCAFS publication page< / li >
< li > It looks the the CCAFS publications page fetches the donut using its DOI, whereas CGSpace queries via Handle< / li >
< li > I will write to Altmetric support and ask them, as perhaps its part of a larger issue< / li >
< li > CGSpace item: < a href = "https://cgspace.cgiar.org/handle/10568/89643" > https://cgspace.cgiar.org/handle/10568/89643< / a > < / li >
< li > CCAFS publication page: < a href = "https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy" > https://ccafs.cgiar.org/publications/can-scenario-planning-catalyse-transformational-change-evaluating-climate-change-policy< / a > < / li >
2018-03-14 21:33:09 +01:00
< li > Peter tweeted the Handle link and now Altmetric shows the donut for both the DOI and the Handle< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-14" > 2018-03-14< / h2 >
2018-03-14 21:33:09 +01:00
< ul >
< li > Help Abenet with a troublesome Listings and Report question for CIAT author Steve Beebe< / li >
< li > Continue migrating DSpace Test to the new server (linode6624164)< / li >
< li > I emailed ILRI service desk to update the DNS records for dspacetest.cgiar.org< / li >
2018-03-14 21:45:37 +01:00
< li > Abenet was having problems saving Listings and Reports configurations or layouts but I tested it and it works< / li >
2018-03-13 18:21:40 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-15" > 2018-03-15< / h2 >
2018-03-15 20:41:38 +01:00
< ul >
< li > Help Abenet troubleshoot the Listings and Reports issue again< / li >
2019-11-28 16:30:45 +01:00
< li > It looks like it's an issue with the layouts, if you create a new layout that only has one type (< code > dc.identifier.citation< / code > ):< / li >
2018-03-15 20:41:38 +01:00
< / ul >
2019-11-28 16:30:45 +01:00
< p > < img src = "/cgspace-notes/2018/03/layout-only-citation.png" alt = "Listing and Reports layout" > < / p >
2018-03-15 20:41:38 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > The error in the DSpace log is:< / li >
< / ul >
2018-03-15 20:41:38 +01:00
< pre > < code > org.apache.jasper.JasperException: java.lang.ArrayIndexOutOfBoundsException: -1
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > The full error is here: < a href = "https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca" > https://gist.github.com/alanorth/ea47c092725960e39610db9b0c13f6ca< / a > < / li >
< li > If I do a report for “ Orth, Alan” with the same custom layout it works!< / li >
< li > I submitted a ticket to Atmire: < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589" > https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589< / a > < / li >
< li > Small fix to the example citation text in Listings and Reports (< a href = "https://github.com/ilri/DSpace/pull/365" > #365< / a > )< / li >
2018-03-15 20:41:38 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-16" > 2018-03-16< / h2 >
2018-03-16 15:48:39 +01:00
< ul >
< li > ICT made the DNS updates for dspacetest.cgiar.org late last night< / li >
< li > I have removed the old server (linode02 aka linode578611) in favor of linode19 aka linode6624164< / li >
2019-11-28 16:30:45 +01:00
< li > Looking at the CRP subjects on CGSpace I see there is one blank one so I'll just fix it:< / li >
< / ul >
2018-03-16 15:48:39 +01:00
< pre > < code > dspace=# delete from metadatavalue where resource_type_id=2 and metadata_field_id=230 and text_value='';
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Copy all CRP subjects to a CSV to do the mass updates:< / li >
< / ul >
2018-03-16 15:48:39 +01:00
< pre > < code > dspace=# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=230 group by text_value order by count desc) to /tmp/crps.csv with csv header;
COPY 21
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Once I prepare the new input forms (< a href = "https://github.com/ilri/DSpace/issues/362" > #362< / a > ) I will need to do the batch corrections:< / li >
< / ul >
2018-03-16 17:44:32 +01:00
< pre > < code > $ ./fix-metadata-values.py -i Correct-21-CRPs-2018-03-16.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.crp -t correct -m 230 -n -d
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Create a pull request to update the input forms for the new CRP subject style (< a href = "https://github.com/ilri/DSpace/pull/366" > #366< / a > )< / li >
2018-03-16 17:58:36 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-19" > 2018-03-19< / h2 >
2018-03-19 07:30:32 +01:00
< ul >
< li > Tezira has been having problems accessing CGSpace from the ILRI Nairobi campus since last week< / li >
< li > She is getting an HTTPS error apparently< / li >
2019-11-28 16:30:45 +01:00
< li > It's working outside, and Ethiopian users seem to be having no issues so I've asked ICT to have a look< / li >
2018-03-19 16:56:41 +01:00
< li > CGSpace crashed this morning for about seven minutes and Dani restarted Tomcat< / li >
2019-11-28 16:30:45 +01:00
< li > Around that time there were an increase of SQL errors:< / li >
< / ul >
2018-03-19 16:56:41 +01:00
< pre > < code > 2018-03-19 09:10:54,856 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-19 09:10:54,862 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL query singleTable Error -
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > But these errors, I don't even know what they mean, because a handful of them happen every day:< / li >
< / ul >
2018-03-19 16:56:41 +01:00
< pre > < code > $ grep -c 'ERROR org.dspace.storage.rdbms.DatabaseManager' dspace.log.2018-03-1*
dspace.log.2018-03-10:13
dspace.log.2018-03-11:15
dspace.log.2018-03-12:13
dspace.log.2018-03-13:13
dspace.log.2018-03-14:14
dspace.log.2018-03-15:13
dspace.log.2018-03-16:13
dspace.log.2018-03-17:13
dspace.log.2018-03-18:15
dspace.log.2018-03-19:90
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > There wasn't even a lot of traffic at the time (8– 9 AM):< / li >
< / ul >
2018-03-19 16:56:41 +01:00
< pre > < code > # zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E " 19/Mar/2018:0[89]:" | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
2019-11-28 16:30:45 +01:00
92 40.77.167.197
92 83.103.94.48
96 40.77.167.175
116 207.46.13.178
122 66.249.66.153
140 95.108.181.88
196 213.55.99.121
206 197.210.168.174
207 104.196.152.243
294 54.198.169.202
< / code > < / pre > < ul >
< li > Well there is a hint in Tomcat's < code > catalina.out< / code > :< / li >
< / ul >
2018-03-19 17:18:28 +01:00
< pre > < code > Mon Mar 19 09:05:28 UTC 2018 | Query:id: 92032 AND type:2
Exception in thread " http-bio-127.0.0.1-8081-exec-280" java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > So someone was doing something heavy somehow… my guess is content and usage stats!< / li >
< li > ICT responded that they “ fixed” the CGSpace connectivity issue in Nairobi without telling me the problem< / li >
< li > When I asked, Robert Okal said CGNET messed up when updating the DNS for cgspace.cgiar.org last week< / li >
< li > I told him that my request last week was for dspacetest.cgiar.org, not cgspace.cgiar.org!< / li >
< li > So they updated the wrong fucking DNS records< / li >
< li > Magdalena from CCAFS wrote to ask about one record that has a bunch of metadata missing in her Listings and Reports export< / li >
< li > It appears to be this one: < a href = "https://cgspace.cgiar.org/handle/10568/83473?show=full" > https://cgspace.cgiar.org/handle/10568/83473?show=full< / a > < / li >
< li > The title is “ Untitled” and there is some metadata but indeed the citation is missing< / li >
< li > I don't know what would cause that< / li >
2018-03-19 07:30:32 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-20" > 2018-03-20< / h2 >
2018-03-20 16:37:20 +01:00
< ul >
2019-11-28 16:30:45 +01:00
< li > DSpace Test has been down for a few hours with SQL and memory errors starting this morning:< / li >
< / ul >
2018-03-20 16:37:20 +01:00
< pre > < code > 2018-03-20 08:47:10,177 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
...
2018-03-20 08:53:11,624 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I have no idea why it crashed< / li >
< li > I ran all system updates and rebooted it< / li >
< li > Abenet told me that one of Lance Robinson's ORCID iDs on CGSpace is incorrect< / li >
< li > I will remove it from the controlled vocabulary (< a href = "https://github.com/ilri/DSpace/pull/367" > #367< / a > ) and update any items using the old one:< / li >
< / ul >
2018-03-20 16:37:20 +01:00
< pre > < code > dspace=# update metadatavalue set text_value='Lance W. Robinson: 0000-0002-5224-8644' where resource_type_id=2 and metadata_field_id=240 and text_value like '%0000-0002-6344-195X%';
UPDATE 1
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Communicate with DSpace editors on Yammer about being more careful about spaces and character editing when doing manual metadata edits< / li >
< li > Merge the changes to CRP names to the < code > 5_x-prod< / code > branch and deploy on CGSpace (< a href = "https://github.com/ilri/DSpace/pull/363" > #363< / a > )< / li >
< li > Run corrections for CRP names in the database:< / li >
< / ul >
2018-03-20 19:31:54 +01:00
< pre > < code > $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db dspace -u dspace -p 'fuuu'
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Run all system updates on CGSpace (linode18) and reboot the server< / li >
< li > I started a full Discovery re-index on CGSpace because of the updated CRPs< / li >
< li > I see this error in the DSpace log:< / li >
< / ul >
2018-03-20 20:04:10 +01:00
< pre > < code > 2018-03-20 19:03:14,844 ERROR com.atmire.dspace.discovery.AtmireSolrService @ No choices plugin was configured for field " dc_contributor_author" .
java.lang.IllegalArgumentException: No choices plugin was configured for field " dc_contributor_author" .
2019-11-28 16:30:45 +01:00
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:261)
at org.dspace.content.authority.ChoiceAuthorityManager.getLabel(ChoiceAuthorityManager.java:249)
at org.dspace.browse.SolrBrowseCreateDAO.additionalIndex(SolrBrowseCreateDAO.java:215)
at com.atmire.dspace.discovery.AtmireSolrService.buildDocument(AtmireSolrService.java:662)
at com.atmire.dspace.discovery.AtmireSolrService.indexContent(AtmireSolrService.java:807)
at com.atmire.dspace.discovery.AtmireSolrService.updateIndex(AtmireSolrService.java:876)
at org.dspace.discovery.SolrServiceImpl.createIndex(SolrServiceImpl.java:370)
at org.dspace.discovery.IndexClient.main(IndexClient.java:117)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:226)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:78)
< / code > < / pre > < ul >
< li > I have to figure that one out… < / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-21" > 2018-03-21< / h2 >
2018-03-21 10:44:06 +01:00
< ul >
< li > Looks like the indexing gets confused that there is still data in the < code > authority< / code > column< / li >
< li > Unfortunately this causes those items to simply not be indexed, which users noticed because item counts were cut in half and old items showed up in RSS!< / li >
2019-11-28 16:30:45 +01:00
< li > Since we've migrated the ORCID identifiers associated with the authority data to the < code > cg.creator.id< / code > field we can nullify the authorities remaining in the database:< / li >
< / ul >
< div class = "highlight" > < pre style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4" > < code class = "language-sql" data-lang = "sql" > dspace< span style = "color:#f92672" > =< / span > < span style = "color:#f92672" > #< / span > < span style = "color:#66d9ef" > UPDATE< / span > metadatavalue < span style = "color:#66d9ef" > SET< / span > authority< span style = "color:#f92672" > =< / span > < span style = "color:#66d9ef" > NULL< / span > < span style = "color:#66d9ef" > WHERE< / span > resource_type_id< span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 2< / span > < span style = "color:#66d9ef" > AND< / span > metadata_field_id< span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 3< / span > < span style = "color:#66d9ef" > AND< / span > authority < span style = "color:#66d9ef" > IS< / span > < span style = "color:#66d9ef" > NOT< / span > < span style = "color:#66d9ef" > NULL< / span > ;
< span style = "color:#66d9ef" > UPDATE< / span > < span style = "color:#ae81ff" > 195463< / span >
< / code > < / pre > < / div > < ul >
< li > After this the indexing works as usual and item counts and facets are back to normal< / li >
< li > Send Peter a list of all authors to correct:< / li >
< / ul >
< div class = "highlight" > < pre style = "color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4" > < code class = "language-sql" data-lang = "sql" > dspace< span style = "color:#f92672" > =< / span > < span style = "color:#f92672" > #< / span > < span style = "color:#960050;background-color:#1e0010" > \< / span > < span style = "color:#66d9ef" > copy< / span > (< span style = "color:#66d9ef" > select< / span > < span style = "color:#66d9ef" > distinct< / span > text_value, < span style = "color:#66d9ef" > count< / span > (< span style = "color:#f92672" > *< / span > ) < span style = "color:#66d9ef" > as< / span > < span style = "color:#66d9ef" > count< / span > < span style = "color:#66d9ef" > from< / span > metadatavalue < span style = "color:#66d9ef" > where< / span > metadata_field_id < span style = "color:#f92672" > =< / span > (< span style = "color:#66d9ef" > select< / span > metadata_field_id < span style = "color:#66d9ef" > from< / span > metadatafieldregistry < span style = "color:#66d9ef" > where< / span > element < span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > ' < / span > < span style = "color:#e6db74" > contributor< / span > < span style = "color:#e6db74" > ' < / span > < span style = "color:#66d9ef" > and< / span > qualifier < span style = "color:#f92672" > =< / span > < span style = "color:#e6db74" > ' < / span > < span style = "color:#e6db74" > author< / span > < span style = "color:#e6db74" > ' < / span > ) < span style = "color:#66d9ef" > AND< / span > resource_type_id < span style = "color:#f92672" > =< / span > < span style = "color:#ae81ff" > 2< / span > < span style = "color:#66d9ef" > group< / span > < span style = "color:#66d9ef" > by< / span > text_value < span style = "color:#66d9ef" > order< / span > < span style = "color:#66d9ef" > by< / span > < span style = "color:#66d9ef" > count< / span > < span style = "color:#66d9ef" > desc< / span > ) < span style = "color:#66d9ef" > to< / span > < span style = "color:#f92672" > /< / span > tmp< span style = "color:#f92672" > /< / span > authors.csv < span style = "color:#66d9ef" > with< / span > csv header;
< span style = "color:#66d9ef" > COPY< / span > < span style = "color:#ae81ff" > 56156< / span >
< / code > < / pre > < / div > < ul >
< li > Afterwards we'll want to do some batch tagging of ORCID identifiers to these names< / li >
< li > CGSpace crashed again this afternoon, I'm not sure of the cause but there are a lot of SQL errors in the DSpace log:< / li >
< / ul >
2018-03-21 17:11:22 +01:00
< pre > < code > 2018-03-21 15:11:08,166 ERROR org.dspace.storage.rdbms.DatabaseManager @ SQL QueryTable Error -
java.sql.SQLException: Connection has already been closed.
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I have no idea why so many connections were abandoned this afternoon:< / li >
< / ul >
2018-03-21 17:11:22 +01:00
< pre > < code > # grep 'Mar 21, 2018' /var/log/tomcat7/catalina.out | grep -c 'org.apache.tomcat.jdbc.pool.ConnectionPool abandon'
268
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > DSpace Test crashed again due to Java heap space, this is from the DSpace log:< / li >
< / ul >
2018-03-21 17:11:22 +01:00
< pre > < code > 2018-03-21 15:18:48,149 ERROR org.dspace.app.xmlui.cocoon.DSpaceCocoonServletFilter @ Serious Error Occurred Processing Request!
org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > And this is from the Tomcat Catalina log:< / li >
< / ul >
2018-03-21 17:11:22 +01:00
< pre > < code > Mar 21, 2018 11:20:00 AM org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor run
SEVERE: Unexpected death of background thread ContainerBackgroundProcessor[StandardEngine[Catalina]]
java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > But there are tons of heap space errors on DSpace Test actually:< / li >
< / ul >
2018-03-21 17:11:22 +01:00
< pre > < code > # grep -c 'java.lang.OutOfMemoryError: Java heap space' /var/log/tomcat7/catalina.out
319
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I guess we need to give it more RAM because it now has CGSpace's large Solr core< / li >
< li > I will increase the memory from 3072m to 4096m< / li >
< li > Update < a href = "https://github.com/ilri/rmg-ansible-public" > Ansible playbooks< / a > to use < a href = "https://jdbc.postgresql.org/" > PostgreSQL JBDC driver< / a > 42.2.2< / li >
< li > Deploy the new JDBC driver on DSpace Test< / li >
< li > I'm also curious to see how long the < code > dspace index-discovery -b< / code > takes on DSpace Test where the DSpace installation directory is on one of Linode's new block storage volumes< / li >
< / ul >
2018-03-22 00:04:49 +01:00
< pre > < code > $ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 208m19.155s
user 8m39.138s
sys 2m45.135s
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > So that's about three times as long as it took on CGSpace this morning< / li >
< li > I should also check the raw read speed with < code > hdparm -tT /dev/sdc< / code > < / li >
< li > Looking at Peter's author corrections there are some mistakes due to Windows 1252 encoding< / li >
< li > I need to find a way to filter these easily with OpenRefine< / li >
< li > For example, Peter has inadvertantly introduced Unicode character 0xfffd into several fields< / li >
< li > I can search for Unicode values by their hex code in OpenRefine using the following GREL expression:< / li >
< / ul >
2018-03-22 00:04:49 +01:00
< pre > < code > isNotNull(value.match(/.*\ufffd.*/))
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > I need to be able to add many common characters though so that it is useful to copy and paste into a new project to find issues< / li >
2018-03-21 17:11:22 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-22" > 2018-03-22< / h2 >
2018-03-22 09:45:56 +01:00
< ul >
< li > Add ORCID identifier for Silvia Alonso< / li >
2018-03-22 22:07:03 +01:00
< li > Update my Mirage 2 setup notes for Ubuntu 18.04: < a href = "https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5" > https://gist.github.com/alanorth/9bfd29feb7d2e836a9d417633319b3f5< / a > < / li >
2018-03-22 09:45:56 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-24" > 2018-03-24< / h2 >
2018-03-24 21:03:00 +01:00
< ul >
< li > More work on the Ubuntu 18.04 readiness stuff for the < a href = "https://github.com/ilri/rmg-ansible-public" > Ansible playbooks< / a > < / li >
2019-11-28 16:30:45 +01:00
< li > The playbook now uses the system's Ruby and Node.js so I don't have to manually install RVM and NVM after< / li >
2018-03-24 21:03:00 +01:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-25" > 2018-03-25< / h2 >
2018-03-25 21:46:48 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Looking at Peter's author corrections and trying to work out a way to find errors in OpenRefine easily< / li >
< li > I can find all names that have acceptable characters using a GREL expression like:< / li >
< / ul >
2018-03-25 21:46:48 +02:00
< pre > < code > isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > But it's probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):< / li >
< / ul >
2018-03-25 21:46:48 +02:00
< pre > < code > or(
2019-11-28 16:30:45 +01:00
isNotNull(value.match(/.*[(|)].*/)),
isNotNull(value.match(/.*\uFFFD.*/)),
isNotNull(value.match(/.*\u00A0.*/)),
isNotNull(value.match(/.*\u200A.*/))
2018-03-25 21:46:48 +02:00
)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > And here's one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it's time to add delete support to my < code > fix-metadata-values.py< / code > script:< / li >
< / ul >
2018-03-25 21:46:48 +02:00
< pre > < code > or(
2019-11-28 16:30:45 +01:00
isNotNull(value.match(/.*delete.*/i)),
isNotNull(value.match(/.*remove.*/i)),
isNotNull(value.match(/.*check.*/i))
2018-03-25 21:46:48 +02:00
)
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li >
< p > So I guess the routine is in OpenRefine is:< / p >
2018-03-25 21:46:48 +02:00
< ul >
< li > Transform: trim leading/trailing whitespace< / li >
< li > Transform: collapse consecutive whitespace< / li >
< li > Custom text facet for items to delete/check< / li >
< li > Custom text facet for illegal characters< / li >
2019-11-28 16:30:45 +01:00
< / ul >
< / li >
< li >
< p > Test the corrections and deletions locally, then run them on CGSpace:< / p >
< / li >
< / ul >
2018-03-25 21:46:48 +02:00
< pre > < code > $ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Afterwards I started a full Discovery reindexing on both CGSpace and DSpace Test< / li >
< li > CGSpace took 76m28.292s< / li >
< li > DSpace Test took 194m56.048s< / li >
2018-03-25 21:46:48 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-26" > 2018-03-26< / h2 >
2018-03-28 08:48:08 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Atmire got back to me about the < a href = "https://tracker.atmire.com/tickets-cgiar-ilri/view-ticket?id=589" > Listings and Reports issue< / a > and said it's caused by items that have missing < code > dc.identifier.citation< / code > fields< / li >
2018-03-28 08:48:08 +02:00
< li > The will send a fix< / li >
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-27" > 2018-03-27< / h2 >
2018-03-28 08:48:08 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > Atmire got back with an updated quote about the DSpace 5.8 compatibility so I've forwarded it to Peter< / li >
2018-03-28 08:48:08 +02:00
< / ul >
2019-12-17 13:49:24 +01:00
< h2 id = "2018-03-28" > 2018-03-28< / h2 >
2018-03-28 08:48:08 +02:00
< ul >
2019-11-28 16:30:45 +01:00
< li > DSpace Test crashed due to heap space so I've increased it from 4096m to 5120m< / li >
< li > The error in Tomcat's < code > catalina.out< / code > was:< / li >
< / ul >
2018-03-28 16:48:23 +02:00
< pre > < code > Exception in thread " RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > Add ISI Journal (cg.isijournal) as an option in Atmire's Listing and Reports layout (< a href = "https://github.com/ilri/DSpace/pull/370" > #370< / a > ) for Abenet< / li >
< li > I noticed a few hundred CRPs using the old capitalized formatting so I corrected them:< / li >
< / ul >
2018-03-28 16:48:23 +02:00
< pre > < code > $ ./fix-metadata-values.py -i /tmp/Correct-21-CRPs-2018-03-16.csv -f cg.contributor.crp -t correct -m 230 -db cgspace -u cgspace -p 'fuuu'
Fixed 29 occurences of: CLIMATE CHANGE, AGRICULTURE AND FOOD SECURITY
Fixed 7 occurences of: WATER, LAND AND ECOSYSTEMS
Fixed 19 occurences of: AGRICULTURE FOR NUTRITION AND HEALTH
Fixed 100 occurences of: ROOTS, TUBERS AND BANANAS
Fixed 31 occurences of: HUMIDTROPICS
Fixed 21 occurences of: MAIZE
Fixed 11 occurences of: POLICIES, INSTITUTIONS, AND MARKETS
Fixed 28 occurences of: GRAIN LEGUMES
Fixed 3 occurences of: FORESTS, TREES AND AGROFORESTRY
Fixed 5 occurences of: GENEBANKS
2019-11-28 16:30:45 +01:00
< / code > < / pre > < ul >
< li > That's weird because we just updated them last week… < / li >
< li > Create a pull request to enable searching by ORCID identifier (< code > cg.creator.id< / code > ) in Discovery and Listings and Reports (< a href = "https://github.com/ilri/DSpace/pull/371" > #371< / a > )< / li >
< li > I will test it on DSpace Test first!< / li >
< li > Fix one missing XMLUI string for “ Access Status” (cg.identifier.status)< / li >
< li > Run all system updates on DSpace Test and reboot the machine< / li >
2018-03-28 08:48:08 +02:00
< / ul >
2018-03-02 15:09:18 +01:00
< / article >
< / div > <!-- /.blog - main -->
< aside class = "col-sm-3 ml-auto blog-sidebar" >
< section class = "sidebar-module" >
< h4 > Recent Posts< / h4 >
< ol class = "list-unstyled" >
2020-01-14 19:40:41 +01:00
< li > < a href = "/cgspace-notes/2020-01/" > January, 2020< / a > < / li >
2019-12-01 10:29:49 +01:00
< li > < a href = "/cgspace-notes/2019-12/" > December, 2019< / a > < / li >
2019-11-04 15:41:19 +01:00
< li > < a href = "/cgspace-notes/2019-11/" > November, 2019< / a > < / li >
2019-10-28 12:43:25 +01:00
< li > < a href = "/cgspace-notes/cgspace-cgcorev2-migration/" > CGSpace CG Core v2 Migration< / a > < / li >
2019-10-01 16:31:40 +02:00
< li > < a href = "/cgspace-notes/2019-10/" > October, 2019< / a > < / li >
2018-03-02 15:09:18 +01:00
< / ol >
< / section >
< section class = "sidebar-module" >
< h4 > Links< / h4 >
< ol class = "list-unstyled" >
< li > < a href = "https://cgspace.cgiar.org" > CGSpace< / a > < / li >
< li > < a href = "https://dspacetest.cgiar.org" > DSpace Test< / a > < / li >
< li > < a href = "https://github.com/ilri/DSpace" > CGSpace @ GitHub< / a > < / li >
< / ol >
< / section >
< / aside >
< / div > <!-- /.row -->
< / div > <!-- /.container -->
< footer class = "blog-footer" >
2019-10-11 10:19:42 +02:00
< p dir = "auto" >
2018-03-02 15:09:18 +01:00
Blog template created by < a href = "https://twitter.com/mdo" > @mdo< / a > , ported to Hugo by < a href = 'https://twitter.com/mralanorth' > @mralanorth< / a > .
< / p >
< p >
< a href = "#" > Back to top< / a >
< / p >
< / footer >
< / body >
< / html >