cgspace-notes/content/post/2016-05.md

10 KiB

+++ date = "2016-05-01T23:06:00+03:00" author = "Alan Orth" title = "May, 2016" tags = ["notes"] image = "../images/bg.jpg"

+++

2016-05-01

  • Since yesterday there have been 10,000 REST errors and the site has been unstable again
  • I have blocked access to the API now
  • There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk '{print $1}' /var/log/nginx/rest.log  | uniq | wc -l
3168
  • The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
  • 100% of the requests coming from Ethiopia are like this and result in an HTTP 500:
GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
  • For now I'll block just the Ethiopian IP
  • The owner of that application has said that the NaN (not a number) is an error in his code and he'll fix it

2016-05-03

  • Update nginx to 1.10.x branch on CGSpace
  • Fix a reference to dc.type.output in Discovery that I had missed when we migrated to dc.type last month (#223)

Item type in Discovery results

2016-05-06

  • DSpace Test is down, catalina.out has lots of messages about heap space from some time yesterday (!)
  • It looks like Sisay was doing some batch imports
  • Hmm, also disk space is full
  • I decided to blow away the solr indexes, since they are 50GB and we don't really need all the Atmire stuff there right now
  • I will re-generate the Discovery indexes after re-deploying
  • Testing renew-letsencrypt.sh script for nginx
#!/usr/bin/env bash

readonly SERVICE_BIN=/usr/sbin/service
readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto

# stop nginx so LE can listen on port 443
$SERVICE_BIN nginx stop

$LETSENCRYPT_BIN renew -nvv --standalone --standalone-supported-challenges tls-sni-01 > /var/log/letsencrypt/renew.log 2>&1

LE_RESULT=$?

$SERVICE_BIN nginx start

if [[ "$LE_RESULT" != 0 ]]; then
    echo 'Automated renewal failed:'

    cat /var/log/letsencrypt/renew.log

    exit 1
fi
  • Seems to work well

2016-05-10

  • Start looking at more metadata migrations
  • There are lots of fields in dcterms namespace that look interesting, like:
    • dcterms.type
    • dcterms.spatial
  • Not sure what dcterms is...
  • Looks like these were added in DSpace 4 to allow for future work to make DSpace more flexible
  • CGSpace's dc registry has 96 items, and the default DSpace one has 73.

2016-05-11

  • Identify and propose the next phase of CGSpace fields to migrate:

    • dc.title.jtitle → cg.title.journal
    • dc.identifier.status → cg.identifier.status
    • dc.river.basin → cg.river.basin
    • dc.Species → cg.species
    • dc.targetaudience → cg.targetaudience
    • dc.fulltextstatus → cg.fulltextstatus
    • dc.editon → cg.edition
    • dc.isijournal → cg.isijournal
  • Start a test rebase of the 5_x-prod branch on top of the dspace-5.5 tag

  • There were a handful of conflicts that I didn't understand

  • After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:

[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
  • I've sent them a question about it
  • A user mentioned having problems with uploading a 33 MB PDF
  • I told her I would increase the limit temporarily tomorrow morning
  • Turns out she was able to decrease the size of the PDF so we didn't have to do anything

2016-05-12

  • Looks like the issue that Abenet was having a few days ago with "Connection Reset" in Firefox might be due to a Firefox 46 issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1268775
  • I finally found a copy of the latest CG Core metadata guidelines and it looks like we can add a few more fields to our next migration:
    • dc.rplace.region → cg.coverage.region
    • dc.cplace.country → cg.coverage.country
  • Questions for CG people:
    • Our dc.place and dc.srplace.subregion could both map to cg.coverage.admin-unit?
    • Should we use dc.contributor.crp or cg.contributor.crp for the CRP (ours is dc.crsubject.crpsubject)?
    • Our dc.contributor.affiliation and dc.contributor.corporate could both map to dc.contributor and possibly dc.contributor.center depending on if it's a CG center or not
    • dc.title.jtitle could either map to dc.publisher or dc.source depending on how you read things
  • Found ~200 messed up CIAT values in dc.publisher:
# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "%  %";

2016-05-13

  • More theorizing about CGcore
  • Add two new fields:
    • dc.srplace.subregion → cg.coverage.admin-unit
    • dc.place → cg.place
  • dc.place is our own field, so it's easy to move
  • I've removed dc.title.jtitle from the list for now because there's no use moving it out of DC until we know where it will go (see discussion yesterday)

2016-05-18

  • Work on 707 CCAFS records
  • They have thumbnails on Flickr and elsewhere
  • In OpenRefine I created a new filename column based on the thumbnail column with the following GREL:
if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
  • Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
  • So for the hqdefault.jpg ones I just take the UUID (-2) and use it as the filename
  • Before importing with SAFBuilder I tested adding "__bundle:THUMBNAIL" to the filename column and it works fine

2016-05-19

  • More quality control on filename field of CCAFS records to make processing in shell and SAFBuilder more reliable:
value.replace('_','').replace('-','')
  • We need to hold off on moving dc.Species to cg.species because it is only used for plants, and might be better to move it to something like cg.species.plant
  • And dc.identifier.fund is MOSTLY used for CPWF project identifier but has some other sponsorship things
    • We should move PN*, SG*, CBA, IA, and PHASE* values to cg.identifier.cpwfproject
    • The rest, like BMGF and USAID etc, might have to go to either dc.description.sponsorship or cg.identifier.fund (not sure yet)
    • There are also some mistakes in CPWF's things, like "PN 47"
    • This ought to catch all the CPWF values (there don't appear to be and SG* values):
# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');

2016-05-20

  • More work on CCAFS Video and Images records
  • For SAFBuilder we need to modify filename column to have the thumbnail bundle:
value + "__bundle:THUMBNAIL"
  • Also, I fixed some weird characters using OpenRefine's transform with the following GREL:
value.replace(/\u0081/,'')

2016-05-23

  • Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine
  • LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.
  • Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don't match!
  • I will have to try later

2016-05-30

  • Export CCAFS video and image records from DSpace Test using the migrate option (-m):
$ mkdir ~/ccafs-images
$ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
  • And then import to CGSpace:
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
  • But now we have double authors for "CGIAR Research Program on Climate Change, Agriculture and Food Security" in the authority
  • I'm trying to do a Discovery index before messing with the authority index
  • Looks like we are missing the index-authority cron job, so who knows what's up with our authority index
  • Run system updates on DSpace Test, re-deploy code, and reboot the server
  • Clean up and import ~200 CTA records to CGSpace via CSV like:
$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
  • Discovery indexing took a few hours for some reason, and after that I started the index-authority script
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority

2016-05-31

  • The index-authority script ran over night and was finished in the morning
  • Hopefully this was because we haven't been running it regularly and it will speed up next time
  • I am running it again with a timer to see:
$ time /home/cgspace.cgiar.org/bin/dspace index-authority
Retrieving all data
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
Cleaning the old index
Writing new data
All done !

real    37m26.538s
user    2m24.627s
sys     0m20.540s
  • Update tomcat7 crontab on CGSpace and DSpace Test to have the index-authority script that we were missing
  • Add new ILRI subject and CCAFS project tags to input-forms.xml (#226, #225)
  • Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.
  • Re-sync DSpace Test data with CGSpace
  • Clean up and import ~65 more CTA items into CGSpace