May, 2016
2016-05-01
- Since yesterday there have been 10,000 REST errors and the site has been unstable again
- I have blocked access to the API now
There are 3,000 IPs accessing the REST API in a 24-hour period!
# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l 3168
- The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
100% of the requests coming from Ethiopia are like this and result in an HTTP 500:
GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
For now I’ll block just the Ethiopian IP
The owner of that application has said that the
NaN
(not a number) is an error in his code and he’ll fix it
2016-05-03
- Update nginx to 1.10.x branch on CGSpace
- Fix a reference to
dc.type.output
in Discovery that I had missed when we migrated todc.type
last month (#223)
2016-05-06
- DSpace Test is down,
catalina.out
has lots of messages about heap space from some time yesterday (!) - It looks like Sisay was doing some batch imports
- Hmm, also disk space is full
- I decided to blow away the solr indexes, since they are 50GB and we don’t really need all the Atmire stuff there right now
- I will re-generate the Discovery indexes after re-deploying
Testing
renew-letsencrypt.sh
script for nginx#!/usr/bin/env bash readonly SERVICE_BIN=/usr/sbin/service readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto # stop nginx so LE can listen on port 443 $SERVICE_BIN nginx stop $LETSENCRYPT_BIN renew -nvv --standalone --standalone-supported-challenges tls-sni-01 > /var/log/letsencrypt/renew.log 2>&1 LE_RESULT=$? $SERVICE_BIN nginx start if [[ "$LE_RESULT" != 0 ]]; then echo 'Automated renewal failed:' cat /var/log/letsencrypt/renew.log exit 1 fi
Seems to work well
2016-05-10
- Start looking at more metadata migrations
- There are lots of fields in
dcterms
namespace that look interesting, like:- dcterms.type
- dcterms.spatial
- Not sure what
dcterms
is… - Looks like these were added in DSpace 4 to allow for future work to make DSpace more flexible
- CGSpace’s
dc
registry has 96 items, and the default DSpace one has 73.
2016-05-11
Identify and propose the next phase of CGSpace fields to migrate:
- dc.title.jtitle → cg.title.journal
- dc.identifier.status → cg.identifier.status
- dc.river.basin → cg.river.basin
- dc.Species → cg.species
- dc.targetaudience → cg.targetaudience
- dc.fulltextstatus → cg.fulltextstatus
- dc.editon → cg.edition
- dc.isijournal → cg.isijournal
Start a test rebase of the
5_x-prod
branch on top of thedspace-5.5
tagThere were a handful of conflicts that I didn’t understand
After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:
[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
I’ve sent them a question about it
A user mentioned having problems with uploading a 33 MB PDF
I told her I would increase the limit temporarily tomorrow morning
Turns out she was able to decrease the size of the PDF so we didn’t have to do anything
2016-05-12
- Looks like the issue that Abenet was having a few days ago with “Connection Reset” in Firefox might be due to a Firefox 46 issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1268775
- I finally found a copy of the latest CG Core metadata guidelines and it looks like we can add a few more fields to our next migration:
- dc.rplace.region → cg.coverage.region
- dc.cplace.country → cg.coverage.country
- Questions for CG people:
- Our
dc.place
anddc.srplace.subregion
could both map tocg.coverage.admin-unit
? - Should we use
dc.contributor.crp
orcg.contributor.crp
for the CRP (ours isdc.crsubject.crpsubject
)? - Our
dc.contributor.affiliation
anddc.contributor.corporate
could both map todc.contributor
and possiblydc.contributor.center
depending on if it’s a CG center or not dc.title.jtitle
could either map todc.publisher
ordc.source
depending on how you read things
- Our
Found ~200 messed up CIAT values in
dc.publisher
:# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "% %";
2016-05-13
- More theorizing about CGcore
- Add two new fields:
- dc.srplace.subregion → cg.coverage.admin-unit
- dc.place → cg.place
dc.place
is our own field, so it’s easy to move- I’ve removed
dc.title.jtitle
from the list for now because there’s no use moving it out of DC until we know where it will go (see discussion yesterday)
2016-05-18
- Work on 707 CCAFS records
- They have thumbnails on Flickr and elsewhere
In OpenRefine I created a new
filename
column based on thethumbnail
column with the following GREL:if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
So for the
hqdefault.jpg
ones I just take the UUID (-2) and use it as the filenameBefore importing with SAFBuilder I tested adding “__bundle:THUMBNAIL” to the
filename
column and it works fine
2016-05-19
More quality control on
filename
field of CCAFS records to make processing in shell and SAFBuilder more reliable:value.replace('_','').replace('-','')
We need to hold off on moving
dc.Species
tocg.species
because it is only used for plants, and might be better to move it to something likecg.species.plant
And
dc.identifier.fund
is MOSTLY used for CPWF project identifier but has some other sponsorship things- We should move PN, SG, CBA, IA, and PHASE* values to
cg.identifier.cpwfproject
- The rest, like BMGF and USAID etc, might have to go to either
dc.description.sponsorship
orcg.identifier.fund
(not sure yet) - There are also some mistakes in CPWF’s things, like “PN 47”
This ought to catch all the CPWF values (there don’t appear to be and SG* values):
# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
- We should move PN, SG, CBA, IA, and PHASE* values to
2016-05-20
- More work on CCAFS Video and Images records
For SAFBuilder we need to modify filename column to have the thumbnail bundle:
value + "__bundle:THUMBNAIL"
Also, I fixed some weird characters using OpenRefine’s transform with the following GREL:
value.replace(/\u0081/,'')
Write shell script to resize thumbnails with height larger than 400: https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256
Upload 707 CCAFS records to DSpace Test
A few miscellaneous fixes for XMLUI display niggles (spaces in item lists and link target
_black
): #224Work on configuration changes for Phase 2 metadata migrations
2016-05-23
- Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine
- LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.
- Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don’t match!
- I will have to try later
2016-05-30
Export CCAFS video and image records from DSpace Test using the migrate option (
-m
):$ mkdir ~/ccafs-images $ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
And then import to CGSpace:
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
But now we have double authors for “CGIAR Research Program on Climate Change, Agriculture and Food Security” in the authority
I’m trying to do a Discovery index before messing with the authority index
Looks like we are missing the
index-authority
cron job, so who knows what’s up with our authority indexRun system updates on DSpace Test, re-deploy code, and reboot the server
Clean up and import ~200 CTA records to CGSpace via CSV like:
$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" $ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
Discovery indexing took a few hours for some reason, and after that I started the
index-authority
script$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
2016-05-31
- The
index-authority
script ran over night and was finished in the morning - Hopefully this was because we haven’t been running it regularly and it will speed up next time
I am running it again with a timer to see:
$ time /home/cgspace.cgiar.org/bin/dspace index-authority Retrieving all data Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer Cleaning the old index Writing new data All done ! real 37m26.538s user 2m24.627s sys 0m20.540s
Update
tomcat7
crontab on CGSpace and DSpace Test to have theindex-authority
script that we were missingAdd new ILRI subject and CCAFS project tags to
input-forms.xml
(#226, #225)Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.
Re-sync DSpace Test data with CGSpace
Clean up and import ~65 more CTA items into CGSpace