mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-30 10:28:18 +01:00
250 lines
10 KiB
Markdown
250 lines
10 KiB
Markdown
+++
|
|
date = "2016-05-01T23:06:00+03:00"
|
|
author = "Alan Orth"
|
|
title = "May, 2016"
|
|
tags = ["Notes"]
|
|
|
|
+++
|
|
## 2016-05-01
|
|
|
|
- Since yesterday there have been 10,000 REST errors and the site has been unstable again
|
|
- I have blocked access to the API now
|
|
- There are 3,000 IPs accessing the REST API in a 24-hour period!
|
|
|
|
```
|
|
# awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
|
|
3168
|
|
```
|
|
|
|
<!--more-->
|
|
|
|
- The two most often requesters are in Ethiopia and Colombia: 213.55.99.121 and 181.118.144.29
|
|
- 100% of the requests coming from Ethiopia are like this and result in an HTTP 500:
|
|
|
|
```
|
|
GET /rest/handle/10568/NaN?expand=parentCommunityList,metadata HTTP/1.1
|
|
```
|
|
|
|
- For now I'll block just the Ethiopian IP
|
|
- The owner of that application has said that the `NaN` (not a number) is an error in his code and he'll fix it
|
|
|
|
## 2016-05-03
|
|
|
|
- Update nginx to 1.10.x branch on CGSpace
|
|
- Fix a reference to `dc.type.output` in Discovery that I had missed when we migrated to `dc.type` last month ([#223](https://github.com/ilri/DSpace/pull/223))
|
|
|
|
![Item type in Discovery results](/cgspace-notes/2016/05/discovery-types.png)
|
|
|
|
## 2016-05-06
|
|
|
|
- DSpace Test is down, `catalina.out` has lots of messages about heap space from some time yesterday (!)
|
|
- It looks like Sisay was doing some batch imports
|
|
- Hmm, also disk space is full
|
|
- I decided to blow away the solr indexes, since they are 50GB and we don't really need all the Atmire stuff there right now
|
|
- I will re-generate the Discovery indexes after re-deploying
|
|
- Testing `renew-letsencrypt.sh` script for nginx
|
|
|
|
```
|
|
#!/usr/bin/env bash
|
|
|
|
readonly SERVICE_BIN=/usr/sbin/service
|
|
readonly LETSENCRYPT_BIN=/opt/letsencrypt/letsencrypt-auto
|
|
|
|
# stop nginx so LE can listen on port 443
|
|
$SERVICE_BIN nginx stop
|
|
|
|
$LETSENCRYPT_BIN renew -nvv --standalone --standalone-supported-challenges tls-sni-01 > /var/log/letsencrypt/renew.log 2>&1
|
|
|
|
LE_RESULT=$?
|
|
|
|
$SERVICE_BIN nginx start
|
|
|
|
if [[ "$LE_RESULT" != 0 ]]; then
|
|
echo 'Automated renewal failed:'
|
|
|
|
cat /var/log/letsencrypt/renew.log
|
|
|
|
exit 1
|
|
fi
|
|
```
|
|
|
|
- Seems to work well
|
|
|
|
## 2016-05-10
|
|
|
|
- Start looking at more metadata migrations
|
|
- There are lots of fields in `dcterms` namespace that look interesting, like:
|
|
- dcterms.type
|
|
- dcterms.spatial
|
|
- Not sure what `dcterms` is...
|
|
- Looks like these were [added in DSpace 4](https://wiki.duraspace.org/display/DSDOC5x/Metadata+and+Bitstream+Format+Registries#MetadataandBitstreamFormatRegistries-DublinCoreTermsRegistry(DCTERMS)) to allow for future work to make DSpace more flexible
|
|
- CGSpace's `dc` registry has 96 items, and the default DSpace one has 73.
|
|
|
|
## 2016-05-11
|
|
|
|
- Identify and propose the next phase of CGSpace fields to migrate:
|
|
- dc.title.jtitle → cg.title.journal
|
|
- dc.identifier.status → cg.identifier.status
|
|
- dc.river.basin → cg.river.basin
|
|
- dc.Species → cg.species
|
|
- dc.targetaudience → cg.targetaudience
|
|
- dc.fulltextstatus → cg.fulltextstatus
|
|
- dc.editon → cg.edition
|
|
- dc.isijournal → cg.isijournal
|
|
|
|
- Start a test rebase of the `5_x-prod` branch on top of the `dspace-5.5` tag
|
|
- There were a handful of conflicts that I didn't understand
|
|
- After completing the rebase I tried to build with the module versions Atmire had indicated as being 5.5 ready but I got this error:
|
|
|
|
```
|
|
[ERROR] Failed to execute goal on project additions: Could not resolve dependencies for project org.dspace.modules:additions:jar:5.5: Could not find artifact com.atmire:atmire-metadata-quality-api:jar:5.5-2.10.1-0 in sonatype-releases (https://oss.sonatype.org/content/repositories/releases/) -> [Help 1]
|
|
```
|
|
|
|
- I've sent them a question about it
|
|
- A user mentioned having problems with uploading a 33 MB PDF
|
|
- I told her I would increase the limit temporarily tomorrow morning
|
|
- Turns out she was able to decrease the size of the PDF so we didn't have to do anything
|
|
|
|
## 2016-05-12
|
|
|
|
- Looks like the issue that Abenet was having a few days ago with "Connection Reset" in Firefox might be due to a Firefox 46 issue: https://bugzilla.mozilla.org/show_bug.cgi?id=1268775
|
|
- I finally found a copy of the latest CG Core metadata guidelines and it looks like we can add a few more fields to our next migration:
|
|
- dc.rplace.region → cg.coverage.region
|
|
- dc.cplace.country → cg.coverage.country
|
|
- Questions for CG people:
|
|
- Our `dc.place` and `dc.srplace.subregion` could both map to `cg.coverage.admin-unit`?
|
|
- Should we use `dc.contributor.crp` or `cg.contributor.crp` for the CRP (ours is `dc.crsubject.crpsubject`)?
|
|
- Our `dc.contributor.affiliation` and `dc.contributor.corporate` could both map to `dc.contributor` and possibly `dc.contributor.center` depending on if it's a CG center or not
|
|
- `dc.title.jtitle` could either map to `dc.publisher` or `dc.source` depending on how you read things
|
|
- Found ~200 messed up CIAT values in `dc.publisher`:
|
|
|
|
```
|
|
# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=39 and text_value similar to "% %";
|
|
```
|
|
|
|
## 2016-05-13
|
|
|
|
- More theorizing about CGcore
|
|
- Add two new fields:
|
|
- dc.srplace.subregion → cg.coverage.admin-unit
|
|
- dc.place → cg.place
|
|
- `dc.place` is our own field, so it's easy to move
|
|
- I've removed `dc.title.jtitle` from the list for now because there's no use moving it out of DC until we know where it will go (see discussion yesterday)
|
|
|
|
## 2016-05-18
|
|
|
|
- Work on 707 CCAFS records
|
|
- They have thumbnails on Flickr and elsewhere
|
|
- In OpenRefine I created a new `filename` column based on the `thumbnail` column with the following GREL:
|
|
|
|
```
|
|
if(cells['thumbnails'].value.contains('hqdefault'), cells['thumbnails'].value.split('/')[-2] + '.jpg', cells['thumbnails'].value.split('/')[-1])
|
|
```
|
|
|
|
- Because ~400 records had the same filename on Flickr (hqdefault.jpg) but different UUIDs in the URL
|
|
- So for the `hqdefault.jpg` ones I just take the UUID (-2) and use it as the filename
|
|
- Before importing with SAFBuilder I tested adding "__bundle:THUMBNAIL" to the `filename` column and it works fine
|
|
|
|
## 2016-05-19
|
|
|
|
- More quality control on `filename` field of CCAFS records to make processing in shell and SAFBuilder more reliable:
|
|
|
|
```
|
|
value.replace('_','').replace('-','')
|
|
```
|
|
|
|
- We need to hold off on moving `dc.Species` to `cg.species` because it is only used for plants, and might be better to move it to something like `cg.species.plant`
|
|
- And `dc.identifier.fund` is MOSTLY used for CPWF project identifier but has some other sponsorship things
|
|
- We should move PN*, SG*, CBA, IA, and PHASE* values to `cg.identifier.cpwfproject`
|
|
- The rest, like BMGF and USAID etc, might have to go to either `dc.description.sponsorship` or `cg.identifier.fund` (not sure yet)
|
|
- There are also some mistakes in CPWF's things, like "PN 47"
|
|
- This ought to catch all the CPWF values (there don't appear to be and SG* values):
|
|
|
|
```
|
|
# select text_value from metadatavalue where resource_type_id=2 and metadata_field_id=75 and (text_value like 'PN%' or text_value like 'PHASE%' or text_value = 'CBA' or text_value = 'IA');
|
|
```
|
|
|
|
## 2016-05-20
|
|
|
|
- More work on CCAFS Video and Images records
|
|
- For SAFBuilder we need to modify filename column to have the thumbnail bundle:
|
|
|
|
```
|
|
value + "__bundle:THUMBNAIL"
|
|
```
|
|
|
|
- Also, I fixed some weird characters using OpenRefine's transform with the following GREL:
|
|
|
|
```
|
|
value.replace(/\u0081/,'')
|
|
```
|
|
|
|
- Write shell script to resize thumbnails with height larger than 400: https://gist.github.com/alanorth/131401dcd39d00e0ce12e1be3ed13256
|
|
- Upload 707 CCAFS records to DSpace Test
|
|
- A few miscellaneous fixes for XMLUI display niggles (spaces in item lists and link target `_black`): [#224](https://github.com/ilri/DSpace/pull/224)
|
|
- Work on configuration changes for Phase 2 metadata migrations
|
|
|
|
## 2016-05-23
|
|
|
|
- Try to import the CCAFS Images and Videos to CGSpace but had some issues with LibreOffice and OpenRefine
|
|
- LibreOffice excludes empty cells when it exports and all the fields shift over to the left and cause URLs to go to Subjects, etc.
|
|
- Google Docs does this better, but somehow reorders the rows and when I paste the thumbnail/filename row in they don't match!
|
|
- I will have to try later
|
|
|
|
## 2016-05-30
|
|
|
|
- Export CCAFS video and image records from DSpace Test using the migrate option (`-m`):
|
|
|
|
```
|
|
$ mkdir ~/ccafs-images
|
|
$ /home/dspacetest.cgiar.org/bin/dspace export -t COLLECTION -i 10568/79355 -d ~/ccafs-images -n 0 -m
|
|
```
|
|
|
|
- And then import to CGSpace:
|
|
|
|
```
|
|
$ JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace import --add --eperson=aorth@mjanja.ch --collection=10568/70974 --source /tmp/ccafs-images --mapfile=/tmp/ccafs-images-may30.map &> /tmp/ccafs-images-may30.log
|
|
```
|
|
|
|
- But now we have double authors for "CGIAR Research Program on Climate Change, Agriculture and Food Security" in the authority
|
|
- I'm trying to do a Discovery index before messing with the authority index
|
|
- Looks like we are missing the `index-authority` cron job, so who knows what's up with our authority index
|
|
- Run system updates on DSpace Test, re-deploy code, and reboot the server
|
|
- Clean up and import ~200 CTA records to CGSpace via CSV like:
|
|
|
|
```
|
|
$ export JAVA_OPTS="-Xmx512m -Dfile.encoding=UTF-8"
|
|
$ /home/cgspace.cgiar.org/bin/dspace metadata-import -e aorth@mjanja.ch -f ~/CTA-May30/CTA-42229.csv &> ~/CTA-May30/CTA-42229.log
|
|
```
|
|
|
|
- Discovery indexing took a few hours for some reason, and after that I started the `index-authority` script
|
|
|
|
```
|
|
$ JAVA_OPTS="-Xmx1024m -Dfile.encoding=UTF-8" /home/cgspace.cgiar.org/bin/dspace index-authority
|
|
```
|
|
|
|
## 2016-05-31
|
|
|
|
- The `index-authority` script ran over night and was finished in the morning
|
|
- Hopefully this was because we haven't been running it regularly and it will speed up next time
|
|
- I am running it again with a timer to see:
|
|
|
|
```
|
|
$ time /home/cgspace.cgiar.org/bin/dspace index-authority
|
|
Retrieving all data
|
|
Initialize org.dspace.authority.indexer.DSpaceAuthorityIndexer
|
|
Cleaning the old index
|
|
Writing new data
|
|
All done !
|
|
|
|
real 37m26.538s
|
|
user 2m24.627s
|
|
sys 0m20.540s
|
|
```
|
|
|
|
- Update `tomcat7` crontab on CGSpace and DSpace Test to have the `index-authority` script that we were missing
|
|
- Add new ILRI subject and CCAFS project tags to `input-forms.xml` ([#226](https://github.com/ilri/DSpace/pull/226), [#225](https://github.com/ilri/DSpace/pull/225))
|
|
- Manually mapped the authors of a few old CCAFS records to the new CCAFS authority UUID and re-indexed authority indexes to see if it helps correct those items.
|
|
- Re-sync DSpace Test data with CGSpace
|
|
- Clean up and import ~65 more CTA items into CGSpace
|