mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-12-18 11:12:18 +01:00
16 KiB
16 KiB
+++ date = "2016-10-03T15:53:00+03:00" author = "Alan Orth" title = "October, 2016" tags = ["Notes"]
+++
2016-10-03
- Testing adding ORCIDs to a CSV file for a single item to see if the author orders get messed up
- Need to test the following scenarios to see how author order is affected:
- ORCIDs only
- ORCIDs plus normal authors
- I exported a random item's metadata as CSV, deleted all columns except id and collection, and made a new coloum called
ORCID:dc.contributor.author
with the following random ORCIDs from the ORCID registry:
0000-0002-6115-0956||0000-0002-3812-8793||0000-0001-7462-405X
- Hmm, with the
dc.contributor.author
column removed, DSpace doesn't detect any changes - With a blank
dc.contributor.author
column, DSpace wants to remove all non-ORCID authors and add the new ORCID authors - I added the disclaimer text to the About page, then added a footer link to the disclaimer's ID, but there is a Bootstrap issue that causes the page content to disappear when using in-page anchors: https://github.com/twbs/bootstrap/issues/1768
- Looks like we'll just have to add the text to the About page (without a link) or add a separate page
2016-10-04
- Start testing cleanups of authors that Peter sent last week
- Out of 40,000+ rows, Peter had indicated corrections for ~3,200 of them—too many to look through carefully, so I did some basic quality checking:
- Trim leading/trailing whitespace
- Find invalid characters
- Cluster values to merge obvious authors
- That left us with 3,180 valid corrections and 3 deletions:
$ ./fix-metadata-values.py -i authors-fix-3180.csv -f dc.contributor.author -t correct -m 3 -d dspacetest -u dspacetest -p fuuu
$ ./delete-metadata-values.py -i authors-delete-3.csv -f dc.contributor.author -m 3 -d dspacetest -u dspacetest -p fuuu
- Remove old about page (#284)
- CGSpace crashed a few times today
- Generate list of unique authors in CCAFS collections:
dspacetest=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 AND resource_id IN (select item_id from collection2item where collection_id IN (select resource_id from handle where handle in ('10568/32729', '10568/5472', '10568/5473', '10568/10288', '10568/70974', '10568/3547', '10568/3549', '10568/3531','10568/16890','10568/5470','10568/3546', '10568/36024', '10568/66581', '10568/21789', '10568/5469', '10568/5468', '10568/3548', '10568/71053', '10568/25167'))) group by text_value order by count desc) to /tmp/ccafs-authors.csv with csv;
2016-10-05
- Work on more infrastructure cleanups for Ansible DSpace role
- Clean up Let's Encrypt plumbing and submit pull request for rmg-ansible-public (#60)
2016-10-06
- Nice! DSpace Test (linode02) is now having
java.lang.OutOfMemoryError: Java heap space
errors... - Heap space is 2048m, and we have 5GB of RAM being used for OS cache (Solr!) so let's just bump the memory to 3072m
- Magdalena from CCAFS asked why the colors in the thumbnails for these two items look different, even though they are the same in the PDF itself
- Turns out the first PDF was exported from InDesign using CMYK and the second one was using sRGB
- Run all system updates on DSpace Test and reboot it
2016-10-08
- Re-deploy CGSpace with latest changes from late September and early October
- Run fixes for ILRI subjects and delete blank metadata values:
dspace=# delete from metadatavalue where resource_type_id=2 and text_value='';
DELETE 11
- Run all system updates and reboot CGSpace
- Delete ten gigs of old 2015 Tomcat logs that never got rotated (WTF?):
root@linode01:~# ls -lh /var/log/tomcat7/localhost_access_log.2015* | wc -l
47
- Delete 2GB
cron-filter-media.log
file, as it is just a log from a cron job and it doesn't get rotated like normal log files (almost a year now maybe)
2016-10-14
- Run all system updates on DSpace Test and reboot server
- Looking into some issues with Discovery filters in Atmire's content and usage analysis module after adjusting the filter class
- Looks like changing the filters from
configuration.DiscoverySearchFilterFacet
toconfiguration.DiscoverySearchFilter
breaks them in Atmire CUA module
2016-10-17
- A bit more cleanup on the CCAFS authors, and run the corrections on DSpace Test:
$ ./fix-metadata-values.py -i ccafs-authors-oct-16.csv -f dc.contributor.author -t 'correct name' -m 3 -d dspace -u dspace -p fuuu
- One observation is that there are still some old versions of names in the author lookup because authors appear in other communities (as we only corrected authors from CCAFS for this round)
2016-10-18
- Start working on DSpace 5.5 porting work again:
$ git checkout -b 5_x-55 5_x-prod
$ git rebase -i dspace-5.5
- Have to fix about ten merge conflicts, mostly in the SCSS for the CGIAR theme
- Skip 1e34751b8cf17021f45d4cf2b9a5800c93fb4cb2 in lieu of upstream's 55e623d1c2b8b7b1fa45db6728e172e06bfa8598 (fixes X-Forwarded-For header) because I had made the same fix myself and it's better to use the upstream one
- I notice this rebase gets rid of GitHub merge commits... which actually might be fine because merges are fucking annoying to deal with when remote people merge without pulling and rebasing their branch first
- Finished up applying the 5.5 sitemap changes to all themes
- Merge the
discovery.xml
cleanups (#278) - Merge some minor edits to the distribution license (#285)
2016-10-19
- When we move to DSpace 5.5 we should also cherry pick some patches from 5.6 branch:
- memory cleanup: 9f0f5940e7921765c6a22e85337331656b18a403
- sql injection: c6fda557f731dbc200d7d58b8b61563f86fe6d06
- pdfbox security issue: b5330b78153b2052ed3dc2fd65917ccdbfcc0439
2016-10-20
- Run CCAFS author corrections on CGSpace
- Discovery reindexing took forever and kinda caused CGSpace to crash, so I ran all system updates and rebooted the server
2016-10-25
- Move the LIVES community from the top level to the ILRI projects community
$ /home/cgspace.cgiar.org/bin/dspace community-filiator --set --parent=10568/27629 --child=10568/25101
- Start testing some things for DSpace 5.5, like command line metadata import, PDF media filter, and Atmire CUA
- Start looking at batch fixing of "old" ILRI website links without www or https, for example:
dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ilri.org%';
- Also CCAFS has HTTPS and their links should use it where possible:
dspace=# select * from metadatavalue where resource_type_id=2 and text_value like 'http://ccafs.cgiar.org%';
- And this will find community and collection HTML text that is using the old style PNG/JPG icons for RSS and email (we should be using Font Awesome icons instead):
dspace=# select text_value from metadatavalue where resource_type_id in (3,4) and text_value like '%Iconrss2.png%';
- Turns out there are shit tons of varieties of this, like with http, https, www, separate
</img>
tags, alignments, etc - Had to find all variations and replace them individually:
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>','<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/Iconrss2.png"></img>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="http://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="http://www.ilri.org/images/email.jpg"></img>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"></img>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"></img>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/Iconrss2.png"></img>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://www.ilri.org/images/email.jpg"></img>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://www.ilri.org/images/email.jpg"></img>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/Iconrss2.png"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img align="left" src="https://ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img align="left" src="https://ilri.org/images/email.jpg"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/Iconrss2.png"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="https://www.ilri.org/images/email.jpg"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>', '<span class="fa fa-rss fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/Iconrss2.png"/>%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, '<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>', '<span class="fa fa-at fa-2x" aria-hidden="true"></span>') where resource_type_id in (3,4) and text_value like '%<img valign="center" align="left" src="http://www.ilri.org/images/email.jpg"/>%';
- Getting rid of these reduces the number of network requests each client makes on community/collection pages, and makes use of Font Awesome icons (which they are already loading anyways!)
- And now that I start looking, I want to fix a bunch of links to popular sites that should be using HTTPS, like Twitter, Facebook, Google, Feed Burner, DOI, etc
- I should look to see if any of those domains is sending an HTTP 301 or setting HSTS headers to their HTTPS domains, then just replace them
2016-10-27
- Run Font Awesome fixes on DSpace Test:
dspace=# \i /tmp/font-awesome-text-replace.sql
UPDATE 17
UPDATE 17
UPDATE 3
UPDATE 3
UPDATE 30
UPDATE 30
UPDATE 1
UPDATE 1
UPDATE 7
UPDATE 7
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 1
UPDATE 0
- Looks much better now:
- Run the same replacements on CGSpace
2016-10-30
- Fix some messed up authors on CGSpace:
dspace=# update metadatavalue set authority='799da1d8-22f3-43f5-8233-3d2ef5ebf8a8', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Charleston, B.%';
UPDATE 10
dspace=# update metadatavalue set authority='e936f5c5-343d-4c46-aa91-7a1fff6277ed', confidence=600 where metadata_field_id=3 and resource_type_id=2 and text_value like 'Knight-Jones%';
UPDATE 36
- I updated the authority index but nothing seemed to change, so I'll wait and do it again after I update Discovery below
- Skype chat with Tsega about the IFPRI contentdm bridge
- We tested harvesting OAI in an example collection to see how it works
- Talk to Carlos Quiros about CG Core metadata in CGSpace
- Get a list of countries from CGSpace so I can do some batch corrections:
dspace=# \copy (select distinct text_value, count(*) from metadatavalue where metadata_field_id=228 group by text_value order by count desc) to /tmp/countries.csv with csv;
- Fix a bunch of countries in Open Refine and run the corrections on CGSpace:
$ ./fix-metadata-values.py -i countries-fix-18.csv -f dc.coverage.country -t 'correct' -m 228 -d dspace -u dspace -p fuuu
$ ./delete-metadata-values.py -i countries-delete-2.csv -f dc.coverage.country -m 228 -d dspace -u dspace -p fuuu
- Run a shit ton of author fixes from Peter Ballantyne that we've been cleaning up for two months:
$ ./fix-metadata-values.py -i /tmp/authors-fix-pb2.csv -f dc.contributor.author -t correct -m 3 -u dspace -d dspace -p fuuu
- Run a few URL corrections for ilri.org and doi.org, etc:
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://www.ilri.org','https://www.ilri.org') where resource_type_id=2 and text_value like '%http://www.ilri.org%';
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://mahider.ilri.org', 'https://cgspace.cgiar.org') where resource_type_id=2 and text_value like '%http://mahider.%.org%' and metadata_field_id not in (28);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://dx.doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://dx.doi.org%' and metadata_field_id not in (18,26,28,111);
dspace=# update metadatavalue set text_value = regexp_replace(text_value, 'http://doi.org', 'https://dx.doi.org') where resource_type_id=2 and text_value like '%http://doi.org%' and metadata_field_id not in (18,26,28,111);
- I skipped metadata fields like citation and description