CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

December, 2019

2019-12-01

  • Upgrade CGSpace (linode18) to Ubuntu 18.04:
    • Check any packages that have residual configs and purge them:
    • # dpkg -l | grep -E ‘^rc’ | awk ‘{print $2}’ | xargs dpkg -P
    • Make sure all packages are up to date and the package manager is up to date, then reboot:
# apt update && apt full-upgrade
# apt-get autoremove && apt-get autoclean
# dpkg -C
# reboot
  • Take some backups:
# dpkg -l > 2019-12-01-linode18-dpkg.txt
# tar czf 2019-12-01-linode18-etc.tar.gz /etc
  • Then check all third-party repositories in /etc/apt to see if everything using “xenial” has packages available for “bionic” and then update the sources:
  • # sed -i ‘s/xenial/bionic/’ /etc/apt/sources.list.d/*.list
  • Pause the Uptime Robot monitoring for CGSpace
  • Make sure the update manager is installed and do the upgrade:
# apt install update-manager-core
# do-release-upgrade
  • After the upgrade finishes, remove Java 11, force the installation of bionic nginx, and reboot the server:
# apt purge openjdk-11-jre-headless
# apt install 'nginx=1.16.1-1~bionic'
# reboot
  • After the server comes back up, remove Python virtualenvs that were created with Python 3.5 and re-run certbot to make sure it's working:
# rm -rf /opt/eff.org/certbot/venv/bin/letsencrypt
# rm -rf /opt/ilri/dspace-statistics-api/venv
# /opt/certbot-auto
  • Clear Ansible's fact cache and re-run the playbooks to update the system's firewalls, SSH config, etc
  • Altmetric finally responded to my question about Dublin Core fields
    • They shared a list of fields they use for tracking, but it only mentions HTML meta tags, and not fields considered when harvesting via OAI
    • Anyways, there might be some areas we can improve on the HTML meta tags, if I look at one item with a DOI, ISSN, etc I see that we could at least add status (Open Access) and journal title
    • I merged a pull request into the 5_x-prod branch to add status and journal title to the XHTML meta tags

2019-12-02

  • Raise the issue of old, low-quality thumbnails with Peter and the CGSpace team
    • I suggested that we move manually uploaded thumbnails from the ORIGINAL bundle to the THUMBNAIL bundle
    • Also replace old thumbnails where an item is available on Slideshare or YouTube because those are easy to get new, high-quality thumbnails for
  • Continue testing CG Core v2 implementation on DSpace Test
    • Compare the OAI QDC representation of a few items on CGSpace vs DSpace Test:
$ http 'https://cgspace.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/cgspace-104030.xml
$ http 'https://dspacetest.cgiar.org/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:cgspace.cgiar.org:10568/104030' > /tmp/dspacetest-104030.xml
  • The DSpace Test ones actually now capture the DOI, where the CGSpace doesn't…
  • And the DSpace Test one doesn't include review status as dc.description, but I don't think that's an important field

2019-12-04

  • Peter noticed that there were about seventy items on CGSpace that were marked as private
    • Some have been withdrawn, but I extracted a list of the forty-eight that were not:
dspace=# \COPY (SELECT handle, owning_collection FROM item, handle WHERE item.discoverable='f' AND item.in_archive='t' AND handle.resource_id = item.item_id) to /tmp/2019-12-04-CGSpace-private-items.csv WITH CSV HEADER;
COPY 48

2019-12-05

2019-12-08

  • Enrico noticed that the AReS Explorer on CGSpace (linode18) was down
    • I only see HTTP 502 in the nginx logs on CGSpace… so I assume it's something wrong with the AReS server
    • I ran all system updates on the AReS server (linode20) and rebooted it
    • After rebooting the Explorer was accessible again

2019-12-09

  • Update PostgreSQL JDBC driver to version 42.2.9 in Ansible playbooks
    • Deploy on DSpace Test (linode19) to test before deploying on CGSpace in a few days
  • Altmetric responded to my question about the WLE item that has a lower score than its DOI
    • They say that they will “reprocess” the item “before Christmas”

2019-12-11

  • Post message to Yammer about good practices for thumbnails on CGSpace
    • On the topic of thumbnails, I'm thinking we might want to force regenerate all PDF thumbnails on CGSpace since we upgraded it to Ubuntu 18.04 and got a new ghostscript…
  • More discussion about report formats for AReS
  • Peter noticed that the Atmire reports weren't showing any statistics before 2019
    • I checked and indeed Solr had an issue loading some core last time it was started
    • I restarted Tomcat three times before all cores came up successfully
  • While I was restarting the Tomcat service I upgraded the PostgreSQL JDBC driver to version 42.2.9, which had been deployed on DSpace Test earlier this week

2019-12-16

  • Visit CodeObia office to discuss next phase of OpenRXV/AReS development
    • We discussed using CSV instead of Excel for tabular reports
      • OpenRXV should only have “simple” reports with Dublin Core fields
      • AReS should have this as well as a customized “extended” report that has CRPs, Subjects, Sponsors, etc from CGSpace
    • We discussed using RTF instead of Word for graphical reports

2019-12-17

  • Start filing GitHub issues for the reporting features on OpenRXV and AReS
    • I created an issue for the “simple” tabular reports on OpenRXV GitHub (#29)
    • I created an issue for the “extended” tabular reports on AReS GitHub (#8)
    • I created an issue for “simple” text reports on the OpenRXV GitHub (#30)
    • I created an issue for “extended” text reports on the AReS GitHub (#9)
  • I looked into creating RTF documents from HTML in Node.js and there is a library called html-to-rtf that works well, but doesn't support images
  • Export a list of all investors (dc.description.sponsorship) for Peter to look through and correct:
dspace=# \COPY (SELECT DISTINCT text_value as "dc.contributor.sponsor", count(*) FROM metadatavalue WHERE resource_type_id = 2 AND metadata_field_id = 29 GROUP BY text_value ORDER BY count DESC LIMIT 1500) to /tmp/2019-12-17-investors.csv WITH CSV HEADER;
COPY 643

2019-12-18

  • Apply the investor corrections and deletions from Peter on CGSpace:
$ ./fix-metadata-values.py -i /tmp/2019-12-17-investors-fix-112.csv -db dspace -u dspace -p 'fuuu' -f dc.description.sponsorship -m 29 -t correct -d
$ ./delete-metadata-values.py -i /tmp/2019-12-17-investors-delete-68.csv -db dspace -u dspace -p 'fuuu' -m 29 -f dc.description.sponsorship -d
  • Peter asked about the “Open Government Licence 3.0” that is used by some items
    • I notice that it exists in SPDX as UGL-UK-3.0 so I created a GitHub issue to add this to our controlled vocabulary (#439)
    • I only see two in our database that use this for now, so I will update them:
dspace=# SELECT text_value FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open%';
         text_value          
-----------------------------
 Open Government License 3.0
 Open Government License 3.0
(2 rows)
dspace=# UPDATE metadatavalue SET text_value='OGL-UK-3.0' WHERE resource_type_id=2 AND metadata_field_id=53 AND text_value LIKE '%Open Government License 3.0%';
UPDATE 2
  • I created a pull request to add the license and merged it to the 5_x-prod branch (#440)
  • Add three new CCAFS Phase II project tags to CGSpace (#441)
  • Linode said DSpace Test (linode19) had an outbound traffic rate of 73Mb/sec for the last two hours
    • I see some Russian bot active in nginx's access logs:
# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -c MegaIndex.ru 
27320
  • I see they did check robots.txt and their requests are only going to XMLUI item pages… so I guess I just leave them alone
  • Peter wrote to ask why this one WLE item does not have an Altmetric attention score, but the DOI does
    • I tweeted the item just in case, but Peter said that he already did it yesterday
    • The item was added six months ago…
    • The DOI has an Altmetric score of 259, but for the Handle it is HTTP 404!
    • I emailed Altmetric support

2019-12-22

  • I ran the dspace cleanup process on CGSpace (linode18) and had an error:
Error: ERROR: update or delete on table "bitstream" violates foreign key constraint "bundle_primary_bitstream_id_fkey" on table "bundle"
  Detail: Key (bitstream_id)=(179441) is still referenced from table "bundle".
  • The solution is to delete that bitstream manually:
$ psql dspace -c 'update bundle set primary_bitstream_id=NULL where primary_bitstream_id in (179441);'
UPDATE 1

2019-12-23

  • Follow up with Altmetric on the issue where an item has a different (lower) score for its Handle despite it having a correct DOI (with a higher score)
    • I've raised this issue three times to Altmetric this year, and a few weeks ago they said they would re-process the item “before Christmas”
  • Abenet suggested we use cg.reviewStatus instead of cg.review-status and I agree that we should follow other examples like DCTERMS.accessRights and DCTERMS.isPartOf