CGSpace Notes

Documenting day-to-day work on the CGSpace repository.

November, 2020

2020-11-01

  • Continue with processing the statistics-2019 Solr core with the AtomicStatisticsUpdateCLI tool on DSpace Test
    • So far we’ve spent at least fifty hours to process the statistics and statistics-2019 core… wow.

2020-11-02

  • Talk to Moayad and fix a few issues on OpenRXV:
    • Incorrect views and downloads (caused by Elasticsearch’s default result set size of 10)
    • Invalid share link
    • Missing “https://” for Handles in the Simple Excel report (caused by using the handle instead of the uri)
    • Sorting the list of items by views
  • I resumed the processing of the statistics-2018 Solr core after it spent 20 hours to get to 60%

2020-11-04

  • After 29 hours the statistics-2017 core finished processing so I started the statistics-2016 core on DSpace Test

2020-11-05

  • Peter sent me corrections and deletions for the author affiliations
    • I quickly proofed them for UTF-8 issues in OpenRefine and csv-metadata-quality and then tested them locally and then applied them on CGSpace:
$ ./fix-metadata-values.py -i 2020-11-05-fix-862-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -t 'correct' -m 211
$ ./delete-metadata-values.py -i 2020-11-05-delete-29-affiliations.csv -db dspace -u dspace -p 'fuuu' -f cg.contributor.affiliation -m 211
  • Then I started a Discovery re-index on CGSpace:
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    92m24.993s
user    8m11.858s
sys     2m26.931s

2020-11-06

  • Restart the AtomicStatisticsUpdateCLI processing of the statistics-2016 core on DSpace Test after 20 hours…
    • This phase finished after five hours so I started it on the statistics-2015 core

2020-11-07

  • Atmire responded about the issue with duplicate values in owningComm and containerCommunity etc
    • I told them to please look into it and use some of our credits if need be
  • The statistics-2015 core finished after 20 hours so I started the statistics-2014 core

2020-11-08

  • Add “Data Paper” to types on CGSpace
  • Add “SCALING CLIMATE-SMART AGRICULTURE” to CCAFS subjects on CGSpace
  • Add “ANDEAN ROOTS AND TUBERS” to CIP subjects on CGSpace
  • Add CGIAR System subjects to Discovery sidebar facets on CGSpace
    • Also add the System subject to item view on CGSpace
  • The statistics-2014 core finished processing after five hours, so I started processing the statistics-2013 core on DSpace Test
  • Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 164
dspace=# COMMIT;
  • Run system updates on CGSpace (linode18) and reboot it
    • I had to restart Tomcat once after the machine started up to get all Solr statistics cores to load properly
  • After about ten more hours the rest of the Solr statistics cores finished processing on DSpace Test and I started optimizing them in Solr admin UI

2020-11-10

  • I am noticing that CGSpace doesn’t have any statistics showing for years before 2020, but all cores are loaded successfully in Solr Admin UI… strange
    • I restarted Tomcat and I see in Solr Admin UI that the statistics-2015 core failed to load
    • Looking in the DSpace log I see:
2020-11-10 08:43:59,634 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:43:59,687 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:43:59,707 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:44:00,004 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,005 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,005 WARN  org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,325 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2015
  • Seems that the core gets probed twice… perhaps a threading issue?
    • The only thing I can think of is the acceptorThreadCount parameter in Tomcat’s server.xml, which has been set to 2 since 2018-01 (we started sharding the Solr statistics cores in 2019-01 and that’s when this problem arose)
    • I will try reducing that to 1
    • Wow, now it’s even worse:
2020-11-10 08:51:03,007 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
2020-11-10 08:51:03,008 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,137 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:51:03,153 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,289 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2015
2020-11-10 08:51:03,289 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2010
2020-11-10 08:51:03,475 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2010
2020-11-10 08:51:03,475 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2016
2020-11-10 08:51:03,730 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2020-11-10 08:51:03,731 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2017
2020-11-10 08:51:03,992 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2017
2020-11-10 08:51:03,992 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2011
2020-11-10 08:51:04,178 INFO  org.dspace.statistics.SolrLogger @ Created core with name: statistics-2011
2020-11-10 08:51:04,178 INFO  org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2012
  • Could it be because we have two Tomcat connectors?
    • I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01… hmmmmm
  • I added a lowercase formatter to OpenRXV so that we can lowercase AGROVOC subjects during harvesting

2020-11-11

  • Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data
    • I told them to proceed, as it’s within our budget of credits
    • They will write a processor for DSpace 6 to remove the duplicates
  • I did some tests to add a usage statistics chart to the item views on DSpace Test
    • It is inspired by Salem’s work on WorldFish’s repository, and it hits the dspace-statistics-api for the current item and displays a graph
    • I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn’t allow theming via CSS (so our Bootstrap brand colors for each theme won’t work)
      • Hmm, Highcharts is not licensed under and open source license so I will not use it
      • Perhaps I’ll use Chartist with the popover plugin…
    • I think I’ll pursue this after the DSpace 6 upgrade…

2020-11-12

  • I was looking at Solr again trying to find a way to get community and collection stats by faceting on owningComm and owningColl and it seems to work actually
    • The duplicated values in the multi-value fields don’t seem to affect the counts, as I had thought previously (though we should still get rid of them)
    • One major difference between the raw numbers I was looking at and Atmire’s numbers is that Atmire’s code filters “Internal” IP addresses…
    • Also, instead of doing isBot:false I think I should do -isBot:true because it’s not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true
  • First we get the total number of communities with stats (using calcdistinct):
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=1&facet.offset=0&stats=true&stats.field=owningComm&stats.calcdistinct=true&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
  • Then get stats themselves, iterating 100 items at a time with limit and offset:
facet=true&facet.field=owningComm&facet.mincount=1&facet.limit=100&facet.offset=0&shards=http://localhost:8081/solr/statistics,http://localhost:8081/solr/statistics-2019,http://localhost:8081/solr/statistics-2018,http://localhost:8081/solr/statistics-2017,http://localhost:8081/solr/statistics-2016,http://localhost:8081/solr/statistics-2015,http://localhost:8081/solr/statistics-2014,http://localhost:8081/solr/statistics-2013,http://localhost:8081/solr/statistics-2012,http://localhost:8081/solr/statistics-2011,http://localhost:8081/solr/statistics-2010
  • I was surprised to see 10,000,000 docs with isBot:true when I was testing on DSpace Test…
    • This has got to be a mistake of some kind, as I see 4 million in 2014 that are from dns:localhost., perhaps that’s when we didn’t have useProxies set up correctly?
    • I don’t see the same thing on CGSpace… I wonder what happened?
    • Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with isBot:true in the future and not automatically purge these!!!
  • I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc…
    • I hadn’t seen monit before, but the others are already in DSpace’s spider agents lists for some time so probably only appear in older stats cores
    • The issue with purging these using check-spider-hits.sh is that it can’t do case-insensitive regexes and some metacharacters like \s don’t work so I added case-sensitive patterns to a local agents file and purged them with the script

2020-11-15

  • Upgrade CGSpace to DSpace 6.3
    • First build, update, and migrate the database:
$ dspace cleanup -v
$ git checkout origin/6_x-dev-atmire-modules
$ npm install -g yarn
$ chrt -b 0 mvn -U -Dmirage2.on=true -Dmirage2.deps.included=false -P \!dspace-lni,\!dspace-rdf,\!dspace-sword,\!dspace-swordv2,\!dspace-jspui clean package
$ sudo su - postgres
$ psql dspace -c 'CREATE EXTENSION pgcrypto;'
$ psql dspace -c "DELETE FROM schema_version WHERE version IN ('5.8.2015.12.03.3');"
$ exit
$ rm -rf /home/cgspace/config/spring
$ ant update
$ dspace database info
$ dspace database migrate
$ sudo systemctl start tomcat7
  • After starting Tomcat DSpace should start up OK and begin Discovery indexing, but I want to also upgrade from PostgreSQL 9.6 to 10
    • I installed and configured PostgreSQL 10 using the Ansible playbooks, then migrated the database manually:
# systemctl stop tomcat7
# pg_ctlcluster 9.6 main stop
# tar -cvzpf var-lib-postgresql-9.6.tar.gz /var/lib/postgresql/9.6
# tar -cvzpf etc-postgresql-9.6.tar.gz /etc/postgresql/9.6
# pg_ctlcluster 10 main stop
# pg_dropcluster 10 main
# pg_upgradecluster 9.6 main
# pg_dropcluster 9.6 main
# systemctl start postgresql
# dpkg -l | grep postgresql | grep 9.6 | awk '{print $2}' | xargs dpkg -r
  • Then I ran all system updates and rebooted the server…
  • After the server came back up I re-ran the Ansible playbook to make sure all configs and services were updated
  • I disabled the dspace-statistsics-api for now because it won’t work until I migrate all the Solr statistics anyways
  • Start a full Discovery re-indexing:
$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b

real    211m30.726s
user    134m40.124s
sys     2m17.979s
  • Towards the end of the indexing there were a few dozen of these messages:
2020-11-15 13:23:21,685 INFO  com.atmire.dspace.discovery.service.AtmireSolrService @ Removed Item: null from Index
  • I updated all the Ansible infrastructure and DSpace branches to be the DSpace 6 ones
  • I will wait until the Discovery indexing is finished to start doing the Solr statistics migration
  • I tested the email functionality and it seems to need more configuration:
$ dspace test-email

About to send test email:
 - To: blah@cgiar.org
 - Subject: DSpace test email
 - Server: smtp.office365.com

Error sending email:
 - Error: com.sun.mail.smtp.SMTPSendFailedException: 451 5.7.3 STARTTLS is required to send mail [AM4PR0701CA0003.eurprd07.prod.outlook.com]
  • I copied the mail.extraproperties = mail.smtp.starttls.enable=true setting from the old DSpace 5 dspace.cfg and now the emails are working
  • After the Discovery indexing finished I started processing the Solr stats one core and 2.5 million records at a time:
$ export JAVA_OPTS='-Dfile.encoding=UTF-8 -Xmx2048m'
$ chrt -b 0 dspace solr-upgrade-statistics-6x -n 2500000 -i statistics
  • After about 6,000,000 records I got the same error that I’ve gotten every time I test this migration process:
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
        at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
        at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
        at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)

2020-11-16

  • Users are having issues submitting items to CGSpace
    • Looking at the data I see that connections skyrocketed since DSpace 6 upgrade yesterday, and they are all in “waiting for lock” state:

PostgreSQL connections week PostgreSQL locks week

  • There are almost 1,500 locks:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1494
  • I sent a mail to the dspace-tech mailing list to ask for help…
    • For now I just restarted PostgreSQL and a few users were able to complete submissions…
  • While processing the statistics-2018 Solr core I got the same memory error that I have gotten every time I processed this core in testing:
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuffer.append(StringBuffer.java:270)
        at java.io.StringWriter.write(StringWriter.java:101)
        at org.apache.solr.common.util.XML.writeXML(XML.java:133)
        at org.apache.solr.client.solrj.util.ClientUtils.writeVal(SourceFile:160)
        at org.apache.solr.client.solrj.util.ClientUtils.writeXML(SourceFile:128)
        at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:365)
        at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:281)
        at org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:67)
        at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:95)
        at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:105)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.createMethod(HttpSolrServer.java:302)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
        at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
        at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
        at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
  • I increased the Java heap memory to 4096MB and restarted the processing
    • After a few hours I got the following error, which I have gotten several times over the last few months:
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
        at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
        at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
        at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
        at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)

2020-11-17

  • Chat with Peter about using some remaining CRP Livestock open access money to fund more work on OpenRXV / AReS
    • I will create GitHub issues for each of the things we talked about and then create ToRs to send to CodeObia for a quote
  • Continue migrating Solr statistics to DSpace 6 UUID format after the upgrade on Sunday
  • Regarding the IWMI issue about flagships and strategic priorities we can use CRP Livestock as an example because all their flagships are mapped to collections
  • Database issues are worse today…

PostgreSQL connections week

  • There are over 2,000 locks:
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
2071

2020-11-18

  • I decided to enable the rollbackOnReturn=true option in Tomcat’s JDBC connection pool parameters because I noticed that all of the “idle in transaction” connections waiting for locks were SELECT queries
    • There are many posts on the Internet about people having this issue with Hibernate
    • The locks are lower now, but Peter and Abenet are still having issues approving items and Tezira forwarded one strange case where an item was “approved” and was assigned a handle, but it doesn’t exist…
    • I sent another mail to the dspace-tech mailing list to ask for help
    • I reverted the rollbackOnReturn change in Tomcat…
    • I sent a message to Atmire to ask for urgent help
  • Call with IWMI and Abenet about them potentially moving from InMagic to CGSpace
    • They have questions about the reporting on AReS
    • We told them that we can use collections to infer Strategic Priorities and Research Groups and WLE Flagships
    • It sounds like we will create this structure under the top-level IWMI community:
      • IWMI Strategic Priorities (sub-community)
        • Water, Food and Ecosystems (sub-community)
          • Sustainable and Resilient Food Production Systems (collection)
          • Sustainable Water infrastructure and Ecosystems (collection)
          • Integrated Basin and Aquifer Management
        • Water, Climate Change and Resilience (sub-community)
          • Climate Change Adaptation and Resilience (collection)
        • etc…
    • They will submit items to their normal output type collections and map to these
  • In other news I finally finished processing the Solr statistics for UUIDs and re-indexed the stats with the dspace-statistics-api
  • Peter got a strange message this evening when trying to update metadata:
2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,316 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [13]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,385 INFO  org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl @ HHH000010: On release of batch it still contained JDBC statements
  • Minor bug fixes to limit parameter in DSpace Statistics API
  • Send a list of potential ToRs for a next phase of OpenRXV development to Michael Victor for feedback:
    • Enable advanced reporting templates using “Angular expressions” in Docxtemplater (would be used immediately for IWMI and Bioversity–CIAT)
    • Enable embedding of charts like world map and word cloud in reports
    • Enable embedding of item thumbnails in reports, similar to the “list of information products”
    • Enable something like the “Statistics” Excel report Peter wanted in 2019 so we can get community and collection statistics reports
    • Add a new “metrics” block with statistics about top authors and items by number of views and downloads for the current search terms
    • Add ability to change the explorer UI to “Usage Statistics” mode where lists of authors, affiliations, sponsors, CRPs, communities, collections, etc are sorted according to the number of views or downloads for the current search results, rather than by number of occurrences of metadata values
    • Add ability to “drill down” or modify search filter terms by clicking on countries in the map
    • Enable date-based usage statistics (currently only “all time” statistics are available)
    • Fixing minor bugs for all issues filed on GitHub
  • I also added GitHub issues for each of them

2020-11-19

  • I started a fresh reharvest on AReS and when it was done I noticed that the metadata from CGSpace is fine, but the views and downloads don’t seem to be working
  • Peter said he was able to approve a few items on CGSpace immediately “like old times” this morning
  • The PostgreSQL status looks much better now, though I haven’t changed anything

PostgreSQL connections week PostgreSQL locks week PostgreSQL transaction log week PostgreSQL transactions week

  • Very curious that there was such a high number of rolled back transactions after the update

2020-11-22

  • PostgreSQL situation on CGSpace (linode18) looks much better now:

PostgreSQL locks week PostgreSQL transaction log week

  • In other news, I noticed that harvesting DSpace 6 works fine in OpenRXV, but the statistics fail on page 1
  • Abenet asked for help trying to add a new user to the Bioversity and CIAT groups on CGSpace
    • I see that the user search is split on five results, so the user in question appears on page 2
    • I asked Abenet if she was getting an error or it was simply this…
  • Maria Garuccio sent me an example report that she wants to be able to generate from AReS
    • First, she would like to have the option to group by output type
    • Second, she would like to be able to control the sorting in the template, like sorting the citation alphabetically
    • I filed an issue: https://github.com/ilri/OpenRXV/issues/60
  • Mohammad Salem had asked if there was an item ID to UUID mapping for CGSpace
    • I found a thread on the dspace-tech mailing list that pointed out that there is a new uuid column in the item table
    • Only old items have an item_id so we can get a mapping easily:
dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
COPY 87411
  • Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API
  • Facet by owningComm to see total number of distinct communities (136):
  facet=true&facet.mincount=1&facet.field=owningComm&facet.limit=1&facet.offset=0&stats=true&stats.field=id&stats.calcdistinct=true
  • Facet by owningComm and get the first 5 distinct:
  facet=true&facet.mincount=1&facet.field=owningComm&facet.limit=5&facet.offset=0&facet.pivot=id,countryCode
  • Facet by owningComm and countryCode using facet.pivot and maybe I can just skip the normal facet params?
facet=true&f.owningComm.facet.limit=5&f.owningComm.facet.offset=5&facet.pivot=owningComm,countryCode
  • Facet by owningComm and countryCode using facet.pivot and limiting to top five countries… fuck it’s possible!
facet=true&f.owningComm.facet.limit=5&f.owningComm.facet.offset=5&f.countryCode.facet.limit=5&facet.pivot=owningComm,countryCode

2020-11-23

2020-11-24

  • Yesterday Abenet asked me to investigate why AReS only shows 9,000 “livestock” terms in the ILRI community on AReS, but on CGSpace we have over 10,000
    • I added the lowercase formatter to all center and CRP subjects fields and re-harvested
    • Now I see there are 9,999, which seems suspicious
    • I filed a bug on GitHub: https://github.com/ilri/OpenRXV/issues/61
  • Help Abenet map an item on CGSpace for CIAT
    • If I search for the entire item title I don’t get any results, but I notice this item had a “:” in the title, so I tried searching for part of the title without the colon and it worked
    • It is a mystery to me that you can’t map an item using its Handle…
  • I started processing the statistics-2011 core with Atmire’s AtomicStatisticsUpdateCLI tool
  • I called Moayad and we worked on the views/downloads issue on OpenRXV
    • It turns out to be a mapping (schema) issue in Elasticsearch due to DSpace 6 UUIDs (LOL!!)

2020-11-25

  • Zoom meeting with ILRI communicators about CGSpace, Altmetric, and AReS
  • Send an email to Richard Fulss and Paola Camargo Paz at CIMMYT about having them work closer with us on AReS
  • Send an email to Usman at CIFOR to ask how his DSpace stuff is going
  • The Atmire AtomicStatisticsUpdateCLI tool finished processing the statistics-2017 core
  • Atmire responded about the duplicate fields in Solr and said they don’t see them
    • I sent a few examples that I found after thirty seconds of randomly looking in several Solr cores

2020-11-27

  • I finished processing the statistics-2016 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2015 core

2020-11-28

  • I finished processing the statistics-2015 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2014 core
  • I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2013 core
  • I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2012 core
  • I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2012 core
  • I finished processing the statistics-2014 core with the AtomicStatisticsUpdateCLI tool so I started the statistics-2010 core

2020-11-29

2020-11-30

  • Ben Hack asked for the ILRI subject we are using on CGSpace
    • I linked him the input-forms.xml file and also sent him a list of 112 terms extracted with xml from the xmlstarlet package:
$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
  • IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my resolve-orcids.py script:
$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/iwmi-orcids.txt /tmp/hung.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2020-11-30-combined-orcids.txt
$ ./resolve-orcids.py -i /tmp/2020-11-30-combined-orcids.txt -o /tmp/2020-11-30-combined-orcids-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml
  • I used my fix-metadata-values.py script to update the old occurences of Hung’s ORCID and some others that I see have changed:
$ cat 2020-11-30-fix-hung-orcid.csv
cg.creator.id,correct
"Hung Nguyen-Viet: 0000-0001-9877-0596","Hung Nguyen-Viet: 0000-0003-1549-2733"
"Adriana Tofiño: 0000-0001-7115-7169","Adriana Tofiño Rivera: 0000-0001-7115-7169"
"Cristhian Puerta Rodriguez: 0000-0001-5992-1697","David Puerta: 0000-0001-5992-1697"
"Ermias Betemariam: 0000-0002-1955-6995","Ermias Aynekulu: 0000-0002-1955-6995"
"Hirut Betaw: 0000-0002-1205-3711","Betaw Hirut: 0000-0002-1205-3711"
"Megan Zandstra: 0000-0002-3326-6492","Megan McNeil Zandstra: 0000-0002-3326-6492"
"Tolu Eyinla: 0000-0003-1442-4392","Toluwalope Emmanuel: 0000-0003-1442-4392"
"VInay Nangia: 0000-0001-5148-8614","Vinay Nangia: 0000-0001-5148-8614"
$ ./fix-metadata-values.py -i 2020-11-30-fix-hung-orcid.csv -db dspace63 -u dspacetest -p 'dom@in34sniper' -f cg.creator.id -t 'correct' -m 240