- The statistics-2014 core finished processing after five hours, so I started processing the statistics-2013 core on DSpace Test
- Since I was going to restart CGSpace and update the Discovery indexes anyways I decided to check for any straggling upper case AGROVOC entries and lower case them:
```
dspace=# BEGIN;
dspace=# UPDATE metadatavalue SET text_value=LOWER(text_value) WHERE resource_type_id=2 AND metadata_field_id=57 AND text_value ~ '[[:upper:]]';
UPDATE 164
dspace=# COMMIT;
```
- Run system updates on CGSpace (linode18) and reboot it
- I had to restart Tomcat once after the machine started up to get all Solr statistics cores to load properly
- After about ten more hours the rest of the Solr statistics cores finished processing on DSpace Test and I started optimizing them in Solr admin UI
## 2020-11-10
- I am noticing that CGSpace doesn't have any statistics showing for years before 2020, but all cores are loaded successfully in Solr Admin UI... strange
- I restarted Tomcat and I see in Solr Admin UI that the statistics-2015 core failed to load
- Looking in the DSpace log I see:
```
2020-11-10 08:43:59,634 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:43:59,687 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:43:59,707 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:44:00,004 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,005 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,005 WARN org.dspace.core.ConfigurationManager @ Requested configuration module: atmire-datatables not found
2020-11-10 08:44:00,325 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2015
```
- Seems that the core gets probed twice... perhaps a threading issue?
- The only thing I can think of is the `acceptorThreadCount` parameter in Tomcat's server.xml, which has been set to 2 since 2018-01 (we started sharding the Solr statistics cores in 2019-01 and that's when this problem arose)
- I will try reducing that to 1
- Wow, now it's even worse:
```
2020-11-10 08:51:03,007 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2018
2020-11-10 08:51:03,008 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,137 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2018
2020-11-10 08:51:03,153 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2015
2020-11-10 08:51:03,289 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2015
2020-11-10 08:51:03,289 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2010
2020-11-10 08:51:03,475 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2010
2020-11-10 08:51:03,475 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2016
2020-11-10 08:51:03,730 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2016
2020-11-10 08:51:03,731 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2017
2020-11-10 08:51:03,992 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2017
2020-11-10 08:51:03,992 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2011
2020-11-10 08:51:04,178 INFO org.dspace.statistics.SolrLogger @ Created core with name: statistics-2011
2020-11-10 08:51:04,178 INFO org.dspace.statistics.SolrLogger @ Loading core with name: statistics-2012
```
- Could it be because we have two Tomcat connectors?
- I restarted Tomcat a few more times before all cores loaded, and still there are no stats before 2020-01... hmmmmm
- I added a [lowercase formatter to OpenRXV](https://github.com/ilri/OpenRXV/commit/3816b9b3f3d9182d2ba1a899c1017c5895a59dee) so that we can lowercase AGROVOC subjects during harvesting
- Atmire responded with a quote for the work to fix the duplicate owningComm, etc in our Solr data
- I told them to proceed, as it's within our budget of credits
- They will write a processor for DSpace 6 to remove the duplicates
- I did some tests to add a usage statistics chart to the item views on DSpace Test
- It is inspired by Salem's work on WorldFish's repository, and it hits the dspace-statistics-api for the current item and displays a graph
- I got it working very easily for all-time statistics with Chart.js, but I think I will need to use Highcharts or something else because Chart.js is HTML5 canvas and doesn't allow theming via CSS (so our Bootstrap brand colors for each theme won't work)
- I think I'll pursue this after the DSpace 6 upgrade...
## 2020-11-12
- I was looking at Solr again trying to find a way to get community and collection stats by faceting on `owningComm` and `owningColl` and it seems to work actually
- The duplicated values in the multi-value fields don't seem to affect the counts, as I had thought previously (though we should still get rid of them)
- One major difference between the raw numbers I was looking at and Atmire's numbers is that Atmire's code filters "Internal" IP addresses...
- Also, instead of doing `isBot:false` I think I should do `-isBot:true` because it's not a given that all documents will have this field and have it false, but we can definitely exclude the ones that have it as true
- First we get the total number of communities with stats (using calcdistinct):
- I was surprised to see 10,000,000 docs with `isBot:true` when I was testing on DSpace Test...
- This has got to be a mistake of some kind, as I see 4 million in 2014 that are from `dns:localhost.`, perhaps that's when we didn't have useProxies set up correctly?
- I don't see the same thing on CGSpace... I wonder what happened?
- Perhaps they got re-tagged during the DSpace 6 upgrade, somehow during the Solr migration? Hmmmmm. Definitely have to be careful with `isBot:true` in the future and not automatically purge these!!!
- I noticed 120,000+ hits from monit, FeedBurner, and Blackboard Safeassign in 2014, 2015, 2016, 2017, etc...
- I hadn't seen monit before, but the others are already in DSpace's spider agents lists for some time so probably only appear in older stats cores
- The issue with purging these using `check-spider-hits.sh` is that it can't do case-insensitive regexes and some metacharacters like `\s` don't work so I added case-sensitive patterns to a local agents file and purged them with the script
- After about 6,000,000 records I got the same error that I've gotten every time I test this migration process:
```
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
$ psql -c 'SELECT * FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;' | wc -l
1494
```
- I sent a mail to the dspace-tech mailing list to ask for help...
- For now I just restarted PostgreSQL and a few users were able to complete submissions...
- While processing the statistics-2018 Solr core I got the *same* memory error that I have gotten every time I processed this core in testing:
```
Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuffer.append(StringBuffer.java:270)
at java.io.StringWriter.write(StringWriter.java:101)
at org.apache.solr.common.util.XML.writeXML(XML.java:133)
at org.apache.solr.client.solrj.util.ClientUtils.writeVal(SourceFile:160)
at org.apache.solr.client.solrj.util.ClientUtils.writeXML(SourceFile:128)
at org.apache.solr.client.solrj.request.UpdateRequest.writeXML(UpdateRequest.java:365)
at org.apache.solr.client.solrj.request.UpdateRequest.getXML(UpdateRequest.java:281)
at org.apache.solr.client.solrj.request.RequestWriter.getContentStream(RequestWriter.java:67)
at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getDelegate(RequestWriter.java:95)
at org.apache.solr.client.solrj.request.RequestWriter$LazyContentStream.getName(RequestWriter.java:105)
at org.apache.solr.client.solrj.impl.HttpSolrServer.createMethod(HttpSolrServer.java:302)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
- I increased the Java heap memory to 4096MB and restarted the processing
- After a few hours I got the following error, which I have gotten several times over the last few months:
```
Exception: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error while creating field 'p_group_id{type=uuid,properties=indexed,stored,multiValued}' from value '10'
at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.dspace.util.SolrUpgradePre6xStatistics.batchUpdateStats(SolrUpgradePre6xStatistics.java:161)
at org.dspace.util.SolrUpgradePre6xStatistics.run(SolrUpgradePre6xStatistics.java:456)
at org.dspace.util.SolrUpgradePre6xStatistics.main(SolrUpgradePre6xStatistics.java:365)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:229)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:81)
```
## 2020-11-17
- Chat with Peter about using some remaining CRP Livestock open access money to fund more work on OpenRXV / AReS
- I will create GitHub issues for each of the things we talked about and then create ToRs to send to CodeObia for a quote
- Continue migrating Solr statistics to DSpace 6 UUID format after the upgrade on Sunday
- Regarding the IWMI issue about flagships and strategic priorities we can use CRP Livestock as an example because all their [flagships are mapped to collections](https://cgspace.cgiar.org/handle/10568/80102)
- I decided to enable the `rollbackOnReturn=true` option in [Tomcat's JDBC connection pool parameters](https://tomcat.apache.org/tomcat-7.0-doc/jdbc-pool.html) because I noticed that all of the "idle in transaction" connections waiting for locks were SELECT queries
- There are many posts on the Internet about people having this issue with Hibernate
- The locks are lower now, but Peter and Abenet are still having issues approving items and Tezira forwarded one strange case where an item was "approved" and was assigned a handle, but it doesn't exist...
- I sent another mail to the dspace-tech mailing list to ask for help
- I reverted the `rollbackOnReturn` change in Tomcat...
- I sent a message to Atmire to ask for urgent help
- Call with IWMI and Abenet about them potentially moving from InMagic to CGSpace
- They have questions about the reporting on AReS
- We told them that we can use collections to infer Strategic Priorities and Research Groups and WLE Flagships
- It sounds like we will create this structure under the top-level IWMI community:
- IWMI Strategic Priorities (sub-community)
- Water, Food and Ecosystems (sub-community)
- Sustainable and Resilient Food Production Systems (collection)
- Sustainable Water infrastructure and Ecosystems (collection)
- Integrated Basin and Aquifer Management
- Water, Climate Change and Resilience (sub-community)
- Climate Change Adaptation and Resilience (collection)
- etc...
- They will submit items to their normal output type collections and map to these
- In other news I finally finished processing the Solr statistics for UUIDs and re-indexed the stats with the dspace-statistics-api
- I started the Atmire stats processing, notes in the dedicated [CGSpace DSpace 6 Upgrade section]({{< relref "cgspace-dspace6-upgrade.md" >}})
- Peter got a strange message this evening when trying to update metadata:
```
2020-11-18 16:57:33,309 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,316 ERROR org.hibernate.engine.jdbc.batch.internal.BatchingBatch @ HHH000315: Exception executing batch [Batch update returned unexpected row count from update [13]; actual row count: 0; expected: 1]
2020-11-18 16:57:33,385 INFO org.hibernate.engine.jdbc.batch.internal.AbstractBatchImpl @ HHH000010: On release of batch it still contained JDBC statements
```
- Minor bug fixes to limit parameter in DSpace Statistics API
- Send a list of potential ToRs for a next phase of OpenRXV development to Michael Victor for feedback:
- Enable advanced reporting templates using "Angular expressions" in Docxtemplater (would be used immediately for IWMI and Bioversity–CIAT)
- Enable embedding of charts like world map and word cloud in reports
- Enable embedding of item thumbnails in reports, similar to the "list of information products"
- Enable something like the "Statistics" Excel report Peter wanted in 2019 so we can get community and collection statistics reports
- Add a new "metrics" block with statistics about top authors and items by number of views and downloads for the current search terms
- Add ability to change the explorer UI to "Usage Statistics" mode where lists of authors, affiliations, sponsors, CRPs, communities, collections, etc are sorted according to the number of views or downloads for the current search results, rather than by number of occurrences of metadata values
- Add ability to "drill down" or modify search filter terms by clicking on countries in the map
- Enable date-based usage statistics (currently only "all time" statistics are available)
- Fixing minor bugs for all issues filed on GitHub
- I started a fresh reharvest on AReS and when it was done I noticed that the metadata from CGSpace is fine, but the views and downloads don't seem to be working
- In other news, I noticed that harvesting DSpace 6 works fine in OpenRXV, but the statistics fail on page 1
- I filed an issue: https://github.com/ilri/OpenRXV/issues/59
- Abenet asked for help trying to add a new user to the Bioversity and CIAT groups on CGSpace
- I see that the user search is split on five results, so the user in question appears on page 2
- I asked Abenet if she was getting an error or it was simply this...
- Maria Garuccio sent me an example report that she wants to be able to generate from AReS
- First, she would like to have the option to group by output type
- Second, she would like to be able to control the sorting in the template, like sorting the citation alphabetically
- I filed an issue: https://github.com/ilri/OpenRXV/issues/60
- Mohammad Salem had asked if there was an item ID to UUID mapping for CGSpace
- I found a thread on the dspace-tech mailing list that pointed out that there is a new `uuid` column in the item table
- Only old items have an `item_id` so we can get a mapping easily:
```
dspace=# \COPY (SELECT item_id,uuid FROM item WHERE in_archive='t' AND withdrawn='f' AND item_id IS NOT NULL) TO /tmp/2020-11-22-item-id2uuid.csv WITH CSV HEADER;
COPY 87411
```
- Saving some notes I wrote down about faceting by community and collection in Solr, for potential use in the future in the DSpace Statistics API
- Facet by owningComm to see total number of distinct communities (136):
- I created the sub-communities and collections for IWMI's Strategic Priorities and Research Groups on CGSpace: https://cgspace.cgiar.org/handle/10568/110259
- Yesterday Abenet asked me to investigate why AReS only shows 9,000 "livestock" terms in the ILRI community on AReS, but on CGSpace we have over 10,000
- I added the lowercase formatter to all center and CRP subjects fields and re-harvested
- Now I see there are 9,999, which seems suspicious
- I filed a bug on GitHub: https://github.com/ilri/OpenRXV/issues/61
- Help Abenet map an item on CGSpace for CIAT
- If I search for the entire item title I don't get any results, but I notice this item had a ":" in the title, so I tried searching for part of the title without the colon and it worked
- It is a mystery to me that you can't map an item using its Handle...
- I started processing the statistics-2011 core with Atmire's AtomicStatisticsUpdateCLI tool
- I called Moayad and we worked on the views/downloads issue on OpenRXV
- It turns out to be a mapping (schema) issue in Elasticsearch due to DSpace 6 UUIDs (LOL!!)
- Peter told me that he can't find the [CGIAR Research Program on Livestock](https://cgspace.cgiar.org/handle/10568/80099) community in the community filters on AReS
- I looked briefly and couldn't find it either so I filed an issue on OpenRXV: https://github.com/ilri/OpenRXV/issues/62
- Ben Hack asked for the ILRI subject we are using on CGSpace
- I linked him the input-forms.xml file and also sent him a list of 112 terms extracted with `xml` from the xmlstarlet package:
```
$ xml sel -t -m '//value-pairs[@value-pairs-name="ilrisubject"]/pair/displayed-value/text()' -c '.' -n dspace/config/input-forms.xml
```
- IWMI sent me a few new ORCID identifiers so I combined them with our existing ones as well as another ILRI one that Tezira asked me to update, filtered the unique ones, and then resolved their names using my `resolve-orcids.py` script: