dspace-statistics-api

mirror of https://github.com/ilri/dspace-statistics-api.git synced 2024-11-22 14:25:01 +01:00

Author	SHA1	Message	Date
Alan Orth	4ede966dbb	indexer.py: Fix logic error in SQL insert This was inserting correctly on the first run, but subsequent runs were inserting into the incorrect column on conflict. This made it seem like there were downloads for items where there were none.	2018-10-05 00:16:24 +03:00
Alan Orth	b14c3eef4d	indexer.py: Use ujson instead of json Falcon optionally makes use of the ujson library to speed up media (de)serialization, error serialization, and query string parsing. See: https://falcon.readthedocs.io/en/stable/user/install.html	2018-09-27 09:51:40 +03:00
Alan Orth	2c1e4952b1	indexer.py: Remove comment I had left this there so I could remember how to get the number of facets, but I don't need it anymore.	2018-09-26 23:27:48 +03:00
Alan Orth	385a34e5d0	indexer.py: Use psycopg2's execute_values to batch inserts Batch inserts are much faster than a series of individual inserts because they drastically reduce the overhead caused by round-trip communication with the server. My tests in development confirm: - cursor.execute(): 19 seconds - execute_values(): 14 seconds I'm currently only working with 4,500 rows, but I will experiment with larger data sets, as well as larger batches. For example, on the PostgreSQL mailing list a user reports doing 10,000 rows with a page size of 100. See: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values See: https://github.com/psycopg/psycopg2/issues/491#issuecomment-276551038	2018-09-26 23:10:29 +03:00
Alan Orth	e604d8ca81	indexer.py: Major refactor Basically Solr's numFound has nothing to do with the actual number of distinct facets that are returned. You need to use Solr's stats component to get the number of distinct facets, aka countDistinct. This is apparently deprecated in newer Solr versions, but we're on version 4.10 and it works there. Also, I realized that there is no need to return facets for items without any views or downloads. Using facet.mincount=1 reduces the result set size and also means we can store less data in the data- base. The API returns HTTP 404 Not Found if an item is not in the database anyways. I can't figure it out exactly, but there is some weird issue with Solr's facet results when you don't use facet.mincount=1. For some reason you get tons of results with an id that doesn't even exist in the document database, let alone as an actual DSpace item! See: https://lucene.apache.org/solr/guide/6_6/the-stats-component.html	2018-09-26 02:41:10 +03:00
Alan Orth	bfceffd84d	indexer.py: Improve inline documentation	2018-09-25 12:23:31 +03:00
Alan Orth	4b72f626d9	Update string substitution format Instead of doing numbered strings I will just depend on the order, at least to be consistent.	2018-09-25 02:19:29 +03:00
Alan Orth	3327884f21	Update docs to remove SQLite stuff I've decided to use PostgreSQL instead of SQLite because the UPSERT support is available in versions of PostgreSQL we're alread running, whereas SQLite needs a VERY new (3.24.0) version that is not avail- able on any recent long-term support Ubuntu releases.	2018-09-25 00:56:01 +03:00
Alan Orth	8f7450f67a	Use PostgreSQL instead of SQLite I was very surprised how easy and fast and robust SQLite was, but in the end I realized that its UPSERT support only came in version 3.24 and both Ubuntu 16.04 and 18.04 have older versions than that! I did manage to install libsqlite3-0 from Ubuntu 18.04 cosmic on my xenial host, but that feels dirty. PostgreSQL has support for UPSERT since 9.5, not to mention the same nice LIMIT and OFFSET clauses.	2018-09-25 00:49:47 +03:00
Alan Orth	2cab456f16	indexer.py: Use single items table with UPSERT I was using two separate tables for item views and downloads without realizing that SQLite didn't support FULL OUTER JOIN, which would be needed to get views and downloads for a given item in a single query. Instead I can use one table with a default value of 0 for both views and downloads, and then use "UPSERT" to populate the statistics. This is a newish SQL concept that allows you to attempt an INSERT and then specify an action to perform in case of conflict. This works well in SQLite and actually simplifies my Python logic greatly! Note that the "excluded" table qualifier is a special keyword that allows you to reference the value that would have been inserted. See: https://www.sqlite.org/lang_UPSERT.html	2018-09-24 14:19:50 +03:00
Alan Orth	53615dea2d	indexer.py: Add license and documentation	2018-09-24 09:18:50 +03:00
Alan Orth	a51422273c	Remove SOLR_CORE configuration variable This parameter is not customizable. All DSpace instances use this name for the Solr statistics core.	2018-09-24 00:20:54 +03:00
Alan Orth	89621af85d	Split database access into RW and RO The indexer need to be able to write to the database, but the API only needs to read it.	2018-09-24 00:00:05 +03:00
Alan Orth	2db5e02be9	Add indexer.py Standalone script to ingest item views and downloads from Solr into SQLite.	2018-09-23 16:47:48 +03:00

14 Commits