1
0
mirror of https://github.com/ilri/dspace-statistics-api.git synced 2024-12-27 06:54:29 +01:00
Commit Graph

400 Commits

Author SHA1 Message Date
8f7450f67a
Use PostgreSQL instead of SQLite
I was very surprised how easy and fast and robust SQLite was, but in
the end I realized that its UPSERT support only came in version 3.24
and both Ubuntu 16.04 and 18.04 have older versions than that! I did
manage to install libsqlite3-0 from Ubuntu 18.04 cosmic on my xenial
host, but that feels dirty.

PostgreSQL has support for UPSERT since 9.5, not to mention the same
nice LIMIT and OFFSET clauses.
2018-09-25 00:49:47 +03:00
28d61fb041
README.md: Add notes about Python and SQLite versions 2018-09-24 17:26:48 +03:00
cbc98991b4
CHANGELOG.md: Move unreleased notes to version 0.1.0 2018-09-24 16:14:14 +03:00
6c28be0463
README.md: Add note about route for all items 2018-09-24 16:13:26 +03:00
42e8f17305
CHANGELOG.md: Add note about route for all items 2018-09-24 16:13:05 +03:00
19a45f3f6f
app.py: Add route to page through all item statistics
This route exposes all item statistics and uses the limit and offset
parameters to control paging throug the result set. The logic here
is extremely easy thanks to the brilliant LIMIT and OFFSET features
of SQLite (of course the SQL query sorts the results by some unique
field to ensure the order is already the same).
2018-09-24 16:07:26 +03:00
505ef31101
CHANGELOG.md: Add note about UPSERT 2018-09-24 14:31:05 +03:00
1543cacc54
app.py: Update SQL logic to use single table
The indexer.py script was updated to use a single table because I
learned about UPSERT. This simplifies the database schema and the
Python logic, and makes it easier to page all views and downloads
at once without complicated JOIN queries.
2018-09-24 14:28:00 +03:00
2cab456f16
indexer.py: Use single items table with UPSERT
I was using two separate tables for item views and downloads without
realizing that SQLite didn't support FULL OUTER JOIN, which would be
needed to get views and downloads for a given item in a single query.

Instead I can use one table with a default value of 0 for both views
and downloads, and then use "UPSERT" to populate the statistics. This
is a newish SQL concept that allows you to attempt an INSERT and then
specify an action to perform in case of conflict. This works well in
SQLite and actually simplifies my Python logic greatly!

Note that the "excluded" table qualifier is a special keyword that
allows you to reference the value that would have been inserted.

See: https://www.sqlite.org/lang_UPSERT.html
2018-09-24 14:19:50 +03:00
53615dea2d
indexer.py: Add license and documentation 2018-09-24 09:18:50 +03:00
2d8d1e6833
README.md: Add TODO for nonexistent items 2018-09-24 00:48:02 +03:00
e26e595ea1
README.md: Add more TODOs 2018-09-24 00:35:00 +03:00
a9151b5bbf
CHANGELOG.md: Update unreleased notes 2018-09-24 00:30:58 +03:00
76833d6f5f
contrib: Update some old CGSpace references to DSpace 2018-09-24 00:30:26 +03:00
a51422273c
Remove SOLR_CORE configuration variable
This parameter is not customizable. All DSpace instances use this
name for the Solr statistics core.
2018-09-24 00:20:54 +03:00
89621af85d
Split database access into RW and RO
The indexer need to be able to write to the database, but the API only
needs to read it.
2018-09-24 00:00:05 +03:00
c554404d7f
CHANGELOG.md: Add systemd units for indexer 2018-09-23 23:15:27 +03:00
90d7a452bd
contrib: Add systemd units for indexer
An example systemd service unit for the indexer and an accompanying
timer unit.
2018-09-23 23:13:43 +03:00
431a1c9d64
CHANGELOG.md: Add unreleased changes 2018-09-23 23:04:01 +03:00
e1b9d1284f
Rename project to DSpace Statistics API
At first I called it "CGSpace" because I was making it specifically
for our CGSpace DSpace repository, but the potential here is bigger
than that!
2018-09-23 23:02:21 +03:00
bac764a0a4
CHANGELOG.md: Move entries to version 0.0.4 2018-09-23 16:49:25 +03:00
1a650e57c0
CHANGELOG.md: Update unreleased features 2018-09-23 16:48:39 +03:00
2db5e02be9
Add indexer.py
Standalone script to ingest item views and downloads from Solr into
SQLite.
2018-09-23 16:47:48 +03:00
9e942736b1
app.py: Get item statistics from SQLite database
It is much more efficient to cache view and download statistics in
a database than to query Solr on demand (not to mention that it is
not possible to page easily with facets in Solr). I decided to use
SQLite because it is fast, native in Python 3, and doesn't require
any extra steps during provisioning (assuming permissions are ok).
2018-09-23 16:47:00 +03:00
ea85393b13
app.py: Use parameterized URI instead of query for /item
Falcon's get_param_as_int() is really nice in that it gets a query
parameter and does validation for you, but I really wanted to have
cleaner URIs for API routes so I am now using a route URI template
with a field converter. This is cleaner, but means that parameters
not matching the template will return HTTP 404.

See: https://falcon.readthedocs.io/en/stable/api/routing.html#field-converters
2018-09-23 16:23:33 +03:00
cbeb7c89a7
CHANGELOG.md: Add note about Solr connection refactor 2018-09-23 13:27:43 +03:00
b0d81a543c
Refactor Solr components
This makes it so we only need to define and connect once and then we
can re-use the connection everywhere else.
2018-09-23 13:24:30 +03:00
84801a4ab5
Add vim modeline to all Python files
Uses four spaces for tab and shift widths, and turns on expansion of
tabs to spaces.
2018-09-23 11:33:26 +03:00
4e8621e3d9
README.md: Add TODO about API documentation 2018-09-23 09:52:36 +03:00
2c8430171d
CHANGELOG.md: Add note about systemd unit file 2018-09-23 07:58:15 +03:00
fb60133713
Add example systemd unit for statistics API 2018-09-23 07:50:04 +03:00
9e01a80011
CHANGELOG.md: Move changes to version 0.0.3 2018-09-20 17:41:47 +03:00
a263996582
app.py: Fix Solr queries for item views
According to dspace-api's Constants.java, items are type 2 and they
use a unique ID field of `id` instead of `owningItem`. There is no
need to check the bundleName for item types.

Also, I decided to use the main Solr query for item IDs because the
filter query parameter (fq) stores results in the filterCache and
can be quite expensive with cores storing tens of millions of docu-
ments (we currently have 149 million docs!). It makes sense to use
the filter query parameter to reduce the result set returned by the
main Solr query.
2018-09-20 17:37:13 +03:00
ed9d25294e
app.py: Use SolrClient's rows parameter
Instead of putting this in the raw query we can just use SolrClient's
native rows parameter.
2018-09-19 12:48:28 +03:00
5e165d2e88
CHANGELOG.md: Add note about using rows=0 in Solr queries 2018-09-19 01:50:14 +03:00
8e29fd8a43
app.py: Use rows=0 for Solr queries
There is no need to return any rows of the result because I am only
interested in the numFound.
2018-09-19 01:48:35 +03:00
24af83b03f
CHANGELOG.md: Add note about simplified Solr query 2018-09-19 00:30:28 +03:00
a87aaba812
app.py: Simplify Solr query for bitstream downloads
This whole business with negative query ranges is confusing as hell
and I'll definitely forget it in the future. In DSpace's Solr term-
inology a "download" is a view to some bitstream that lives in the
ORIGINAL bundle. This is where bitstreams that are uploaded during
the item submission process go, versus generated thumbnails, etc.
2018-09-19 00:24:23 +03:00
57faec59c8
CHANGELOG.md: Add note about config refactor 2018-09-18 17:01:24 +03:00
06ab254017
Refactor configuration into separate module
There is a good example of this in the Project Weekend GitHub profile.

See: https://github.com/projectweekend/Falcon-PostgreSQL-API-Seed
2018-09-18 16:59:28 +03:00
5b5cab8b34
README.md: Update todo 2018-09-18 15:59:27 +03:00
40ce3c72a9
CHANGELOG.md: Update for version 0.0.2 2018-09-18 15:36:56 +03:00
ea2283355b
Add CHANGELOG.md
See: https://keepachangelog.com/en/1.0.0/
2018-09-18 15:35:42 +03:00
4b4a959a1c
Add ability to get Solr parameters from environment
You can use the SOLR_SERVER and SOLR_CORE variables to make deployment
via systemd, etc easier.
2018-09-18 15:34:25 +03:00
1e16beed30
README.md: Add todo list 2018-09-18 14:19:14 +03:00
182e13efca
Add GPLv3 license 2018-09-18 14:16:07 +03:00
fe43423256
README.md: Update introduction 2018-09-18 14:11:29 +03:00
4d610e04b7
Add .gitignore 2018-09-18 14:09:53 +03:00
6c66303b45
Add README.md 2018-09-18 14:09:29 +03:00
36633e405a
Initial commit
Add first working version of the statistics API.
2018-09-18 14:03:15 +03:00