Commit Graph

551 Commits

Author SHA1 Message Date
Alan Orth 3327884f21
Update docs to remove SQLite stuff
I've decided to use PostgreSQL instead of SQLite because the UPSERT
support is available in versions of PostgreSQL we're alread running,
whereas SQLite needs a VERY new (3.24.0) version that is not avail-
able on any recent long-term support Ubuntu releases.
2018-09-25 00:56:01 +03:00
Alan Orth 8f7450f67a
Use PostgreSQL instead of SQLite
I was very surprised how easy and fast and robust SQLite was, but in
the end I realized that its UPSERT support only came in version 3.24
and both Ubuntu 16.04 and 18.04 have older versions than that! I did
manage to install libsqlite3-0 from Ubuntu 18.04 cosmic on my xenial
host, but that feels dirty.

PostgreSQL has support for UPSERT since 9.5, not to mention the same
nice LIMIT and OFFSET clauses.
2018-09-25 00:49:47 +03:00
Alan Orth 28d61fb041
README.md: Add notes about Python and SQLite versions 2018-09-24 17:26:48 +03:00
Alan Orth cbc98991b4
CHANGELOG.md: Move unreleased notes to version 0.1.0 2018-09-24 16:14:14 +03:00
Alan Orth 6c28be0463
README.md: Add note about route for all items 2018-09-24 16:13:26 +03:00
Alan Orth 42e8f17305
CHANGELOG.md: Add note about route for all items 2018-09-24 16:13:05 +03:00
Alan Orth 19a45f3f6f
app.py: Add route to page through all item statistics
This route exposes all item statistics and uses the limit and offset
parameters to control paging throug the result set. The logic here
is extremely easy thanks to the brilliant LIMIT and OFFSET features
of SQLite (of course the SQL query sorts the results by some unique
field to ensure the order is already the same).
2018-09-24 16:07:26 +03:00
Alan Orth 505ef31101
CHANGELOG.md: Add note about UPSERT 2018-09-24 14:31:05 +03:00
Alan Orth 1543cacc54
app.py: Update SQL logic to use single table
The indexer.py script was updated to use a single table because I
learned about UPSERT. This simplifies the database schema and the
Python logic, and makes it easier to page all views and downloads
at once without complicated JOIN queries.
2018-09-24 14:28:00 +03:00
Alan Orth 2cab456f16
indexer.py: Use single items table with UPSERT
I was using two separate tables for item views and downloads without
realizing that SQLite didn't support FULL OUTER JOIN, which would be
needed to get views and downloads for a given item in a single query.

Instead I can use one table with a default value of 0 for both views
and downloads, and then use "UPSERT" to populate the statistics. This
is a newish SQL concept that allows you to attempt an INSERT and then
specify an action to perform in case of conflict. This works well in
SQLite and actually simplifies my Python logic greatly!

Note that the "excluded" table qualifier is a special keyword that
allows you to reference the value that would have been inserted.

See: https://www.sqlite.org/lang_UPSERT.html
2018-09-24 14:19:50 +03:00
Alan Orth 53615dea2d
indexer.py: Add license and documentation 2018-09-24 09:18:50 +03:00
Alan Orth 2d8d1e6833
README.md: Add TODO for nonexistent items 2018-09-24 00:48:02 +03:00
Alan Orth e26e595ea1
README.md: Add more TODOs 2018-09-24 00:35:00 +03:00
Alan Orth a9151b5bbf
CHANGELOG.md: Update unreleased notes 2018-09-24 00:30:58 +03:00
Alan Orth 76833d6f5f
contrib: Update some old CGSpace references to DSpace 2018-09-24 00:30:26 +03:00
Alan Orth a51422273c
Remove SOLR_CORE configuration variable
This parameter is not customizable. All DSpace instances use this
name for the Solr statistics core.
2018-09-24 00:20:54 +03:00
Alan Orth 89621af85d
Split database access into RW and RO
The indexer need to be able to write to the database, but the API only
needs to read it.
2018-09-24 00:00:05 +03:00
Alan Orth c554404d7f
CHANGELOG.md: Add systemd units for indexer 2018-09-23 23:15:27 +03:00
Alan Orth 90d7a452bd
contrib: Add systemd units for indexer
An example systemd service unit for the indexer and an accompanying
timer unit.
2018-09-23 23:13:43 +03:00
Alan Orth 431a1c9d64
CHANGELOG.md: Add unreleased changes 2018-09-23 23:04:01 +03:00
Alan Orth e1b9d1284f
Rename project to DSpace Statistics API
At first I called it "CGSpace" because I was making it specifically
for our CGSpace DSpace repository, but the potential here is bigger
than that!
2018-09-23 23:02:21 +03:00
Alan Orth bac764a0a4
CHANGELOG.md: Move entries to version 0.0.4 2018-09-23 16:49:25 +03:00
Alan Orth 1a650e57c0
CHANGELOG.md: Update unreleased features 2018-09-23 16:48:39 +03:00
Alan Orth 2db5e02be9
Add indexer.py
Standalone script to ingest item views and downloads from Solr into
SQLite.
2018-09-23 16:47:48 +03:00
Alan Orth 9e942736b1
app.py: Get item statistics from SQLite database
It is much more efficient to cache view and download statistics in
a database than to query Solr on demand (not to mention that it is
not possible to page easily with facets in Solr). I decided to use
SQLite because it is fast, native in Python 3, and doesn't require
any extra steps during provisioning (assuming permissions are ok).
2018-09-23 16:47:00 +03:00
Alan Orth ea85393b13
app.py: Use parameterized URI instead of query for /item
Falcon's get_param_as_int() is really nice in that it gets a query
parameter and does validation for you, but I really wanted to have
cleaner URIs for API routes so I am now using a route URI template
with a field converter. This is cleaner, but means that parameters
not matching the template will return HTTP 404.

See: https://falcon.readthedocs.io/en/stable/api/routing.html#field-converters
2018-09-23 16:23:33 +03:00
Alan Orth cbeb7c89a7
CHANGELOG.md: Add note about Solr connection refactor 2018-09-23 13:27:43 +03:00
Alan Orth b0d81a543c
Refactor Solr components
This makes it so we only need to define and connect once and then we
can re-use the connection everywhere else.
2018-09-23 13:24:30 +03:00
Alan Orth 84801a4ab5
Add vim modeline to all Python files
Uses four spaces for tab and shift widths, and turns on expansion of
tabs to spaces.
2018-09-23 11:33:26 +03:00
Alan Orth 4e8621e3d9
README.md: Add TODO about API documentation 2018-09-23 09:52:36 +03:00
Alan Orth 2c8430171d
CHANGELOG.md: Add note about systemd unit file 2018-09-23 07:58:15 +03:00
Alan Orth fb60133713
Add example systemd unit for statistics API 2018-09-23 07:50:04 +03:00
Alan Orth 9e01a80011
CHANGELOG.md: Move changes to version 0.0.3 2018-09-20 17:41:47 +03:00
Alan Orth a263996582
app.py: Fix Solr queries for item views
According to dspace-api's Constants.java, items are type 2 and they
use a unique ID field of `id` instead of `owningItem`. There is no
need to check the bundleName for item types.

Also, I decided to use the main Solr query for item IDs because the
filter query parameter (fq) stores results in the filterCache and
can be quite expensive with cores storing tens of millions of docu-
ments (we currently have 149 million docs!). It makes sense to use
the filter query parameter to reduce the result set returned by the
main Solr query.
2018-09-20 17:37:13 +03:00
Alan Orth ed9d25294e
app.py: Use SolrClient's rows parameter
Instead of putting this in the raw query we can just use SolrClient's
native rows parameter.
2018-09-19 12:48:28 +03:00
Alan Orth 5e165d2e88
CHANGELOG.md: Add note about using rows=0 in Solr queries 2018-09-19 01:50:14 +03:00
Alan Orth 8e29fd8a43
app.py: Use rows=0 for Solr queries
There is no need to return any rows of the result because I am only
interested in the numFound.
2018-09-19 01:48:35 +03:00
Alan Orth 24af83b03f
CHANGELOG.md: Add note about simplified Solr query 2018-09-19 00:30:28 +03:00
Alan Orth a87aaba812
app.py: Simplify Solr query for bitstream downloads
This whole business with negative query ranges is confusing as hell
and I'll definitely forget it in the future. In DSpace's Solr term-
inology a "download" is a view to some bitstream that lives in the
ORIGINAL bundle. This is where bitstreams that are uploaded during
the item submission process go, versus generated thumbnails, etc.
2018-09-19 00:24:23 +03:00
Alan Orth 57faec59c8
CHANGELOG.md: Add note about config refactor 2018-09-18 17:01:24 +03:00
Alan Orth 06ab254017
Refactor configuration into separate module
There is a good example of this in the Project Weekend GitHub profile.

See: https://github.com/projectweekend/Falcon-PostgreSQL-API-Seed
2018-09-18 16:59:28 +03:00
Alan Orth 5b5cab8b34
README.md: Update todo 2018-09-18 15:59:27 +03:00
Alan Orth 40ce3c72a9
CHANGELOG.md: Update for version 0.0.2 2018-09-18 15:36:56 +03:00
Alan Orth ea2283355b
Add CHANGELOG.md
See: https://keepachangelog.com/en/1.0.0/
2018-09-18 15:35:42 +03:00
Alan Orth 4b4a959a1c
Add ability to get Solr parameters from environment
You can use the SOLR_SERVER and SOLR_CORE variables to make deployment
via systemd, etc easier.
2018-09-18 15:34:25 +03:00
Alan Orth 1e16beed30
README.md: Add todo list 2018-09-18 14:19:14 +03:00
Alan Orth 182e13efca
Add GPLv3 license 2018-09-18 14:16:07 +03:00
Alan Orth fe43423256
README.md: Update introduction 2018-09-18 14:11:29 +03:00
Alan Orth 4d610e04b7
Add .gitignore 2018-09-18 14:09:53 +03:00
Alan Orth 6c66303b45
Add README.md 2018-09-18 14:09:29 +03:00