CHANGELOG.md: Release version 0.2.1

app.py: Remove comment
This comment was added when I first began the application and the testing status is documented in the README now.
2025-07-02 04:27:24 +02:00 · 2018-09-25 02:21:44 +03:00 · 2018-09-25 02:20:51 +03:00 · 2018-09-25 02:19:29 +03:00 · 2018-09-25 02:08:54 +03:00 · 2018-09-25 02:06:29 +03:00
10 changed files with 308 additions and 30 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,6 +4,39 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.2.1] - 2018-09-24
+### Changed
+- Environment settings in example systemd unit files
+- Use psycopg2.extras.DictCursor for PostgreSQL connection
+
+## [0.2.0] - 2018-09-24
+### Changed
+- Use PostgreSQL instead of SQLite because UPSERT support needs a very new libsqlite3 whereas it's already in PostgreSQL 9.5+
+
+## [0.1.0] - 2018-09-24
+### Changed
+- Rename project to "DSpace Statistics API"
+- Use read-only database connection in API
+- Update systemd units for CGSpace→DSpace rename
+- Use UPSERT to simplify database schema and Python logic
+
+### Added
+- Example systemd service and timer unit for indexer service
+- Add top-level route to expose all item statistics
+
+### Removed
+- Ability to customize SOLR_CORE variable
+
+## [0.0.4] - 2018-09-23
+### Added
+- Added example systemd unit file for API
+- Added indexer.py to ingest views and downloads from Solr to a SQLite database
+
+### Changed
+- Refactor Solr configuration and connection
+- /item route now expects id as part of the URI instead of a query parameter: /item/id
+- View and download stats are now fetched from a SQLite database
+
 ## [0.0.3] - 2018-09-20
 ### Changed
 - Refactor environment variables into config module
--- a/README.md
+++ b/README.md
@ -1,19 +1,22 @@
-# CGSpace Statistics API
+# DSpace Statistics API
 A quick and dirty REST API to expose Solr view and download statistics for items in a DSpace repository.

-Written and tested in Python 3.6. SolrClient (0.2.1) does not currently run in Python 3.7.0.
+Written and tested in Python 3.6. SolrClient (0.2.1) does not currently run in Python 3.7.0. Requires PostgreSQL version 9.5 or greater for [`UPSERT` support](https://wiki.postgresql.org/wiki/UPSERT).

 ## Installation
 Create a virtual environment and run it:

    $ virtualenv -p /usr/bin/python3.6 venv
    $ . venv/bin/activate
-    $ pip install falcon gunicorn SolrClient
+    $ pip install falcon gunicorn SolrClient psycopg2-binary
    $ gunicorn app:api

 ## Todo

- Ability to return a paginated list of items (on a different route?)
+- Add API documentation
+- Close up DB connection when gunicorn shuts down gracefully
+- Better logging
+- Return HTTP 404 when item_id is nonexistent

 ## License
 This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
--- a/app.py
+++ b/app.py
@ -1,44 +1,65 @@
-# Tested with Python 3.6
-# See DSpace Solr docs for tips about parameters
-# https://wiki.duraspace.org/display/DSPACE/Solr
-
-from config import SOLR_SERVER
-from config import SOLR_CORE
+from database import database_connection
 import falcon
-from SolrClient import SolrClient
+from solr import solr_connection

+db = database_connection()
+db.set_session(readonly=True)
+solr = solr_connection()

-class ItemResource:
+class AllItemsResource:
    def on_get(self, req, resp):
        """Handles GET requests"""
        # Return HTTPBadRequest if id parameter is not present and valid
-        item_id = req.get_param_as_int("id", required=True, min=0)
+        limit = req.get_param_as_int("limit", min=0, max=100) or 100
+        page = req.get_param_as_int("page", min=0) or 0
+        offset = limit * page

-        solr = SolrClient(SOLR_SERVER)
+        cursor = db.cursor()

-        # Get views
-        res = solr.query(SOLR_CORE, {
-            'q':'type:2 AND id:{0}'.format(item_id),
-            'fq':'isBot:false AND statistics_type:view'
-        }, rows=0)
+        # get total number of items so we can estimate the pages
+        cursor.execute('SELECT COUNT(id) FROM items')
+        pages = round(cursor.fetchone()[0] / limit)

-        views = res.get_num_found()
+        # get statistics, ordered by id, and use limit and offset to page through results
+        cursor.execute('SELECT id, views, downloads FROM items ORDER BY id ASC LIMIT {} OFFSET {}'.format(limit, offset))
+        results = cursor.fetchmany(limit)
+        cursor.close()

-        # Get downloads
-        res = solr.query(SOLR_CORE, {
-            'q':'type:0 AND owningItem:{0}'.format(item_id),
-            'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL'
-        }, rows=0)
+        # create a list to hold dicts of item stats
+        statistics = list()

-        downloads = res.get_num_found() 
+        # iterate over results and build statistics object
+        for item in results:
+            statistics.append({ 'id': item['id'], 'views': item['views'], 'downloads': item['downloads'] })
+
+        message = {
+                'currentPage': page,
+                'totalPages': pages,
+                'limit': limit,
+                'statistics': statistics
+        }
+
+        resp.media = message
+
+class ItemResource:
+    def on_get(self, req, resp, item_id):
+        """Handles GET requests"""
+
+        cursor = db.cursor()
+        cursor.execute('SELECT views, downloads FROM items WHERE id={}'.format(item_id))
+        results = cursor.fetchone()
+        cursor.close()

        statistics = {
            'id': item_id,
-            'views': views,
-            'downloads': downloads
+            'views': results['views'],
+            'downloads': results['downloads']
        }

        resp.media = statistics

 api = falcon.API()
-api.add_route('/item', ItemResource())
+api.add_route('/', AllItemsResource())
+api.add_route('/item/{item_id:int}', ItemResource())
+
+# vim: set sw=4 ts=4 expandtab:
--- a/config.py
+++ b/config.py
@ -2,4 +2,10 @@ import os

 # Check if Solr connection information was provided in the environment
 SOLR_SERVER = os.environ.get('SOLR_SERVER', 'http://localhost:8080/solr')
-SOLR_CORE = os.environ.get('SOLR_CORE', 'statistics')
+
+DATABASE_NAME = os.environ.get('DATABASE_NAME', 'dspacestatistics')
+DATABASE_USER = os.environ.get('DATABASE_USER', 'dspacestatistics')
+DATABASE_PASS = os.environ.get('DATABASE_PASS', 'dspacestatistics')
+DATABASE_HOST = os.environ.get('DATABASE_HOST', 'localhost')
+
+# vim: set sw=4 ts=4 expandtab:
--- a/contrib/dspace-statistics-api.service
+++ b/contrib/dspace-statistics-api.service
@ -0,0 +1,20 @@
+[Unit]
+Description=DSpace Statistics API
+After=network.target
+
+[Service]
+Environment=DATABASE_NAME=dspacestatistics
+Environment=DATABASE_USER=dspacestatistics
+Environment=DATABASE_PASS=dspacestatistics
+Environment=DATABASE_HOST=localhost
+User=nobody
+Group=nogroup
+WorkingDirectory=/opt/ilri/dspace-statistics-api
+ExecStart=/opt/ilri/dspace-statistics-api/venv/bin/gunicorn \
+          --bind 127.0.0.1:5000                             \
+          app:api
+ExecReload=/bin/kill -s HUP $MAINPID
+ExecStop=/bin/kill -s TERM $MAINPID
+
+[Install]
+WantedBy=multi-user.target
--- a/contrib/dspace-statistics-indexer.service
+++ b/contrib/dspace-statistics-indexer.service
@ -0,0 +1,17 @@
+[Unit]
+Description=DSpace Statistics Indexer
+After=tomcat7.target
+
+[Service]
+Environment=SOLR_SERVER=http://localhost:8081/solr
+Environment=DATABASE_NAME=dspacestatistics
+Environment=DATABASE_USER=dspacestatistics
+Environment=DATABASE_PASS=dspacestatistics
+Environment=DATABASE_HOST=localhost
+User=nobody
+Group=nogroup
+WorkingDirectory=/opt/ilri/dspace-statistics-api
+ExecStart=/opt/ilri/dspace-statistics-api/venv/bin/python indexer.py
+
+[Install]
+WantedBy=multi-user.target
--- a/contrib/dspace-statistics-indexer.timer
+++ b/contrib/dspace-statistics-indexer.timer
@ -0,0 +1,12 @@
+[Unit]
+Description=DSpace Statistics Indexer
+
+[Timer]
+# twice a day, at 6AM and 6PM
+OnCalendar=*-*-* 06:00:00,18:00:00
+# Add a random delay of 0–3600 seconds
+RandomizedDelaySec=3600
+Persistent=true
+
+[Install]
+WantedBy=timers.target
--- a/database.py
+++ b/database.py
@ -0,0 +1,13 @@
+from config import DATABASE_NAME
+from config import DATABASE_USER
+from config import DATABASE_PASS
+from config import DATABASE_HOST
+import psycopg2
+import psycopg2.extras
+
+def database_connection():
+    connection = psycopg2.connect("dbname={} user={} password={} host='{}'".format(DATABASE_NAME, DATABASE_USER, DATABASE_PASS, DATABASE_HOST), cursor_factory=psycopg2.extras.DictCursor)
+
+    return connection
+
+# vim: set sw=4 ts=4 expandtab:
--- a/indexer.py
+++ b/indexer.py
@ -0,0 +1,144 @@
+#!/usr/bin/env python
+#
+# indexer.py
+#
+# Copyright 2018 Alan Orth.
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# ---
+#
+# Connects to a DSpace Solr statistics core and ingests item views and downloads
+# into a Postgres database for use with other applications (an API, for example).
+#
+# This script is written for Python 3 and requires several modules that you can
+# install with pip (I recommend setting up a Python virtual environment first):
+#
+#   $ pip install SolrClient
+#
+# See: https://solrclient.readthedocs.io/en/latest/SolrClient.html
+# See: https://wiki.duraspace.org/display/DSPACE/Solr
+#
+# Tested with Python 3.5 and 3.6.
+
+from database import database_connection
+from solr import solr_connection
+
+def index_views():
+    print("Populating database with item views.")
+
+    # determine the total number of items with views (aka Solr's numFound)
+    res = solr.query('statistics', {
+        'q':'type:2',
+        'fq':'isBot:false AND statistics_type:view',
+        'facet':True,
+        'facet.field':'id',
+    }, rows=0)
+
+    # divide results into "pages" (numFound / 100)
+    results_numFound = res.get_num_found()
+    results_per_page = 100
+    results_num_pages = round(results_numFound / results_per_page)
+    results_current_page = 0
+
+    cursor = db.cursor()
+
+    while results_current_page <= results_num_pages:
+        print('Page {} of {}.'.format(results_current_page, results_num_pages))
+
+        res = solr.query('statistics', {
+            'q':'type:2',
+            'fq':'isBot:false AND statistics_type:view',
+            'facet':True,
+            'facet.field':'id',
+            'facet.limit':results_per_page,
+            'facet.offset':results_current_page * results_per_page
+        })
+
+        # make sure total number of results > 0
+        if res.get_num_found() > 0:
+            # SolrClient's get_facets() returns a dict of dicts
+            views = res.get_facets()
+            # in this case iterate over the 'id' dict and get the item ids and views
+            for item_id, item_views in views['id'].items():
+                cursor.execute('''INSERT INTO items(id, views) VALUES(%s, %s)
+                               ON CONFLICT(id) DO UPDATE SET downloads=excluded.views''',
+                               (item_id, item_views))
+
+        db.commit()
+
+        results_current_page += 1
+
+    cursor.close()
+
+def index_downloads():
+    print("Populating database with item downloads.")
+
+    # determine the total number of items with downloads (aka Solr's numFound)
+    res = solr.query('statistics', {
+        'q':'type:0',
+        'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
+        'facet':True,
+        'facet.field':'owningItem',
+    }, rows=0)
+
+    # divide results into "pages" (numFound / 100)
+    results_numFound = res.get_num_found()
+    results_per_page = 100
+    results_num_pages = round(results_numFound / results_per_page)
+    results_current_page = 0
+
+    cursor = db.cursor()
+
+    while results_current_page <= results_num_pages:
+        print('Page {} of {}.'.format(results_current_page, results_num_pages))
+
+        res = solr.query('statistics', {
+            'q':'type:0',
+            'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
+            'facet':True,
+            'facet.field':'owningItem',
+            'facet.limit':results_per_page,
+            'facet.offset':results_current_page * results_per_page
+        })
+
+        # make sure total number of results > 0
+        if res.get_num_found() > 0:
+            # SolrClient's get_facets() returns a dict of dicts
+            downloads = res.get_facets()
+            # in this case iterate over the 'owningItem' dict and get the item ids and downloads
+            for item_id, item_downloads in downloads['owningItem'].items():
+                cursor.execute('''INSERT INTO items(id, downloads) VALUES(%s, %s)
+                               ON CONFLICT(id) DO UPDATE SET downloads=excluded.downloads''',
+                               (item_id, item_downloads))
+
+        db.commit()
+
+        results_current_page += 1
+
+    cursor.close()
+
+db = database_connection()
+solr = solr_connection()
+
+# create table to store item views and downloads
+cursor = db.cursor()
+cursor.execute('''CREATE TABLE IF NOT EXISTS items
+                  (id INT PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)''')
+index_views()
+index_downloads()
+
+db.close()
+
+# vim: set sw=4 ts=4 expandtab:
--- a/solr.py
+++ b/solr.py
@ -0,0 +1,9 @@
+from config import SOLR_SERVER
+from SolrClient import SolrClient
+
+def solr_connection():
+    connection = SolrClient(SOLR_SERVER)
+
+    return connection
+
+# vim: set sw=4 ts=4 expandtab:
Author	SHA1	Message	Date
Alan Orth	87dbb6c4df	CHANGELOG.md: Release version 0.2.1	2018-09-25 02:21:44 +03:00
Alan Orth	3160c44566	app.py: Remove comment This comment was added when I first began the application and the testing status is documented in the README now.	2018-09-25 02:20:51 +03:00
Alan Orth	4b72f626d9	Update string substitution format Instead of doing numbered strings I will just depend on the order, at least to be consistent.	2018-09-25 02:19:29 +03:00
Alan Orth	2d3b7620e3	CHANGELOG.md: Add note about psycopg2.extras.DictCursor	2018-09-25 02:08:54 +03:00
Alan Orth	6e4bc630f7	database.py: Use psycopg2.extras.DictCursor This allows us to access records using their column name. I didn't notice that this was not working, as I had been testing the wrong server! See: http://initd.org/psycopg/docs/extras.html	2018-09-25 02:06:29 +03:00
Alan Orth	44884140e5	CHANGELOG.md: Add new unreleased changes	2018-09-25 01:11:37 +03:00
Alan Orth	74ff86ee3b	contrib: Update environment settings in system units	2018-09-25 01:10:14 +03:00
Alan Orth	3327884f21	Update docs to remove SQLite stuff I've decided to use PostgreSQL instead of SQLite because the UPSERT support is available in versions of PostgreSQL we're alread running, whereas SQLite needs a VERY new (3.24.0) version that is not avail- able on any recent long-term support Ubuntu releases.	2018-09-25 00:56:01 +03:00
Alan Orth	8f7450f67a	Use PostgreSQL instead of SQLite I was very surprised how easy and fast and robust SQLite was, but in the end I realized that its UPSERT support only came in version 3.24 and both Ubuntu 16.04 and 18.04 have older versions than that! I did manage to install libsqlite3-0 from Ubuntu 18.04 cosmic on my xenial host, but that feels dirty. PostgreSQL has support for UPSERT since 9.5, not to mention the same nice LIMIT and OFFSET clauses.	2018-09-25 00:49:47 +03:00
Alan Orth	28d61fb041	README.md: Add notes about Python and SQLite versions	2018-09-24 17:26:48 +03:00
Alan Orth	cbc98991b4	CHANGELOG.md: Move unreleased notes to version 0.1.0	2018-09-24 16:14:14 +03:00
Alan Orth	6c28be0463	README.md: Add note about route for all items	2018-09-24 16:13:26 +03:00
Alan Orth	42e8f17305	CHANGELOG.md: Add note about route for all items	2018-09-24 16:13:05 +03:00
Alan Orth	19a45f3f6f	app.py: Add route to page through all item statistics This route exposes all item statistics and uses the limit and offset parameters to control paging throug the result set. The logic here is extremely easy thanks to the brilliant LIMIT and OFFSET features of SQLite (of course the SQL query sorts the results by some unique field to ensure the order is already the same).	2018-09-24 16:07:26 +03:00
Alan Orth	505ef31101	CHANGELOG.md: Add note about UPSERT	2018-09-24 14:31:05 +03:00
Alan Orth	1543cacc54	app.py: Update SQL logic to use single table The indexer.py script was updated to use a single table because I learned about UPSERT. This simplifies the database schema and the Python logic, and makes it easier to page all views and downloads at once without complicated JOIN queries.	2018-09-24 14:28:00 +03:00
Alan Orth	2cab456f16	indexer.py: Use single items table with UPSERT I was using two separate tables for item views and downloads without realizing that SQLite didn't support FULL OUTER JOIN, which would be needed to get views and downloads for a given item in a single query. Instead I can use one table with a default value of 0 for both views and downloads, and then use "UPSERT" to populate the statistics. This is a newish SQL concept that allows you to attempt an INSERT and then specify an action to perform in case of conflict. This works well in SQLite and actually simplifies my Python logic greatly! Note that the "excluded" table qualifier is a special keyword that allows you to reference the value that would have been inserted. See: https://www.sqlite.org/lang_UPSERT.html	2018-09-24 14:19:50 +03:00
Alan Orth	53615dea2d	indexer.py: Add license and documentation	2018-09-24 09:18:50 +03:00
Alan Orth	2d8d1e6833	README.md: Add TODO for nonexistent items	2018-09-24 00:48:02 +03:00
Alan Orth	e26e595ea1	README.md: Add more TODOs	2018-09-24 00:35:00 +03:00
Alan Orth	a9151b5bbf	CHANGELOG.md: Update unreleased notes	2018-09-24 00:30:58 +03:00
Alan Orth	76833d6f5f	contrib: Update some old CGSpace references to DSpace	2018-09-24 00:30:26 +03:00
Alan Orth	a51422273c	Remove SOLR_CORE configuration variable This parameter is not customizable. All DSpace instances use this name for the Solr statistics core.	2018-09-24 00:20:54 +03:00
Alan Orth	89621af85d	Split database access into RW and RO The indexer need to be able to write to the database, but the API only needs to read it.	2018-09-24 00:00:05 +03:00
Alan Orth	c554404d7f	CHANGELOG.md: Add systemd units for indexer	2018-09-23 23:15:27 +03:00
Alan Orth	90d7a452bd	contrib: Add systemd units for indexer An example systemd service unit for the indexer and an accompanying timer unit.	2018-09-23 23:13:43 +03:00
Alan Orth	431a1c9d64	CHANGELOG.md: Add unreleased changes	2018-09-23 23:04:01 +03:00
Alan Orth	e1b9d1284f	Rename project to DSpace Statistics API At first I called it "CGSpace" because I was making it specifically for our CGSpace DSpace repository, but the potential here is bigger than that!	2018-09-23 23:02:21 +03:00
Alan Orth	bac764a0a4	CHANGELOG.md: Move entries to version 0.0.4	2018-09-23 16:49:25 +03:00
Alan Orth	1a650e57c0	CHANGELOG.md: Update unreleased features	2018-09-23 16:48:39 +03:00
Alan Orth	2db5e02be9	Add indexer.py Standalone script to ingest item views and downloads from Solr into SQLite.	2018-09-23 16:47:48 +03:00
Alan Orth	9e942736b1	app.py: Get item statistics from SQLite database It is much more efficient to cache view and download statistics in a database than to query Solr on demand (not to mention that it is not possible to page easily with facets in Solr). I decided to use SQLite because it is fast, native in Python 3, and doesn't require any extra steps during provisioning (assuming permissions are ok).	2018-09-23 16:47:00 +03:00
Alan Orth	ea85393b13	app.py: Use parameterized URI instead of query for /item Falcon's get_param_as_int() is really nice in that it gets a query parameter and does validation for you, but I really wanted to have cleaner URIs for API routes so I am now using a route URI template with a field converter. This is cleaner, but means that parameters not matching the template will return HTTP 404. See: https://falcon.readthedocs.io/en/stable/api/routing.html#field-converters	2018-09-23 16:23:33 +03:00
Alan Orth	cbeb7c89a7	CHANGELOG.md: Add note about Solr connection refactor	2018-09-23 13:27:43 +03:00
Alan Orth	b0d81a543c	Refactor Solr components This makes it so we only need to define and connect once and then we can re-use the connection everywhere else.	2018-09-23 13:24:30 +03:00
Alan Orth	84801a4ab5	Add vim modeline to all Python files Uses four spaces for tab and shift widths, and turns on expansion of tabs to spaces.	2018-09-23 11:33:26 +03:00
Alan Orth	4e8621e3d9	README.md: Add TODO about API documentation	2018-09-23 09:52:36 +03:00
Alan Orth	2c8430171d	CHANGELOG.md: Add note about systemd unit file	2018-09-23 07:58:15 +03:00
Alan Orth	fb60133713	Add example systemd unit for statistics API	2018-09-23 07:50:04 +03:00