CHANGELOG.md: Move entries to version 0.0.4

CHANGELOG.md: Update unreleased features
Add indexer.py
2025-05-10 15:16:02 +02:00 · 2018-09-23 16:49:25 +03:00 · 2018-09-23 16:48:39 +03:00 · 2018-09-23 16:47:48 +03:00 · 2018-09-23 16:47:00 +03:00 · 2018-09-23 16:23:33 +03:00
9 changed files with 196 additions and 25 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,2 +1,3 @@
 __pycache__
 venv
+*.db
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,6 +4,23 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.0.4] - 2018-09-23
+### Added
+- Added example systemd unit file for API
+- Added indexer.py to ingest views and downloads from Solr to a SQLite database
+
+### Changed
+- Refactor Solr configuration and connection
+- /item route now expects id as part of the URI instead of a query parameter: /item/id
+- View and download stats are now fetched from a SQLite database
+
+## [0.0.3] - 2018-09-20
+### Changed
+- Refactor environment variables into config module
+- Simplify Solr query for "downloads"
+- Optimize Solr query by using rows=0
+- Fix Solr queries for item views
+
 ## [0.0.2] - 2018-09-18
 ### Added
 - Ability to get Solr parameters from environment (`SOLR_SERVER` and `SOLR_CORE`)
--- a/README.md
+++ b/README.md
@ -13,7 +13,8 @@ Create a virtual environment and run it:

 ## Todo

- Take a list of items (POST in JSON?)
+- Ability to return a paginated list of items (on a different route?)
+- Add API documentation

 ## License
 This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
--- a/app.py
+++ b/app.py
@ -2,37 +2,34 @@
 # See DSpace Solr docs for tips about parameters
 # https://wiki.duraspace.org/display/DSPACE/Solr

+from config import SOLR_CORE
+from database import database_connection
 import falcon
-import os
-from SolrClient import SolrClient
+from solr import solr_connection

-# Check if Solr connection information was provided in the environment
-solr_server = os.environ.get('SOLR_SERVER', 'http://localhost:8080/solr')
-solr_core = os.environ.get('SOLR_CORE', 'statistics')
+db = database_connection()
+solr = solr_connection()

 class ItemResource:
-    def on_get(self, req, resp):
+    def on_get(self, req, resp, item_id):
        """Handles GET requests"""
-        # Return HTTPBadRequest if id parameter is not present and valid
-        item_id = req.get_param_as_int("id", required=True, min=0)

-        solr = SolrClient(solr_server)
+        cursor = db.cursor()
+        # get item views (and catch the TypeError if item doesn't have any views)
+        cursor.execute('SELECT views FROM itemviews WHERE id={0}'.format(item_id))
+        try:
+            views = cursor.fetchone()['views']
+        except:
+            views = 0

-        # Get views
-        res = solr.query(solr_core, {
-            'q':'type:0',
-            'fq':'owningItem:{0} AND isBot:false AND statistics_type:view AND -bundleName:ORIGINAL'.format(item_id)
-        })
+        # get item downloads (and catch the TypeError if item doesn't have any downloads)
+        cursor.execute('SELECT downloads FROM itemdownloads WHERE id={0}'.format(item_id))
+        try:
+            downloads = cursor.fetchone()['downloads']
+        except:
+            downloads = 0

-        views = res.get_num_found()
-
-        # Get downloads
-        res = solr.query(solr_core, {
-            'q':'type:0',
-            'fq':'owningItem:{0} AND isBot:false AND statistics_type:view AND -(bundleName:[* TO *] -bundleName:ORIGINAL)'.format(item_id)
-        })
-
-        downloads = res.get_num_found() 
+        cursor.close()

        statistics = {
            'id': item_id,
@ -43,4 +40,6 @@ class ItemResource:
        resp.media = statistics

 api = falcon.API()
-api.add_route('/item', ItemResource())
+api.add_route('/item/{item_id:int}', ItemResource())
+
+# vim: set sw=4 ts=4 expandtab:
--- a/config.py
+++ b/config.py
@ -0,0 +1,9 @@
+import os
+
+# Check if Solr connection information was provided in the environment
+SOLR_SERVER = os.environ.get('SOLR_SERVER', 'http://localhost:8080/solr')
+SOLR_CORE = os.environ.get('SOLR_CORE', 'statistics')
+
+SQLITE_DB = os.environ.get('SQLITE_DB', 'statistics.db')
+
+# vim: set sw=4 ts=4 expandtab:
--- a/contrib/cgspace-statistics-api.service
+++ b/contrib/cgspace-statistics-api.service
@ -0,0 +1,18 @@
+[Unit]
+Description=CGSpace Statistics API
+After=network.target
+
+[Service]
+Environment=SOLR_SERVER=http://localhost:8081/solr
+Environment=SOLR_CORE=statistics
+User=nobody
+Group=nogroup
+WorkingDirectory=/opt/ilri/cgspace-statistics-api
+ExecStart=/opt/ilri/cgspace-statistics-api/venv/bin/gunicorn \
+          --bind 127.0.0.1:5000                              \
+          app:api
+ExecReload=/bin/kill -s HUP $MAINPID
+ExecStop=/bin/kill -s TERM $MAINPID
+
+[Install]
+WantedBy=multi-user.target
--- a/database.py
+++ b/database.py
@ -0,0 +1,11 @@
+from config import SQLITE_DB
+import sqlite3
+
+def database_connection():
+    connection = sqlite3.connect(SQLITE_DB)
+    # allow iterating over row results by column key
+    connection.row_factory = sqlite3.Row
+
+    return connection
+
+# vim: set sw=4 ts=4 expandtab:
--- a/indexer.py
+++ b/indexer.py
@ -0,0 +1,106 @@
+#!/usr/bin/env python
+#
+# Tested with Python 3.6
+# See DSpace Solr docs for tips about parameters
+# https://wiki.duraspace.org/display/DSPACE/Solr
+
+from config import SOLR_CORE
+from database import database_connection
+from solr import solr_connection
+
+def index_views():
+    print("Populating database with item views.")
+
+    # determine the total number of items with views (aka Solr's numFound)
+    res = solr.query(SOLR_CORE, {
+        'q':'type:2',
+        'fq':'isBot:false AND statistics_type:view',
+        'facet':True,
+        'facet.field':'id',
+    }, rows=0)
+
+    # divide results into "pages" (numFound / 100)
+    results_numFound = res.get_num_found()
+    results_per_page = 100
+    results_num_pages = round(results_numFound / results_per_page)
+    results_current_page = 0
+
+    while results_current_page <= results_num_pages:
+        print('Page {0} of {1}.'.format(results_current_page, results_num_pages))
+
+        res = solr.query(SOLR_CORE, {
+            'q':'type:2',
+            'fq':'isBot:false AND statistics_type:view',
+            'facet':True,
+            'facet.field':'id',
+            'facet.limit':results_per_page,
+            'facet.offset':results_current_page * results_per_page
+        })
+
+        # make sure total number of results > 0
+        if res.get_num_found() > 0:
+            # SolrClient's get_facets() returns a dict of dicts
+            views = res.get_facets()
+            # in this case iterate over the 'id' dict and get the item ids and views
+            for item_id, item_views in views['id'].items():
+                db.execute('''REPLACE INTO itemviews VALUES (?, ?)''', (item_id, item_views))
+
+        db.commit()
+
+        results_current_page += 1
+
+def index_downloads():
+    print("Populating database with item downloads.")
+
+    # determine the total number of items with downloads (aka Solr's numFound)
+    res = solr.query(SOLR_CORE, {
+        'q':'type:0',
+        'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
+        'facet':True,
+        'facet.field':'owningItem',
+    }, rows=0)
+
+    # divide results into "pages" (numFound / 100)
+    results_numFound = res.get_num_found()
+    results_per_page = 100
+    results_num_pages = round(results_numFound / results_per_page)
+    results_current_page = 0
+
+    while results_current_page <= results_num_pages:
+        print('Page {0} of {1}.'.format(results_current_page, results_num_pages))
+
+        res = solr.query(SOLR_CORE, {
+            'q':'type:0',
+            'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
+            'facet':True,
+            'facet.field':'owningItem',
+            'facet.limit':results_per_page,
+            'facet.offset':results_current_page * results_per_page
+        })
+
+        # make sure total number of results > 0
+        if res.get_num_found() > 0:
+            # SolrClient's get_facets() returns a dict of dicts
+            downloads = res.get_facets()
+            # in this case iterate over the 'owningItem' dict and get the item ids and downloads
+            for item_id, item_downloads in downloads['owningItem'].items():
+                db.execute('''REPLACE INTO itemdownloads VALUES (?, ?)''', (item_id, item_downloads))
+
+        db.commit()
+
+        results_current_page += 1
+
+db = database_connection()
+solr = solr_connection()
+
+# use separate views and downloads tables so we can REPLACE INTO carelessly (ie, item may have views but no downloads)
+db.execute('''CREATE TABLE IF NOT EXISTS itemviews
+                  (id integer primary key, views integer)''')
+db.execute('''CREATE TABLE IF NOT EXISTS itemdownloads
+                  (id integer primary key, downloads integer)''')
+index_views()
+index_downloads()
+
+db.close()
+
+# vim: set sw=4 ts=4 expandtab:
--- a/solr.py
+++ b/solr.py
@ -0,0 +1,9 @@
+from config import SOLR_SERVER
+from SolrClient import SolrClient
+
+def solr_connection():
+    connection = SolrClient(SOLR_SERVER)
+
+    return connection
+
+# vim: set sw=4 ts=4 expandtab:
Author	SHA1	Message	Date
Alan Orth	bac764a0a4	CHANGELOG.md: Move entries to version 0.0.4	2018-09-23 16:49:25 +03:00
Alan Orth	1a650e57c0	CHANGELOG.md: Update unreleased features	2018-09-23 16:48:39 +03:00
Alan Orth	2db5e02be9	Add indexer.py Standalone script to ingest item views and downloads from Solr into SQLite.	2018-09-23 16:47:48 +03:00
Alan Orth	9e942736b1	app.py: Get item statistics from SQLite database It is much more efficient to cache view and download statistics in a database than to query Solr on demand (not to mention that it is not possible to page easily with facets in Solr). I decided to use SQLite because it is fast, native in Python 3, and doesn't require any extra steps during provisioning (assuming permissions are ok).	2018-09-23 16:47:00 +03:00
Alan Orth	ea85393b13	app.py: Use parameterized URI instead of query for /item Falcon's get_param_as_int() is really nice in that it gets a query parameter and does validation for you, but I really wanted to have cleaner URIs for API routes so I am now using a route URI template with a field converter. This is cleaner, but means that parameters not matching the template will return HTTP 404. See: https://falcon.readthedocs.io/en/stable/api/routing.html#field-converters	2018-09-23 16:23:33 +03:00
Alan Orth	cbeb7c89a7	CHANGELOG.md: Add note about Solr connection refactor	2018-09-23 13:27:43 +03:00
Alan Orth	b0d81a543c	Refactor Solr components This makes it so we only need to define and connect once and then we can re-use the connection everywhere else.	2018-09-23 13:24:30 +03:00
Alan Orth	84801a4ab5	Add vim modeline to all Python files Uses four spaces for tab and shift widths, and turns on expansion of tabs to spaces.	2018-09-23 11:33:26 +03:00
Alan Orth	4e8621e3d9	README.md: Add TODO about API documentation	2018-09-23 09:52:36 +03:00
Alan Orth	2c8430171d	CHANGELOG.md: Add note about systemd unit file	2018-09-23 07:58:15 +03:00
Alan Orth	fb60133713	Add example systemd unit for statistics API	2018-09-23 07:50:04 +03:00
Alan Orth	9e01a80011	CHANGELOG.md: Move changes to version 0.0.3	2018-09-20 17:41:47 +03:00
Alan Orth	a263996582	app.py: Fix Solr queries for item views According to dspace-api's Constants.java, items are type 2 and they use a unique ID field of `id` instead of `owningItem`. There is no need to check the bundleName for item types. Also, I decided to use the main Solr query for item IDs because the filter query parameter (fq) stores results in the filterCache and can be quite expensive with cores storing tens of millions of docu- ments (we currently have 149 million docs!). It makes sense to use the filter query parameter to reduce the result set returned by the main Solr query.	2018-09-20 17:37:13 +03:00
Alan Orth	ed9d25294e	app.py: Use SolrClient's rows parameter Instead of putting this in the raw query we can just use SolrClient's native rows parameter.	2018-09-19 12:48:28 +03:00
Alan Orth	5e165d2e88	CHANGELOG.md: Add note about using rows=0 in Solr queries	2018-09-19 01:50:14 +03:00
Alan Orth	8e29fd8a43	app.py: Use rows=0 for Solr queries There is no need to return any rows of the result because I am only interested in the numFound.	2018-09-19 01:48:35 +03:00
Alan Orth	24af83b03f	CHANGELOG.md: Add note about simplified Solr query	2018-09-19 00:30:28 +03:00
Alan Orth	a87aaba812	app.py: Simplify Solr query for bitstream downloads This whole business with negative query ranges is confusing as hell and I'll definitely forget it in the future. In DSpace's Solr term- inology a "download" is a view to some bitstream that lives in the ORIGINAL bundle. This is where bitstreams that are uploaded during the item submission process go, versus generated thumbnails, etc.	2018-09-19 00:24:23 +03:00
Alan Orth	57faec59c8	CHANGELOG.md: Add note about config refactor	2018-09-18 17:01:24 +03:00
Alan Orth	06ab254017	Refactor configuration into separate module There is a good example of this in the Project Weekend GitHub profile. See: https://github.com/projectweekend/Falcon-PostgreSQL-API-Seed	2018-09-18 16:59:28 +03:00
Alan Orth	5b5cab8b34	README.md: Update todo	2018-09-18 15:59:27 +03:00