1
0
mirror of https://github.com/ilri/dspace-statistics-api.git synced 2025-05-10 15:16:02 +02:00

Compare commits

..

21 Commits

Author SHA1 Message Date
bac764a0a4 CHANGELOG.md: Move entries to version 0.0.4 2018-09-23 16:49:25 +03:00
1a650e57c0 CHANGELOG.md: Update unreleased features 2018-09-23 16:48:39 +03:00
2db5e02be9 Add indexer.py
Standalone script to ingest item views and downloads from Solr into
SQLite.
2018-09-23 16:47:48 +03:00
9e942736b1 app.py: Get item statistics from SQLite database
It is much more efficient to cache view and download statistics in
a database than to query Solr on demand (not to mention that it is
not possible to page easily with facets in Solr). I decided to use
SQLite because it is fast, native in Python 3, and doesn't require
any extra steps during provisioning (assuming permissions are ok).
2018-09-23 16:47:00 +03:00
ea85393b13 app.py: Use parameterized URI instead of query for /item
Falcon's get_param_as_int() is really nice in that it gets a query
parameter and does validation for you, but I really wanted to have
cleaner URIs for API routes so I am now using a route URI template
with a field converter. This is cleaner, but means that parameters
not matching the template will return HTTP 404.

See: https://falcon.readthedocs.io/en/stable/api/routing.html#field-converters
2018-09-23 16:23:33 +03:00
cbeb7c89a7 CHANGELOG.md: Add note about Solr connection refactor 2018-09-23 13:27:43 +03:00
b0d81a543c Refactor Solr components
This makes it so we only need to define and connect once and then we
can re-use the connection everywhere else.
2018-09-23 13:24:30 +03:00
84801a4ab5 Add vim modeline to all Python files
Uses four spaces for tab and shift widths, and turns on expansion of
tabs to spaces.
2018-09-23 11:33:26 +03:00
4e8621e3d9 README.md: Add TODO about API documentation 2018-09-23 09:52:36 +03:00
2c8430171d CHANGELOG.md: Add note about systemd unit file 2018-09-23 07:58:15 +03:00
fb60133713 Add example systemd unit for statistics API 2018-09-23 07:50:04 +03:00
9e01a80011 CHANGELOG.md: Move changes to version 0.0.3 2018-09-20 17:41:47 +03:00
a263996582 app.py: Fix Solr queries for item views
According to dspace-api's Constants.java, items are type 2 and they
use a unique ID field of `id` instead of `owningItem`. There is no
need to check the bundleName for item types.

Also, I decided to use the main Solr query for item IDs because the
filter query parameter (fq) stores results in the filterCache and
can be quite expensive with cores storing tens of millions of docu-
ments (we currently have 149 million docs!). It makes sense to use
the filter query parameter to reduce the result set returned by the
main Solr query.
2018-09-20 17:37:13 +03:00
ed9d25294e app.py: Use SolrClient's rows parameter
Instead of putting this in the raw query we can just use SolrClient's
native rows parameter.
2018-09-19 12:48:28 +03:00
5e165d2e88 CHANGELOG.md: Add note about using rows=0 in Solr queries 2018-09-19 01:50:14 +03:00
8e29fd8a43 app.py: Use rows=0 for Solr queries
There is no need to return any rows of the result because I am only
interested in the numFound.
2018-09-19 01:48:35 +03:00
24af83b03f CHANGELOG.md: Add note about simplified Solr query 2018-09-19 00:30:28 +03:00
a87aaba812 app.py: Simplify Solr query for bitstream downloads
This whole business with negative query ranges is confusing as hell
and I'll definitely forget it in the future. In DSpace's Solr term-
inology a "download" is a view to some bitstream that lives in the
ORIGINAL bundle. This is where bitstreams that are uploaded during
the item submission process go, versus generated thumbnails, etc.
2018-09-19 00:24:23 +03:00
57faec59c8 CHANGELOG.md: Add note about config refactor 2018-09-18 17:01:24 +03:00
06ab254017 Refactor configuration into separate module
There is a good example of this in the Project Weekend GitHub profile.

See: https://github.com/projectweekend/Falcon-PostgreSQL-API-Seed
2018-09-18 16:59:28 +03:00
5b5cab8b34 README.md: Update todo 2018-09-18 15:59:27 +03:00
9 changed files with 196 additions and 25 deletions

1
.gitignore vendored
View File

@ -1,2 +1,3 @@
__pycache__
venv
*.db

View File

@ -4,6 +4,23 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.0.4] - 2018-09-23
### Added
- Added example systemd unit file for API
- Added indexer.py to ingest views and downloads from Solr to a SQLite database
### Changed
- Refactor Solr configuration and connection
- /item route now expects id as part of the URI instead of a query parameter: /item/id
- View and download stats are now fetched from a SQLite database
## [0.0.3] - 2018-09-20
### Changed
- Refactor environment variables into config module
- Simplify Solr query for "downloads"
- Optimize Solr query by using rows=0
- Fix Solr queries for item views
## [0.0.2] - 2018-09-18
### Added
- Ability to get Solr parameters from environment (`SOLR_SERVER` and `SOLR_CORE`)

View File

@ -13,7 +13,8 @@ Create a virtual environment and run it:
## Todo
- Take a list of items (POST in JSON?)
- Ability to return a paginated list of items (on a different route?)
- Add API documentation
## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

47
app.py
View File

@ -2,37 +2,34 @@
# See DSpace Solr docs for tips about parameters
# https://wiki.duraspace.org/display/DSPACE/Solr
from config import SOLR_CORE
from database import database_connection
import falcon
import os
from SolrClient import SolrClient
from solr import solr_connection
# Check if Solr connection information was provided in the environment
solr_server = os.environ.get('SOLR_SERVER', 'http://localhost:8080/solr')
solr_core = os.environ.get('SOLR_CORE', 'statistics')
db = database_connection()
solr = solr_connection()
class ItemResource:
def on_get(self, req, resp):
def on_get(self, req, resp, item_id):
"""Handles GET requests"""
# Return HTTPBadRequest if id parameter is not present and valid
item_id = req.get_param_as_int("id", required=True, min=0)
solr = SolrClient(solr_server)
cursor = db.cursor()
# get item views (and catch the TypeError if item doesn't have any views)
cursor.execute('SELECT views FROM itemviews WHERE id={0}'.format(item_id))
try:
views = cursor.fetchone()['views']
except:
views = 0
# Get views
res = solr.query(solr_core, {
'q':'type:0',
'fq':'owningItem:{0} AND isBot:false AND statistics_type:view AND -bundleName:ORIGINAL'.format(item_id)
})
# get item downloads (and catch the TypeError if item doesn't have any downloads)
cursor.execute('SELECT downloads FROM itemdownloads WHERE id={0}'.format(item_id))
try:
downloads = cursor.fetchone()['downloads']
except:
downloads = 0
views = res.get_num_found()
# Get downloads
res = solr.query(solr_core, {
'q':'type:0',
'fq':'owningItem:{0} AND isBot:false AND statistics_type:view AND -(bundleName:[* TO *] -bundleName:ORIGINAL)'.format(item_id)
})
downloads = res.get_num_found()
cursor.close()
statistics = {
'id': item_id,
@ -43,4 +40,6 @@ class ItemResource:
resp.media = statistics
api = falcon.API()
api.add_route('/item', ItemResource())
api.add_route('/item/{item_id:int}', ItemResource())
# vim: set sw=4 ts=4 expandtab:

9
config.py Normal file
View File

@ -0,0 +1,9 @@
import os
# Check if Solr connection information was provided in the environment
SOLR_SERVER = os.environ.get('SOLR_SERVER', 'http://localhost:8080/solr')
SOLR_CORE = os.environ.get('SOLR_CORE', 'statistics')
SQLITE_DB = os.environ.get('SQLITE_DB', 'statistics.db')
# vim: set sw=4 ts=4 expandtab:

View File

@ -0,0 +1,18 @@
[Unit]
Description=CGSpace Statistics API
After=network.target
[Service]
Environment=SOLR_SERVER=http://localhost:8081/solr
Environment=SOLR_CORE=statistics
User=nobody
Group=nogroup
WorkingDirectory=/opt/ilri/cgspace-statistics-api
ExecStart=/opt/ilri/cgspace-statistics-api/venv/bin/gunicorn \
--bind 127.0.0.1:5000 \
app:api
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
[Install]
WantedBy=multi-user.target

11
database.py Normal file
View File

@ -0,0 +1,11 @@
from config import SQLITE_DB
import sqlite3
def database_connection():
connection = sqlite3.connect(SQLITE_DB)
# allow iterating over row results by column key
connection.row_factory = sqlite3.Row
return connection
# vim: set sw=4 ts=4 expandtab:

106
indexer.py Executable file
View File

@ -0,0 +1,106 @@
#!/usr/bin/env python
#
# Tested with Python 3.6
# See DSpace Solr docs for tips about parameters
# https://wiki.duraspace.org/display/DSPACE/Solr
from config import SOLR_CORE
from database import database_connection
from solr import solr_connection
def index_views():
print("Populating database with item views.")
# determine the total number of items with views (aka Solr's numFound)
res = solr.query(SOLR_CORE, {
'q':'type:2',
'fq':'isBot:false AND statistics_type:view',
'facet':True,
'facet.field':'id',
}, rows=0)
# divide results into "pages" (numFound / 100)
results_numFound = res.get_num_found()
results_per_page = 100
results_num_pages = round(results_numFound / results_per_page)
results_current_page = 0
while results_current_page <= results_num_pages:
print('Page {0} of {1}.'.format(results_current_page, results_num_pages))
res = solr.query(SOLR_CORE, {
'q':'type:2',
'fq':'isBot:false AND statistics_type:view',
'facet':True,
'facet.field':'id',
'facet.limit':results_per_page,
'facet.offset':results_current_page * results_per_page
})
# make sure total number of results > 0
if res.get_num_found() > 0:
# SolrClient's get_facets() returns a dict of dicts
views = res.get_facets()
# in this case iterate over the 'id' dict and get the item ids and views
for item_id, item_views in views['id'].items():
db.execute('''REPLACE INTO itemviews VALUES (?, ?)''', (item_id, item_views))
db.commit()
results_current_page += 1
def index_downloads():
print("Populating database with item downloads.")
# determine the total number of items with downloads (aka Solr's numFound)
res = solr.query(SOLR_CORE, {
'q':'type:0',
'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
'facet':True,
'facet.field':'owningItem',
}, rows=0)
# divide results into "pages" (numFound / 100)
results_numFound = res.get_num_found()
results_per_page = 100
results_num_pages = round(results_numFound / results_per_page)
results_current_page = 0
while results_current_page <= results_num_pages:
print('Page {0} of {1}.'.format(results_current_page, results_num_pages))
res = solr.query(SOLR_CORE, {
'q':'type:0',
'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
'facet':True,
'facet.field':'owningItem',
'facet.limit':results_per_page,
'facet.offset':results_current_page * results_per_page
})
# make sure total number of results > 0
if res.get_num_found() > 0:
# SolrClient's get_facets() returns a dict of dicts
downloads = res.get_facets()
# in this case iterate over the 'owningItem' dict and get the item ids and downloads
for item_id, item_downloads in downloads['owningItem'].items():
db.execute('''REPLACE INTO itemdownloads VALUES (?, ?)''', (item_id, item_downloads))
db.commit()
results_current_page += 1
db = database_connection()
solr = solr_connection()
# use separate views and downloads tables so we can REPLACE INTO carelessly (ie, item may have views but no downloads)
db.execute('''CREATE TABLE IF NOT EXISTS itemviews
(id integer primary key, views integer)''')
db.execute('''CREATE TABLE IF NOT EXISTS itemdownloads
(id integer primary key, downloads integer)''')
index_views()
index_downloads()
db.close()
# vim: set sw=4 ts=4 expandtab:

9
solr.py Normal file
View File

@ -0,0 +1,9 @@
from config import SOLR_SERVER
from SolrClient import SolrClient
def solr_connection():
connection = SolrClient(SOLR_SERVER)
return connection
# vim: set sw=4 ts=4 expandtab: