1
0
mirror of https://github.com/ilri/dspace-statistics-api.git synced 2025-05-10 15:16:02 +02:00

Compare commits

...

38 Commits

Author SHA1 Message Date
78900b5d85 CHANGELOG.md: Add changes for v0.6.1 2018-11-01 00:39:12 +02:00
eb08832bf8 Sync API documentation HTML with README.md 2018-11-01 00:37:52 +02:00
c2ec780ad9 README.md: Improve API documentation 2018-11-01 00:37:40 +02:00
df8ebc8bf1 README.md: Improve API endpoint documentation 2018-11-01 00:31:16 +02:00
0d4be5f4c8 README.md: Add API documentation endpoint 2018-11-01 00:22:16 +02:00
30dc7f1939 Add basic API documentation on root (/)
I had imagined plugging in an interactive Swagger or OpenAPI instance
here, but that's actually much more involved in Falcon than I want to
deal with right now.
2018-11-01 00:19:39 +02:00
77194707fd README.md: Improve introduction 2018-11-01 00:08:24 +02:00
10c1f8bdcc README.md: Update Travis CI badge 2018-10-31 23:14:38 +02:00
da74943da2 README.md: Update introduction 2018-10-31 22:40:36 +02:00
fc8348ab29 README.md: Add acknoledgement about the Solr queries 2018-10-31 19:36:50 +02:00
15c3299b99 CHANGELOG.md: Add changes for v0.6.0 2018-10-31 19:26:45 +02:00
d36be5ee50 contrib: Update systemd unit files for refactor 2018-10-28 11:14:21 +02:00
2f45d27554 dspace_statistics_api/app.py: remove unused code
This was added accidentally when I refactored. I was trying to see
if I could use Falcon's on_exit() hook.
2018-10-28 11:14:21 +02:00
b8356f7a87 Add "application" alias to API object
By default gunicorn looks for an "application" object to run, so this
saves us having to type api:app.
2018-10-28 11:14:21 +02:00
2136dc79ce Remove shebang from indexer.py
This is run as a Python module now so does not need a shebang.
2018-10-28 11:14:21 +02:00
ed60120cef Remove executable bit from indexer.py
Now it is run as a Python module.
2018-10-28 11:14:21 +02:00
c027f01b48 Refactor project structure
This follows guidance from several well-known Python best practices
guides. Basically, the idea is create a package for the application
that is comprised of several re-usable modules.

See: https://docs.python-guide.org/writing/structure/
See: https://realpython.com/python-application-layouts/
2018-10-28 11:14:21 +02:00
754663f062 CHANGELOG.md: Add changes for version 0.5.2 2018-10-28 11:12:27 +02:00
507699e58a requirements.txt: Update libraries
Switch to a personal fork of SolrClient so that we can use kazoo 2.5.0
and get rid of the error about the 'async' keyword on Python 3.7. Also
this bumps some of the other libraries to their latest versions.
2018-10-28 11:09:47 +02:00
a016916995 CHANGELOD.md: Add note about ujson 2018-10-24 14:15:03 +03:00
6fd2827a7c Use Python's native json instead of ujson
Falcon can optionally use ujson to speed up JSON (de)serialization,
but Falcon's already really fast and requiring ujson actually makes
deployment trickier in some cases (for example in Docker containers
that are based on Alpine Linux).

Here are some tests of Falcon 1.4.1 on Python 3.5 from my laptop:

    1. falcon...............60172 req/sec or 16.62 μs/req (36x)
    2. falcon-ext...........34186 req/sec or 29.25 μs/req (20x)
    3. bottle...............32924 req/sec or 30.37 μs/req (20x)
    4. werkzeug.............11948 req/sec or 83.70 μs/req (7x)
    5. flask.................6654 req/sec or 150.30 μs/req (4x)
    6. django................4565 req/sec or 219.04 μs/req (3x)
    7. pecan.................1672 req/sec or 598.19 μs/req (1x)

The tests were conducted with Falcon's official Docker benchmarking
tools on my Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz on Arch Linux.

See: https://github.com/falconry/falcon/tree/master/docker
2018-10-24 14:08:23 +03:00
62142eb79e CHANGELOG.md: Move unreleased changes to v0.5.0 2018-10-24 12:02:42 +03:00
fda0321942 CHANGELOG.md: Add note about Solr in API component 2018-10-24 12:01:47 +03:00
963aa245c8 app.py: Don't initialize Solr connection
We only need Solr in the indexing component, not for the API itself.
2018-10-24 11:59:50 +03:00
568ff2eebb CHANGELOG.md: Add note about nginx configuration 2018-10-23 14:56:44 +03:00
deecb8a10b README.md: Add example nginx configuration 2018-10-23 14:55:36 +03:00
12f45d7c08 contrib: Adjust example path 2018-10-23 14:34:29 +03:00
f65089f9ce CHANGELOG.md: Update and move to 0.4.3 release 2018-10-17 09:51:44 +03:00
1db5cf1c29 README.md: Grammar 2018-10-17 09:51:35 +03:00
e581c4b1aa README.md: Improve documentation 2018-10-17 09:50:30 +03:00
e8d356c9ca README.md: Add TODO about Python 3.6+ f-string syntax
They are faster.
2018-10-17 09:13:25 +03:00
34a9b8d629 CHANGELOG.md: Add unreleased changes for Travis CI 2018-10-14 19:02:09 +03:00
41e3d66a0e .travis.yml: Only build master branch 2018-10-14 19:00:31 +03:00
9b2a6137b4 README.md: Add Travis CI badge
For now this is only an indicator that the Python requirements can
be satisfied and installed.
2018-10-14 18:58:12 +03:00
600b986f99 .travis.yml: Use Python 3.7-dev instead of 3.7
I don't think Travis supports Python 3.7 yet because the builds for
that version keep failing.
2018-10-14 18:57:30 +03:00
49a7790794 .travis.yml: Move script to one line 2018-10-14 18:53:45 +03:00
f2deba627c .travis.yml: Run pip install as script
Basically for now there are no tests so I just want to just check
that requirements.txt is correct and that all dependencies can be
installed.
2018-10-14 18:47:14 +03:00
9323513794 README.md: Update instructions 2018-10-14 18:45:40 +03:00
13 changed files with 133 additions and 39 deletions

View File

@ -2,8 +2,10 @@ language: python
python:
- "3.5"
- "3.6"
- "3.7"
install:
- pip install -r requirements.txt
- "3.7-dev"
script: pip install -r requirements.txt
branches:
only:
- master
# vim: ts=2 sw=2 et

View File

@ -4,6 +4,36 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
### [0.6.1] - 2018-10-31
## Added
- API documentation at root path (/)
### [0.6.0] - 2018-10-31
## Changed
- Refactor project structure (note breaking changes to API and indexing invocation, see contrib and README.md)
### [0.5.2] - 2018-10-28
## Changed
- Update library versions in requirements.txt
### [0.5.1] - 2018-10-24
## Changed
- Use Python's native json instead of ujson
### [0.5.0] - 2018-10-24
## Added
- Example nginx configuration to README.md
## Changed
- Don't initialize Solr connection in API
### [0.4.3] - 2018-10-17
## Changed
- Use pip install as script for Travis CI
## Improved
- Documentation for deployment and testing
## [0.4.2] - 2018-10-04
### Changed
- README.md introduction and requirements

View File

@ -1,5 +1,7 @@
# DSpace Statistics API
A simple REST API to expose Solr view and download statistics for items in a DSpace repository. This project contains a standalone indexing component and a WSGI application.
# DSpace Statistics API [![Build Status](https://travis-ci.org/ilri/dspace-statistics-api.svg?branch=master)](https://travis-ci.org/ilri/dspace-statistics-api)
DSpace versions 4.0 and up include a [REST API](https://wiki.duraspace.org/display/DSDOC5x/REST+API) that allows the repository to be queried programmatically. The API exposes information about communities, collections, items, and bitstreams, but not item views or downloads. This project contains a lightweight indexer and a web application to make the view and download statistics available via a simple REST API that can be deployed simultaneously with DSpace's own.
You can read more about the Solr queries used to gather the item view and download statistics on the [DSpace wiki](https://wiki.duraspace.org/display/DSPACE/Solr).
## Requirements
@ -14,32 +16,68 @@ Create a Python virtual environment and install the dependencies:
$ . venv/bin/activate
$ pip install -r requirements.txt
Set up the environment variables Solr and PostgreSQL:
Set up the environment variables for Solr and PostgreSQL:
$ export SOLR_SERVER=http://localhost:8080/solr
$
$ gunicorn app:api
$ export DATABASE_NAME=dspacestatistics
$ export DATABASE_USER=dspacestatistics
$ export DATABASE_PASS=dspacestatistics
$ export DATABASE_HOST=localhost
Index the Solr statistics core to populate the PostgreSQL database:
$ python -m dspace_statistics_api.indexer
Run the REST API:
$ gunicorn dspace_statistics_api.app
Test to see if there are any statistics:
$ curl 'http://localhost:8000/items?limit=1'
## Deployment
There are example systemd service and timer units in the `contrib` directory.
There are example systemd service and timer units in the `contrib` directory. The API service listens on localhost by default so you will need to expose it publicly using a web server like nginx.
An example nginx configuration is:
```
server {
#...
location ~ /rest/statistics/?(.*) {
access_log /var/log/nginx/statistics.log;
proxy_pass http://statistics_api/$1$is_args$args;
}
}
upstream statistics_api {
server 127.0.0.1:5000;
}
```
This would expose the API at `/rest/statistics`.
## Using the API
The API exposes the following endpoints:
- GET `/items`return views and downloads for all items that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results.
- GET `/item/id`return views and downloads for a single item (*id* must be a positive integer). Returns HTTP 404 if an item id is not found.
- GET `/`return a basic API documentation page.
- GET `/items`return views and downloads for all items that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results (`limit` must be an integer between 1 and 100, and `page` must be an integer greater than or equal to 0).
- GET `/item/id`return views and downloads for a single item (`id` must be a positive integer). Returns HTTP 404 if an item id is not found.
¹ We are querying the Solr statistics core, which technically only knows about items that have either views or downloads.
The item id is the *internal* id for an item. You can get these from the standard DSpace REST API.
¹ We are querying the Solr statistics core, which technically only knows about items that have either views or downloads. If an item is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository.
## Todo
- Add API documentation
- Close up DB connection when gunicorn shuts down gracefully
- Close DB connection when gunicorn shuts down gracefully
- Better logging
- Tests
- Check if database exists (try/except)
- Version API
- Use JSON in PostgreSQL
- Switch to [Python 3.6+ f-string syntax](https://realpython.com/python-f-strings/)
## License
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).

View File

@ -9,10 +9,10 @@ Environment=DATABASE_PASS=dspacestatistics
Environment=DATABASE_HOST=localhost
User=nobody
Group=nogroup
WorkingDirectory=/opt/ilri/dspace-statistics-api
ExecStart=/opt/ilri/dspace-statistics-api/venv/bin/gunicorn \
WorkingDirectory=/var/lib/dspace-statistics-api
ExecStart=/var/lib/dspace-statistics-api/venv/bin/gunicorn \
--bind 127.0.0.1:5000 \
app:api
dspace_statistics_api.app
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID

View File

@ -10,8 +10,8 @@ Environment=DATABASE_PASS=dspacestatistics
Environment=DATABASE_HOST=localhost
User=nobody
Group=nogroup
WorkingDirectory=/opt/ilri/dspace-statistics-api
ExecStart=/opt/ilri/dspace-statistics-api/venv/bin/python indexer.py
WorkingDirectory=/var/lib/dspace-statistics-api
ExecStart=/var/lib/dspace-statistics-api/venv/bin/python -m dspace_statistics_api.indexer
[Install]
WantedBy=multi-user.target

View File

View File

@ -1,10 +1,15 @@
from database import database_connection
from .database import database_connection
import falcon
from solr import solr_connection
db = database_connection()
db.set_session(readonly=True)
solr = solr_connection()
class RootResource:
def on_get(self, req, resp):
resp.status = falcon.HTTP_200
resp.content_type = 'text/html'
with open('dspace_statistics_api/docs/index.html', 'r') as f:
resp.body = f.read()
class AllItemsResource:
def on_get(self, req, resp):
@ -65,7 +70,8 @@ class ItemResource:
cursor.close()
api = falcon.API()
api = application = falcon.API()
api.add_route('/', RootResource())
api.add_route('/items', AllItemsResource())
api.add_route('/item/{item_id:int}', ItemResource())

View File

@ -1,7 +1,7 @@
from config import DATABASE_NAME
from config import DATABASE_USER
from config import DATABASE_PASS
from config import DATABASE_HOST
from .config import DATABASE_NAME
from .config import DATABASE_USER
from .config import DATABASE_PASS
from .config import DATABASE_HOST
import psycopg2, psycopg2.extras
def database_connection():

View File

@ -0,0 +1,20 @@
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title>DSpace Statistics API</title>
</head>
<body>
<h1>DSpace Statistics API</h1>
<p>This site is running the <a href="https://github.com/ilri/dspace-statistics-api" title="DSpace Statistics API project">DSpace Statistics API</a>. The following endpoints are available:</p>
<ul>
<li>GET <code>/</code>return a basic API documentation page.</li>
<li>GET <code>/items</code>return views and downloads for all items that Solr knows about¹. Accepts <code>limit</code> and <code>page</code> query parameters for pagination of results (<code>limit</code> must be an integer between 1 and 100, and <code>page</code> must be an integer greater than or equal to 0).</li>
<li>GET <code>/item/id</code>return views and downloads for a single item (<code>id</code> must be a positive integer). Returns HTTP 404 if an item id is not found.</li>
</ul>
<p>The item id is the <em>internal</em> id for an item. You can get these from the standard DSpace REST API.</p>
<p>¹ We are querying the Solr statistics core, which technically only knows about items that have either views or downloads. If an item is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository.</code>
</body>
</html>

11
indexer.py → dspace_statistics_api/indexer.py Executable file → Normal file
View File

@ -1,4 +1,3 @@
#!/usr/bin/env python
#
# indexer.py
#
@ -30,10 +29,10 @@
# See: https://solrclient.readthedocs.io/en/latest/SolrClient.html
# See: https://wiki.duraspace.org/display/DSPACE/Solr
from database import database_connection
import ujson
from .database import database_connection
import json
import psycopg2.extras
from solr import solr_connection
from .solr import solr_connection
def index_views():
# get total number of distinct facets for items with a minimum of 1 view,
@ -56,7 +55,7 @@ def index_views():
}, rows=0)
# get total number of distinct facets (countDistinct)
results_totalNumFacets = ujson.loads(res.get_json())['stats']['stats_fields']['id']['countDistinct']
results_totalNumFacets = json.loads(res.get_json())['stats']['stats_fields']['id']['countDistinct']
# divide results into "pages" (cast to int to effectively round down)
results_per_page = 100
@ -115,7 +114,7 @@ def index_downloads():
}, rows=0)
# get total number of distinct facets (countDistinct)
results_totalNumFacets = ujson.loads(res.get_json())['stats']['stats_fields']['owningItem']['countDistinct']
results_totalNumFacets = json.loads(res.get_json())['stats']['stats_fields']['owningItem']['countDistinct']
# divide results into "pages" (cast to int to effectively round down)
results_per_page = 100

View File

@ -1,4 +1,4 @@
from config import SOLR_SERVER
from .config import SOLR_SERVER
from SolrClient import SolrClient
def solr_connection():

View File

@ -1,4 +1,4 @@
certifi==2018.8.24
certifi==2018.10.15
chardet==3.0.4
falcon==1.4.1
gunicorn==19.9.0
@ -6,8 +6,7 @@ idna==2.7
kazoo==2.5.0
psycopg2-binary==2.7.5
python-mimeparse==1.6.0
requests==2.19.1
requests==2.20.0
six==1.11.0
SolrClient==0.2.1
ujson==1.35
urllib3==1.23
-e git://github.com/alanorth/SolrClient.git@c629e3475be37c82770b2be61748be7e29882648#egg=SolrClient
urllib3==1.24