1
0
mirror of https://github.com/ilri/dspace-statistics-api.git synced 2024-11-25 23:58:18 +01:00

Compare commits

..

No commits in common. "ab82e90773423512c6017972206b8da76e9163d6" and "787eec20ea69d82b42aebdfbffa3021bebd06af2" have entirely different histories.

8 changed files with 186 additions and 314 deletions

View File

@ -5,14 +5,10 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased ## Unreleased
### Added
- indexer.py now indexes views and downloads for communities and collections
- API endpoints for /communities, /community/id, /collections, and /collections/id
### Changed ### Changed
- Add ORDER BY to /items resource to make sure results are returned - Add ORDER BY to /items resource to make sure results are returned
deterministically deterministically
- Use `fl` parameter in indexer to return only the field we are faceting by - Use `fl` parameter in indexer to return only the id field
- Minor refactoring of imports for PEP8 style - Minor refactoring of imports for PEP8 style
## [1.3.2] - 2020-11-18 ## [1.3.2] - 2020-11-18

View File

@ -4,7 +4,7 @@ DSpace stores item view and download events in a Solr "statistics" core. This in
- If your DSpace is version 4 or 5, use [dspace-statistics-api v1.1.1](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.1.1) - If your DSpace is version 4 or 5, use [dspace-statistics-api v1.1.1](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.1.1)
- If your DSpace is version 6+, use [dspace-statistics-api v1.2.0 or greater](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.2.0) - If your DSpace is version 6+, use [dspace-statistics-api v1.2.0 or greater](https://github.com/ilri/dspace-statistics-api/releases/tag/v1.2.0)
This project contains an indexer and a [Falcon-based](https://falcon.readthedocs.io/) web application to make the item, community, and collection statistics available via a simple REST API. You can read more about the Solr queries used to gather the item view and download statistics on the [DSpace wiki](https://wiki.lyrasis.org/display/DSPACE/Solr). This project contains an indexer and a [Falcon-based](https://falcon.readthedocs.io/) web application to make the statistics available via a simple REST API. You can read more about the Solr queries used to gather the item view and download statistics on the [DSpace wiki](https://wiki.lyrasis.org/display/DSPACE/Solr).
If you use the DSpace Statistics API please cite: If you use the DSpace Statistics API please cite:
@ -83,18 +83,12 @@ The API exposes the following endpoints:
- GET `/items`return views and downloads for all items that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results (`limit` must be an integer between 1 and 100, and `page` must be an integer greater than or equal to 0). - GET `/items`return views and downloads for all items that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results (`limit` must be an integer between 1 and 100, and `page` must be an integer greater than or equal to 0).
- POST `/items`return views and downloads for an arbitrary list of items with an optional date range. Accepts `limit`, `page`, `dateFrom`, and `dateTo` parameters². - POST `/items`return views and downloads for an arbitrary list of items with an optional date range. Accepts `limit`, `page`, `dateFrom`, and `dateTo` parameters².
- GET `/item/id`return views and downloads for a single item (`id` must be a UUID). Returns HTTP 404 if an item id is not found. - GET `/item/id`return views and downloads for a single item (`id` must be a UUID). Returns HTTP 404 if an item id is not found.
- GET `/communities`return views and downloads for all communities that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results (`limit` must be an integer between 1 and 100, and `page` must be an integer greater than or equal to 0).
- POST `/communities`return views and downloads for an arbitrary list of communities with an optional date range. Accepts `limit`, `page`, `dateFrom`, and `dateTo` parameters².
- GET `/community/id`return views and downloads for a single community (`id` must be a UUID). Returns HTTP 404 if a community id is not found.
- GET `/collections`return views and downloads for all collections that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results (`limit` must be an integer between 1 and 100, and `page` must be an integer greater than or equal to 0).
- POST `/collections`return views and downloads for an arbitrary list of collections with an optional date range. Accepts `limit`, `page`, `dateFrom`, and `dateTo` parameters².
- GET `/collection/id`return views and downloads for a single collection (`id` must be a UUID). Returns HTTP 404 if an collection id is not found.
The id is the *internal* UUID for an item, community, or collection. You can get these from the standard DSpace REST API. The item id is the *internal* UUID for an item. You can get these from the standard DSpace REST API.
¹ We are querying the Solr statistics core, which technically only knows about items, communities, or collections that have either views or downloads. If an item, community, or collection is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository. ¹ We are querying the Solr statistics core, which technically only knows about items that have either views or downloads. If an item is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository.
² POST requests to `/items`, `/communities`, and `/collections` should be in JSON format with the following parameters (substitute the "items" list for communities or collections accordingly): ² POST requests to `/items` should be in JSON format with the following parameters:
``` ```
{ {
@ -119,6 +113,8 @@ The id is the *internal* UUID for an item, community, or collection. You can get
- Use JSON in PostgreSQL - Use JSON in PostgreSQL
- Add top items endpoint, perhaps `/top/items` or `/items/top`? - Add top items endpoint, perhaps `/top/items` or `/items/top`?
- Actually we could add `/items?limit=10&sort=views` - Actually we could add `/items?limit=10&sort=views`
- Make community and collection stats available
- Facet on owningComm and owningColl
- Add Swagger with OpenAPI 3.0.x with [falcon-swagger-ui](https://github.com/rdidyk/falcon-swagger-ui) - Add Swagger with OpenAPI 3.0.x with [falcon-swagger-ui](https://github.com/rdidyk/falcon-swagger-ui)
## License ## License

View File

@ -2,8 +2,8 @@ import falcon
import psycopg2.extras import psycopg2.extras
from .database import DatabaseManager from .database import DatabaseManager
from .stats import get_downloads, get_views from .items import get_downloads, get_views
from .util import set_statistics_scope, validate_post_parameters from .util import validate_items_post_parameters
class RootResource: class RootResource:
@ -14,8 +14,7 @@ class RootResource:
resp.body = f.read() resp.body = f.read()
class AllStatisticsResource: class AllItemsResource:
@falcon.before(set_statistics_scope)
def on_get(self, req, resp): def on_get(self, req, resp):
"""Handles GET requests""" """Handles GET requests"""
# Return HTTPBadRequest if id parameter is not present and valid # Return HTTPBadRequest if id parameter is not present and valid
@ -27,26 +26,26 @@ class AllStatisticsResource:
db.set_session(readonly=True) db.set_session(readonly=True)
with db.cursor() as cursor: with db.cursor() as cursor:
# get total number of communities/collections/items so we can estimate the pages # get total number of items so we can estimate the pages
cursor.execute(f"SELECT COUNT(id) FROM {req.context.statistics_scope}") cursor.execute("SELECT COUNT(id) FROM items")
pages = round(cursor.fetchone()[0] / limit) pages = round(cursor.fetchone()[0] / limit)
# get statistics and use limit and offset to page through results # get statistics and use limit and offset to page through results
cursor.execute( cursor.execute(
f"SELECT id, views, downloads FROM {req.context.statistics_scope} ORDER BY id LIMIT %s OFFSET %s", "SELECT id, views, downloads FROM items ORDER BY id LIMIT %s OFFSET %s",
[limit, offset], [limit, offset],
) )
# create a list to hold dicts of stats # create a list to hold dicts of item stats
statistics = list() statistics = list()
# iterate over results and build statistics object # iterate over results and build statistics object
for result in cursor: for item in cursor:
statistics.append( statistics.append(
{ {
"id": str(result["id"]), "id": str(item["id"]),
"views": result["views"], "views": item["views"],
"downloads": result["downloads"], "downloads": item["downloads"],
} }
) )
@ -59,15 +58,9 @@ class AllStatisticsResource:
resp.media = message resp.media = message
@falcon.before(set_statistics_scope) @falcon.before(validate_items_post_parameters)
@falcon.before(validate_post_parameters)
def on_post(self, req, resp): def on_post(self, req, resp):
"""Handles POST requests. """Handles POST requests"""
Uses two `before` hooks to set the statistics "scope" and validate the
POST parameters. The "scope" is the type of statistics we want, which
will be items, communities, or collections, depending on the request.
"""
# Build the Solr date string, ie: [* TO *] # Build the Solr date string, ie: [* TO *]
if req.context.dateFrom and req.context.dateTo: if req.context.dateFrom and req.context.dateTo:
@ -81,10 +74,10 @@ class AllStatisticsResource:
# Helper variables to make working with pages/items/results easier and # Helper variables to make working with pages/items/results easier and
# to make the code easier to understand # to make the code easier to understand
number_of_elements: int = len(req.context.elements) number_of_items: int = len(req.context.items)
pages: int = int(number_of_elements / req.context.limit) pages: int = int(number_of_items / req.context.limit)
first_element: int = req.context.page * req.context.limit first_item: int = req.context.page * req.context.limit
last_element: int = first_element + req.context.limit last_item: int = first_item + req.context.limit
# Get a subset of the POSTed items based on our limit. Note that Python # Get a subset of the POSTed items based on our limit. Note that Python
# list slicing and indexing are both zero based, but the first and last # list slicing and indexing are both zero based, but the first and last
# items in a slice can be confusing. See this ASCII diagram: # items in a slice can be confusing. See this ASCII diagram:
@ -95,24 +88,20 @@ class AllStatisticsResource:
# Slice position: 0 1 2 3 4 5 6 # Slice position: 0 1 2 3 4 5 6
# Index position: 0 1 2 3 4 5 # Index position: 0 1 2 3 4 5
# #
# So if we have a list of items with 240 items: # So if we have a list items with 240 items:
# #
# 1st set: items[0:100] would give items at indexes 0 to 99 # 1st set: items[0:100] would give items at indexes 0 to 99
# 2nd set: items[100:200] would give items at indexes 100 to 199 # 2nd set: items[100:200] would give items at indexes 100 to 199
# 3rd set: items[200:300] would give items at indexes 200 to 239 # 3rd set: items[200:300] would give items at indexes 200 to 239
elements_subset: list = req.context.elements[first_element:last_element] items_subset: list = req.context.items[first_item:last_item]
views: dict = get_views( views: dict = get_views(solr_date_string, items_subset)
solr_date_string, elements_subset, req.context.views_facet_field downloads: dict = get_downloads(solr_date_string, items_subset)
)
downloads: dict = get_downloads(
solr_date_string, elements_subset, req.context.downloads_facet_field
)
# create a list to hold dicts of stats # create a list to hold dicts of item stats
statistics = list() statistics = list()
# iterate over views dict to extract views and use the element id as an # iterate over views dict to extract views and use the item id as an
# index to the downloads dict to extract downloads. # index to the downloads dict to extract downloads.
for k, v in views.items(): for k, v in views.items():
statistics.append({"id": k, "views": v, "downloads": downloads[k]}) statistics.append({"id": k, "views": v, "downloads": downloads[k]})
@ -128,9 +117,8 @@ class AllStatisticsResource:
resp.media = message resp.media = message
class SingleStatisticsResource: class ItemResource:
@falcon.before(set_statistics_scope) def on_get(self, req, resp, item_id):
def on_get(self, req, resp, id_):
"""Handles GET requests""" """Handles GET requests"""
# Adapt Pythons uuid.UUID type to PostgreSQLs uuid # Adapt Pythons uuid.UUID type to PostgreSQLs uuid
@ -143,19 +131,18 @@ class SingleStatisticsResource:
with db.cursor() as cursor: with db.cursor() as cursor:
cursor = db.cursor() cursor = db.cursor()
cursor.execute( cursor.execute(
f"SELECT views, downloads FROM {req.context.database} WHERE id=%s", "SELECT views, downloads FROM items WHERE id=%s", [str(item_id)]
[str(id_)],
) )
if cursor.rowcount == 0: if cursor.rowcount == 0:
raise falcon.HTTPNotFound( raise falcon.HTTPNotFound(
title=f"{req.context.statistics_scope} not found", title="Item not found",
description=f'The {req.context.statistics_scope} with id "{str(id_)}" was not found.', description=f'The item with id "{str(item_id)}" was not found.',
) )
else: else:
results = cursor.fetchone() results = cursor.fetchone()
statistics = { statistics = {
"id": str(id_), "id": str(item_id),
"views": results["views"], "views": results["views"],
"downloads": results["downloads"], "downloads": results["downloads"],
} }
@ -165,17 +152,7 @@ class SingleStatisticsResource:
api = application = falcon.API() api = application = falcon.API()
api.add_route("/", RootResource()) api.add_route("/", RootResource())
api.add_route("/items", AllItemsResource())
# Item routes api.add_route("/item/{item_id:uuid}", ItemResource())
api.add_route("/items", AllStatisticsResource())
api.add_route("/item/{id_:uuid}", SingleStatisticsResource())
# Community routes
api.add_route("/communities", AllStatisticsResource())
api.add_route("/community/{id_:uuid}", SingleStatisticsResource())
# Collection routes
api.add_route("/collections", AllStatisticsResource())
api.add_route("/collection/{id_:uuid}", SingleStatisticsResource())
# vim: set sw=4 ts=4 expandtab: # vim: set sw=4 ts=4 expandtab:

View File

@ -12,21 +12,14 @@
<li>GET <code>/items</code>return views and downloads for all items that Solr knows about¹. Accepts <code>limit</code> and <code>page</code> query parameters for pagination of results (<code>limit</code> must be an integer between 1 and 100, and <code>page</code> must be an integer greater than or equal to 0).</li> <li>GET <code>/items</code>return views and downloads for all items that Solr knows about¹. Accepts <code>limit</code> and <code>page</code> query parameters for pagination of results (<code>limit</code> must be an integer between 1 and 100, and <code>page</code> must be an integer greater than or equal to 0).</li>
<li>POST <code>/items</code>return views and downloads for an arbitrary list of items with an optional date range. Accepts <code>limit</code>, <code>page</code>, <code>dateFrom</code>, and <code>dateTo</code> parameters².</li> <li>POST <code>/items</code>return views and downloads for an arbitrary list of items with an optional date range. Accepts <code>limit</code>, <code>page</code>, <code>dateFrom</code>, and <code>dateTo</code> parameters².</li>
<li>GET <code>/item/id</code>return views and downloads for a single item (<code>id</code> must be a UUID). Returns HTTP 404 if an item id is not found.</li> <li>GET <code>/item/id</code>return views and downloads for a single item (<code>id</code> must be a UUID). Returns HTTP 404 if an item id is not found.</li>
<li>GET <code>/communities</code>return views and downloads for all communities that Solr knows about¹. Accepts <code>limit</code> and <code>page</code> query parameters for pagination of results (<code>limit</code> must be an integer between 1 and 100, and <code>page</code> must be an integer greater than or equal to 0).
<li>POST <code>/communities</code>return views and downloads for an arbitrary list of communities with an optional date range. Accepts <code>limit</code>, <code>page</code>, <code>dateFrom</code>, and <code>dateTo</code> parameters².
<li>GET <code>/community/id</code>return views and downloads for a single community (<code>id</code> must be a UUID). Returns HTTP 404 if a community id is not found.
<li>GET <code>/collections</code>return views and downloads for all collections that Solr knows about¹. Accepts <code>limit</code> and <code>page</code> query parameters for pagination of results (<code>limit</code> must be an integer between 1 and 100, and <code>page</code> must be an integer greater than or equal to 0).
<li>POST <code>/collections</code>return views and downloads for an arbitrary list of collections with an optional date range. Accepts <code>limit</code>, <code>page</code>, <code>dateFrom</code>, and <code>dateTo</code> parameters².
<li>GET <code>/collection/id</code>return views and downloads for a single collection (<code>id</code> must be a UUID). Returns HTTP 404 if an collection id is not found.
</ul> </ul>
<p>The id is the <em>internal</em> UUID for an item, community, or collection. You can get these from the standard DSpace REST API.</p> <p>The item id is the <em>internal</em> uuid for an item. You can get these from the standard DSpace REST API.</p>
<hr/> <hr/>
<p>¹ We are querying the Solr statistics core, which technically only knows about items, communities, or collections that have either views or downloads. If an item, community, or collection is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository.</p> <p>¹ We are querying the Solr statistics core, which technically only knows about items that have either views or downloads. If an item is not present here you can assume it has zero views and zero downloads, but not necessarily that it does not exist in the repository.</p>
<p>² POST requests to <code>/items</code> should be in JSON format with the following parameters:
<p>² POST requests to <code>/items</code>, <code>/communities</code>, and <code>/collections</code> should be in JSON format with the following parameters (substitute the "items" list for communities or collections accordingly):</p>
<pre><code>{ <pre><code>{
"limit": 100, // optional, integer between 1 and 100, default 100 "limit": 100, // optional, integer between 1 and 100, default 100
"page": 0, // optional, integer greater than 0, default 0 "page": 0, // optional, integer greater than 0, default 0

View File

@ -18,8 +18,8 @@
# #
# --- # ---
# #
# Connects to a DSpace Solr statistics core and ingests views and downloads for # Connects to a DSpace Solr statistics core and ingests item views and downloads
# communities, collections, and items into a PostgreSQL database. # into a PostgreSQL database for use by other applications (like an API).
# #
# This script is written for Python 3.6+ and requires several modules that you # This script is written for Python 3.6+ and requires several modules that you
# can install with pip (I recommend using a Python virtual environment): # can install with pip (I recommend using a Python virtual environment):
@ -36,7 +36,7 @@ from .database import DatabaseManager
from .util import get_statistics_shards from .util import get_statistics_shards
def index_views(indexType: str, facetField: str): def index_views():
# get total number of distinct facets for items with a minimum of 1 view, # get total number of distinct facets for items with a minimum of 1 view,
# otherwise Solr returns all kinds of weird ids that are actually not in # otherwise Solr returns all kinds of weird ids that are actually not in
# the database. Also, stats are expensive, but we need stats.calcdistinct # the database. Also, stats are expensive, but we need stats.calcdistinct
@ -47,14 +47,14 @@ def index_views(indexType: str, facetField: str):
solr_query_params = { solr_query_params = {
"q": "type:2", "q": "type:2",
"fq": "-isBot:true AND statistics_type:view", "fq": "-isBot:true AND statistics_type:view",
"fl": facetField, "fl": "id",
"facet": "true", "facet": "true",
"facet.field": facetField, "facet.field": "id",
"facet.mincount": 1, "facet.mincount": 1,
"facet.limit": 1, "facet.limit": 1,
"facet.offset": 0, "facet.offset": 0,
"stats": "true", "stats": "true",
"stats.field": facetField, "stats.field": "id",
"stats.calcdistinct": "true", "stats.calcdistinct": "true",
"shards": shards, "shards": shards,
"rows": 0, "rows": 0,
@ -67,11 +67,11 @@ def index_views(indexType: str, facetField: str):
try: try:
# get total number of distinct facets (countDistinct) # get total number of distinct facets (countDistinct)
results_totalNumFacets = res.json()["stats"]["stats_fields"][facetField][ results_totalNumFacets = res.json()["stats"]["stats_fields"]["id"][
"countDistinct" "countDistinct"
] ]
except TypeError: except TypeError:
print(f"{indexType}: no views, exiting.") print("No item views to index, exiting.")
exit(0) exit(0)
@ -88,15 +88,15 @@ def index_views(indexType: str, facetField: str):
while results_current_page <= results_num_pages: while results_current_page <= results_num_pages:
# "pages" are zero based, but one based is more human readable # "pages" are zero based, but one based is more human readable
print( print(
f"{indexType}: indexing views (page {results_current_page + 1} of {results_num_pages + 1})" f"Indexing item views (page {results_current_page + 1} of {results_num_pages + 1})"
) )
solr_query_params = { solr_query_params = {
"q": "type:2", "q": "type:2",
"fq": "-isBot:true AND statistics_type:view", "fq": "-isBot:true AND statistics_type:view",
"fl": facetField, "fl": "id",
"facet": "true", "facet": "true",
"facet.field": facetField, "facet.field": "id",
"facet.mincount": 1, "facet.mincount": 1,
"facet.limit": results_per_page, "facet.limit": results_per_page,
"facet.offset": results_current_page * results_per_page, "facet.offset": results_current_page * results_per_page,
@ -110,12 +110,12 @@ def index_views(indexType: str, facetField: str):
# Solr returns facets as a dict of dicts (see json.nl parameter) # Solr returns facets as a dict of dicts (see json.nl parameter)
views = res.json()["facet_counts"]["facet_fields"] views = res.json()["facet_counts"]["facet_fields"]
# iterate over the facetField dict and get the ids and views # iterate over the 'id' dict and get the item ids and views
for id_, views in views[facetField].items(): for item_id, item_views in views["id"].items():
data.append((id_, views)) data.append((item_id, item_views))
# do a batch insert of values from the current "page" of results # do a batch insert of values from the current "page" of results
sql = f"INSERT INTO {indexType}(id, views) VALUES %s ON CONFLICT(id) DO UPDATE SET views=excluded.views" sql = "INSERT INTO items(id, views) VALUES %s ON CONFLICT(id) DO UPDATE SET views=excluded.views"
psycopg2.extras.execute_values(cursor, sql, data, template="(%s, %s)") psycopg2.extras.execute_values(cursor, sql, data, template="(%s, %s)")
db.commit() db.commit()
@ -125,19 +125,19 @@ def index_views(indexType: str, facetField: str):
results_current_page += 1 results_current_page += 1
def index_downloads(indexType: str, facetField: str): def index_downloads():
# get the total number of distinct facets for items with at least 1 download # get the total number of distinct facets for items with at least 1 download
solr_query_params = { solr_query_params = {
"q": "type:0", "q": "type:0",
"fq": "-isBot:true AND statistics_type:view AND bundleName:ORIGINAL", "fq": "-isBot:true AND statistics_type:view AND bundleName:ORIGINAL",
"fl": facetField, "fl": "owningItem",
"facet": "true", "facet": "true",
"facet.field": facetField, "facet.field": "owningItem",
"facet.mincount": 1, "facet.mincount": 1,
"facet.limit": 1, "facet.limit": 1,
"facet.offset": 0, "facet.offset": 0,
"stats": "true", "stats": "true",
"stats.field": facetField, "stats.field": "owningItem",
"stats.calcdistinct": "true", "stats.calcdistinct": "true",
"shards": shards, "shards": shards,
"rows": 0, "rows": 0,
@ -150,11 +150,11 @@ def index_downloads(indexType: str, facetField: str):
try: try:
# get total number of distinct facets (countDistinct) # get total number of distinct facets (countDistinct)
results_totalNumFacets = res.json()["stats"]["stats_fields"][facetField][ results_totalNumFacets = res.json()["stats"]["stats_fields"]["owningItem"][
"countDistinct" "countDistinct"
] ]
except TypeError: except TypeError:
print(f"{indexType}: no downloads, exiting.") print("No item downloads to index, exiting.")
exit(0) exit(0)
@ -171,15 +171,15 @@ def index_downloads(indexType: str, facetField: str):
while results_current_page <= results_num_pages: while results_current_page <= results_num_pages:
# "pages" are zero based, but one based is more human readable # "pages" are zero based, but one based is more human readable
print( print(
f"{indexType}: indexing downloads (page {results_current_page + 1} of {results_num_pages + 1})" f"Indexing item downloads (page {results_current_page + 1} of {results_num_pages + 1})"
) )
solr_query_params = { solr_query_params = {
"q": "type:0", "q": "type:0",
"fq": "-isBot:true AND statistics_type:view AND bundleName:ORIGINAL", "fq": "-isBot:true AND statistics_type:view AND bundleName:ORIGINAL",
"fl": facetField, "fl": "owningItem",
"facet": "true", "facet": "true",
"facet.field": facetField, "facet.field": "owningItem",
"facet.mincount": 1, "facet.mincount": 1,
"facet.limit": results_per_page, "facet.limit": results_per_page,
"facet.offset": results_current_page * results_per_page, "facet.offset": results_current_page * results_per_page,
@ -193,12 +193,12 @@ def index_downloads(indexType: str, facetField: str):
# Solr returns facets as a dict of dicts (see json.nl parameter) # Solr returns facets as a dict of dicts (see json.nl parameter)
downloads = res.json()["facet_counts"]["facet_fields"] downloads = res.json()["facet_counts"]["facet_fields"]
# iterate over the facetField dict and get the item ids and downloads # iterate over the 'owningItem' dict and get the item ids and downloads
for id_, downloads in downloads[facetField].items(): for item_id, item_downloads in downloads["owningItem"].items():
data.append((id_, downloads)) data.append((item_id, item_downloads))
# do a batch insert of values from the current "page" of results # do a batch insert of values from the current "page" of results
sql = f"INSERT INTO {indexType}(id, downloads) VALUES %s ON CONFLICT(id) DO UPDATE SET downloads=excluded.downloads" sql = "INSERT INTO items(id, downloads) VALUES %s ON CONFLICT(id) DO UPDATE SET downloads=excluded.downloads"
psycopg2.extras.execute_values(cursor, sql, data, template="(%s, %s)") psycopg2.extras.execute_values(cursor, sql, data, template="(%s, %s)")
db.commit() db.commit()
@ -215,32 +215,13 @@ with DatabaseManager() as db:
"""CREATE TABLE IF NOT EXISTS items """CREATE TABLE IF NOT EXISTS items
(id UUID PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)""" (id UUID PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)"""
) )
# create table to store community views and downloads
cursor.execute(
"""CREATE TABLE IF NOT EXISTS communities
(id UUID PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)"""
)
# create table to store collection views and downloads
cursor.execute(
"""CREATE TABLE IF NOT EXISTS collections
(id UUID PRIMARY KEY, views INT DEFAULT 0, downloads INT DEFAULT 0)"""
)
# commit the table creation before closing the database connection # commit the table creation before closing the database connection
db.commit() db.commit()
shards = get_statistics_shards() shards = get_statistics_shards()
# Index views and downloads for items, communities, and collections. Here the index_views()
# first parameter is the type of indexing to perform, and the second parameter index_downloads()
# is the field to facet by in Solr's statistics to get this information.
index_views("items", "id")
index_views("communities", "owningComm")
index_views("collections", "owningColl")
index_downloads("items", "owningItem")
index_downloads("communities", "owningComm")
index_downloads("collections", "owningColl")
# vim: set sw=4 ts=4 expandtab: # vim: set sw=4 ts=4 expandtab:

View File

@ -0,0 +1,107 @@
import requests
from .config import SOLR_SERVER
from .util import get_statistics_shards
def get_views(solr_date_string: str, items: list):
"""
Get view statistics for a list of items from Solr.
:parameter solr_date_string (str): Solr date string, for example "[* TO *]"
:parameter items (list): a list of item IDs
:returns: A dict of item IDs and views
"""
shards = get_statistics_shards()
# Join the UUIDs with "OR" and escape the hyphens for Solr
solr_items_string: str = " OR ".join(items).replace("-", r"\-")
solr_query_params = {
"q": f"id:({solr_items_string})",
"fq": f"type:2 AND isBot:false AND statistics_type:view AND time:{solr_date_string}",
"fl": "id",
"facet": "true",
"facet.field": "id",
"facet.mincount": 1,
"shards": shards,
"rows": 0,
"wt": "json",
"json.nl": "map", # return facets as a dict instead of a flat list
}
solr_url = SOLR_SERVER + "/statistics/select"
res = requests.get(solr_url, params=solr_query_params)
# Create an empty dict to store views
data = {}
# Solr returns facets as a dict of dicts (see the json.nl parameter)
views = res.json()["facet_counts"]["facet_fields"]
# iterate over the 'id' dict and get the item ids and views
for item_id, item_views in views["id"].items():
data[item_id] = item_views
# Check if any items have missing stats so we can set them to 0
if len(data) < len(items):
# List comprehension to get a list of item ids (keys) in the data
data_ids = [k for k, v in data.items()]
for item_id in items:
if item_id not in data_ids:
data[item_id] = 0
continue
return data
def get_downloads(solr_date_string: str, items: list):
"""
Get download statistics for a list of items from Solr.
:parameter solr_date_string (str): Solr date string, for example "[* TO *]"
:parameter items (list): a list of item IDs
:returns: A dict of item IDs and downloads
"""
shards = get_statistics_shards()
# Join the UUIDs with "OR" and escape the hyphens for Solr
solr_items_string: str = " OR ".join(items).replace("-", r"\-")
solr_query_params = {
"q": f"owningItem:({solr_items_string})",
"fq": f"type:0 AND isBot:false AND statistics_type:view AND bundleName:ORIGINAL AND time:{solr_date_string}",
"fl": "owningItem",
"facet": "true",
"facet.field": "owningItem",
"facet.mincount": 1,
"shards": shards,
"rows": 0,
"wt": "json",
"json.nl": "map", # return facets as a dict instead of a flat list
}
solr_url = SOLR_SERVER + "/statistics/select"
res = requests.get(solr_url, params=solr_query_params)
# Create an empty dict to store downloads
data = {}
# Solr returns facets as a dict of dicts (see the json.nl parameter)
downloads = res.json()["facet_counts"]["facet_fields"]
# Iterate over the 'owningItem' dict and get the item ids and downloads
for item_id, item_downloads in downloads["owningItem"].items():
data[item_id] = item_downloads
# Check if any items have missing stats so we can set them to 0
if len(data) < len(items):
# List comprehension to get a list of item ids (keys) in the data
data_ids = [k for k, v in data.items()]
for item_id in items:
if item_id not in data_ids:
data[item_id] = 0
continue
return data
# vim: set sw=4 ts=4 expandtab:

View File

@ -1,124 +0,0 @@
import requests
from .config import SOLR_SERVER
from .util import get_statistics_shards
def get_views(solr_date_string: str, elements: list, facetField: str):
"""
Get view statistics for a list of elements from Solr. Depending on the req-
uest this could be items, communities, or collections.
:parameter solr_date_string (str): Solr date string, for example "[* TO *]"
:parameter elements (list): a list of IDs
:parameter facetField (str): Solr field to facet by, for example "id"
:returns: A dict of IDs and views
"""
shards = get_statistics_shards()
# Join the UUIDs with "OR" and escape the hyphens for Solr
solr_elements_string: str = " OR ".join(elements).replace("-", r"\-")
solr_query_params = {
"q": f"{facetField}:({solr_elements_string})",
"fq": f"type:2 AND -isBot:true AND statistics_type:view AND time:{solr_date_string}",
"fl": facetField,
"facet": "true",
"facet.field": facetField,
"facet.mincount": 1,
"shards": shards,
"rows": 0,
"wt": "json",
"json.nl": "map", # return facets as a dict instead of a flat list
}
solr_url = SOLR_SERVER + "/statistics/select"
res = requests.get(solr_url, params=solr_query_params)
# Create an empty dict to store views
data = {}
# Solr returns facets as a dict of dicts (see the json.nl parameter)
views = res.json()["facet_counts"]["facet_fields"]
# iterate over the facetField dict and ids and views
for id_, views in views[facetField].items():
# For items we can rely on Solr returning facets for the *only* the ids
# in our query, but for communities and collections, the owningComm and
# owningColl fields are multi-value so Solr will return facets with the
# values in our query as well as *any others* that happen to be present
# in the field (which looks like Solr returning unrelated results until
# you realize that the field is multi-value and this is correct).
#
# To work around this I make sure that each id in the returned dict are
# present in the elements list POSTed by the user.
if id_ in elements:
data[id_] = views
# Check if any ids have missing stats so we can set them to 0
if len(data) < len(elements):
# List comprehension to get a list of ids (keys) in the data
data_ids = [k for k, v in data.items()]
for element_id in elements:
if element_id not in data_ids:
data[element_id] = 0
continue
return data
def get_downloads(solr_date_string: str, elements: list, facetField: str):
"""
Get download statistics for a list of items from Solr. Depending on the req-
uest this could be items, communities, or collections.
:parameter solr_date_string (str): Solr date string, for example "[* TO *]"
:parameter elements (list): a list of IDs
:parameter facetField (str): Solr field to facet by, for example "id"
:returns: A dict of IDs and downloads
"""
shards = get_statistics_shards()
# Join the UUIDs with "OR" and escape the hyphens for Solr
solr_elements_string: str = " OR ".join(elements).replace("-", r"\-")
solr_query_params = {
"q": f"{facetField}:({solr_elements_string})",
"fq": f"type:0 AND -isBot:true AND statistics_type:view AND bundleName:ORIGINAL AND time:{solr_date_string}",
"fl": facetField,
"facet": "true",
"facet.field": facetField,
"facet.mincount": 1,
"shards": shards,
"rows": 0,
"wt": "json",
"json.nl": "map", # return facets as a dict instead of a flat list
}
solr_url = SOLR_SERVER + "/statistics/select"
res = requests.get(solr_url, params=solr_query_params)
# Create an empty dict to store downloads
data = {}
# Solr returns facets as a dict of dicts (see the json.nl parameter)
downloads = res.json()["facet_counts"]["facet_fields"]
# Iterate over the facetField dict and get the ids and downloads
for id_, downloads in downloads[facetField].items():
# Make sure that each id in the returned dict are present in the
# elements list POSTed by the user.
if id_ in elements:
data[id_] = downloads
# Check if any elements have missing stats so we can set them to 0
if len(data) < len(elements):
# List comprehension to get a list of ids (keys) in the data
data_ids = [k for k, v in data.items()]
for element_id in elements:
if element_id not in data_ids:
data[element_id] = 0
continue
return data
# vim: set sw=4 ts=4 expandtab:

View File

@ -74,9 +74,8 @@ def is_valid_date(date):
) )
def validate_post_parameters(req, resp, resource, params): def validate_items_post_parameters(req, resp, resource, params):
"""Check the POSTed request parameters for the `/items`, `/communities` and """Check the POSTed request parameters for the `/items` endpoint.
`/collections` endpoints.
Meant to be used as a `before` hook. Meant to be used as a `before` hook.
""" """
@ -126,67 +125,14 @@ def validate_post_parameters(req, resp, resource, params):
else: else:
req.context.page = 0 req.context.page = 0
# Parse the list of elements from the POST request body # Parse the list of items from the POST request body
if req.context.statistics_scope in doc: if "items" in doc:
if ( if isinstance(doc["items"], list) and len(doc["items"]) > 0:
isinstance(doc[req.context.statistics_scope], list) req.context.items = doc["items"]
and len(doc[req.context.statistics_scope]) > 0
):
req.context.elements = doc[req.context.statistics_scope]
else: else:
raise falcon.HTTPBadRequest( raise falcon.HTTPBadRequest(
title="Invalid parameter", title="Invalid parameter",
description=f'The "{req.context.statistics_scope}" parameter is invalid. The value must be a comma-separated list of UUIDs.', description='The "items" parameter is invalid. The value must be a comma-separated list of item UUIDs.',
) )
else: else:
req.context.elements = list() req.context.items = list()
def set_statistics_scope(req, resp, resource, params):
"""Set the statistics scope (item, collection, or community) of the request
as well as the appropriate database (for GET requests) and Solr facet fields
(for POST requests).
Meant to be used as a `before` hook.
"""
# Extract the scope from the request path. This is *guaranteed* to be one
# of the following values because we only send requests matching these few
# patterns to routes using this set_statistics_scope hook.
#
# Note: this regex is ordered so that "items" and "collections" match before
# "item" and "collection".
req.context.statistics_scope = re.findall(
r"^/(communities|community|collections|collection|items|item)", req.path
)[0]
# Set the correct database based on the statistics_scope. The database is
# used for all GET requests where statistics are returned directly from the
# database. In this case we can return early.
if req.method == "GET":
if re.findall(r"^(item|items)$", req.context.statistics_scope):
req.context.database = "items"
elif re.findall(r"^(community|communities)$", req.context.statistics_scope):
req.context.database = "communities"
elif re.findall(r"^(collection|collections)$", req.context.statistics_scope):
req.context.database = "collections"
# GET requests only need the scope and the database so we can return now
return
# If the current request is for a plural items, communities, or collections
# that includes a list of element ids POSTed with the request body then we
# need to set the Solr facet field so we can get the live results.
if req.method == "POST":
if req.context.statistics_scope == "items":
req.context.views_facet_field = "id"
req.context.downloads_facet_field = "owningItem"
elif req.context.statistics_scope == "communities":
req.context.views_facet_field = "owningComm"
req.context.downloads_facet_field = "owningComm"
elif req.context.statistics_scope == "collections":
req.context.views_facet_field = "owningColl"
req.context.downloads_facet_field = "owningColl"
# vim: set sw=4 ts=4 expandtab: