mirror of
https://github.com/ilri/dspace-statistics-api.git
synced 2025-05-10 15:16:02 +02:00
Compare commits
49 Commits
Author | SHA1 | Date | |
---|---|---|---|
62142eb79e
|
|||
fda0321942
|
|||
963aa245c8
|
|||
568ff2eebb
|
|||
deecb8a10b
|
|||
12f45d7c08
|
|||
f65089f9ce
|
|||
1db5cf1c29
|
|||
e581c4b1aa
|
|||
e8d356c9ca
|
|||
34a9b8d629
|
|||
41e3d66a0e
|
|||
9b2a6137b4
|
|||
600b986f99
|
|||
49a7790794
|
|||
f2deba627c
|
|||
9323513794
|
|||
daf15610f2
|
|||
4ede966dbb
|
|||
3580473a6d
|
|||
071c24535f
|
|||
4291aecac4
|
|||
46bf537e88
|
|||
eaca5354d3
|
|||
4600288ee4
|
|||
8179563378
|
|||
b14c3eef4d
|
|||
71a789b13f
|
|||
c68ddacaa4
|
|||
9c9e79769e
|
|||
2ad5ade556
|
|||
7412a09670
|
|||
bb744a00b8
|
|||
7499b89d99
|
|||
2c1e4952b1
|
|||
379f202c3f
|
|||
560fa6056d
|
|||
385a34e5d0
|
|||
d0ea62d2bd
|
|||
366ae25b8e
|
|||
0f3054ae03
|
|||
6bf34235d4
|
|||
e604d8ca81
|
|||
fc35b816f3
|
|||
9e6a2f7559
|
|||
46cfc3ffbc
|
|||
2850035a4c
|
|||
c0b550109a
|
|||
bfceffd84d
|
@ -2,8 +2,10 @@ language: python
|
|||||||
python:
|
python:
|
||||||
- "3.5"
|
- "3.5"
|
||||||
- "3.6"
|
- "3.6"
|
||||||
- "3.7"
|
- "3.7-dev"
|
||||||
install:
|
script: pip install -r requirements.txt
|
||||||
- pip install -r requirements.txt
|
branches:
|
||||||
|
only:
|
||||||
|
- master
|
||||||
|
|
||||||
# vim: ts=2 sw=2 et
|
# vim: ts=2 sw=2 et
|
||||||
|
36
CHANGELOG.md
36
CHANGELOG.md
@ -4,6 +4,42 @@ All notable changes to this project will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
### [0.5.0] - 2018-10-24
|
||||||
|
## Added
|
||||||
|
- Example nginx configuration to README.md
|
||||||
|
|
||||||
|
## Changed
|
||||||
|
- Don't initialize Solr connection in API
|
||||||
|
|
||||||
|
### [0.4.3] - 2018-10-17
|
||||||
|
## Changed
|
||||||
|
- Use pip install as script for Travis CI
|
||||||
|
|
||||||
|
## Improved
|
||||||
|
- Documentation for deployment and testing
|
||||||
|
|
||||||
|
## [0.4.2] - 2018-10-04
|
||||||
|
### Changed
|
||||||
|
- README.md introduction and requirements
|
||||||
|
- Use ujson instead of json
|
||||||
|
- Iterate directly on SQL cursor in `/items` route
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Logic error in SQL for item views
|
||||||
|
|
||||||
|
## [0.4.1] - 2018-09-26
|
||||||
|
### Changed
|
||||||
|
- Use execute_values() to batch insert records to PostgreSQL
|
||||||
|
|
||||||
|
## [0.4.0] - 2018-09-25
|
||||||
|
### Fixed
|
||||||
|
- Invalid OnCalendar syntax in dspace-statistics-indexer.timer
|
||||||
|
- Major logic error in indexer.py
|
||||||
|
|
||||||
|
## [0.3.2] - 2018-09-25
|
||||||
|
## Changed
|
||||||
|
- /item/id route now returns HTTP 404 if an item is not found
|
||||||
|
|
||||||
## [0.3.1] - 2018-09-25
|
## [0.3.1] - 2018-09-25
|
||||||
### Changed
|
### Changed
|
||||||
- Force SolrClient's kazoo dependency to version 2.5.0 to work with Python 3.7
|
- Force SolrClient's kazoo dependency to version 2.5.0 to work with Python 3.7
|
||||||
|
66
README.md
66
README.md
@ -1,31 +1,79 @@
|
|||||||
# DSpace Statistics API
|
# DSpace Statistics API [](https://travis-ci.org/alanorth/dspace-statistics-api)
|
||||||
A quick and dirty REST API to expose Solr view and download statistics for items in a DSpace repository.
|
A simple REST API to expose Solr view and download statistics for items in a DSpace repository. This project contains a standalone indexing component and a WSGI application.
|
||||||
|
|
||||||
Written and tested in Python 3.5, 3.6, and 3.7. Requires PostgreSQL version 9.5 or greater for [`UPSERT` support](https://wiki.postgresql.org/wiki/UPSERT).
|
## Requirements
|
||||||
|
|
||||||
## Installation
|
- Python 3.5+
|
||||||
Create a virtual environment and run it:
|
- PostgreSQL version 9.5+ (due to [`UPSERT` support](https://wiki.postgresql.org/wiki/UPSERT))
|
||||||
|
- DSpace 4+ with [Solr usage statistics enabled](https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics)
|
||||||
|
|
||||||
|
## Installation and Testing
|
||||||
|
Create a Python virtual environment and install the dependencies:
|
||||||
|
|
||||||
$ python -m venv venv
|
$ python -m venv venv
|
||||||
$ . venv/bin/activate
|
$ . venv/bin/activate
|
||||||
$ pip install -r requirements.txt
|
$ pip install -r requirements.txt
|
||||||
|
|
||||||
|
Set up the environment variables for Solr and PostgreSQL:
|
||||||
|
|
||||||
|
$ export SOLR_SERVER=http://localhost:8080/solr
|
||||||
|
$ export DATABASE_NAME=dspacestatistics
|
||||||
|
$ export DATABASE_USER=dspacestatistics
|
||||||
|
$ export DATABASE_PASS=dspacestatistics
|
||||||
|
$ export DATABASE_HOST=localhost
|
||||||
|
|
||||||
|
Index the Solr statistics core to populate the PostgreSQL database:
|
||||||
|
|
||||||
|
$ ./indexer.py
|
||||||
|
|
||||||
|
Run the REST API:
|
||||||
|
|
||||||
$ gunicorn app:api
|
$ gunicorn app:api
|
||||||
|
|
||||||
|
Test to see if there are any statistics:
|
||||||
|
|
||||||
|
$ curl 'http://localhost:8000/items?limit=1'
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
There are example systemd service and timer units in the `contrib` directory. The API service listens on localhost by default so you will need to expose it publicly using a web server like nginx.
|
||||||
|
|
||||||
|
An example nginx configuration is:
|
||||||
|
|
||||||
|
```
|
||||||
|
server {
|
||||||
|
#...
|
||||||
|
|
||||||
|
location ~ /rest/statistics/?(.*) {
|
||||||
|
access_log /var/log/nginx/statistics.log;
|
||||||
|
proxy_pass http://statistics_api/$1$is_args$args;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
upstream statistics_api {
|
||||||
|
server 127.0.0.1:5000;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This would expose the API at `/rest/statistics`.
|
||||||
|
|
||||||
## Using the API
|
## Using the API
|
||||||
The API exposes the following endpoints:
|
The API exposes the following endpoints:
|
||||||
|
|
||||||
- GET `/items` — return views and downloads for all items that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results.
|
- GET `/items` — return views and downloads for all items that Solr knows about¹. Accepts `limit` and `page` query parameters for pagination of results.
|
||||||
- GET `/item/id` — return views and downloads for a single item (*id* must be a positive integer).
|
- GET `/item/id` — return views and downloads for a single item (*id* must be a positive integer). Returns HTTP 404 if an item id is not found.
|
||||||
|
|
||||||
¹ We are querying the Solr statistics core, which technically only knows about all items that have either views or downloads.
|
¹ We are querying the Solr statistics core, which technically only knows about items that have either views or downloads.
|
||||||
|
|
||||||
## Todo
|
## Todo
|
||||||
|
|
||||||
- Add API documentation
|
- Add API documentation
|
||||||
- Close up DB connection when gunicorn shuts down gracefully
|
- Close DB connection when gunicorn shuts down gracefully
|
||||||
- Better logging
|
- Better logging
|
||||||
- Return HTTP 404 when item_id is nonexistent
|
|
||||||
- Tests
|
- Tests
|
||||||
|
- Check if database exists (try/except)
|
||||||
|
- Version API
|
||||||
|
- Use JSON in PostgreSQL
|
||||||
|
- Switch to [Python 3.6+ f-string syntax](https://realpython.com/python-f-strings/)
|
||||||
|
|
||||||
## License
|
## License
|
||||||
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
|
This work is licensed under the [GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html).
|
||||||
|
34
app.py
34
app.py
@ -1,10 +1,8 @@
|
|||||||
from database import database_connection
|
from database import database_connection
|
||||||
import falcon
|
import falcon
|
||||||
from solr import solr_connection
|
|
||||||
|
|
||||||
db = database_connection()
|
db = database_connection()
|
||||||
db.set_session(readonly=True)
|
db.set_session(readonly=True)
|
||||||
solr = solr_connection()
|
|
||||||
|
|
||||||
class AllItemsResource:
|
class AllItemsResource:
|
||||||
def on_get(self, req, resp):
|
def on_get(self, req, resp):
|
||||||
@ -22,16 +20,16 @@ class AllItemsResource:
|
|||||||
|
|
||||||
# get statistics, ordered by id, and use limit and offset to page through results
|
# get statistics, ordered by id, and use limit and offset to page through results
|
||||||
cursor.execute('SELECT id, views, downloads FROM items ORDER BY id ASC LIMIT {} OFFSET {}'.format(limit, offset))
|
cursor.execute('SELECT id, views, downloads FROM items ORDER BY id ASC LIMIT {} OFFSET {}'.format(limit, offset))
|
||||||
results = cursor.fetchmany(limit)
|
|
||||||
cursor.close()
|
|
||||||
|
|
||||||
# create a list to hold dicts of item stats
|
# create a list to hold dicts of item stats
|
||||||
statistics = list()
|
statistics = list()
|
||||||
|
|
||||||
# iterate over results and build statistics object
|
# iterate over results and build statistics object
|
||||||
for item in results:
|
for item in cursor:
|
||||||
statistics.append({ 'id': item['id'], 'views': item['views'], 'downloads': item['downloads'] })
|
statistics.append({ 'id': item['id'], 'views': item['views'], 'downloads': item['downloads'] })
|
||||||
|
|
||||||
|
cursor.close()
|
||||||
|
|
||||||
message = {
|
message = {
|
||||||
'currentPage': page,
|
'currentPage': page,
|
||||||
'totalPages': pages,
|
'totalPages': pages,
|
||||||
@ -47,16 +45,26 @@ class ItemResource:
|
|||||||
|
|
||||||
cursor = db.cursor()
|
cursor = db.cursor()
|
||||||
cursor.execute('SELECT views, downloads FROM items WHERE id={}'.format(item_id))
|
cursor.execute('SELECT views, downloads FROM items WHERE id={}'.format(item_id))
|
||||||
results = cursor.fetchone()
|
if cursor.rowcount == 0:
|
||||||
|
raise falcon.HTTPNotFound(
|
||||||
|
title='Item not found',
|
||||||
|
description='The item with id "{}" was not found.'.format(item_id)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
results = cursor.fetchone()
|
||||||
|
|
||||||
|
statistics = {
|
||||||
|
'id': item_id,
|
||||||
|
'views': results['views'],
|
||||||
|
'downloads': results['downloads']
|
||||||
|
}
|
||||||
|
|
||||||
|
resp.media = statistics
|
||||||
|
|
||||||
cursor.close()
|
cursor.close()
|
||||||
|
|
||||||
statistics = {
|
def on_exit(api):
|
||||||
'id': item_id,
|
print("Shutting down DB")
|
||||||
'views': results['views'],
|
|
||||||
'downloads': results['downloads']
|
|
||||||
}
|
|
||||||
|
|
||||||
resp.media = statistics
|
|
||||||
|
|
||||||
api = falcon.API()
|
api = falcon.API()
|
||||||
api.add_route('/items', AllItemsResource())
|
api.add_route('/items', AllItemsResource())
|
||||||
|
@ -9,8 +9,8 @@ Environment=DATABASE_PASS=dspacestatistics
|
|||||||
Environment=DATABASE_HOST=localhost
|
Environment=DATABASE_HOST=localhost
|
||||||
User=nobody
|
User=nobody
|
||||||
Group=nogroup
|
Group=nogroup
|
||||||
WorkingDirectory=/opt/ilri/dspace-statistics-api
|
WorkingDirectory=/var/lib/dspace-statistics-api
|
||||||
ExecStart=/opt/ilri/dspace-statistics-api/venv/bin/gunicorn \
|
ExecStart=/var/lib/dspace-statistics-api/venv/bin/gunicorn \
|
||||||
--bind 127.0.0.1:5000 \
|
--bind 127.0.0.1:5000 \
|
||||||
app:api
|
app:api
|
||||||
ExecReload=/bin/kill -s HUP $MAINPID
|
ExecReload=/bin/kill -s HUP $MAINPID
|
||||||
|
@ -10,8 +10,8 @@ Environment=DATABASE_PASS=dspacestatistics
|
|||||||
Environment=DATABASE_HOST=localhost
|
Environment=DATABASE_HOST=localhost
|
||||||
User=nobody
|
User=nobody
|
||||||
Group=nogroup
|
Group=nogroup
|
||||||
WorkingDirectory=/opt/ilri/dspace-statistics-api
|
WorkingDirectory=/var/lib/dspace-statistics-api
|
||||||
ExecStart=/opt/ilri/dspace-statistics-api/venv/bin/python indexer.py
|
ExecStart=/var/lib/dspace-statistics-api/venv/bin/python indexer.py
|
||||||
|
|
||||||
[Install]
|
[Install]
|
||||||
WantedBy=multi-user.target
|
WantedBy=multi-user.target
|
||||||
|
@ -3,7 +3,7 @@ Description=DSpace Statistics Indexer
|
|||||||
|
|
||||||
[Timer]
|
[Timer]
|
||||||
# twice a day, at 6AM and 6PM
|
# twice a day, at 6AM and 6PM
|
||||||
OnCalendar=*-*-* 06:00:00,18:00:00
|
OnCalendar=*-*-* 06,18:00:00
|
||||||
# Add a random delay of 0–3600 seconds
|
# Add a random delay of 0–3600 seconds
|
||||||
RandomizedDelaySec=3600
|
RandomizedDelaySec=3600
|
||||||
Persistent=true
|
Persistent=true
|
||||||
|
@ -2,8 +2,7 @@ from config import DATABASE_NAME
|
|||||||
from config import DATABASE_USER
|
from config import DATABASE_USER
|
||||||
from config import DATABASE_PASS
|
from config import DATABASE_PASS
|
||||||
from config import DATABASE_HOST
|
from config import DATABASE_HOST
|
||||||
import psycopg2
|
import psycopg2, psycopg2.extras
|
||||||
import psycopg2.extras
|
|
||||||
|
|
||||||
def database_connection():
|
def database_connection():
|
||||||
connection = psycopg2.connect("dbname={} user={} password={} host='{}'".format(DATABASE_NAME, DATABASE_USER, DATABASE_PASS, DATABASE_HOST), cursor_factory=psycopg2.extras.DictCursor)
|
connection = psycopg2.connect("dbname={} user={} password={} host='{}'".format(DATABASE_NAME, DATABASE_USER, DATABASE_PASS, DATABASE_HOST), cursor_factory=psycopg2.extras.DictCursor)
|
||||||
|
109
indexer.py
109
indexer.py
@ -20,111 +20,140 @@
|
|||||||
# ---
|
# ---
|
||||||
#
|
#
|
||||||
# Connects to a DSpace Solr statistics core and ingests item views and downloads
|
# Connects to a DSpace Solr statistics core and ingests item views and downloads
|
||||||
# into a Postgres database for use with other applications (an API, for example).
|
# into a PostgreSQL database for use by other applications (like an API).
|
||||||
#
|
#
|
||||||
# This script is written for Python 3 and requires several modules that you can
|
# This script is written for Python 3.5+ and requires several modules that you
|
||||||
# install with pip (I recommend setting up a Python virtual environment first):
|
# can install with pip (I recommend using a Python virtual environment):
|
||||||
#
|
#
|
||||||
# $ pip install SolrClient
|
# $ pip install SolrClient psycopg2-binary
|
||||||
#
|
#
|
||||||
# See: https://solrclient.readthedocs.io/en/latest/SolrClient.html
|
# See: https://solrclient.readthedocs.io/en/latest/SolrClient.html
|
||||||
# See: https://wiki.duraspace.org/display/DSPACE/Solr
|
# See: https://wiki.duraspace.org/display/DSPACE/Solr
|
||||||
#
|
|
||||||
# Tested with Python 3.5 and 3.6.
|
|
||||||
|
|
||||||
from database import database_connection
|
from database import database_connection
|
||||||
|
import ujson
|
||||||
|
import psycopg2.extras
|
||||||
from solr import solr_connection
|
from solr import solr_connection
|
||||||
|
|
||||||
def index_views():
|
def index_views():
|
||||||
print("Populating database with item views.")
|
# get total number of distinct facets for items with a minimum of 1 view,
|
||||||
|
# otherwise Solr returns all kinds of weird ids that are actually not in
|
||||||
# determine the total number of items with views (aka Solr's numFound)
|
# the database. Also, stats are expensive, but we need stats.calcdistinct
|
||||||
|
# so we can get the countDistinct summary.
|
||||||
|
#
|
||||||
|
# see: https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
|
||||||
res = solr.query('statistics', {
|
res = solr.query('statistics', {
|
||||||
'q':'type:2',
|
'q':'type:2',
|
||||||
'fq':'isBot:false AND statistics_type:view',
|
'fq':'isBot:false AND statistics_type:view',
|
||||||
'facet':True,
|
'facet':True,
|
||||||
'facet.field':'id',
|
'facet.field':'id',
|
||||||
|
'facet.mincount':1,
|
||||||
|
'facet.limit':1,
|
||||||
|
'facet.offset':0,
|
||||||
|
'stats':True,
|
||||||
|
'stats.field':'id',
|
||||||
|
'stats.calcdistinct':True
|
||||||
}, rows=0)
|
}, rows=0)
|
||||||
|
|
||||||
# divide results into "pages" (numFound / 100)
|
# get total number of distinct facets (countDistinct)
|
||||||
results_numFound = res.get_num_found()
|
results_totalNumFacets = ujson.loads(res.get_json())['stats']['stats_fields']['id']['countDistinct']
|
||||||
|
|
||||||
|
# divide results into "pages" (cast to int to effectively round down)
|
||||||
results_per_page = 100
|
results_per_page = 100
|
||||||
results_num_pages = round(results_numFound / results_per_page)
|
results_num_pages = int(results_totalNumFacets / results_per_page)
|
||||||
results_current_page = 0
|
results_current_page = 0
|
||||||
|
|
||||||
cursor = db.cursor()
|
cursor = db.cursor()
|
||||||
|
|
||||||
|
# create an empty list to store values for batch insertion
|
||||||
|
data = []
|
||||||
|
|
||||||
while results_current_page <= results_num_pages:
|
while results_current_page <= results_num_pages:
|
||||||
print('Page {} of {}.'.format(results_current_page, results_num_pages))
|
print('Indexing item views (page {} of {})'.format(results_current_page, results_num_pages))
|
||||||
|
|
||||||
res = solr.query('statistics', {
|
res = solr.query('statistics', {
|
||||||
'q':'type:2',
|
'q':'type:2',
|
||||||
'fq':'isBot:false AND statistics_type:view',
|
'fq':'isBot:false AND statistics_type:view',
|
||||||
'facet':True,
|
'facet':True,
|
||||||
'facet.field':'id',
|
'facet.field':'id',
|
||||||
|
'facet.mincount':1,
|
||||||
'facet.limit':results_per_page,
|
'facet.limit':results_per_page,
|
||||||
'facet.offset':results_current_page * results_per_page
|
'facet.offset':results_current_page * results_per_page
|
||||||
})
|
}, rows=0)
|
||||||
|
|
||||||
# make sure total number of results > 0
|
# SolrClient's get_facets() returns a dict of dicts
|
||||||
if res.get_num_found() > 0:
|
views = res.get_facets()
|
||||||
# SolrClient's get_facets() returns a dict of dicts
|
# in this case iterate over the 'id' dict and get the item ids and views
|
||||||
views = res.get_facets()
|
for item_id, item_views in views['id'].items():
|
||||||
# in this case iterate over the 'id' dict and get the item ids and views
|
data.append((item_id, item_views))
|
||||||
for item_id, item_views in views['id'].items():
|
|
||||||
cursor.execute('''INSERT INTO items(id, views) VALUES(%s, %s)
|
|
||||||
ON CONFLICT(id) DO UPDATE SET downloads=excluded.views''',
|
|
||||||
(item_id, item_views))
|
|
||||||
|
|
||||||
|
# do a batch insert of values from the current "page" of results
|
||||||
|
sql = 'INSERT INTO items(id, views) VALUES %s ON CONFLICT(id) DO UPDATE SET views=excluded.views'
|
||||||
|
psycopg2.extras.execute_values(cursor, sql, data, template='(%s, %s)')
|
||||||
db.commit()
|
db.commit()
|
||||||
|
|
||||||
|
# clear all items from the list so we can populate it with the next batch
|
||||||
|
data.clear()
|
||||||
|
|
||||||
results_current_page += 1
|
results_current_page += 1
|
||||||
|
|
||||||
cursor.close()
|
cursor.close()
|
||||||
|
|
||||||
def index_downloads():
|
def index_downloads():
|
||||||
print("Populating database with item downloads.")
|
# get the total number of distinct facets for items with at least 1 download
|
||||||
|
|
||||||
# determine the total number of items with downloads (aka Solr's numFound)
|
|
||||||
res = solr.query('statistics', {
|
res = solr.query('statistics', {
|
||||||
'q':'type:0',
|
'q':'type:0',
|
||||||
'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
|
'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
|
||||||
'facet':True,
|
'facet':True,
|
||||||
'facet.field':'owningItem',
|
'facet.field':'owningItem',
|
||||||
|
'facet.mincount':1,
|
||||||
|
'facet.limit':1,
|
||||||
|
'facet.offset':0,
|
||||||
|
'stats':True,
|
||||||
|
'stats.field':'owningItem',
|
||||||
|
'stats.calcdistinct':True
|
||||||
}, rows=0)
|
}, rows=0)
|
||||||
|
|
||||||
# divide results into "pages" (numFound / 100)
|
# get total number of distinct facets (countDistinct)
|
||||||
results_numFound = res.get_num_found()
|
results_totalNumFacets = ujson.loads(res.get_json())['stats']['stats_fields']['owningItem']['countDistinct']
|
||||||
|
|
||||||
|
# divide results into "pages" (cast to int to effectively round down)
|
||||||
results_per_page = 100
|
results_per_page = 100
|
||||||
results_num_pages = round(results_numFound / results_per_page)
|
results_num_pages = int(results_totalNumFacets / results_per_page)
|
||||||
results_current_page = 0
|
results_current_page = 0
|
||||||
|
|
||||||
cursor = db.cursor()
|
cursor = db.cursor()
|
||||||
|
|
||||||
|
# create an empty list to store values for batch insertion
|
||||||
|
data = []
|
||||||
|
|
||||||
while results_current_page <= results_num_pages:
|
while results_current_page <= results_num_pages:
|
||||||
print('Page {} of {}.'.format(results_current_page, results_num_pages))
|
print('Indexing item downloads (page {} of {})'.format(results_current_page, results_num_pages))
|
||||||
|
|
||||||
res = solr.query('statistics', {
|
res = solr.query('statistics', {
|
||||||
'q':'type:0',
|
'q':'type:0',
|
||||||
'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
|
'fq':'isBot:false AND statistics_type:view AND bundleName:ORIGINAL',
|
||||||
'facet':True,
|
'facet':True,
|
||||||
'facet.field':'owningItem',
|
'facet.field':'owningItem',
|
||||||
|
'facet.mincount':1,
|
||||||
'facet.limit':results_per_page,
|
'facet.limit':results_per_page,
|
||||||
'facet.offset':results_current_page * results_per_page
|
'facet.offset':results_current_page * results_per_page
|
||||||
})
|
}, rows=0)
|
||||||
|
|
||||||
# make sure total number of results > 0
|
# SolrClient's get_facets() returns a dict of dicts
|
||||||
if res.get_num_found() > 0:
|
downloads = res.get_facets()
|
||||||
# SolrClient's get_facets() returns a dict of dicts
|
# in this case iterate over the 'owningItem' dict and get the item ids and downloads
|
||||||
downloads = res.get_facets()
|
for item_id, item_downloads in downloads['owningItem'].items():
|
||||||
# in this case iterate over the 'owningItem' dict and get the item ids and downloads
|
data.append((item_id, item_downloads))
|
||||||
for item_id, item_downloads in downloads['owningItem'].items():
|
|
||||||
cursor.execute('''INSERT INTO items(id, downloads) VALUES(%s, %s)
|
|
||||||
ON CONFLICT(id) DO UPDATE SET downloads=excluded.downloads''',
|
|
||||||
(item_id, item_downloads))
|
|
||||||
|
|
||||||
|
# do a batch insert of values from the current "page" of results
|
||||||
|
sql = 'INSERT INTO items(id, downloads) VALUES %s ON CONFLICT(id) DO UPDATE SET downloads=excluded.downloads'
|
||||||
|
psycopg2.extras.execute_values(cursor, sql, data, template='(%s, %s)')
|
||||||
db.commit()
|
db.commit()
|
||||||
|
|
||||||
|
# clear all items from the list so we can populate it with the next batch
|
||||||
|
data.clear()
|
||||||
|
|
||||||
results_current_page += 1
|
results_current_page += 1
|
||||||
|
|
||||||
cursor.close()
|
cursor.close()
|
||||||
|
@ -9,4 +9,5 @@ python-mimeparse==1.6.0
|
|||||||
requests==2.19.1
|
requests==2.19.1
|
||||||
six==1.11.0
|
six==1.11.0
|
||||||
SolrClient==0.2.1
|
SolrClient==0.2.1
|
||||||
|
ujson==1.35
|
||||||
urllib3==1.23
|
urllib3==1.23
|
||||||
|
Reference in New Issue
Block a user