dspace-statistics-api

Commit Graph

Author	SHA1	Message	Date
Alan Orth	0650c5985e	Add SPDX short license identifier to all Python files continuous-integration/drone/push Build is passing Details See: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/	2021-03-22 13:42:42 +02:00
Alan Orth	49751b53f0	dspace_statistics_api/indexer.py: Limit to UUIDs We need to make sure that the indexer only tries to index UUIDs, as opposed to legacy IDs that may have been left over from a migration from earlier DSpace versions. For example, "98110-unmigrated", "-1" etc. For matching the UUIDs in Solr I decided that it is sufficient for our use case to simply match thirty-six characters, where a UUID is composed of thirty-two hexadecimal characters and four dashes. We don't need to do any verification of "real" UUIDs because it would be needlessly complex in our case. See: https://github.com/ilri/dspace-statistics-api/issues/12	2021-01-05 12:30:27 +02:00
Alan Orth	4f8cd1097b	Rework paging The "totalPages" value in our response is calculated incorrectly. Instead of casting to int and rounding, we should rather round up to the next integer with math.ceil. This is a more correct way to get the value. Also update the indexer to use the same logic, although there the values are printed with +1 so they are more readable.	2020-12-27 12:22:07 +02:00
Alan Orth	20c8ba0cf8	indexer.py: Add support for communities and collections The logic to get views and downloads is very similar to that used for items, but we facet by different fields. This uses a generic function for indexing that takes an "indexType" and a "facetField" parameter. The indexType parameter controls which database table to insert into, and the facetField parameter indicates which field to facet by in Solr.	2020-12-18 22:53:16 +02:00
Alan Orth	b486f51dd7	indexer.py: Rename index functions for items Start making plans for indexing communities and collections.	2020-12-18 22:53:16 +02:00
Alan Orth	4bbbaa4af3	dspace_statistics_api/indexer.py: Use `fl` parameter continuous-integration/drone/push Build is passing Details I forgot to add the fl parameter to the downloads function.	2020-12-18 10:44:02 +02:00
Alan Orth	2407aeec70	dspace_statistics_api/indexer.py: Use `fl` parameter When indexing item views and downloads the only field we need is the the id. The `fl` parameter tells Solr which fields to return in the search results. This should theoretically be more efficient, though I don't have any time to figure out how to measure it right now.	2020-12-17 12:25:28 +02:00
Alan Orth	810508d038	dspace_statistics_api/indexer.py: Use -isBot:true Minor change to bot filtering. We should use a negated match for documents that have `isBot:true` rather than looking for documents that are tagged with `isBot:false` (the distinction is subtle, but important).	2020-11-17 17:40:08 +02:00
Alan Orth	b06651d1ec	dspace_statistics_api/indexer.py: Fix Python comment	2020-09-25 13:35:05 +03:00
Alan Orth	f58c209609	dspace_statistics_api/indexer.py: Update comment I don't remember why we needed the stats, but it seems that it was because without them there is no way to know how many results were returned and therefore no way to know how many pages we'll need to iterate over. Having the total number allows us to use a limit and and offset to page through them deterministically.	2020-09-25 13:25:34 +03:00
Alan Orth	495386856b	Refactor indexer Move the get_statistics_shards() method to a utility module so it can be used by other things.	2020-09-24 12:03:12 +03:00
Alan Orth	8e87f80e9a	dspace_statistics_api/indexer.py: Remove duplicate solr_url variable This is declared twice and it never changes.	2020-09-24 11:54:31 +03:00
Alan Orth	6ff95bb5f2	dspace_statistics_api/indexer.py: Remove SolrClient reference We stopped using SolrClient in favor of vanilla requests.	2020-09-24 11:30:31 +03:00
Alan Orth	0ef071a91d	dspace_statistics_api: Use f-strings instead of format() We had previously been avoiding the f-strings because we needed to run on Python 3.5 and they were only available in Python 3.6+, but now the black formatter requires Python 3.6 and all our systems are running Python 3.6+ anyways.	2020-03-02 11:24:29 +02:00
Alan Orth	250fd8164f	dspace_statistics_api/indexer.py: Use UUID DSpace 6+ uses a UUID for item identifiers instead of an integer so we need to update the PostgreSQL schema accordingly. Solr still re- fers to them as "id" in its schema so we don't need to change anyt- hing there.	2020-03-01 21:22:10 +02:00
Alan Orth	cb3c3d37fa	Sort imports with isort	2019-11-27 12:31:04 +02:00
Alan Orth	4ff1fd4a22	Format code with black	2019-11-27 12:30:06 +02:00
Alan Orth	eeb8e6bba1	dspace_statistics_api/indexer.py: Fix minor issues raised by flake8	2019-11-27 12:12:05 +02:00
Alan Orth	2acd08e0ab	Use one-based paging in indexer output It is easier for humans to understand one-based paging output like "page 1 of 3" than "page 0 of 2" in the indexer.	2019-04-15 10:25:54 +03:00
Alan Orth	8f46ceb8d8	Refactor to use vanilla requests library The SolrClient library is unmaintained, which is starting to cause problems due to the moving Python ecosystem. Switching to requests does not change my code in any meaningful way and makes maintenance easier.	2019-04-15 10:19:50 +03:00
Alan Orth	043d897cef	dspace_statistics_api/indexer.py: Catch case of no views/downloads Don't fail with an exception when there are no views or downloads, for example on a new DSpace installation.	2019-01-22 09:00:22 +02:00
Alan Orth	40e284dac0	dspace_statistics_api/indexer.py: Query multiple shards DSpace's stats-util script splits the Solr statistics core into yearly shards. We need to use Solr's `shards` query parameter in order to get the statistics for previous years. This commit adds a helper function to enumerate the active Solr cores to find yearly shards matching the statistics-YYYY pattern and add them to the query.	2019-01-22 08:39:36 +02:00
Alan Orth	f6e866a589	dspace_statistics_api/indexer.py: Remove debug code	2018-11-07 17:51:24 +02:00
Alan Orth	2f342be948	Refactor database code to use a context manager Instead of opening one global persistent database connection when the application I am now abstracting it to a class that I can use in combination with Python's "with" context. Both connections and cursors are kept for the context of each "with" block and closed automatically when exiting. See: https://alysivji.github.io/managing-resources-with-context-managers-pythonic.html See: http://initd.org/psycopg/docs/connection.html#connection.close	2018-11-07 17:41:21 +02:00
Alan Orth	cc5ce3ab98	Correct issues highlighted by Flake8 Flake8 validates code style against PEP 8 in order to encourage the writing of idiomatic Python. For reference, I am currently ignoring errors about line length (E501) because I feel it makes code harder to read. This is the invocation I am using: $ flake8 --ignore E501 dspace_statistics_api	2018-11-04 00:04:27 +02:00
Alan Orth	2136dc79ce	Remove shebang from indexer.py This is run as a Python module now so does not need a shebang.	2018-10-28 11:14:21 +02:00
Alan Orth	ed60120cef	Remove executable bit from indexer.py Now it is run as a Python module.	2018-10-28 11:14:21 +02:00
Alan Orth	c027f01b48	Refactor project structure This follows guidance from several well-known Python best practices guides. Basically, the idea is create a package for the application that is comprised of several re-usable modules. See: https://docs.python-guide.org/writing/structure/ See: https://realpython.com/python-application-layouts/	2018-10-28 11:14:21 +02:00

28 Commits