Stats and Analytics¶
References¶
https://github.com/inveniosoftware/invenio-stats/blob/master/invenio_stats/utils.py#L94 https://github.com/inveniosoftware/invenio-app-rdm/blob/master/invenio_app_rdm/config.py#L1186 https://github.com/inveniosoftware/react-invenio-app-ils/blob/c55054d59f2a033a5ffffc59b67f4fa1819d2913/src/lib/api/stats/stats.js#L5
Stats indices¶
Usage stats are stored only in the search index, not in the db. This makes it imperative that the stats indices are regularly backed up, because the stats cannot be recovered from the db.
Individual events¶
Indices are created to store the individual usage events
kcworks-events-stats-record-view-YYYY-MM
kcworks-events-stats-file-download-YYYY-MM
These are collectively aliased to
kcworks-events-stats-record-view
kcworks-events-stats-file-download
Aggregations¶
In addition, aggregate indices are created for each month with the naming patterns
kcworks-stats-file-download-YYYY-MM
kcworks-stats-record-view-YYYY-MM
These are collectively aliased tokcworks-stats-file-download
kcworks-stats-record-view
A bookmarks index tracks the creation of the aggregate indices:
kcworks-stats-bookmarks
The aggregate records each count a single record’s events (views or downloads) during the month for the index. They include both the recid and the parent recid, allowing the aggregations to be collected either for an individual version or for all versions of a work (i.e., parent).
bucket aggregations create a bucket for each unique term in a field
“aggregations.xxx.buckets” is an array of objects with “key” and “doc_count” values
Aggregation requests¶
in agg_iter:
dsl.Search(using=self.client, index=self.event_index).filter(
# Filter for the specific interval (hour, day, month)
"term",
timestamp=rounded_dt,
)
for p in range(num_partitions)
terms = agg_query.aggs.bucket(
"terms",
"terms",
field=unique_id,
include={"partition": p, "num_partitions": num_partitions},
size=self.max_bucket_size,
)
terms.metric(
"top_hit", "top_hits", size=1, sort={"timestamp": "desc"}
)
for dst, (metric, src, opts) in self.metric_fields.items():
terms.metric(dst, metric, field=src, **opts)
{
"aggs": {
"???": {
"terms": {
"terms": {
"field": "unique_session_id",
"partition": n,
"num_partitions": x,
"size": y
}
}
"cardinality": {
"field": "unique_session_id",
"precision_threshold": 1000,
}
}
}
}
query result basis for aggregation records, from a terms bucket aggregation with only stats returned:
organized in buckets by “ui” + record id (or “unique_id”)
“doc_count” lists number of hits (events) for a given day (the set interval)
{
'_shards': {'failed': 0, 'skipped': 0, 'successful': 69, 'total': 69},
'aggregations': {
'terms': {
'buckets': [
...
{
'doc_count': 1,
'key': 'ui_yv6k0-srr25',
'last_update': {
'value': 1726598317179.0,
'value_as_string': '2024-09-17T18:38:37.179Z'
},
'top_hit': {
'hits': {
'hits': [
{
'_id': '2024-04-17T12:19:30-'
'd719203049c7b7ffa7bad3dfc97de87225549cf7',
'_index': 'events-stats-record-view-2024-04',
'_score': None,
'_source': {
'country': 'imported',
'is_robot': False,
'parent_recid': 'q4wph-4hk75',
'recid': 'yv6k0-srr25',
'timestamp': '2024-04-17T11:19:33',
'unique_id': 'ui_yv6k0-srr25',
'unique_session_id':
'a6f9fcc160c3360209'
'd163fb95983caee3273e3fb1a4ba549765f75c',
'updated_timestamp': '2024-09-i'
'17T18:38:37.179332',
'via_api': False,
'visitor_id':
'a6f9fcc160c3360209d163fb959'
'83caee3273e3fb1a4ba549765f75c'
},
'sort': [1713352773000]
}
],
'max_score': None,
'total': {
'relation': 'eq',
'value': 1
}
}
},
'unique_count': {
'value': 1
}
}
],
'doc_count_error_upper_bound': 0,
'sum_other_doc_count': 0
}
},
'hits': {
'hits': [],
'max_score': None,
'total': {'relation': 'eq', 'value': 174}},
'timed_out': False,
'took': 10
}
Index record schemas¶
kcworks-events-stats-record-view¶
{
"_index": "kcworks-events-stats-record-view-2019-01",
"_id": "2019-01-01T00:00:00-8e31b4cd058701ecf5ce0dbdc057f3cc775d94ce",
"_score": 1.0,
"_source": {
"timestamp": "2019-01-01T00:00:00",
"recid": "jnhz6-e2n32",
"parent_recid": "pbwf2-yty55",
"unique_id": "ui_jnhz6-e2n32",
"is_robot": false,
"country": "imported",
"via_api": false,
"visitor_id": "4c2caf11d9698e896a9b5990c43f60949578cc66deb98d2e31eddabd",
"unique_session_id": "4c2caf11d9698e896a9b5990c43f60949578cc66deb98d2e31eddabd",
"updated_timestamp": "2024-07-16T22:58:10.107933"
}
}
kcworks-events-stats-file-download¶
{
"_index": "kcworks-events-stats-file-download-2019-01",
"_id": "2019-01-01T00:00:00-0f6eaae908805134818089196c2855f8804781de",
"_score": 1.0,
"_source": {
"timestamp": "2019-01-01T00:00:00",
"bucket_id": "019b4cd0-31f3-413c-bbc4-95047d1fde34",
"file_id": "da479344-5585-4bf9-9c6d-23cadf51e7ee",
"file_key": "lrvol2019iss13fulltext.pdf",
"size": 742774,
"recid": "k59an-48c77",
"parent_recid": "19xa6-enq55",
"is_robot": false,
"country": "imported",
"unique_id": "019b4cd0-31f3-413c-bbc4-95047d1fde34_da479344-5585-4bf9-9c6d-23cadf51e7ee",
"via_api": false,
"visitor_id": "4c2caf11d9698e896a9b5990c43f60949578cc66deb98d2e31eddabd",
"unique_session_id": "4c2caf11d9698e896a9b5990c43f60949578cc66deb98d2e31eddabd",
"updated_timestamp": "2024-07-17T18:34:13.204519"
}
}
kcworks-stats-bookmarks¶
{
"_index": "kcworks-stats-bookmarks",
"_id": "9GFKe5ABGzTNItjdhxgL",
"_score": 1.0,
"_source": {
"date": "2024-07-04T01:10:00.181946",
"aggregation_type": "stats_reindex"
}
}
kcworks-stats-record-view (aggregates)¶
{
"_index": "kcworks-stats-record-view-2019-01",
"_id": "ui_3bn84-vwr29-2019-01-01",
"_score": 1.0,
"_source": {
"timestamp": "2019-01-01T00:00:00",
"unique_id": "ui_3bn84-vwr29",
"count": 2,
"updated_timestamp": "2024-09-04T21:05:28.568054",
"unique_count": 2,
"recid": "3bn84-vwr29",
"parent_recid": "v51aw-69a63",
"via_api": false
}
}
kcworks-stats-file-download (aggregates)¶
{
"_index": "kcworks-stats-file-download-2019-01",
"_id": "1770b96d-04a2-42ff-8bd5-4369f4fceb3c_78785ea0-70a1-49cf-b082-960bf6deae9c-2019-01-01",
"_score": 1.0,
"_source": {
"timestamp": "2019-01-01T00:00:00",
"unique_id": "1770b96d-04a2-42ff-8bd5-4369f4fceb3c_78785ea0-70a1-49cf-b082-960bf6deae9c",
"count": 2,
"updated_timestamp": "2024-09-04T20:56:41.105763",
"unique_count": 2,
"volume": 53651620.0,
"file_id": "78785ea0-70a1-49cf-b082-960bf6deae9c",
"file_key": "fair-use-in-the-visual-arts-lesson-plans-for-librarians.pdf",
"bucket_id": "1770b96d-04a2-42ff-8bd5-4369f4fceb3c",
"recid": "c2gjs-86225",
"parent_recid": "cqazz-qzj09"
}
}
Retrieving stats¶
Useful OpenSearch queries¶
To retrieve the cumulative file download stats for all records from a specific date, use the following query:
GET kcworks-events-stats-file-download/_search?scroll=1m
{
"size": 10,
"query": {
"bool": {
"must": [
{"term": {"is_robot": false}},
{"range": {"timestamp": {"gte": "2024-11-15T00:00:00"}}}
]
}
},
"aggs": {
"unique_sessions": {
"cardinality": {"field": "unique_session_id"}
}
}
}
This query deduplicates the events by unique_session_id
.
To retrieve the record view stats, you can use the same query but replace the kcworks-events-stats-file-download
index with kcworks-events-stats-record-view
.
Recording events¶
invenio_records¶
signals before and after…
record create
record update
recort delete
record revert
Stats event recording is triggered by the record insertion signals from invenio_records, which are subscribed to like this:
from invenio_records.signals import (before_record_insert, \
after_record_insert)
listener = before_record_insert.connect(do_something)
listener = after_record_insert.connect(do_something_else)
invenio_rdm_records¶
The invenio_rdm_records
app subscribes to the record signals and triggers the stats event recording.
Stats collection configuration¶
docs at https://invenio-stats.readthedocs.io/en/latest/configuration.html
The configuration objects are shaped like this:
STATS_EVENTS
{
"file-download": {
"templates": "invenio_rdm_records.records.stats.templates.events.file_download",
"event_builders": [
"invenio_rdm_records.resources.stats.file_download_event_builder",
"invenio_rdm_records.resources.stats.check_if_via_api",
],
"cls": EventsIndexer,
"params": {
"preprocessors": [flag_robots, anonymize_user, build_file_unique_id]
},
},
STATS_AGGREGATIONS
"file-download-agg": {
"templates": "invenio_rdm_records.records.stats.templates.aggregations.aggr_file_download",
"cls": StatAggregator,
"params": {
"event": "file-download",
"field": "unique_id",
"interval": "day",
"index_interval": "month",
"copy_fields": {
"file_id": "file_id",
"file_key": "file_key",
"bucket_id": "bucket_id",
"recid": "recid",
"parent_recid": "parent_recid",
},
"metric_fields": {
"unique_count": (
"cardinality",
"unique_session_id",
{"precision_threshold": 1000},
),
"volume": ("sum", "size", {}),
},
},
},
Stat calculation for records¶
calculated in invenio_rdm_records.records.stats.api.Statistics.get_record_stats()
Stats task scheduling¶
CELERY_BEAT_SCHEDULE = {
# indexing of statistics events & aggregations
"stats-process-events": {
StatsEventTask,
"schedule": crontab(minute="25,55"), # Every hour at minute 25 and 55
},
"stats-aggregate-events": {
StatsAggregationTask,
"schedule": crontab(minute=0), # Every hour at minute 0
},
"reindex-stats": StatsRDMReindexTask, # Every hour at minute 10
}
API query configuration¶
STATS_QUERIES config variable defines allowed queries, params, permissions like this:
STATS_QUERIES = {
"record-view": {
"cls": TermsQuery,
"permission_factory": None,
"params": {
"index": "stats-record-view",
"doc_type": "record-view-day-aggregation",
"copy_fields": {
"recid": "recid",
"parent_recid": "parent_recid",
},
"query_modifiers": [],
"required_filters": {
"recid": "recid",
},
"metric_fields": {
"views": ("sum", "count", {}),
"unique_views": ("sum", "unique_count", {}),
},
},
}
API endpoint permissions¶
STATS_PERMISSION_FACTORY = permissions_policy_lookup_factory
Other kinds of stats¶
Language of records¶
To retrieve the counts of records with each language code from OpenSearch, use the following query:
GET kcworks-rdmrecords-records/_search?scroll=1m
{
"size": 1,
"aggs": {
"languages": {
"terms": {"field": "metadata.languages.id", "size": 1000}
}
}
}