Aggregate events at a fixed interval¶
This recipe aggregates possibly non-uniformly sampled events into fixed-length intervals (e.g., seconds, hours, days, or weeks). In other words, it converts the event features into time series.
For example, suppose we have the sales log from a store, where each sold item is represented by an event. Let's assume each sale event has a date-time, the sale price and the unit cost of the product. We want to calculate total daily sales, with one single event at 00:00
each day.
Example data¶
Let's create some sale events with non-uniform sampling and the mentioned features.
import pandas as pd
import temporian as tp
sales_data = pd.DataFrame(
data=[
# sale timestamp, price, cost
["2020-01-01 13:04", 3.0, 1.0],
["2020-01-01 13:04", 5.0, 2.0], # duplicated timestamp
["2020-01-02 15:24", 7.0, 3.0],
["2020-01-03 13:45", 3.0, 1.0],
["2020-01-03 16:10", 7.0, 3.0],
["2020-01-03 17:30", 10.0, 5.0],
["2020-01-06 10:10", 4.0, 2.0],
["2020-01-06 19:35", 3.0, 1.0],
],
columns=[
"timestamp",
"unit_price",
"unit_cost",
],
)
sales_evset = tp.from_pandas(sales_data)
sales_evset.plot()
Solution¶
We want to calculate total daily sales. So this is what we can do:
- Create a uniform sampling with one tick per day (could be any other interval), at time
00:00:00
. - Add up all sales that happened between
00:00:01
from the previous day, and the current tick at00:00:00
.
1. Create uniform sampling¶
# Define the time span to cover: one week
time_span = tp.event_set(timestamps=["2020-01-01 00:00", "2020-01-07 00:00"])
# Create daily ticks at 00:00
interval = tp.duration.days(1)
ticks = time_span.tick(interval)
ticks
timestamp |
---|
2020-01-01 00:00:00+00:00 |
2020-01-02 00:00:00+00:00 |
2020-01-03 00:00:00+00:00 |
2020-01-04 00:00:00+00:00 |
2020-01-05 00:00:00+00:00 |
2020-01-06 00:00:00+00:00 |
2020-01-07 00:00:00+00:00 |
2. Aggregate the events¶
Now we can aggregate the events between ticks, in this case by running a moving sum over the specified sampling=ticks
, with the window_length
equal to the interval between ticks.
Note that all moving window operators support the sampling
argument, so any other kind of aggregation could be used depending on the use case (e.g: moving average, max, min).
# Provide uniform ticks as sampling
moving_sum = sales_evset.moving_sum(window_length=interval, sampling=ticks)
moving_sum
timestamp | unit_price | unit_cost |
---|---|---|
2020-01-01 00:00:00+00:00 | 0 | 0 |
2020-01-02 00:00:00+00:00 | 8 | 3 |
2020-01-03 00:00:00+00:00 | 7 | 3 |
2020-01-04 00:00:00+00:00 | 20 | 9 |
2020-01-05 00:00:00+00:00 | 0 | 0 |
2020-01-06 00:00:00+00:00 | 0 | 0 |
2020-01-07 00:00:00+00:00 | 7 | 3 |
(Optional) Rename and plot¶
Finally, we can rename features to match their actual meaning after aggregation.
In this case we also calculate and plot the daily profit.
# Rename aggregated features
daily_sales = moving_sum.rename({"unit_price": "daily_revenue", "unit_cost": "daily_cost"})
# Profit = revenue - cost
daily_profit = (daily_sales["daily_revenue"] - daily_sales["daily_cost"]).rename("daily_profit")
daily_profit.plot()