Aggregate events by calendar features (month/year)¶
In this recipe we'll learn how to aggregate events based on calendar features (e.g: monthly, yearly).
For example, suppose we want to calculate total monthly sales, having one event per month that accumulates all sales of the past month.
Here we'll use a more general use case: for every month, show the sales of the past 2 months. This covers the previous case as well, only by changing a parameter's value (steps=1
).
Example data¶
Let's create some sale events with datetime samplings.
import pandas as pd
import temporian as tp
sales_data = pd.DataFrame(
data=[
# sale timestamp, price, cost
["2020-01-01 13:04", 3.0, 1.0], # January
["2020-01-15 15:24", 7.0, 3.0],
["2020-02-01 13:45", 3.0, 1.0], # February
["2020-02-20 16:10", 7.0, 3.0],
["2020-03-10 10:00", 10.0, 5.0], # March
["2020-03-28 10:10", 4.0, 2.0],
["2020-04-15 19:35", 3.0, 1.0], # April
["2020-05-25 18:30", 18.0, 2.0], # May
],
columns=[
"timestamp",
"unit_price",
"unit_cost",
],
)
sales_evset = tp.from_pandas(sales_data)
sales_evset.plot()
Solution¶
We want to calculate for every month, the accumulated sales from the last 2 months. So this is what we can do:
- Create a tick on the first day of every month.
- Use a
moving_sum
with variable window length, at each tick covering the duration since the last 2 months.
1. Create a tick every month¶
# Period to cover. Includes the first day of the month after the last event.
time_span = tp.event_set(timestamps=["2020-01-01 00:00:00", "2020-06-01 00:00:00"])
# Tick first day of every month (equivalent: set mday=1)
monthly_ticks = time_span.tick_calendar(month='*')
monthly_ticks.plot()
2. Moving sum with variable window length¶
The window_length
argument can be an EventSet
with one single feature, which specifies the duration (in seconds) of the window at each timestamp.
Using the since_last()
operator, we get exactly that: an EventSet
with the duration (in seconds) since the last previous event, or since the number events indicated by the steps
parameter. For example, using steps=1
(default), would accumulate events by month, and using steps=6
means a rolling sum over the previous 6 months.
We want the last 2 months calculated every month, so tick every month and use since_last(steps=2)
.
# Same sampling as monthly_ticks, create single feature with the duration of the last 2 months (in seconds)
monthly_window_lengths = monthly_ticks.since_last(steps=2)
# Remove 01/01 and 01/02 (not enough previous data)
monthly_window_lengths = monthly_window_lengths.filter(monthly_window_lengths.notnan())
# Use since_last() feature as variable window length
moving_sum = sales_evset.moving_sum(window_length=monthly_window_lengths)
moving_sum
timestamp | unit_price | unit_cost |
---|---|---|
2020-03-01 00:00:00+00:00 | 20 | 8 |
2020-04-01 00:00:00+00:00 | 24 | 11 |
2020-05-01 00:00:00+00:00 | 17 | 8 |
2020-06-01 00:00:00+00:00 | 21 | 3 |
(Optional) Rename and plot¶
Finally, we can rename features to match their actual meaning after aggregation.
In this case we also calculate and plot the daily profit.
# Rename aggregated features
monthly_sales = moving_sum.rename({"unit_price": "monthly_revenue", "unit_cost": "monthly_cost"})
# Profit = revenue - cost
monthly_profit = (monthly_sales["monthly_revenue"] - monthly_sales["monthly_cost"]).rename("monthly_profit")
monthly_profit.plot(style='line', interactive=True, width_px=600)