Aggregate events from different indexes¶
This recipe applies when you have events indexed by one or more features, and you want to drop some index levels and unify the events with the same timestamps.
In this example, we aggregate daily sales by store and product, into daily revenue for each individual store (i.e., the total sales for each day).
Example data¶
Let's define 2 stores, each one with 2 products. The product IDs might be the same across stores or not.
For each store/product, we'll create the sales (in USD) for the same 3 days (1, 2 and 3 of January, 2020).
In [1]:
Copied!
import pandas as pd
import temporian as tp
sales_data = pd.DataFrame(
data=[
# date, store ID (1), product ID, sales (USD)
["2020-01-01", "store_1", "product_1", 300.0],
["2020-01-02", "store_1", "product_1", 450.0],
["2020-01-03", "store_1", "product_1", 600.0],
["2020-01-01", "store_1", "product_2", 100.0],
["2020-01-02", "store_1", "product_2", 250.0],
["2020-01-03", "store_1", "product_2", 100.0],
# date, store ID (2), product ID, sales (USD)
["2020-01-01", "store_2", "product_1", 900.0],
["2020-01-02", "store_2", "product_1", 750.0],
["2020-01-03", "store_2", "product_1", 750.0],
["2020-01-01", "store_2", "product_3", 20.0],
["2020-01-02", "store_2", "product_3", 40.0],
["2020-01-03", "store_2", "product_3", 30.0],
],
columns=[
"timestamp",
"store_id",
"product_id",
"sales_usd",
],
)
# Load data indexed by store/product
sales_evset = tp.from_pandas(sales_data, indexes=["store_id", "product_id"])
sales_evset.plot()
import pandas as pd
import temporian as tp
sales_data = pd.DataFrame(
data=[
# date, store ID (1), product ID, sales (USD)
["2020-01-01", "store_1", "product_1", 300.0],
["2020-01-02", "store_1", "product_1", 450.0],
["2020-01-03", "store_1", "product_1", 600.0],
["2020-01-01", "store_1", "product_2", 100.0],
["2020-01-02", "store_1", "product_2", 250.0],
["2020-01-03", "store_1", "product_2", 100.0],
# date, store ID (2), product ID, sales (USD)
["2020-01-01", "store_2", "product_1", 900.0],
["2020-01-02", "store_2", "product_1", 750.0],
["2020-01-03", "store_2", "product_1", 750.0],
["2020-01-01", "store_2", "product_3", 20.0],
["2020-01-02", "store_2", "product_3", 40.0],
["2020-01-03", "store_2", "product_3", 30.0],
],
columns=[
"timestamp",
"store_id",
"product_id",
"sales_usd",
],
)
# Load data indexed by store/product
sales_evset = tp.from_pandas(sales_data, indexes=["store_id", "product_id"])
sales_evset.plot()
WARNING:root:Feature "store_id" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_). WARNING:root:Feature "product_id" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
In [2]:
Copied!
store_sales = sales_evset.drop_index("product_id")
store_sales["sales_usd"].plot()
store_sales = sales_evset.drop_index("product_id")
store_sales["sales_usd"].plot()
As you can see, now we've each timestamp duplicated, one for each product.
2. Unify events¶
We want to unify the events with the same timestamps, adding up their sales.
In [3]:
Copied!
unique_days = store_sales.unique_timestamps()
store_daily_sales = store_sales["sales_usd"].moving_sum(window_length=tp.duration.days(1), sampling=unique_days)
store_daily_sales.plot()
unique_days = store_sales.unique_timestamps()
store_daily_sales = store_sales["sales_usd"].moving_sum(window_length=tp.duration.days(1), sampling=unique_days)
store_daily_sales.plot()
In [ ]:
Copied!