Aggregate events from different indexes¶

This recipe applies when you have events indexed by one or more features, and you want to drop some index levels and unify the events with the same timestamps.

In this example, we aggregate daily sales by store and product, into daily revenue for each individual store (i.e., the total sales for each day).

Example data¶

Let's define 2 stores, each one with 2 products. The product IDs might be the same across stores or not.

For each store/product, we'll create the sales (in USD) for the same 3 days (1, 2 and 3 of January, 2020).

In [1]:

Copied!





import pandas as pd
import temporian as tp


sales_data = pd.DataFrame(
    data=[
        # date,    store ID (1),  product ID, sales (USD)  
        ["2020-01-01", "store_1", "product_1", 300.0],
        ["2020-01-02", "store_1", "product_1", 450.0],
        ["2020-01-03", "store_1", "product_1", 600.0],
        ["2020-01-01", "store_1", "product_2", 100.0],
        ["2020-01-02", "store_1", "product_2", 250.0],
        ["2020-01-03", "store_1", "product_2", 100.0],
        # date,    store ID (2),  product ID, sales (USD)  
        ["2020-01-01", "store_2", "product_1", 900.0],
        ["2020-01-02", "store_2", "product_1", 750.0],
        ["2020-01-03", "store_2", "product_1", 750.0],
        ["2020-01-01", "store_2", "product_3", 20.0],
        ["2020-01-02", "store_2", "product_3", 40.0],
        ["2020-01-03", "store_2", "product_3", 30.0],
    ],
    columns=[
        "timestamp",
        "store_id",
        "product_id",
        "sales_usd",
    ],
)

# Load data indexed by store/product
sales_evset = tp.from_pandas(sales_data, indexes=["store_id", "product_id"])
sales_evset.plot()
import pandas as pd
import temporian as tp


sales_data = pd.DataFrame(
    data=[
        # date,    store ID (1),  product ID, sales (USD)  
        ["2020-01-01", "store_1", "product_1", 300.0],
        ["2020-01-02", "store_1", "product_1", 450.0],
        ["2020-01-03", "store_1", "product_1", 600.0],
        ["2020-01-01", "store_1", "product_2", 100.0],
        ["2020-01-02", "store_1", "product_2", 250.0],
        ["2020-01-03", "store_1", "product_2", 100.0],
        # date,    store ID (2),  product ID, sales (USD)  
        ["2020-01-01", "store_2", "product_1", 900.0],
        ["2020-01-02", "store_2", "product_1", 750.0],
        ["2020-01-03", "store_2", "product_1", 750.0],
        ["2020-01-01", "store_2", "product_3", 20.0],
        ["2020-01-02", "store_2", "product_3", 40.0],
        ["2020-01-03", "store_2", "product_3", 30.0],
    ],
    columns=[
        "timestamp",
        "store_id",
        "product_id",
        "sales_usd",
    ],
)

# Load data indexed by store/product
sales_evset = tp.from_pandas(sales_data, indexes=["store_id", "product_id"])
sales_evset.plot()

WARNING:root:Feature "store_id" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "product_id" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).

No description has been provided for this image

Solution¶

We want to aggregate all product sales per store, so this is what we can do:

Drop the product_id index, and ignore it.
Unify sales with the same timestamp and same store, adding them up.

1. Drop index¶

We don't care about the different product_id that we're adding up in each store.

In [2]:

Copied!

store_sales = sales_evset.drop_index("product_id")

store_sales["sales_usd"].plot()
store_sales = sales_evset.drop_index("product_id")

store_sales["sales_usd"].plot()

As you can see, now we've each timestamp duplicated, one for each product.

2. Unify events¶

We want to unify the events with the same timestamps, adding up their sales.

In [3]:

Copied!

unique_days = store_sales.unique_timestamps()

store_daily_sales = store_sales["sales_usd"].moving_sum(window_length=tp.duration.days(1), sampling=unique_days)

store_daily_sales.plot()
unique_days = store_sales.unique_timestamps()

store_daily_sales = store_sales["sales_usd"].moving_sum(window_length=tp.duration.days(1), sampling=unique_days)

store_daily_sales.plot()

In [ ]: