Split data at a given timestamp¶

This recipe can be used to split an EventSet in two or more subsets at fixed timestamps.

This exact same procedure applies to multi-index data or the default single empty index.

For example, to train a machine learning forecasting model, the data usually needs to be split into train, validation and test EventSets at some fixed timestamps in that respective order.

Example data¶

In this toy example we'll use two separate indexes, but this also applies to any number of indexes as mentioned above.

The second index has 2 more data points from the previous year. Both of them finish at the same date.

In [1]:

Copied!





import pandas as pd
import temporian as tp

sample_data = pd.DataFrame(
    data=[
        # date,  index=1, feature
        ["2020-01-01", 1, 1.0],
        ["2020-02-01", 1, 2.0],
        ["2020-03-01", 1, 3.0],
        ["2020-04-01", 1, 4.0],
        ["2020-05-01", 1, 5.0],
        ["2020-06-01", 1, 6.0],
        # date,  index=2, feature
        ["2019-11-01", 2, -20.0],
        ["2019-12-01", 2, -10.0],
        ["2020-01-01", 2, 10.0],
        ["2020-02-01", 2, 20.0],
        ["2020-03-01", 2, 30.0],
        ["2020-04-01", 2, 40.0],
        ["2020-05-01", 2, 50.0],
        ["2020-06-01", 2, 60.0],
    ],
    columns=[
        "timestamp",
        "idx",
        "feature",
    ],
)

sample_evset = tp.from_pandas(sample_data, indexes=["idx"])
sample_evset
import pandas as pd
import temporian as tp

sample_data = pd.DataFrame(
    data=[
        # date,  index=1, feature
        ["2020-01-01", 1, 1.0],
        ["2020-02-01", 1, 2.0],
        ["2020-03-01", 1, 3.0],
        ["2020-04-01", 1, 4.0],
        ["2020-05-01", 1, 5.0],
        ["2020-06-01", 1, 6.0],
        # date,  index=2, feature
        ["2019-11-01", 2, -20.0],
        ["2019-12-01", 2, -10.0],
        ["2020-01-01", 2, 10.0],
        ["2020-02-01", 2, 20.0],
        ["2020-03-01", 2, 30.0],
        ["2020-04-01", 2, 40.0],
        ["2020-05-01", 2, 50.0],
        ["2020-06-01", 2, 60.0],
    ],
    columns=[
        "timestamp",
        "idx",
        "feature",
    ],
)

sample_evset = tp.from_pandas(sample_data, indexes=["idx"])
sample_evset

Out[1]:

features [1]: feature (float64)

indexes [1]: idx (int64)

events: 14

index values: 2

memory usage: 1.0 kB

index ( idx: 1 ) with 6 events

timestamp	feature
2020-01-01 00:00:00+00:00	1
2020-02-01 00:00:00+00:00	2
2020-03-01 00:00:00+00:00	3
2020-04-01 00:00:00+00:00	4
2020-05-01 00:00:00+00:00	5
2020-06-01 00:00:00+00:00	6

index ( idx: 2 ) with 8 events

timestamp	feature
2019-11-01 00:00:00+00:00	-20
2019-12-01 00:00:00+00:00	-10
2020-01-01 00:00:00+00:00	10
…	…
2020-04-01 00:00:00+00:00	40
2020-05-01 00:00:00+00:00	50
2020-06-01 00:00:00+00:00	60

Solution¶

We want to split this into 3 separate EventSets as follows:

Train data: all data points until 2020-03-01 (including it)
Validation: after training and until 2020-05-01 (including it)
Test: data after 2020-05-01 (not including it)

So the proposed steps are:

Convert the timestamps of events into a feature.
Filter train/validation/test by comparing the timestamps feature to the defined boundaries.

In [2]:

Copied!

from datetime import datetime

# Define boundaries for train/validation/test
train_until = datetime(2020, 3, 1).timestamp()
val_until = datetime(2020, 5, 1).timestamp()
from datetime import datetime

# Define boundaries for train/validation/test
train_until = datetime(2020, 3, 1).timestamp()
val_until = datetime(2020, 5, 1).timestamp()

1. Convert the timestamps into a feature¶

The timestamps() operator creates a single-feature EventSet with the unix timestamp of each event, keeping the indexes and samplings compatible with the original EventSet.

In [3]:

Copied!

# Get the data timestamps as a feature
sample_timestamps = sample_evset.timestamps()
# Get the data timestamps as a feature
sample_timestamps = sample_evset.timestamps()

2. Split based on timestamps¶

Now we compare the timestamps feature to the boundary timestamps of each subset. This will create boolean EventSets that can be passed directly to the filter() operator.

In [4]:

Copied!

train_evset = sample_evset.filter(sample_timestamps <= train_until)
val_evset = sample_evset.filter((sample_timestamps > train_until) & (sample_timestamps <= val_until))
test_evset = sample_evset.filter(sample_timestamps > val_until)
train_evset = sample_evset.filter(sample_timestamps <= train_until)
val_evset = sample_evset.filter((sample_timestamps > train_until) & (sample_timestamps <= val_until))
test_evset = sample_evset.filter(sample_timestamps > val_until)

Check results¶

In [5]:

Copied!

train_evset
train_evset

Out[5]:

features [1]: feature (float64)

indexes [1]: idx (int64)

events: 8

index values: 2

memory usage: 0.9 kB

index ( idx: 1 ) with 3 events

timestamp	feature
2020-01-01 00:00:00+00:00	1
2020-02-01 00:00:00+00:00	2
2020-03-01 00:00:00+00:00	3

index ( idx: 2 ) with 5 events

timestamp	feature
2019-11-01 00:00:00+00:00	-20
2019-12-01 00:00:00+00:00	-10
2020-01-01 00:00:00+00:00	10
2020-02-01 00:00:00+00:00	20
2020-03-01 00:00:00+00:00	30

In [6]:

Copied!

val_evset
val_evset

Out[6]:

features [1]: feature (float64)

indexes [1]: idx (int64)

events: 4

index values: 2

memory usage: 0.9 kB

index ( idx: 1 ) with 2 events

timestamp	feature
2020-04-01 00:00:00+00:00	4
2020-05-01 00:00:00+00:00	5

index ( idx: 2 ) with 2 events

timestamp	feature
2020-04-01 00:00:00+00:00	40
2020-05-01 00:00:00+00:00	50

In [7]:

Copied!

test_evset
test_evset

Out[7]:

features [1]: feature (float64)

indexes [1]: idx (int64)

events: 2

index values: 2

memory usage: 0.8 kB

index ( idx: 1 ) with 1 events

timestamp	feature
2020-06-01 00:00:00+00:00	6
…	…

index ( idx: 2 ) with 1 events

timestamp	feature
2020-06-01 00:00:00+00:00	60
…	…

In [ ]: