Split data at a given timestamp¶
This recipe can be used to split an EventSet
in two or more subsets at fixed timestamps.
This exact same procedure applies to multi-index data or the default single empty index.
For example, to train a machine learning forecasting model, the data usually needs to be split into train, validation and test EventSets
at some fixed timestamps in that respective order.
Example data¶
In this toy example we'll use two separate indexes, but this also applies to any number of indexes as mentioned above.
The second index has 2 more data points from the previous year. Both of them finish at the same date.
import pandas as pd
import temporian as tp
sample_data = pd.DataFrame(
data=[
# date, index=1, feature
["2020-01-01", 1, 1.0],
["2020-02-01", 1, 2.0],
["2020-03-01", 1, 3.0],
["2020-04-01", 1, 4.0],
["2020-05-01", 1, 5.0],
["2020-06-01", 1, 6.0],
# date, index=2, feature
["2019-11-01", 2, -20.0],
["2019-12-01", 2, -10.0],
["2020-01-01", 2, 10.0],
["2020-02-01", 2, 20.0],
["2020-03-01", 2, 30.0],
["2020-04-01", 2, 40.0],
["2020-05-01", 2, 50.0],
["2020-06-01", 2, 60.0],
],
columns=[
"timestamp",
"idx",
"feature",
],
)
sample_evset = tp.from_pandas(sample_data, indexes=["idx"])
sample_evset
timestamp | feature |
---|---|
2020-01-01 00:00:00+00:00 | 1 |
2020-02-01 00:00:00+00:00 | 2 |
2020-03-01 00:00:00+00:00 | 3 |
2020-04-01 00:00:00+00:00 | 4 |
2020-05-01 00:00:00+00:00 | 5 |
2020-06-01 00:00:00+00:00 | 6 |
timestamp | feature |
---|---|
2019-11-01 00:00:00+00:00 | -20 |
2019-12-01 00:00:00+00:00 | -10 |
2020-01-01 00:00:00+00:00 | 10 |
… | … |
2020-04-01 00:00:00+00:00 | 40 |
2020-05-01 00:00:00+00:00 | 50 |
2020-06-01 00:00:00+00:00 | 60 |
Solution¶
We want to split this into 3 separate EventSets
as follows:
- Train data: all data points until
2020-03-01
(including it) - Validation: after training and until
2020-05-01
(including it) - Test: data after
2020-05-01
(not including it)
So the proposed steps are:
- Convert the timestamps of events into a feature.
- Filter train/validation/test by comparing the timestamps feature to the defined boundaries.
from datetime import datetime
# Define boundaries for train/validation/test
train_until = datetime(2020, 3, 1).timestamp()
val_until = datetime(2020, 5, 1).timestamp()
1. Convert the timestamps into a feature¶
The timestamps()
operator creates a single-feature EventSet
with the unix timestamp of each event, keeping the indexes and samplings compatible with the original EventSet
.
# Get the data timestamps as a feature
sample_timestamps = sample_evset.timestamps()
2. Split based on timestamps¶
Now we compare the timestamps feature to the boundary timestamps of each subset. This will create boolean EventSets
that can be passed directly to the filter()
operator.
train_evset = sample_evset.filter(sample_timestamps <= train_until)
val_evset = sample_evset.filter((sample_timestamps > train_until) & (sample_timestamps <= val_until))
test_evset = sample_evset.filter(sample_timestamps > val_until)
Check results¶
train_evset
timestamp | feature |
---|---|
2020-01-01 00:00:00+00:00 | 1 |
2020-02-01 00:00:00+00:00 | 2 |
2020-03-01 00:00:00+00:00 | 3 |
timestamp | feature |
---|---|
2019-11-01 00:00:00+00:00 | -20 |
2019-12-01 00:00:00+00:00 | -10 |
2020-01-01 00:00:00+00:00 | 10 |
2020-02-01 00:00:00+00:00 | 20 |
2020-03-01 00:00:00+00:00 | 30 |
val_evset
timestamp | feature |
---|---|
2020-04-01 00:00:00+00:00 | 4 |
2020-05-01 00:00:00+00:00 | 5 |
timestamp | feature |
---|---|
2020-04-01 00:00:00+00:00 | 40 |
2020-05-01 00:00:00+00:00 | 50 |
test_evset
timestamp | feature |
---|---|
2020-06-01 00:00:00+00:00 | 6 |
… | … |
timestamp | feature |
---|---|
2020-06-01 00:00:00+00:00 | 60 |
… | … |