Split data by fraction¶
This recipe can be used to split an EventSet
in two or more subsets, each with a specified fraction of the total number of data points.
For example, to train a machine learning forecasting model, the data usually needs to be split into train, validation and test EventSets
. In this case we'll use 60%
of the data for training, 20%
for validation, and 20%
for test.
Example data¶
import temporian as tp
import numpy as np
T = 10
t = np.arange(0, T, 0.1)
signal_evset = tp.event_set(timestamps=t, features={"signal": np.sin(t)})
signal_evset.plot()
Solution¶
We want to split this into 3 separate EventSets
as follows:
- Train data:
60%
of the events, at the beginning of the series. - Validation:
20%
of the events, following training data. - Test: Remaining
20%
of the events.
The proposed steps are:
- Get the total number of events and calculate split limits.
- Get each event's position in the
EventSet
. - Split comparing each event's position to the split limits.
1. Calculate split limits¶
n_events = len(signal_evset.get_index_value(()))
train_until = int(n_events * 0.6)
val_until = train_until + int(n_events * 0.2)
2. Get each event's position¶
The enumerate()
operator creates a single-feature EventSet
with the position of each event, keeping the indexes and samplings compatible with the original EventSet
.
sample_positions = signal_evset.enumerate()
3. Split based on positions¶
Now we compare the sample_positions
limits of each subset. This will create boolean EventSets
that can be passed directly to the filter()
operator.
train_evset = signal_evset.filter(sample_positions <= train_until)
val_evset = signal_evset.filter((sample_positions > train_until) & (sample_positions <= val_until))
test_evset = signal_evset.filter(sample_positions > val_until)
Check results¶
train_evset.plot()
val_evset.plot()
test_evset.plot()