Split data by fraction¶

This recipe can be used to split an EventSet in two or more subsets, each with a specified fraction of the total number of data points.

For example, to train a machine learning forecasting model, the data usually needs to be split into train, validation and test EventSets. In this case we'll use 60% of the data for training, 20% for validation, and 20% for test.

Example data¶

In [1]:

Copied!





import temporian as tp
import numpy as np

T = 10
t = np.arange(0, T, 0.1)
signal_evset = tp.event_set(timestamps=t, features={"signal": np.sin(t)})

signal_evset.plot()
import temporian as tp
import numpy as np

T = 10
t = np.arange(0, T, 0.1)
signal_evset = tp.event_set(timestamps=t, features={"signal": np.sin(t)})

signal_evset.plot()

No description has been provided for this image

Solution¶

We want to split this into 3 separate EventSets as follows:

Train data: 60% of the events, at the beginning of the series.
Validation: 20% of the events, following training data.
Test: Remaining 20% of the events.

The proposed steps are:

Get the total number of events and calculate split limits.
Get each event's position in the EventSet.
Split comparing each event's position to the split limits.

1. Calculate split limits¶

In [2]:

Copied!

n_events = len(signal_evset.get_index_value(()))

train_until = int(n_events * 0.6)
val_until = train_until + int(n_events * 0.2)
n_events = len(signal_evset.get_index_value(()))

train_until = int(n_events * 0.6)
val_until = train_until + int(n_events * 0.2)

2. Get each event's position¶

The enumerate() operator creates a single-feature EventSet with the position of each event, keeping the indexes and samplings compatible with the original EventSet.

In [3]:

Copied!

sample_positions = signal_evset.enumerate()
sample_positions = signal_evset.enumerate()

3. Split based on positions¶

Now we compare the sample_positions limits of each subset. This will create boolean EventSets that can be passed directly to the filter() operator.

In [4]:

Copied!

train_evset = signal_evset.filter(sample_positions <= train_until)
val_evset = signal_evset.filter((sample_positions > train_until) & (sample_positions <= val_until))
test_evset = signal_evset.filter(sample_positions > val_until)
train_evset = signal_evset.filter(sample_positions <= train_until)
val_evset = signal_evset.filter((sample_positions > train_until) & (sample_positions <= val_until))
test_evset = signal_evset.filter(sample_positions > val_until)

Check results¶

In [5]:

Copied!

train_evset.plot()
train_evset.plot()

In [6]:

Copied!

val_evset.plot()
val_evset.plot()

In [7]:

Copied!

test_evset.plot()
test_evset.plot()

In [ ]: