Getting StartedĀ¶
Temporian is an open-source Python library for preprocessing and feature engineering temporal data, to get it ready for machine learning applications š¤.
This guide will introduce you to the basics of the library, including how to:
- Create an
EventSet
and use it. - Visualize input/output data using
EventSet.plot()
and interactive plots. - Convert back and forth between
EventSet
and pandasDataFrame
. - Transform an
EventSet
by using operators. - Work with
indexes
. - Use common operators like
glue
,resample
,lag
, moving windows and arithmetics.
If you're interested in a topic that is not included here, we provide links to other parts of the documentation on the final section, to continue learning.
By reading this guide, you will learn how to implement a processing pipeline with Temporian, to get your data ready to train machine learning models by using straightforward operations and avoiding common mistakes.
SetupĀ¶
# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q
Note: you may need to restart the kernel to use updated packages.
import temporian as tp
import pandas as pd
import numpy as np
Part 1: Events and EventSetsĀ¶
Events are the basic unit of data in Temporian. They consist of a timestamp and a set of feature values. Events are not handled individually, but are instead grouped together into EventSets
.
The main data structure in Temporian is the EventSet
, and it represents multivariate and multi-index time sequences. Let's break that down:
- multivariate: indicates that each event in the time sequence holds several feature values.
- multi-index: indicates that the events can represent hierarchical data, and be therefore grouped by one or more of their features' values.
- time sequence: indicates that the events are not necessarily sampled at a uniform rate (in which case we would call it a time series).
You can create an EventSet
from a pandas DataFrame, NumPy arrays, CSV files, and more. Here is an example containing only 3 events and 2 features:
evset = tp.event_set(
timestamps=[1, 2, 3],
features={
"feature_1": [10, 20, 30],
"feature_2": [False, False, True],
},
)
evset
timestamp | feature_1 | feature_2 |
---|---|---|
1 | 10 | False |
2 | 20 | False |
3 | 30 | True |
An EventSet
can hold one or several time sequences, depending on its index.
- If it has no index (e.g: above case), an
EventSet
holds a single multivariate time sequence. - If it has one (or more) indexes, the events are grouped by their index values. This means that the
EventSet
will hold one multivariate time sequence for each unique value (or unique combination of values) of its indexes.
Operators are applied on each time sequence of an EventSet
independently. Indexing is the primary way to handle rich and complex databases. For instance, in a retail database, you can index on customers, stores, products, etc.
The following example will create one sequence for blue
events, and another one for red
ones, by specifying that one of the features is an index
:
# EventSet with indexes
evset = tp.event_set(
timestamps=["2023-02-04", "2023-02-06", "2023-02-07", "2023-02-07"],
features={
"feature_1": [0.5, 0.6, np.nan, 0.9],
"feature_2": ["red", "blue", "red", "blue"],
"feature_3": [10.0, -1.0, 5.0, 5.0],
},
indexes=["feature_2"],
)
evset
timestamp | feature_1 | feature_3 |
---|---|---|
2023-02-06 00:00:00+00:00 | 0.6 | -1 |
2023-02-07 00:00:00+00:00 | 0.9 | 5 |
timestamp | feature_1 | feature_3 |
---|---|---|
2023-02-04 00:00:00+00:00 | 0.5 | 10 |
2023-02-07 00:00:00+00:00 | nan | 5 |
See the last part of this tutorial to see some examples using indexes
and operators.
Example DataĀ¶
For the following examples, we will generate some fake data which consists of a signal
with a timestamp
for each sample.
The signal is composed of a periodic season
(sine wave), with a slight positive slope which we call trend
. Plus the ubiquitous noise
. We will include all these components as separate features, together with the resulting signal
.
# Generate a synthetic dataset
timestamps = np.arange(0, 100, 0.1)
n = len(timestamps)
noise = 0.1 * np.random.randn(n)
trend = 0.01 * timestamps
season = 0.4 * np.sin(timestamps)
# Convention: 'df_' for DataFrame
df_signals = pd.DataFrame(
{
"timestamp": timestamps,
"noise": noise,
"trend": trend,
"season": season,
"signal": noise + trend + season,
}
)
df_signals
timestamp | noise | trend | season | signal | |
---|---|---|---|---|---|
0 | 0.0 | -0.011801 | 0.000 | 0.000000 | -0.011801 |
1 | 0.1 | -0.059150 | 0.001 | 0.039933 | -0.018217 |
2 | 0.2 | -0.125599 | 0.002 | 0.079468 | -0.044131 |
3 | 0.3 | 0.059551 | 0.003 | 0.118208 | 0.180759 |
4 | 0.4 | 0.014413 | 0.004 | 0.155767 | 0.174180 |
... | ... | ... | ... | ... | ... |
995 | 99.5 | 0.152641 | 0.995 | -0.343118 | 0.804523 |
996 | 99.6 | -0.068763 | 0.996 | -0.320879 | 0.606358 |
997 | 99.7 | -0.056275 | 0.997 | -0.295433 | 0.645292 |
998 | 99.8 | 0.062650 | 0.998 | -0.267035 | 0.793614 |
999 | 99.9 | 0.078997 | 0.999 | -0.235970 | 0.842027 |
1000 rows Ć 5 columns
Creating an EventSet from a DataFrameĀ¶
As mentioned in the previous section, any kind of signal is represented in Temporian as a collection of events, using the EventSet
object.
In this case there's no indexes
because we only have one sequence. In the third part we'll learn how to use them and why they can be useful.
# Convert the DataFrame into a Temporian EventSet
evset_signals = tp.from_pandas(df_signals)
evset_signals
timestamp | noise | trend | season | signal |
---|---|---|---|---|
0 | -0.0118 | 0 | 0 | -0.0118 |
0.1 | -0.05915 | 0.001 | 0.03993 | -0.01822 |
0.2 | -0.1256 | 0.002 | 0.07947 | -0.04413 |
ā¦ | ā¦ | ā¦ | ā¦ | ā¦ |
99.7 | -0.05628 | 0.997 | -0.2954 | 0.6453 |
99.8 | 0.06265 | 0.998 | -0.267 | 0.7936 |
99.9 | 0.079 | 0.999 | -0.236 | 0.842 |
# Plot the dataset
_ = evset_signals.plot()
Note: If you're wondering why the plot has an empty ()
in the title, it's because we have no indexes
, as mentioned above.
Part 2: Using OperatorsĀ¶
Now, let's actually transform our data with a couple operations.
To extract only the long-term trend, the sine and noise signals are first removed using a moving average over a large moving window (they have zero mean).
# Pick only one feature
signal = evset_signals["signal"]
# Moving avg
trend = signal.simple_moving_average(tp.duration.seconds(30))
trend.plot()
Notice that the feature is still named signal
?
Let's give it a new name to avoid confusions.
# Let's rename the feature by adding a prefix
trend = trend.prefix("trend_")
trend.plot()
Now we've the long-term trend, we can subtract it from the original signal to get only the season
component.
# Remove the slow 'trend' to get 'season'
detrend = signal - trend
# Rename resulting feature
detrend = detrend.rename("detrend")
detrend.plot()
Using a shorter moving average, we can filter out the noise.
denoise = detrend.simple_moving_average(tp.duration.seconds(1.5)).rename("denoise")
denoise.plot()