Getting Started¶

Temporian is an open-source Python library for preprocessing and feature engineering temporal data, to get it ready for machine learning applications 🤖.

This guide will introduce you to the basics of the library, including how to:

Create an EventSet and use it.
Visualize input/output data using EventSet.plot() and interactive plots.
Convert back and forth between EventSet and pandas DataFrame.
Transform an EventSet by using operators.
Work with indexes.
Use common operators like glue, resample, lag, moving windows and arithmetics.

If you're interested in a topic that is not included here, we provide links to other parts of the documentation on the final section, to continue learning.

By reading this guide, you will learn how to implement a processing pipeline with Temporian, to get your data ready to train machine learning models by using straightforward operations and avoiding common mistakes.

Setup¶

In [1]:

Copied!

# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q
# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q

Note: you may need to restart the kernel to use updated packages.

In [2]:

Copied!

import temporian as tp

import pandas as pd
import numpy as np
import temporian as tp

import pandas as pd
import numpy as np

Part 1: Events and EventSets¶

Events are the basic unit of data in Temporian. They consist of a timestamp and a set of feature values. Events are not handled individually, but are instead grouped together into EventSets.

The main data structure in Temporian is the EventSet, and it represents multivariate and multi-index time sequences. Let's break that down:

multivariate: indicates that each event in the time sequence holds several feature values.
multi-index: indicates that the events can represent hierarchical data, and be therefore grouped by one or more of their features' values.
time sequence: indicates that the events are not necessarily sampled at a uniform rate (in which case we would call it a time series).

You can create an EventSet from a pandas DataFrame, NumPy arrays, CSV files, and more. Here is an example containing only 3 events and 2 features:

In [3]:

Copied!





evset = tp.event_set(
    timestamps=[1, 2, 3],
    features={
        "feature_1": [10, 20, 30],
        "feature_2":  [False, False, True],
    },
)
evset
evset = tp.event_set(
    timestamps=[1, 2, 3],
    features={
        "feature_1": [10, 20, 30],
        "feature_2":  [False, False, True],
    },
)
evset

Out[3]:

features [2]: feature_1 (int64) , feature_2 (bool_)

indexes [0]: none

events: 3

index values: 1

memory usage: 0.7 kB

index ( ) with 3 events

timestamp	feature_1	feature_2
1	10	False
2	20	False
3	30	True

An EventSet can hold one or several time sequences, depending on its index.

If it has no index (e.g: above case), an EventSet holds a single multivariate time sequence.
If it has one (or more) indexes, the events are grouped by their index values. This means that the EventSet will hold one multivariate time sequence for each unique value (or unique combination of values) of its indexes.

Operators are applied on each time sequence of an EventSet independently. Indexing is the primary way to handle rich and complex databases. For instance, in a retail database, you can index on customers, stores, products, etc.

The following example will create one sequence for blue events, and another one for red ones, by specifying that one of the features is an index:

In [4]:

Copied!





# EventSet with indexes
evset = tp.event_set(
    timestamps=["2023-02-04", "2023-02-06", "2023-02-07", "2023-02-07"],
    features={
        "feature_1": [0.5, 0.6, np.nan, 0.9],
        "feature_2": ["red", "blue", "red", "blue"],
        "feature_3":  [10.0, -1.0, 5.0, 5.0],
    },
    indexes=["feature_2"],
)
evset
# EventSet with indexes
evset = tp.event_set(
    timestamps=["2023-02-04", "2023-02-06", "2023-02-07", "2023-02-07"],
    features={
        "feature_1": [0.5, 0.6, np.nan, 0.9],
        "feature_2": ["red", "blue", "red", "blue"],
        "feature_3":  [10.0, -1.0, 5.0, 5.0],
    },
    indexes=["feature_2"],
)
evset

Out[4]:

features [2]: feature_1 (float64) , feature_3 (float64)

indexes [1]: feature_2 (str_)

events: 4

index values: 2

memory usage: 1.1 kB

index ( feature_2: blue ) with 2 events

timestamp	feature_1	feature_3
2023-02-06 00:00:00+00:00	0.6	-1
2023-02-07 00:00:00+00:00	0.9	5

index ( feature_2: red ) with 2 events

timestamp	feature_1	feature_3
2023-02-04 00:00:00+00:00	0.5	10
2023-02-07 00:00:00+00:00	nan	5

See the last part of this tutorial to see some examples using indexes and operators.

Example Data¶

For the following examples, we will generate some fake data which consists of a signal with a timestamp for each sample.

The signal is composed of a periodic season (sine wave), with a slight positive slope which we call trend. Plus the ubiquitous noise. We will include all these components as separate features, together with the resulting signal.

In [5]:

Copied!





# Generate a synthetic dataset
timestamps = np.arange(0, 100, 0.1)
n = len(timestamps)
noise = 0.1 * np.random.randn(n)
trend = 0.01 * timestamps
season = 0.4 * np.sin(timestamps)

# Convention: 'df_' for DataFrame
df_signals = pd.DataFrame(
    {
        "timestamp": timestamps,
        "noise": noise,
        "trend": trend,
        "season": season,
        "signal": noise + trend + season,
    }
)

df_signals
# Generate a synthetic dataset
timestamps = np.arange(0, 100, 0.1)
n = len(timestamps)
noise = 0.1 * np.random.randn(n)
trend = 0.01 * timestamps
season = 0.4 * np.sin(timestamps)

# Convention: 'df_' for DataFrame
df_signals = pd.DataFrame(
    {
        "timestamp": timestamps,
        "noise": noise,
        "trend": trend,
        "season": season,
        "signal": noise + trend + season,
    }
)

df_signals

Out[5]:

	timestamp	noise	trend	season	signal
0	0.0	0.108578	0.000	0.000000	0.108578
1	0.1	0.030499	0.001	0.039933	0.071432
2	0.2	-0.047119	0.002	0.079468	0.034348
3	0.3	0.172282	0.003	0.118208	0.293490
4	0.4	0.191504	0.004	0.155767	0.351271
...	...	...	...	...	...
995	99.5	-0.226783	0.995	-0.343118	0.425099
996	99.6	-0.049185	0.996	-0.320879	0.625936
997	99.7	0.040880	0.997	-0.295433	0.742447
998	99.8	0.042679	0.998	-0.267035	0.773643
999	99.9	-0.106103	0.999	-0.235970	0.656928

1000 rows × 5 columns

Creating an EventSet from a DataFrame¶

As mentioned in the previous section, any kind of signal is represented in Temporian as a collection of events, using the EventSet object.

In this case there's no indexes because we only have one sequence. In the third part we'll learn how to use them and why they can be useful.

In [6]:

Copied!

# Convert the DataFrame into a Temporian EventSet
evset_signals = tp.from_pandas(df_signals)

evset_signals
# Convert the DataFrame into a Temporian EventSet
evset_signals = tp.from_pandas(df_signals)

evset_signals

Out[6]:

features [4]: noise (float64) , trend (float64) , season (float64) , signal (float64)

indexes [0]: none

events: 1000

index values: 1

memory usage: 8.8 kB

index ( ) with 1000 events

timestamp	noise	trend	season	signal
0	0.1086	0	0	0.1086
0.1	0.0305	0.001	0.03993	0.07143
0.2	-0.04712	0.002	0.07947	0.03435
…	…	…	…	…
99.7	0.04088	0.997	-0.2954	0.7424
99.8	0.04268	0.998	-0.267	0.7736
99.9	-0.1061	0.999	-0.236	0.6569

In [7]:

Copied!

# Plot the dataset
_ = evset_signals.plot()
# Plot the dataset
_ = evset_signals.plot()

WARNING:matplotlib.font_manager:Matplotlib is building the font cache; this may take a moment.

No description has been provided for this image

Note: If you're wondering why the plot has an empty () in the title, it's because we have no indexes, as mentioned above.

Part 2: Using Operators¶

Now, let's actually transform our data with a couple operations.

To extract only the long-term trend, the sine and noise signals are first removed using a moving average over a large moving window (they have zero mean).

In [8]:

Copied!





# Pick only one feature
signal = evset_signals["signal"]

# Moving avg
trend = signal.simple_moving_average(tp.duration.seconds(30))
trend.plot()
# Pick only one feature
signal = evset_signals["signal"]

# Moving avg
trend = signal.simple_moving_average(tp.duration.seconds(30))
trend.plot()

Notice that the feature is still named signal?

Let's give it a new name to avoid confusions.

In [9]:

Copied!

# Let's rename the feature by adding a prefix
trend = trend.prefix("trend_")
trend.plot()
# Let's rename the feature by adding a prefix
trend = trend.prefix("trend_")
trend.plot()

Now we've the long-term trend, we can subtract it from the original signal to get only the season component.

In [10]:

Copied!

# Remove the slow 'trend' to get 'season'
detrend = signal - trend

# Rename resulting feature
detrend = detrend.rename("detrend")

detrend.plot()
# Remove the slow 'trend' to get 'season'
detrend = signal - trend

# Rename resulting feature
detrend = detrend.rename("detrend")

detrend.plot()

Using a shorter moving average, we can filter out the noise.

In [11]:

Copied!

denoise = detrend.simple_moving_average(tp.duration.seconds(1.5)).rename("denoise")
denoise.plot()
denoise = detrend.simple_moving_average(tp.duration.seconds(1.5)).rename("denoise")
denoise.plot()

Selecting and combining features¶

Features can be selected and combined to create new EventSets using two operations:

Select: using evset["feature_1"] or evset[["feature_1", "feature_2"]] will return a new EventSet object with only one or two features respectively.
Glue: using tp.glue(evset_1, evset_2) will return a new EventSet combining all features from both inputs. But the feature names cannot be repeated, so you may need to use prefix() or rename() before combining.

Let's add some operations and then plot together everything:

The slope of one of the signals is calculated, by subtracting a delayed version of itself. Note that the time axis for this plot is shifted.

In [12]:

Copied!





# Pack results to show all plots together
evset_result = tp.glue(
    signal,
    trend,
    detrend,
    denoise
)

evset_result.plot()
# Pack results to show all plots together
evset_result = tp.glue(
    signal,
    trend,
    detrend,
    denoise
)

evset_result.plot()

Lag and resample¶

Just as another example, let's also calculate the derivative of the denoised signal, numerically.

In [13]:

Copied!





# Estimate numeric derivative

# Time step
delta_t = 1

# Increment in y axis
y = denoise
y_lag = y.lag(delta_t)
delta_y = y - y_lag.resample(y)

# Remember the formula? :)
derivative = delta_y / delta_t

# Also, let's use an interactive plot just for fun.
derivative.plot(interactive=True, width_px=600)
# Estimate numeric derivative

# Time step
delta_t = 1

# Increment in y axis
y = denoise
y_lag = y.lag(delta_t)
delta_y = y - y_lag.resample(y)

# Remember the formula? :)
derivative = delta_y / delta_t

# Also, let's use an interactive plot just for fun.
derivative.plot(interactive=True, width_px=600)

Pretty accurate! We had a 0.4 amplitude sine wave with unit frequency, so the derivative should be a 0.4 amplitude cosine.

Now, taking a look at the operators, the lag() call is pretty self-descriptive. But you might be wondering, why is the resample() operator needed?

That's because the y.lag(delta_t) just shifts the timestamps by delta_t, and as a result, y and y_lag are signals with different samplings.

But, how would you subtract two signals that are defined at different timestamps? In Temporian, we don't like error-prone implicit magic behavior, so you have to do it explicitly. You can only do arithmetics between signals with the same samplings.

To create matching samplings, we explicitly use y_lag.resample(y), creating a signal using the timestamps from y, but taking the values from y_lag. It's essentialy the same signal as y_lag, but sampled at the same timestamps as y.

Exporting outputs from Temporian¶

You may need to use this data in different ways for downstream tasks, like training a model using whatever library you need.

If you can't use the data directly from Temporian, you can always go back to a pandas DataFrame:

In [14]:

Copied!

tp.to_pandas(evset_result)
tp.to_pandas(evset_result)

Out[14]:

	signal	trend_signal	detrend	denoise	timestamp
0	0.108578	0.108578	0.000000	0.000000	0.0
1	0.071432	0.090005	-0.018573	-0.009287	0.1
2	0.034348	0.071453	-0.037104	-0.018559	0.2
3	0.293490	0.126962	0.166528	0.027713	0.3
4	0.351271	0.171824	0.179447	0.058060	0.4
...	...	...	...	...	...
995	0.425099	0.841116	-0.416017	-0.241947	99.5
996	0.625936	0.840020	-0.214084	-0.239183	99.6
997	0.742447	0.839153	-0.096706	-0.227394	99.7
998	0.773643	0.838790	-0.065146	-0.225435	99.8
999	0.656928	0.837832	-0.180904	-0.219381	99.9

1000 rows × 5 columns

Part 3: Using indexes¶

This is the final important concept to get from this introduction.

Indexes are useful to handle multiple signals in parallel (as mentioned at the top of this notebook). For example, working with signals from multiple sensor devices or representing sales from many stores or products. The feature names may be exactly the same for all the data, but we need to separate them by setting the correct index for each one.

New example data: multiple devices¶

Let's create two signals with overlapping timestamps, with a different device_id:

In [15]:

Copied!





# Two devices with overlapping timestamps
df_device_1 = df_signals[:900].copy()
df_device_2 = df_signals[300:].copy()

# Add a column with device_id and concat
df_device_1["device_id"] = "Device 1"
df_device_2["device_id"] = "Device 2"
df_both_devices = pd.concat([df_device_1, df_device_2])

# Create evset using 'device_id' as index
evset_devices = tp.from_pandas(df_both_devices, indexes=["device_id"])
evset_devices

# Two devices with overlapping timestamps
df_device_1 = df_signals[:900].copy()
df_device_2 = df_signals[300:].copy()

# Add a column with device_id and concat
df_device_1["device_id"] = "Device 1"
df_device_2["device_id"] = "Device 2"
df_both_devices = pd.concat([df_device_1, df_device_2])

# Create evset using 'device_id' as index
evset_devices = tp.from_pandas(df_both_devices, indexes=["device_id"])
evset_devices

WARNING:root:Feature "device_id" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).

Out[15]:

features [4]: noise (float64) , trend (float64) , season (float64) , signal (float64)

indexes [1]: device_id (str_)

events: 1600

index values: 2

memory usage: 65.5 kB

index ( device_id: Device 1 ) with 900 events

timestamp	noise	trend	season	signal
0	0.1086	0	0	0.1086
0.1	0.0305	0.001	0.03993	0.07143
0.2	-0.04712	0.002	0.07947	0.03435
…	…	…	…	…
89.7	0.08734	0.897	0.3946	1.379
89.8	0.1026	0.898	0.3861	1.387
89.9	-0.2063	0.899	0.3737	1.066

index ( device_id: Device 2 ) with 700 events

timestamp	noise	trend	season	signal
30	0.1919	0.3	-0.3952	0.0967
30.1	-0.04673	0.301	-0.3871	-0.1328
30.2	0.009876	0.302	-0.3751	-0.0632
…	…	…	…	…
99.7	0.04088	0.997	-0.2954	0.7424
99.8	0.04268	0.998	-0.267	0.7736
99.9	-0.1061	0.999	-0.236	0.6569

As you can see above, each index has it's own timestamps and feature values. They will always have the same features though, because they're on the same EventSet.

The plots also accomodate to show each index separately. In particular, see below how the timestamps are different and partly overlapping, and that's completely fine for separate indices. This wouldn't be possible by using different feature names for each sensor, for example.

In [16]:

Copied!

evset_devices["signal"].plot()
evset_devices["signal"].plot()

Operations with index¶

Any operator that we apply now, is aware of the index and will be performed over each one separately.

In [17]:

Copied!





# Apply some operations
trend_i = evset_devices["signal"].simple_moving_average(tp.duration.seconds(30))
detrend_i = evset_devices["signal"] - trend_i
denoise_i = detrend_i.simple_moving_average(tp.duration.seconds(1.5))

# Plot for each index
tp.glue(evset_devices["signal"],
        detrend_i.rename("detrend"),
        denoise_i.rename("denoise")
       ).plot()
# Apply some operations
trend_i = evset_devices["signal"].simple_moving_average(tp.duration.seconds(30))
detrend_i = evset_devices["signal"] - trend_i
denoise_i = detrend_i.simple_moving_average(tp.duration.seconds(1.5))

# Plot for each index
tp.glue(evset_devices["signal"],
        detrend_i.rename("detrend"),
        denoise_i.rename("denoise")
       ).plot()

Multi-indexes¶

Finally, let's point out that multiple columns of the input data may be set as indexes.

For example, in the case of sales in a store, we could use both the store and product columns to group the sequences. In this case, each group would contain the sales for a single product in a single store.

This is easy to do since the indexes argument is actually a list of columns, and each group represented in Temporian by using a tuple (store, product) as the index key.

Summary¶

Congratulations! You now have the basic concepts needed to create a data preprocessing pipeline with Temporian:

Defining an EventSet and using operators on it.
Combine features using select and glue.
Converting data back and forth between Temporian's EventSet and pandas DataFrames.
Visualizing input/output data using EventSet.plot().
Operating and plotting with indexes.

Other important details¶

To keep it short and concise, there are interesting concepts that were not mentioned above:

Check the Time Units section of the User Guide. There are many calendar operators available when working with datetimes.
To combine or operate with events from different sampling sources (potentially non-uniform samplings) check the sampling section of the User Guide.
Temporian is strict on the feature data types when applying operations, to avoid potentially silent errors or memory issues. Check the User Guide's casting section section to learn how to tackle those cases.

Next Steps¶

The Recipes are short and self-contained examples showing how to use Temporian in typical use cases.
Try the more advanced tutorials to continue learning by example about all these topics and more!
Learn how Temporian is ready for production, using graph mode or Apache Beam.
We could only cover a small fraction of all available operators.
We put a lot of ❤️ in the User Guide, so make sure to check it out 🙂.