Getting Started¶
Temporian is an open-source Python library for preprocessing and feature engineering temporal data, to get it ready for machine learning applications 🤖.
This guide will introduce you to the basics of the library, including how to:
- Create an
EventSet
and use it. - Visualize input/output data using
EventSet.plot()
and interactive plots. - Convert back and forth between
EventSet
and pandasDataFrame
. - Transform an
EventSet
by using operators. - Work with
indexes
. - Use common operators like
glue
,resample
,lag
, moving windows and arithmetics.
If you're interested in a topic that is not included here, we provide links to other parts of the documentation on the final section, to continue learning.
By reading this guide, you will learn how to implement a processing pipeline with Temporian, to get your data ready to train machine learning models by using straightforward operations and avoiding common mistakes.
Setup¶
# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q
Note: you may need to restart the kernel to use updated packages.
import temporian as tp
import pandas as pd
import numpy as np
Part 1: Events and EventSets¶
Events are the basic unit of data in Temporian. They consist of a timestamp and a set of feature values. Events are not handled individually, but are instead grouped together into EventSets
.
The main data structure in Temporian is the EventSet
, and it represents multivariate and multi-index time sequences. Let's break that down:
- multivariate: indicates that each event in the time sequence holds several feature values.
- multi-index: indicates that the events can represent hierarchical data, and be therefore grouped by one or more of their features' values.
- time sequence: indicates that the events are not necessarily sampled at a uniform rate (in which case we would call it a time series).
You can create an EventSet
from a pandas DataFrame, NumPy arrays, CSV files, and more. Here is an example containing only 3 events and 2 features:
evset = tp.event_set(
timestamps=[1, 2, 3],
features={
"feature_1": [10, 20, 30],
"feature_2": [False, False, True],
},
)
evset
timestamp | feature_1 | feature_2 |
---|---|---|
1 | 10 | False |
2 | 20 | False |
3 | 30 | True |
An EventSet
can hold one or several time sequences, depending on its index.
- If it has no index (e.g: above case), an
EventSet
holds a single multivariate time sequence. - If it has one (or more) indexes, the events are grouped by their index values. This means that the
EventSet
will hold one multivariate time sequence for each unique value (or unique combination of values) of its indexes.
Operators are applied on each time sequence of an EventSet
independently. Indexing is the primary way to handle rich and complex databases. For instance, in a retail database, you can index on customers, stores, products, etc.
The following example will create one sequence for blue
events, and another one for red
ones, by specifying that one of the features is an index
:
# EventSet with indexes
evset = tp.event_set(
timestamps=["2023-02-04", "2023-02-06", "2023-02-07", "2023-02-07"],
features={
"feature_1": [0.5, 0.6, np.nan, 0.9],
"feature_2": ["red", "blue", "red", "blue"],
"feature_3": [10.0, -1.0, 5.0, 5.0],
},
indexes=["feature_2"],
)
evset
timestamp | feature_1 | feature_3 |
---|---|---|
2023-02-06 00:00:00+00:00 | 0.6 | -1 |
2023-02-07 00:00:00+00:00 | 0.9 | 5 |
timestamp | feature_1 | feature_3 |
---|---|---|
2023-02-04 00:00:00+00:00 | 0.5 | 10 |
2023-02-07 00:00:00+00:00 | nan | 5 |
See the last part of this tutorial to see some examples using indexes
and operators.
Example Data¶
For the following examples, we will generate some fake data which consists of a signal
with a timestamp
for each sample.
The signal is composed of a periodic season
(sine wave), with a slight positive slope which we call trend
. Plus the ubiquitous noise
. We will include all these components as separate features, together with the resulting signal
.
# Generate a synthetic dataset
timestamps = np.arange(0, 100, 0.1)
n = len(timestamps)
noise = 0.1 * np.random.randn(n)
trend = 0.01 * timestamps
season = 0.4 * np.sin(timestamps)
# Convention: 'df_' for DataFrame
df_signals = pd.DataFrame(
{
"timestamp": timestamps,
"noise": noise,
"trend": trend,
"season": season,
"signal": noise + trend + season,
}
)
df_signals
timestamp | noise | trend | season | signal | |
---|---|---|---|---|---|
0 | 0.0 | 0.108578 | 0.000 | 0.000000 | 0.108578 |
1 | 0.1 | 0.030499 | 0.001 | 0.039933 | 0.071432 |
2 | 0.2 | -0.047119 | 0.002 | 0.079468 | 0.034348 |
3 | 0.3 | 0.172282 | 0.003 | 0.118208 | 0.293490 |
4 | 0.4 | 0.191504 | 0.004 | 0.155767 | 0.351271 |
... | ... | ... | ... | ... | ... |
995 | 99.5 | -0.226783 | 0.995 | -0.343118 | 0.425099 |
996 | 99.6 | -0.049185 | 0.996 | -0.320879 | 0.625936 |
997 | 99.7 | 0.040880 | 0.997 | -0.295433 | 0.742447 |
998 | 99.8 | 0.042679 | 0.998 | -0.267035 | 0.773643 |
999 | 99.9 | -0.106103 | 0.999 | -0.235970 | 0.656928 |
1000 rows × 5 columns
Creating an EventSet from a DataFrame¶
As mentioned in the previous section, any kind of signal is represented in Temporian as a collection of events, using the EventSet
object.
In this case there's no indexes
because we only have one sequence. In the third part we'll learn how to use them and why they can be useful.
# Convert the DataFrame into a Temporian EventSet
evset_signals = tp.from_pandas(df_signals)
evset_signals
timestamp | noise | trend | season | signal |
---|---|---|---|---|
0 | 0.1086 | 0 | 0 | 0.1086 |
0.1 | 0.0305 | 0.001 | 0.03993 | 0.07143 |
0.2 | -0.04712 | 0.002 | 0.07947 | 0.03435 |
… | … | … | … | … |
99.7 | 0.04088 | 0.997 | -0.2954 | 0.7424 |
99.8 | 0.04268 | 0.998 | -0.267 | 0.7736 |
99.9 | -0.1061 | 0.999 | -0.236 | 0.6569 |
# Plot the dataset
_ = evset_signals.plot()
WARNING:matplotlib.font_manager:Matplotlib is building the font cache; this may take a moment.
Note: If you're wondering why the plot has an empty ()
in the title, it's because we have no indexes
, as mentioned above.
Part 2: Using Operators¶
Now, let's actually transform our data with a couple operations.
To extract only the long-term trend, the sine and noise signals are first removed using a moving average over a large moving window (they have zero mean).
# Pick only one feature
signal = evset_signals["signal"]
# Moving avg
trend = signal.simple_moving_average(tp.duration.seconds(30))
trend.plot()
Notice that the feature is still named signal
?
Let's give it a new name to avoid confusions.
# Let's rename the feature by adding a prefix
trend = trend.prefix("trend_")
trend.plot()
Now we've the long-term trend, we can subtract it from the original signal to get only the season
component.
# Remove the slow 'trend' to get 'season'
detrend = signal - trend
# Rename resulting feature
detrend = detrend.rename("detrend")
detrend.plot()
Using a shorter moving average, we can filter out the noise.
denoise = detrend.simple_moving_average(tp.duration.seconds(1.5)).rename("denoise")
denoise.plot()
Selecting and combining features¶
Features can be selected and combined to create new EventSets
using two operations:
- Select: using
evset["feature_1"]
orevset[["feature_1", "feature_2"]]
will return a newEventSet
object with only one or two features respectively. - Glue: using
tp.glue(evset_1, evset_2)
will return a newEventSet
combining all features from both inputs. But the feature names cannot be repeated, so you may need to useprefix()
orrename()
before combining.
Let's add some operations and then plot together everything:
- The
slope
of one of the signals is calculated, by subtracting a delayed version of itself. Note that the time axis for this plot is shifted.
# Pack results to show all plots together
evset_result = tp.glue(
signal,
trend,
detrend,
denoise
)
evset_result.plot()
Lag and resample¶
Just as another example, let's also calculate the derivative of the denoised signal, numerically.
# Estimate numeric derivative
# Time step
delta_t = 1
# Increment in y axis
y = denoise
y_lag = y.lag(delta_t)
delta_y = y - y_lag.resample(y)
# Remember the formula? :)
derivative = delta_y / delta_t
# Also, let's use an interactive plot just for fun.
derivative.plot(interactive=True, width_px=600)
Pretty accurate! We had a 0.4
amplitude sine wave with unit frequency, so the derivative should be a 0.4
amplitude cosine.
Now, taking a look at the operators, the lag()
call is pretty self-descriptive. But you might be wondering, why is the resample()
operator needed?
That's because the y.lag(delta_t)
just shifts the timestamps by delta_t
, and as a result, y
and y_lag
are signals with different samplings.
But, how would you subtract two signals that are defined at different timestamps? In Temporian, we don't like error-prone implicit magic behavior, so you have to do it explicitly. You can only do arithmetics between signals with the same samplings.
To create matching samplings, we explicitly use y_lag.resample(y)
, creating a signal using the timestamps from y
, but taking the values from y_lag
. It's essentialy the same signal as y_lag
, but sampled at the same timestamps as y
.
Exporting outputs from Temporian¶
You may need to use this data in different ways for downstream tasks, like training a model using whatever library you need.
If you can't use the data directly from Temporian, you can always go back to a pandas DataFrame
:
tp.to_pandas(evset_result)
signal | trend_signal | detrend | denoise | timestamp | |
---|---|---|---|---|---|
0 | 0.108578 | 0.108578 | 0.000000 | 0.000000 | 0.0 |
1 | 0.071432 | 0.090005 | -0.018573 | -0.009287 | 0.1 |
2 | 0.034348 | 0.071453 | -0.037104 | -0.018559 | 0.2 |
3 | 0.293490 | 0.126962 | 0.166528 | 0.027713 | 0.3 |
4 | 0.351271 | 0.171824 | 0.179447 | 0.058060 | 0.4 |
... | ... | ... | ... | ... | ... |
995 | 0.425099 | 0.841116 | -0.416017 | -0.241947 | 99.5 |
996 | 0.625936 | 0.840020 | -0.214084 | -0.239183 | 99.6 |
997 | 0.742447 | 0.839153 | -0.096706 | -0.227394 | 99.7 |
998 | 0.773643 | 0.838790 | -0.065146 | -0.225435 | 99.8 |
999 | 0.656928 | 0.837832 | -0.180904 | -0.219381 | 99.9 |
1000 rows × 5 columns
Part 3: Using indexes¶
This is the final important concept to get from this introduction.
Indexes are useful to handle multiple signals in parallel (as mentioned at the top of this notebook).
For example, working with signals from multiple sensor devices or representing sales from many stores or products. The feature names may be exactly the same for all the data, but we need to separate them by setting the correct index
for each one.
New example data: multiple devices¶
Let's create two signals with overlapping timestamps, with a different device_id
:
# Two devices with overlapping timestamps
df_device_1 = df_signals[:900].copy()
df_device_2 = df_signals[300:].copy()
# Add a column with device_id and concat
df_device_1["device_id"] = "Device 1"
df_device_2["device_id"] = "Device 2"
df_both_devices = pd.concat([df_device_1, df_device_2])
# Create evset using 'device_id' as index
evset_devices = tp.from_pandas(df_both_devices, indexes=["device_id"])
evset_devices
WARNING:root:Feature "device_id" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
timestamp | noise | trend | season | signal |
---|---|---|---|---|
0 | 0.1086 | 0 | 0 | 0.1086 |
0.1 | 0.0305 | 0.001 | 0.03993 | 0.07143 |
0.2 | -0.04712 | 0.002 | 0.07947 | 0.03435 |
… | … | … | … | … |
89.7 | 0.08734 | 0.897 | 0.3946 | 1.379 |
89.8 | 0.1026 | 0.898 | 0.3861 | 1.387 |
89.9 | -0.2063 | 0.899 | 0.3737 | 1.066 |
timestamp | noise | trend | season | signal |
---|---|---|---|---|
30 | 0.1919 | 0.3 | -0.3952 | 0.0967 |
30.1 | -0.04673 | 0.301 | -0.3871 | -0.1328 |
30.2 | 0.009876 | 0.302 | -0.3751 | -0.0632 |
… | … | … | … | … |
99.7 | 0.04088 | 0.997 | -0.2954 | 0.7424 |
99.8 | 0.04268 | 0.998 | -0.267 | 0.7736 |
99.9 | -0.1061 | 0.999 | -0.236 | 0.6569 |
As you can see above, each index has it's own timestamps and feature values. They will always have the same features though, because they're on the same EventSet
.
The plots also accomodate to show each index separately. In particular, see below how the timestamps are different and partly overlapping, and that's completely fine for separate indices. This wouldn't be possible by using different feature names for each sensor, for example.
evset_devices["signal"].plot()
Operations with index¶
Any operator that we apply now, is aware of the index
and will be performed over each one separately.
# Apply some operations
trend_i = evset_devices["signal"].simple_moving_average(tp.duration.seconds(30))
detrend_i = evset_devices["signal"] - trend_i
denoise_i = detrend_i.simple_moving_average(tp.duration.seconds(1.5))
# Plot for each index
tp.glue(evset_devices["signal"],
detrend_i.rename("detrend"),
denoise_i.rename("denoise")
).plot()
Multi-indexes¶
Finally, let's point out that multiple columns of the input data may be set as indexes.
For example, in the case of sales in a store, we could use both the store and product columns to group the sequences. In this case, each group would contain the sales for a single product in a single store.
This is easy to do since the indexes
argument is actually a list of columns, and each group represented in Temporian by using a tuple (store, product)
as the index key.
Summary¶
Congratulations! You now have the basic concepts needed to create a data preprocessing pipeline with Temporian:
- Defining an
EventSet
and using operators on it. - Combine features using
select
andglue
. - Converting data back and forth between Temporian's
EventSet
and pandasDataFrames
. - Visualizing input/output data using
EventSet.plot()
. - Operating and plotting with
indexes
.
Other important details¶
To keep it short and concise, there are interesting concepts that were not mentioned above:
- Check the Time Units section of the User Guide. There are many calendar operators available when working with datetimes.
- To combine or operate with events from different sampling sources (potentially non-uniform samplings) check the sampling section of the User Guide.
- Temporian is strict on the feature data types when applying operations, to avoid potentially silent errors or memory issues. Check the User Guide's casting section section to learn how to tackle those cases.
Next Steps¶
The Recipes are short and self-contained examples showing how to use Temporian in typical use cases.
Try the more advanced tutorials to continue learning by example about all these topics and more!
Learn how Temporian is ready for production, using graph mode or Apache Beam.
We could only cover a small fraction of all available operators.
We put a lot of ❤️ in the User Guide, so make sure to check it out 🙂.