Supervised anomaly detection with Temporian and scikit-learn¶
In this tutorial we'll be using Temporian to perform exploratory data analysis and feature engineering on the ServerMachineDataset (SMD), published as part of the OmniAnomaly paper, to then train a simple MLP (multi-layer perceptron, or fully-connected neural network) model on it in a supervised fashion to detect anomalies.
Check out the Unsupervised anomaly detection tutorial for a version of this notebook that trains a model without ground truth labels, which is very common in anomaly detection use cases.
The ServerMachineDataset (hosted as csv files in that same repository) is a 5-week-long dataset collected from a large internet company. It is made up of system metrics (such as CPU utilization, network in and out, memory usage, etc.) from 28 different machines belonging to 3 groups.
The data has been anonymized and normalized, so there's no telling what feature means what, and it's also had its timestamps removed, so we will need to treat it as a normal time series, since we know the values are sequential, but don't know how much time has passed between each one. This makes us lose out on some of Temporian's potential - but perfectly illustrates that Temporian can be used on time series data too!
Installation and imports¶
We'll be using scikit-learn's MLPClassifier as our model, since our dataset isn't too large (about 700k rows) and a simple two-layer neural network trained on CPU seems to be fit for the job.
%pip install temporian -q
[notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
import os
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.utils.class_weight import compute_class_weight
import temporian as tp
Downloading the dataset¶
The dataset is comprised of 3 groups of 8, 9, and 11 machines respectively, with names "machine-1-1", ..., "machine-3-11".
Let's create the list of names, and then download each machine's data and labels to a tmp/temporian_server_machine_dataset/ folder.
For the sake of time we'll only be using 3 machines in each group, but we encourage you to try it out in the complete dataset by swapping the commented line with the following one in the cell below.
# Create list of machine names
# machines_per_group = [8, 9, 11]
machines_per_group = [3, 3, 3]
machines = [f"machine-{group}-{id}" for group, machine in zip(range(1, 4), machines_per_group) for id in range(1, machine + 1)]
machines
['machine-1-1', 'machine-1-2', 'machine-1-3', 'machine-2-1', 'machine-2-2', 'machine-2-3', 'machine-3-1', 'machine-3-2', 'machine-3-3']
data_dir = Path("tmp/temporian_server_machine_dataset")
data_dir.mkdir(parents=True, exist_ok=True)
DATA = "data.csv"
LABELS = "labels.csv"
# Download the data and labels for each machine to its own folder
for machine in machines:
print(f"Downloading data and labels for {machine}")
dir = data_dir / machine
dir.mkdir(exist_ok=True)
data_path = dir / DATA
if not data_path.exists():
os.system(f"wget -q -O {data_path} https://raw.githubusercontent.com/NetManAIOps/OmniAnomaly/master/ServerMachineDataset/test/{machine}.txt")
labels_path = dir / LABELS
if not labels_path.exists():
os.system(f"wget -q -O {labels_path} https://raw.githubusercontent.com/NetManAIOps/OmniAnomaly/master/ServerMachineDataset/test_label/{machine}.txt")
Downloading data and labels for machine-1-1 Downloading data and labels for machine-1-2 Downloading data and labels for machine-1-3 Downloading data and labels for machine-2-1 Downloading data and labels for machine-2-2 Downloading data and labels for machine-2-3 Downloading data and labels for machine-3-1 Downloading data and labels for machine-3-2 Downloading data and labels for machine-3-3
Loading the data¶
We'll use pandas to load the data and perform some basic manipulation of it before transforming it into a Temporian EventSet.
Note that in the code below, we'll be using the loaded data's pandas index (which is a sequential one) as the "timestamp" column for each DataFrame. This will effectively render a time series, since each new event will be one unit of time ahead of the previous one, but it means that the timestamp column has no actual semantic meaning.
dfs = []
for machine in machines:
dir = data_dir / machine
# Read the data and labels
df = pd.read_csv(dir / DATA, header=None).add_prefix("f")
labels = pd.read_csv(dir/ LABELS, header=None)
df = df.assign(label=labels)
# Assign the group and machine as features (note that the group is the 8th character in "machine-1-1")
df["group"] = machine[8]
df["machine"] = machine
# Use index as timestamps column
df = df.reset_index(drop=False, names="timestamp")
# Cast column names to string
df.columns = df.columns.astype(str)
print(f"Events in {machine}: {len(df)}")
dfs.append(df)
df = pd.concat(dfs)
df
Events in machine-1-1: 28479 Events in machine-1-2: 23694 Events in machine-1-3: 23703 Events in machine-2-1: 23694 Events in machine-2-2: 23700 Events in machine-2-3: 23689 Events in machine-3-1: 28700 Events in machine-3-2: 23703 Events in machine-3-3: 23703
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | ... | f31 | f32 | f33 | f34 | f35 | f36 | f37 | label | group | machine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.075269 | 0.065678 | 0.070234 | 0.074332 | 0.0 | 0.933333 | 0.274011 | 0.0 | 0.031081 | ... | 0.048893 | 0.000386 | 0.000034 | 0.064432 | 0.064500 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
| 1 | 1 | 0.086022 | 0.080508 | 0.075808 | 0.076655 | 0.0 | 0.930769 | 0.274953 | 0.0 | 0.031081 | ... | 0.050437 | 0.000386 | 0.000022 | 0.065228 | 0.065224 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
| 2 | 2 | 0.075269 | 0.064619 | 0.071349 | 0.074332 | 0.0 | 0.928205 | 0.274953 | 0.0 | 0.030940 | ... | 0.055069 | 0.000386 | 0.000045 | 0.067111 | 0.067178 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
| 3 | 3 | 0.086022 | 0.048729 | 0.063545 | 0.070848 | 0.0 | 0.928205 | 0.273070 | 0.0 | 0.027250 | ... | 0.051467 | 0.000000 | 0.000034 | 0.066676 | 0.066744 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
| 4 | 4 | 0.086022 | 0.051907 | 0.062430 | 0.070848 | 0.0 | 0.933333 | 0.274011 | 0.0 | 0.030940 | ... | 0.051467 | 0.000386 | 0.000022 | 0.066604 | 0.066671 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23698 | 23698 | 0.139785 | 0.027030 | 0.025315 | 0.033124 | 0.0 | 0.985549 | 0.865359 | 0.0 | 0.008939 | ... | 0.083451 | 0.108108 | 0.058300 | 0.163431 | 0.162907 | 0.163221 | 0.163221 | 0 | 3 | machine-3-3 |
| 23699 | 23699 | 0.139785 | 0.018722 | 0.022852 | 0.031948 | 0.0 | 0.981936 | 0.864005 | 0.0 | 0.008939 | ... | 0.082037 | 0.108108 | 0.071146 | 0.149656 | 0.149123 | 0.149360 | 0.149360 | 0 | 3 | machine-3-3 |
| 23700 | 23700 | 0.139785 | 0.018371 | 0.021620 | 0.031164 | 0.0 | 0.977601 | 0.861976 | 0.0 | 0.007204 | ... | 0.082037 | 0.081081 | 0.057312 | 0.149029 | 0.148496 | 0.149078 | 0.149078 | 0 | 3 | machine-3-3 |
| 23701 | 23701 | 0.150538 | 0.013223 | 0.018883 | 0.029596 | 0.0 | 0.981214 | 0.861976 | 0.0 | 0.007605 | ... | 0.086280 | 0.081081 | 0.036561 | 0.164684 | 0.163534 | 0.164264 | 0.164264 | 0 | 3 | machine-3-3 |
| 23702 | 23702 | 0.193548 | 0.019541 | 0.019978 | 0.029988 | 0.0 | 0.986272 | 0.864682 | 0.0 | 0.011007 | ... | 0.091938 | 0.054054 | 0.042490 | 0.175955 | 0.175439 | 0.175829 | 0.175829 | 0 | 3 | machine-3-3 |
223065 rows × 42 columns
df.describe()
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | ... | f29 | f30 | f31 | f32 | f33 | f34 | f35 | f36 | f37 | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.0 | 223065.000000 | ... | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 |
| mean | 12475.482474 | 0.157293 | 0.057534 | 0.064723 | 0.073195 | 0.195905 | 0.769432 | 0.450433 | 0.0 | 0.017383 | ... | 0.057789 | 0.200988 | 0.108881 | 0.038904 | 0.035894 | 0.146026 | 0.148182 | 0.032332 | 0.032332 | 0.046507 |
| std | 7307.892531 | 0.148955 | 0.088460 | 0.097902 | 0.102525 | 0.335727 | 0.195054 | 0.278884 | 0.0 | 0.045717 | ... | 0.084540 | 0.174399 | 0.137078 | 0.071978 | 0.047155 | 0.170754 | 0.170771 | 0.109479 | 0.109479 | 0.210580 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 6196.000000 | 0.041237 | 0.002837 | 0.003745 | 0.005693 | 0.000000 | 0.638994 | 0.260829 | 0.0 | 0.000119 | ... | 0.000000 | 0.058296 | 0.000000 | 0.000000 | 0.005076 | 0.002687 | 0.003036 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 12392.000000 | 0.110000 | 0.023520 | 0.030332 | 0.041748 | 0.000000 | 0.812133 | 0.383190 | 0.0 | 0.003512 | ... | 0.020833 | 0.145740 | 0.068451 | 0.000000 | 0.019663 | 0.080189 | 0.081544 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 18588.000000 | 0.236559 | 0.080138 | 0.088292 | 0.101360 | 0.301887 | 0.922922 | 0.689052 | 0.0 | 0.022140 | ... | 0.064516 | 0.314263 | 0.153846 | 0.054054 | 0.044466 | 0.216912 | 0.224265 | 0.000000 | 0.000000 | 0.000000 |
| max | 28699.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 40 columns
Awesome! Seems like we have 708420 rows and 42 columns in our dataset, made up of:
- 38 metric columns, which will be the main features we'll be working on with Temporian
- 2 categorical columns, which will allow us to tell apart events that belong to different machines
- 1 timestamps column
- 1 labels column
As stated previously, all metrics seem to be anonymized and normalized to [0, 1], so we won't need to take care of that ourselves.
Creating an EventSet¶
Now that our data's ready, let's create a Temporian EventSet from it.
We'll use the "group" and "machine" columns as its index, which means that Temporian will treat the events corresponding to each machine as an independent time series when computing features off of it. This is clear when displaying the EventSet, which renders one table for each of our indexes' values.
Note that, since we used one column as timestamps and two others as indexes, our EventSet has 39 features instead of the previous 42.
tp.config.max_display_features = 100
tp.config.max_display_events = 5
evset = tp.from_pandas(df, indexes=["group", "machine"])
print(evset.schema.features)
evset
WARNING:root:Feature "group" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_). WARNING:root:Feature "machine" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
[('f0', float64), ('f1', float64), ('f2', float64), ('f3', float64), ('f4', float64), ('f5', float64), ('f6', float64), ('f7', float64), ('f8', float64), ('f9', float64), ('f10', float64), ('f11', float64), ('f12', float64), ('f13', float64), ('f14', float64), ('f15', float64), ('f16', float64), ('f17', float64), ('f18', float64), ('f19', float64), ('f20', float64), ('f21', float64), ('f22', float64), ('f23', float64), ('f24', float64), ('f25', float64), ('f26', float64), ('f27', float64), ('f28', float64), ('f29', float64), ('f30', float64), ('f31', float64), ('f32', float64), ('f33', float64), ('f34', float64), ('f35', float64), ('f36', float64), ('f37', float64), ('label', int64)]
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
| 1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
| 2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.09459 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
| 3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
| 4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03696 | 0.003551 | 0.02929 | … |
| 1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
| 2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4784 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
| 3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
| 4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
| 1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
| 2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
| 3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
| 4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
| 1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
| 2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
| 3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
| 4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Let's free up some memory by deleting the pandas DataFrame and casting all of our features to float32 (which will also make Temporian create new float32 features when applying operators on the original ones) and our label to int32.
del df
evset = evset.cast(tp.float32).cast({"label": tp.int32})
evset.schema.features
[('f0', float32),
('f1', float32),
('f2', float32),
('f3', float32),
('f4', float32),
('f5', float32),
('f6', float32),
('f7', float32),
('f8', float32),
('f9', float32),
('f10', float32),
('f11', float32),
('f12', float32),
('f13', float32),
('f14', float32),
('f15', float32),
('f16', float32),
('f17', float32),
('f18', float32),
('f19', float32),
('f20', float32),
('f21', float32),
('f22', float32),
('f23', float32),
('f24', float32),
('f25', float32),
('f26', float32),
('f27', float32),
('f28', float32),
('f29', float32),
('f30', float32),
('f31', float32),
('f32', float32),
('f33', float32),
('f34', float32),
('f35', float32),
('f36', float32),
('f37', float32),
('label', int32)]
Data visualization¶
Let's take a look at some of the first machine's features.
evset.plot(max_num_plots=15)
evset[["label"]].plot(max_num_plots=1)
The number of plots (351) is larger than "options.max_num_plots=15". Only the first plots will be printed. The number of plots (9) is larger than "options.max_num_plots=1". Only the first plots will be printed.
Great! A lot to unpack here:
- It seems to be easy to understand when an anomaly occurs (label takes value of 1) by looking at the other plots. Features 11 to 14, for example, seem to be very correlated to the label.
- The data seems to have some periodicity to it.
- Some features seem empty, and we could evaluate dropping them if needed.
Data preparation¶
To prepare our data to train a model on it, let's start off by separating the features from the labels.
feature_names = evset.schema.feature_names()
feature_names.remove("label")
raw_features = evset[feature_names]
labels = evset[["label"]]
print("Raw features:", raw_features.schema)
print("Labels:", labels.schema)
Raw features: features: [('f0', float32), ('f1', float32), ('f2', float32), ('f3', float32), ('f4', float32), ('f5', float32), ('f6', float32), ('f7', float32), ('f8', float32), ('f9', float32), ('f10', float32), ('f11', float32), ('f12', float32), ('f13', float32), ('f14', float32), ('f15', float32), ('f16', float32), ('f17', float32), ('f18', float32), ('f19', float32), ('f20', float32), ('f21', float32), ('f22', float32), ('f23', float32), ('f24', float32), ('f25', float32), ('f26', float32), ('f27', float32), ('f28', float32), ('f29', float32), ('f30', float32), ('f31', float32), ('f32', float32), ('f33', float32), ('f34', float32), ('f35', float32), ('f36', float32), ('f37', float32)]
indexes: [('group', str_), ('machine', str_)]
is_unix_timestamp: False
Labels: features: [('label', int32)]
indexes: [('group', str_), ('machine', str_)]
is_unix_timestamp: False
Next, we'll need to split our dataset into train and testing sets, which we'll use an 80/20 split for.
We'll be creating reusable functions for each step, since we'll do some iteration over the feature engineering -> training -> evaluation cycle.
CATEGORICAL_COLS = ["group", "machine"]
DROP_COLS = CATEGORICAL_COLS + ["timestamp"]
def make_datasets(X: tp.EventSet, y: tp.EventSet):
"""Splits X and y into train and test sets and transforms categorical features into one-hot-encoded features."""
# Compute the timestamp that corresponds to 80% of the data, and use it to split the data
# Note that not all of each machine's time series are of the same length, so the ones with more will have more test examples
train_cutoff = int(len(X.get_arbitrary_index_data()) * 0.8)
print("Last train timestamp:", train_cutoff)
# Compute masks and split data based on cutoff
timestamp = X.timestamps()
train_mask = timestamp <= train_cutoff
test_mask = ~train_mask
X_train = X.filter(train_mask)
X_test = X.filter(test_mask)
y_train = y.filter(train_mask)
y_test = y.filter(test_mask)
# Using a DataFrame for these last few steps to feed into the scikit-learn model
# Note that even though the raw data has no NaNs, we will create some during our feature engineering
X_train = tp.to_pandas(X_train).fillna(-1)
# Define and fit the one-hot encoder for our categorical features
encoder = OneHotEncoder(sparse_output=False)
train_encoded = encoder.fit_transform(X_train[CATEGORICAL_COLS])
# Replace timestamp and categorical columns with the new encoded ones
X_train = X_train.drop(columns=DROP_COLS)
X_train = np.concatenate([X_train.to_numpy(), train_encoded], axis=1)
# Repeat process for test set
X_test = tp.to_pandas(X_test).fillna(-1)
test_encoded = encoder.transform(X_test[CATEGORICAL_COLS])
X_test = X_test.drop(columns=DROP_COLS)
X_test = np.concatenate([X_test.to_numpy(), test_encoded], axis=1)
# Cast our labels and remove timestamp and categorical columns
y_train = tp.to_pandas(y_train).drop(columns=DROP_COLS).squeeze()
y_test = tp.to_pandas(y_test).drop(columns=DROP_COLS).squeeze()
print("Number of samples in train set:", len(X_train))
print("Number of positive (anomalous) samples in train set:", y_train.sum())
print("Number of samples in test set:", len(X_test))
print("Number of positive (anomalous) samples in test set:", y_test.sum())
return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test = make_datasets(raw_features, labels)
Last train timestamp: 18962 Number of samples in train set: 170667 Number of positive (anomalous) samples in train set: 7186 Number of samples in test set: 52398 Number of positive (anomalous) samples in test set: 3188
Those numbers look alright. However, we seem to be dealing with a fairly unbalanced dataset, with the positive labels in the training set accounting for only 3% of the total. We'll remember to take that into account when evaluating our model.
Training¶
Having done all that work to prepare our data, all that remains is to train our MLP. A small network will do the trick.
def train(X_train, y_train):
model = MLPClassifier(
hidden_layer_sizes=(64, 16),
learning_rate="adaptive",
learning_rate_init=0.0001,
batch_size=512,
tol=0.001,
n_iter_no_change=3,
random_state=0,
verbose=True,
)
model.fit(X_train, y_train)
return model
model = train(X_train, y_train)
Iteration 1, loss = 0.35762397 Iteration 2, loss = 0.19198969 Iteration 3, loss = 0.17113193 Iteration 4, loss = 0.15829303 Iteration 5, loss = 0.14980958 Iteration 6, loss = 0.14465439 Iteration 7, loss = 0.14051211 Iteration 8, loss = 0.13632020 Iteration 9, loss = 0.13243515 Iteration 10, loss = 0.12874860 Iteration 11, loss = 0.12537540 Iteration 12, loss = 0.12237949 Iteration 13, loss = 0.11961306 Iteration 14, loss = 0.11716857 Iteration 15, loss = 0.11497058 Iteration 16, loss = 0.11302140 Iteration 17, loss = 0.11129302 Iteration 18, loss = 0.10975305 Iteration 19, loss = 0.10831662 Iteration 20, loss = 0.10700486 Iteration 21, loss = 0.10579782 Iteration 22, loss = 0.10478984 Iteration 23, loss = 0.10378929 Iteration 24, loss = 0.10297079 Iteration 25, loss = 0.10216076 Iteration 26, loss = 0.10148325 Iteration 27, loss = 0.10084022 Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping.
Evaluation¶
As we noticed previously, the dataset is very unbalanced. The MLPClassifier returns the mean accuracy over each possible class (0 or 1 in our case) in .score(), which would make it very hard to assess how well we're doing on the cases we value the most. Luckily, that same function accepts a sample_weight parameter, which we'll use to balance out the impact of each class' accuracy on the final score.
On top of that, we'll be reporting the model's ROC AUC score, which provides an aggregate measure of performance across all possible classification thresholds (since our model outputs probabilities for each of its two classes, and in a real-world scenario it would be up to us to define the thershold from which we consider an event to be marked as anomalous).
# Compute class weights (we only need to do this once, since it won't be changing while we iterate over feature engineering)
classes = y_train.unique()
class_weights = compute_class_weight("balanced", classes=classes, y=y_train)
class_weights = dict(zip(classes, class_weights))
print("Class weights:", class_weights)
Class weights: {0: 0.5219780891969097, 1: 11.87496521013081}
figsize=(20,3)
results = {}
def evaluate(model, X_train, y_train, X_test, y_test, name):
"""Evaluates a model on its training data and unseen test data, computing accuracy score and plotting ground truth vs predictions."""
# Compute sample weights based on class weights
train_sample_weights = compute_sample_weight(class_weights, y_train)
test_sample_weights = compute_sample_weight(class_weights, y_test)
# Compute scores
train_score = model.score(X_train, y_train, sample_weight=train_sample_weights)
test_score = model.score(X_test, y_test, sample_weight=test_sample_weights)
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
results[name] = {"train_acc": train_score, "test_acc": test_score, "train_roc_auc": train_roc_auc, "test_roc_auc": test_roc_auc}
print("Results:")
print(pd.DataFrame(results))
# Get predictions
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
print("Train and test labels and predictions")
tp.event_set(timestamps=y_train.index, features={"label": y_train, "pred": train_preds}).plot(style="vline")
tp.event_set(timestamps=y_test.index, features={"label": y_test, "pred": test_preds}).plot(style="vline")
evaluate(model, X_train, y_train, X_test, y_test, "raw features")
Results:
raw features
test_acc 0.577362
test_roc_auc 0.729571
train_acc 0.635011
train_roc_auc 0.920704
Train and test labels and predictions
That's pretty decent for a first try. Our model seems to be learning, but not overfitting, on its training data. There's plenty of room for improvement though, so let's kick off the feature engineering!
Feature engineering¶
Lag features¶
Right now our model only has access to each event's raw metric values, + the group and machine that it belongs to. This means that it has no knowledge of the context an event is happening on - some values might have been completely normal when the measuring started, but anomalous a couple of weeks later, e.g. if that machine's usage went up as a whole during that time and its baseline usage now stands much higher than it used to.
To combat this, we'll start by lagging the values of each feature. In doing this, we're providing the model (some) information about that the metric's value looked like a couple of steps into the past.
lag_features = []
# Lag each raw feature by 1, 2, ..., 10 steps
for window in range(1, 11):
lag_features.append(raw_features.lag(window).resample(raw_features).prefix(f"lag_{window}_"))
features = tp.glue(raw_features, *lag_features)
features
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
| 1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
| 2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.0946 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
| 3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
| 4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03695 | 0.003551 | 0.02929 | … |
| 1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
| 2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4785 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
| 3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
| 4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
| 1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
| 2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
| 3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
| 4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
| 1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
| 2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
| 3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
| 4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Let's take a look at any metric alongside its lagged values. We'll select a small time window, to be able to appreciate how the time series moves to the right as the number of lagged timesteps increases.
f13_lags = features[["f13"] + [f"lag_{i}_f13" for i in range(1, 11)]]
timestamps = f13_lags.timestamps()
f13_lags = f13_lags.filter((timestamps > 21500) & (timestamps < 21600))
f13_lags.plot(max_num_plots=10)
The number of plots (99) is larger than "options.max_num_plots=10". Only the first plots will be printed.
Time to train and evaluate a new model with these new features!
X_train, y_train, X_test, y_test = make_datasets(features, labels)
model = train(X_train, y_train)
evaluate(model, X_train, y_train, X_test, y_test, "lagged features")
Last train timestamp: 18962
Number of samples in train set: 170667
Number of positive (anomalous) samples in train set: 7186
Number of samples in test set: 52398
Number of positive (anomalous) samples in test set: 3188
Iteration 1, loss = 0.26836240
Iteration 2, loss = 0.16748997
Iteration 3, loss = 0.14817471
Iteration 4, loss = 0.13660890
Iteration 5, loss = 0.12654223
Iteration 6, loss = 0.11901831
Iteration 7, loss = 0.11333389
Iteration 8, loss = 0.10884045
Iteration 9, loss = 0.10498634
Iteration 10, loss = 0.10173576
Iteration 11, loss = 0.09892545
Iteration 12, loss = 0.09629730
Iteration 13, loss = 0.09404799
Iteration 14, loss = 0.09212120
Iteration 15, loss = 0.09032455
Iteration 16, loss = 0.08873757
Iteration 17, loss = 0.08747842
Iteration 18, loss = 0.08614309
Iteration 19, loss = 0.08491832
Iteration 20, loss = 0.08392474
Iteration 21, loss = 0.08277086
Iteration 22, loss = 0.08168309
Iteration 23, loss = 0.08081591
Iteration 24, loss = 0.07996610
Iteration 25, loss = 0.07889448
Iteration 26, loss = 0.07808412
Iteration 27, loss = 0.07734530
Iteration 28, loss = 0.07645521
Iteration 29, loss = 0.07567317
Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping.
Results:
raw features lagged features
train_acc 0.635011 0.717378
test_acc 0.577362 0.599498
train_roc_auc 0.920704 0.964004
test_roc_auc 0.729571 0.809318
Train and test labels and predictions
Moving statistic features¶
The improvement in performance given by the lag features was significant!
However, although useful, the raw lagged values aren't enough to provide the model a comprehensive look at each value's past context. Note also that we only gave it a glimpse of 10 steps into the past, and each time series has more than 24k values.
This is where moving statistics can come in handy. Instead of a list of raw values, we can provide the model an aggregation of each metric's values over the last N timesteps. For example, we can tell it what the maximum and minimum value of a metric were in the last steps, of what the standard deviation was in the last 1000.
Luckily, Temporian's window operators make this a breeze.
moving_statistic_features = []
# Compute the moving average, standard deviation, max, and min over different windows
for window in [20, 200, 2000]:
moving_statistic_features.append(raw_features.simple_moving_average(window).prefix(f"avg_{window}_"))
moving_statistic_features.append(raw_features.moving_standard_deviation(window).prefix(f"std_{window}_"))
moving_statistic_features.append(raw_features.moving_max(window).prefix(f"max_{window}_"))
moving_statistic_features.append(raw_features.moving_min(window).prefix(f"min_{window}_"))
features = tp.glue(raw_features, *lag_features, *moving_statistic_features)
features
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
| 1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
| 2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.0946 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
| 3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
| 4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03695 | 0.003551 | 0.02929 | … |
| 1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
| 2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4785 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
| 3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
| 4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
| 1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
| 2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
| 3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
| 4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
| 1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
| 2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
| 3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
| 4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Taking a look at some of the generated features:
f13_stats = features[["f13", "avg_20_f13", "avg_200_f13", "max_20_f13", "max_200_f13"]]
timestamps = f13_stats.timestamps()
f13_stats = f13_stats.filter((timestamps > 21000) & (timestamps < 21100))
f13_stats.plot(max_num_plots=10)
The number of plots (45) is larger than "options.max_num_plots=10". Only the first plots will be printed.
X_train, y_train, X_test, y_test = make_datasets(features, labels)
model = train(X_train, y_train)
evaluate(model, X_train, y_train, X_test, y_test, "moving statistics")
Last train timestamp: 18962
Number of samples in train set: 170667
Number of positive (anomalous) samples in train set: 7186
Number of samples in test set: 52398
Number of positive (anomalous) samples in test set: 3188
Iteration 1, loss = 0.21733885
Iteration 2, loss = 0.13700396
Iteration 3, loss = 0.10732090
Iteration 4, loss = 0.08874875
Iteration 5, loss = 0.07546138
Iteration 6, loss = 0.06413285
Iteration 7, loss = 0.05557835
Iteration 8, loss = 0.04873534
Iteration 9, loss = 0.04378912
Iteration 10, loss = 0.03967100
Iteration 11, loss = 0.03651742
Iteration 12, loss = 0.03402103
Iteration 13, loss = 0.03181854
Iteration 14, loss = 0.02990912
Iteration 15, loss = 0.02830940
Iteration 16, loss = 0.02705818
Iteration 17, loss = 0.02552236
Iteration 18, loss = 0.02439079
Iteration 19, loss = 0.02345999
Iteration 20, loss = 0.02249410
Iteration 21, loss = 0.02164874
Iteration 22, loss = 0.02067219
Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping.
Results:
raw features lagged features moving statistics
train_acc 0.635011 0.717378 0.940251
test_acc 0.577362 0.599498 0.741069
train_roc_auc 0.920704 0.964004 0.996714
test_roc_auc 0.729571 0.809318 0.860711
Train and test labels and predictions
That's quite an improvement! Our model seems to not be generalizing as well as before, but these new features have definitely helped it learn to recognize anomalies.
Per-group features¶
As of now, each machine's events only have access to that same machine's lagged values and moving statistics.
In some cases, giving the model information about each entity's parent can be helpful. In this case, that could mean for example providing information to each machine's events about the average value of each metric in its group (remember that each machine belongs to one of 3 groups). In this case we don't know the semantics of what a group means and this could be of little to no use - but in other contexts it can be incredibly useful, such as feeding a store's aggregated sales to each product, or one country's music preferences to each user in it!
To compute these hierarchically-aggregated features, Temporian's indexes and .propagate() operator are incredibly powerful.
grouped_features = []
# Drop the "machine" index to obtain an EventSet indexed by group only
# Operators will now operate on each group, instead of on each machine!
grouped_raw_features = raw_features.drop_index("machine", keep=False)
grouped_features.append(grouped_raw_features.moving_sum(1).prefix("gr_sum_").propagate(raw_features, resample=True))
grouped_features.append(grouped_raw_features.simple_moving_average(10).prefix("gr_sma_10_").propagate(raw_features, resample=True))
grouped_features.append(grouped_raw_features.simple_moving_average(100).prefix("gr_sma_100_").propagate(raw_features, resample=True))
grouped_features.append(grouped_raw_features.simple_moving_average(1000).prefix("gr_sma_1000_").propagate(raw_features, resample=True))
features = tp.glue(raw_features, *lag_features, *moving_statistic_features, *grouped_features)
features
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
| 1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
| 2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.0946 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
| 3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
| 4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03695 | 0.003551 | 0.02929 | … |
| 1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
| 2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4785 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
| 3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
| 4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
| 1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
| 2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
| 3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
| 4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
| 1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
| 2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
| 3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
| 4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
X_train, y_train, X_test, y_test = make_datasets(features, labels)
model = train(X_train, y_train)
evaluate(model, X_train, y_train, X_test, y_test, "grouped moving statistics")
Last train timestamp: 18962
Number of samples in train set: 170667
Number of positive (anomalous) samples in train set: 7186
Number of samples in test set: 52398
Number of positive (anomalous) samples in test set: 3188
Iteration 1, loss = 0.34921445
Iteration 2, loss = 0.14615578
Iteration 3, loss = 0.11712329
Iteration 4, loss = 0.10027315
Iteration 5, loss = 0.08660832
Iteration 6, loss = 0.07493218
Iteration 7, loss = 0.06539237
Iteration 8, loss = 0.05848477
Iteration 9, loss = 0.05307936
Iteration 10, loss = 0.04870466
Iteration 11, loss = 0.04500330
Iteration 12, loss = 0.04192038
Iteration 13, loss = 0.03925250
Iteration 14, loss = 0.03700998
Iteration 15, loss = 0.03483423
Iteration 16, loss = 0.03309578
Iteration 17, loss = 0.03143816
Iteration 18, loss = 0.02976545
Iteration 19, loss = 0.02842013
Iteration 20, loss = 0.02715795
Iteration 21, loss = 0.02588653
Iteration 22, loss = 0.02489748
Iteration 23, loss = 0.02367352
Iteration 24, loss = 0.02287543
Iteration 25, loss = 0.02200763
Iteration 26, loss = 0.02102837
Iteration 27, loss = 0.02033093
Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping.
Results:
raw features lagged features moving statistics \
train_acc 0.635011 0.717378 0.940251
test_acc 0.577362 0.599498 0.741069
train_roc_auc 0.920704 0.964004 0.996714
test_roc_auc 0.729571 0.809318 0.860711
grouped moving statistics
train_acc 0.937290
test_acc 0.702891
train_roc_auc 0.996597
test_roc_auc 0.847673
Train and test labels and predictions
Doesn't seem to have been of much use.
Wrapping up¶
In this notebook we learned how to perform feature engineering and visualization using Temporian, applying it to a real-world anomaly detection use case.
There's some further work that could be done in this problem! Here's some ideas:
- Train a larger model! Our two-layer MLP has been alright, but a more capable model would definitely be able to find new patterns in the data - although it'll probably require some extra regularization work too
- Keep adding new features! As we demonstrated, a very simple model can go a long way if the correct features are provided to it. This is where Temporian shines - check out the full list of operators in the API Reference for some inspiration!
- Use the dataset's unlabeled train data to craft an unsupervised solution, and then test it on the labeled test data we used for this notebook. Check out the unsupervised version of this notebook for inspiration!