Supervised anomaly detection with Temporian and scikit-learn¶
In this tutorial we'll be using Temporian to perform exploratory data analysis and feature engineering on the ServerMachineDataset (SMD), published as part of the OmniAnomaly paper, to then train a simple MLP (multi-layer perceptron, or fully-connected neural network) model on it in a supervised fashion to detect anomalies.
Check out the Unsupervised anomaly detection tutorial for a version of this notebook that trains a model without ground truth labels, which is very common in anomaly detection use cases.
The ServerMachineDataset (hosted as csv files in that same repository) is a 5-week-long dataset collected from a large internet company. It is made up of system metrics (such as CPU utilization, network in and out, memory usage, etc.) from 28 different machines belonging to 3 groups.
The data has been anonymized and normalized, so there's no telling what feature means what, and it's also had its timestamps removed, so we will need to treat it as a normal time series, since we know the values are sequential, but don't know how much time has passed between each one. This makes us lose out on some of Temporian's potential - but perfectly illustrates that Temporian can be used on time series data too!
Installation and imports¶
We'll be using scikit-learn's MLPClassifier as our model, since our dataset isn't too large (about 700k rows) and a simple two-layer neural network trained on CPU seems to be fit for the job.
%pip install temporian -q
[notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
import os
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.utils.class_weight import compute_class_weight
import temporian as tp
Downloading the dataset¶
The dataset is comprised of 3 groups of 8, 9, and 11 machines respectively, with names "machine-1-1"
, ..., "machine-3-11"
.
Let's create the list of names, and then download each machine's data and labels to a tmp/temporian_server_machine_dataset/
folder.
For the sake of time we'll only be using 3 machines in each group, but we encourage you to try it out in the complete dataset by swapping the commented line with the following one in the cell below.
# Create list of machine names
# machines_per_group = [8, 9, 11]
machines_per_group = [3, 3, 3]
machines = [f"machine-{group}-{id}" for group, machine in zip(range(1, 4), machines_per_group) for id in range(1, machine + 1)]
machines
['machine-1-1', 'machine-1-2', 'machine-1-3', 'machine-2-1', 'machine-2-2', 'machine-2-3', 'machine-3-1', 'machine-3-2', 'machine-3-3']
data_dir = Path("tmp/temporian_server_machine_dataset")
data_dir.mkdir(parents=True, exist_ok=True)
DATA = "data.csv"
LABELS = "labels.csv"
# Download the data and labels for each machine to its own folder
for machine in machines:
print(f"Downloading data and labels for {machine}")
dir = data_dir / machine
dir.mkdir(exist_ok=True)
data_path = dir / DATA
if not data_path.exists():
os.system(f"wget -q -O {data_path} https://raw.githubusercontent.com/NetManAIOps/OmniAnomaly/master/ServerMachineDataset/test/{machine}.txt")
labels_path = dir / LABELS
if not labels_path.exists():
os.system(f"wget -q -O {labels_path} https://raw.githubusercontent.com/NetManAIOps/OmniAnomaly/master/ServerMachineDataset/test_label/{machine}.txt")
Downloading data and labels for machine-1-1 Downloading data and labels for machine-1-2 Downloading data and labels for machine-1-3 Downloading data and labels for machine-2-1 Downloading data and labels for machine-2-2 Downloading data and labels for machine-2-3 Downloading data and labels for machine-3-1 Downloading data and labels for machine-3-2 Downloading data and labels for machine-3-3
Loading the data¶
We'll use pandas to load the data and perform some basic manipulation of it before transforming it into a Temporian EventSet
.
Note that in the code below, we'll be using the loaded data's pandas index (which is a sequential one) as the "timestamp"
column for each DataFrame. This will effectively render a time series, since each new event will be one unit of time ahead of the previous one, but it means that the timestamp column has no actual semantic meaning.
dfs = []
for machine in machines:
dir = data_dir / machine
# Read the data and labels
df = pd.read_csv(dir / DATA, header=None).add_prefix("f")
labels = pd.read_csv(dir/ LABELS, header=None)
df = df.assign(label=labels)
# Assign the group and machine as features (note that the group is the 8th character in "machine-1-1")
df["group"] = machine[8]
df["machine"] = machine
# Use index as timestamps column
df = df.reset_index(drop=False, names="timestamp")
# Cast column names to string
df.columns = df.columns.astype(str)
print(f"Events in {machine}: {len(df)}")
dfs.append(df)
df = pd.concat(dfs)
df
Events in machine-1-1: 28479 Events in machine-1-2: 23694 Events in machine-1-3: 23703 Events in machine-2-1: 23694 Events in machine-2-2: 23700 Events in machine-2-3: 23689 Events in machine-3-1: 28700 Events in machine-3-2: 23703 Events in machine-3-3: 23703
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | ... | f31 | f32 | f33 | f34 | f35 | f36 | f37 | label | group | machine | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.075269 | 0.065678 | 0.070234 | 0.074332 | 0.0 | 0.933333 | 0.274011 | 0.0 | 0.031081 | ... | 0.048893 | 0.000386 | 0.000034 | 0.064432 | 0.064500 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
1 | 1 | 0.086022 | 0.080508 | 0.075808 | 0.076655 | 0.0 | 0.930769 | 0.274953 | 0.0 | 0.031081 | ... | 0.050437 | 0.000386 | 0.000022 | 0.065228 | 0.065224 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
2 | 2 | 0.075269 | 0.064619 | 0.071349 | 0.074332 | 0.0 | 0.928205 | 0.274953 | 0.0 | 0.030940 | ... | 0.055069 | 0.000386 | 0.000045 | 0.067111 | 0.067178 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
3 | 3 | 0.086022 | 0.048729 | 0.063545 | 0.070848 | 0.0 | 0.928205 | 0.273070 | 0.0 | 0.027250 | ... | 0.051467 | 0.000000 | 0.000034 | 0.066676 | 0.066744 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
4 | 4 | 0.086022 | 0.051907 | 0.062430 | 0.070848 | 0.0 | 0.933333 | 0.274011 | 0.0 | 0.030940 | ... | 0.051467 | 0.000386 | 0.000022 | 0.066604 | 0.066671 | 0.000000 | 0.000000 | 0 | 1 | machine-1-1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23698 | 23698 | 0.139785 | 0.027030 | 0.025315 | 0.033124 | 0.0 | 0.985549 | 0.865359 | 0.0 | 0.008939 | ... | 0.083451 | 0.108108 | 0.058300 | 0.163431 | 0.162907 | 0.163221 | 0.163221 | 0 | 3 | machine-3-3 |
23699 | 23699 | 0.139785 | 0.018722 | 0.022852 | 0.031948 | 0.0 | 0.981936 | 0.864005 | 0.0 | 0.008939 | ... | 0.082037 | 0.108108 | 0.071146 | 0.149656 | 0.149123 | 0.149360 | 0.149360 | 0 | 3 | machine-3-3 |
23700 | 23700 | 0.139785 | 0.018371 | 0.021620 | 0.031164 | 0.0 | 0.977601 | 0.861976 | 0.0 | 0.007204 | ... | 0.082037 | 0.081081 | 0.057312 | 0.149029 | 0.148496 | 0.149078 | 0.149078 | 0 | 3 | machine-3-3 |
23701 | 23701 | 0.150538 | 0.013223 | 0.018883 | 0.029596 | 0.0 | 0.981214 | 0.861976 | 0.0 | 0.007605 | ... | 0.086280 | 0.081081 | 0.036561 | 0.164684 | 0.163534 | 0.164264 | 0.164264 | 0 | 3 | machine-3-3 |
23702 | 23702 | 0.193548 | 0.019541 | 0.019978 | 0.029988 | 0.0 | 0.986272 | 0.864682 | 0.0 | 0.011007 | ... | 0.091938 | 0.054054 | 0.042490 | 0.175955 | 0.175439 | 0.175829 | 0.175829 | 0 | 3 | machine-3-3 |
223065 rows × 42 columns
df.describe()
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | ... | f29 | f30 | f31 | f32 | f33 | f34 | f35 | f36 | f37 | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.0 | 223065.000000 | ... | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 | 223065.000000 |
mean | 12475.482474 | 0.157293 | 0.057534 | 0.064723 | 0.073195 | 0.195905 | 0.769432 | 0.450433 | 0.0 | 0.017383 | ... | 0.057789 | 0.200988 | 0.108881 | 0.038904 | 0.035894 | 0.146026 | 0.148182 | 0.032332 | 0.032332 | 0.046507 |
std | 7307.892531 | 0.148955 | 0.088460 | 0.097902 | 0.102525 | 0.335727 | 0.195054 | 0.278884 | 0.0 | 0.045717 | ... | 0.084540 | 0.174399 | 0.137078 | 0.071978 | 0.047155 | 0.170754 | 0.170771 | 0.109479 | 0.109479 | 0.210580 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 6196.000000 | 0.041237 | 0.002837 | 0.003745 | 0.005693 | 0.000000 | 0.638994 | 0.260829 | 0.0 | 0.000119 | ... | 0.000000 | 0.058296 | 0.000000 | 0.000000 | 0.005076 | 0.002687 | 0.003036 | 0.000000 | 0.000000 | 0.000000 |
50% | 12392.000000 | 0.110000 | 0.023520 | 0.030332 | 0.041748 | 0.000000 | 0.812133 | 0.383190 | 0.0 | 0.003512 | ... | 0.020833 | 0.145740 | 0.068451 | 0.000000 | 0.019663 | 0.080189 | 0.081544 | 0.000000 | 0.000000 | 0.000000 |
75% | 18588.000000 | 0.236559 | 0.080138 | 0.088292 | 0.101360 | 0.301887 | 0.922922 | 0.689052 | 0.0 | 0.022140 | ... | 0.064516 | 0.314263 | 0.153846 | 0.054054 | 0.044466 | 0.216912 | 0.224265 | 0.000000 | 0.000000 | 0.000000 |
max | 28699.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 40 columns
Awesome! Seems like we have 708420 rows and 42 columns in our dataset, made up of:
- 38 metric columns, which will be the main features we'll be working on with Temporian
- 2 categorical columns, which will allow us to tell apart events that belong to different machines
- 1 timestamps column
- 1 labels column
As stated previously, all metrics seem to be anonymized and normalized to [0, 1], so we won't need to take care of that ourselves.
Creating an EventSet¶
Now that our data's ready, let's create a Temporian EventSet
from it.
We'll use the "group"
and "machine"
columns as its index, which means that Temporian will treat the events corresponding to each machine as an independent time series when computing features off of it. This is clear when displaying the EventSet
, which renders one table for each of our indexes' values.
Note that, since we used one column as timestamps and two others as indexes, our EventSet
has 39 features instead of the previous 42.
tp.config.max_display_features = 100
tp.config.max_display_events = 5
evset = tp.from_pandas(df, indexes=["group", "machine"])
print(evset.schema.features)
evset
WARNING:root:Feature "group" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_). WARNING:root:Feature "machine" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
[('f0', float64), ('f1', float64), ('f2', float64), ('f3', float64), ('f4', float64), ('f5', float64), ('f6', float64), ('f7', float64), ('f8', float64), ('f9', float64), ('f10', float64), ('f11', float64), ('f12', float64), ('f13', float64), ('f14', float64), ('f15', float64), ('f16', float64), ('f17', float64), ('f18', float64), ('f19', float64), ('f20', float64), ('f21', float64), ('f22', float64), ('f23', float64), ('f24', float64), ('f25', float64), ('f26', float64), ('f27', float64), ('f28', float64), ('f29', float64), ('f30', float64), ('f31', float64), ('f32', float64), ('f33', float64), ('f34', float64), ('f35', float64), ('f36', float64), ('f37', float64), ('label', int64)]
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.09459 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03696 | 0.003551 | 0.02929 | … |
1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4784 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Let's free up some memory by deleting the pandas DataFrame and casting all of our features to float32 (which will also make Temporian create new float32 features when applying operators on the original ones) and our label to int32.
del df
evset = evset.cast(tp.float32).cast({"label": tp.int32})
evset.schema.features
[('f0', float32), ('f1', float32), ('f2', float32), ('f3', float32), ('f4', float32), ('f5', float32), ('f6', float32), ('f7', float32), ('f8', float32), ('f9', float32), ('f10', float32), ('f11', float32), ('f12', float32), ('f13', float32), ('f14', float32), ('f15', float32), ('f16', float32), ('f17', float32), ('f18', float32), ('f19', float32), ('f20', float32), ('f21', float32), ('f22', float32), ('f23', float32), ('f24', float32), ('f25', float32), ('f26', float32), ('f27', float32), ('f28', float32), ('f29', float32), ('f30', float32), ('f31', float32), ('f32', float32), ('f33', float32), ('f34', float32), ('f35', float32), ('f36', float32), ('f37', float32), ('label', int32)]
Data visualization¶
Let's take a look at some of the first machine's features.
evset.plot(max_num_plots=15)
evset[["label"]].plot(max_num_plots=1)
The number of plots (351) is larger than "options.max_num_plots=15". Only the first plots will be printed. The number of plots (9) is larger than "options.max_num_plots=1". Only the first plots will be printed.
Great! A lot to unpack here:
- It seems to be easy to understand when an anomaly occurs (label takes value of 1) by looking at the other plots. Features 11 to 14, for example, seem to be very correlated to the label.
- The data seems to have some periodicity to it.
- Some features seem empty, and we could evaluate dropping them if needed.
Data preparation¶
To prepare our data to train a model on it, let's start off by separating the features from the labels.
feature_names = evset.schema.feature_names()
feature_names.remove("label")
raw_features = evset[feature_names]
labels = evset[["label"]]
print("Raw features:", raw_features.schema)
print("Labels:", labels.schema)
Raw features: features: [('f0', float32), ('f1', float32), ('f2', float32), ('f3', float32), ('f4', float32), ('f5', float32), ('f6', float32), ('f7', float32), ('f8', float32), ('f9', float32), ('f10', float32), ('f11', float32), ('f12', float32), ('f13', float32), ('f14', float32), ('f15', float32), ('f16', float32), ('f17', float32), ('f18', float32), ('f19', float32), ('f20', float32), ('f21', float32), ('f22', float32), ('f23', float32), ('f24', float32), ('f25', float32), ('f26', float32), ('f27', float32), ('f28', float32), ('f29', float32), ('f30', float32), ('f31', float32), ('f32', float32), ('f33', float32), ('f34', float32), ('f35', float32), ('f36', float32), ('f37', float32)] indexes: [('group', str_), ('machine', str_)] is_unix_timestamp: False Labels: features: [('label', int32)] indexes: [('group', str_), ('machine', str_)] is_unix_timestamp: False
Next, we'll need to split our dataset into train and testing sets, which we'll use an 80/20 split for.
We'll be creating reusable functions for each step, since we'll do some iteration over the feature engineering -> training -> evaluation
cycle.
CATEGORICAL_COLS = ["group", "machine"]
DROP_COLS = CATEGORICAL_COLS + ["timestamp"]
def make_datasets(X: tp.EventSet, y: tp.EventSet):
"""Splits X and y into train and test sets and transforms categorical features into one-hot-encoded features."""
# Compute the timestamp that corresponds to 80% of the data, and use it to split the data
# Note that not all of each machine's time series are of the same length, so the ones with more will have more test examples
train_cutoff = int(len(X.get_arbitrary_index_data()) * 0.8)
print("Last train timestamp:", train_cutoff)
# Compute masks and split data based on cutoff
timestamp = X.timestamps()
train_mask = timestamp <= train_cutoff
test_mask = ~train_mask
X_train = X.filter(train_mask)
X_test = X.filter(test_mask)
y_train = y.filter(train_mask)
y_test = y.filter(test_mask)
# Using a DataFrame for these last few steps to feed into the scikit-learn model
# Note that even though the raw data has no NaNs, we will create some during our feature engineering
X_train = tp.to_pandas(X_train).fillna(-1)
# Define and fit the one-hot encoder for our categorical features
encoder = OneHotEncoder(sparse_output=False)
train_encoded = encoder.fit_transform(X_train[CATEGORICAL_COLS])
# Replace timestamp and categorical columns with the new encoded ones
X_train = X_train.drop(columns=DROP_COLS)
X_train = np.concatenate([X_train.to_numpy(), train_encoded], axis=1)
# Repeat process for test set
X_test = tp.to_pandas(X_test).fillna(-1)
test_encoded = encoder.transform(X_test[CATEGORICAL_COLS])
X_test = X_test.drop(columns=DROP_COLS)
X_test = np.concatenate([X_test.to_numpy(), test_encoded], axis=1)
# Cast our labels and remove timestamp and categorical columns
y_train = tp.to_pandas(y_train).drop(columns=DROP_COLS).squeeze()
y_test = tp.to_pandas(y_test).drop(columns=DROP_COLS).squeeze()
print("Number of samples in train set:", len(X_train))
print("Number of positive (anomalous) samples in train set:", y_train.sum())
print("Number of samples in test set:", len(X_test))
print("Number of positive (anomalous) samples in test set:", y_test.sum())
return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test = make_datasets(raw_features, labels)
Last train timestamp: 18962 Number of samples in train set: 170667 Number of positive (anomalous) samples in train set: 7186 Number of samples in test set: 52398 Number of positive (anomalous) samples in test set: 3188
Those numbers look alright. However, we seem to be dealing with a fairly unbalanced dataset, with the positive labels in the training set accounting for only 3% of the total. We'll remember to take that into account when evaluating our model.
Training¶
Having done all that work to prepare our data, all that remains is to train our MLP. A small network will do the trick.
def train(X_train, y_train):
model = MLPClassifier(
hidden_layer_sizes=(64, 16),
learning_rate="adaptive",
learning_rate_init=0.0001,
batch_size=512,
tol=0.001,
n_iter_no_change=3,
random_state=0,
verbose=True,
)
model.fit(X_train, y_train)
return model
model = train(X_train, y_train)
Iteration 1, loss = 0.35762397 Iteration 2, loss = 0.19198969 Iteration 3, loss = 0.17113193 Iteration 4, loss = 0.15829303 Iteration 5, loss = 0.14980958 Iteration 6, loss = 0.14465439 Iteration 7, loss = 0.14051211 Iteration 8, loss = 0.13632020 Iteration 9, loss = 0.13243515 Iteration 10, loss = 0.12874860 Iteration 11, loss = 0.12537540 Iteration 12, loss = 0.12237949 Iteration 13, loss = 0.11961306 Iteration 14, loss = 0.11716857 Iteration 15, loss = 0.11497058 Iteration 16, loss = 0.11302140 Iteration 17, loss = 0.11129302 Iteration 18, loss = 0.10975305 Iteration 19, loss = 0.10831662 Iteration 20, loss = 0.10700486 Iteration 21, loss = 0.10579782 Iteration 22, loss = 0.10478984 Iteration 23, loss = 0.10378929 Iteration 24, loss = 0.10297079 Iteration 25, loss = 0.10216076 Iteration 26, loss = 0.10148325 Iteration 27, loss = 0.10084022 Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping.
Evaluation¶
As we noticed previously, the dataset is very unbalanced. The MLPClassifier returns the mean accuracy over each possible class (0 or 1 in our case) in .score()
, which would make it very hard to assess how well we're doing on the cases we value the most. Luckily, that same function accepts a sample_weight
parameter, which we'll use to balance out the impact of each class' accuracy on the final score.
On top of that, we'll be reporting the model's ROC AUC score, which provides an aggregate measure of performance across all possible classification thresholds (since our model outputs probabilities for each of its two classes, and in a real-world scenario it would be up to us to define the thershold from which we consider an event to be marked as anomalous).
# Compute class weights (we only need to do this once, since it won't be changing while we iterate over feature engineering)
classes = y_train.unique()
class_weights = compute_class_weight("balanced", classes=classes, y=y_train)
class_weights = dict(zip(classes, class_weights))
print("Class weights:", class_weights)
Class weights: {0: 0.5219780891969097, 1: 11.87496521013081}
figsize=(20,3)
results = {}
def evaluate(model, X_train, y_train, X_test, y_test, name):
"""Evaluates a model on its training data and unseen test data, computing accuracy score and plotting ground truth vs predictions."""
# Compute sample weights based on class weights
train_sample_weights = compute_sample_weight(class_weights, y_train)
test_sample_weights = compute_sample_weight(class_weights, y_test)
# Compute scores
train_score = model.score(X_train, y_train, sample_weight=train_sample_weights)
test_score = model.score(X_test, y_test, sample_weight=test_sample_weights)
train_roc_auc = roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
test_roc_auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
results[name] = {"train_acc": train_score, "test_acc": test_score, "train_roc_auc": train_roc_auc, "test_roc_auc": test_roc_auc}
print("Results:")
print(pd.DataFrame(results))
# Get predictions
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
print("Train and test labels and predictions")
tp.event_set(timestamps=y_train.index, features={"label": y_train, "pred": train_preds}).plot(style="vline")
tp.event_set(timestamps=y_test.index, features={"label": y_test, "pred": test_preds}).plot(style="vline")
evaluate(model, X_train, y_train, X_test, y_test, "raw features")
Results: raw features test_acc 0.577362 test_roc_auc 0.729571 train_acc 0.635011 train_roc_auc 0.920704 Train and test labels and predictions
That's pretty decent for a first try. Our model seems to be learning, but not overfitting, on its training data. There's plenty of room for improvement though, so let's kick off the feature engineering!
Feature engineering¶
Lag features¶
Right now our model only has access to each event's raw metric values, + the group and machine that it belongs to. This means that it has no knowledge of the context an event is happening on - some values might have been completely normal when the measuring started, but anomalous a couple of weeks later, e.g. if that machine's usage went up as a whole during that time and its baseline usage now stands much higher than it used to.
To combat this, we'll start by lagging the values of each feature. In doing this, we're providing the model (some) information about that the metric's value looked like a couple of steps into the past.
lag_features = []
# Lag each raw feature by 1, 2, ..., 10 steps
for window in range(1, 11):
lag_features.append(raw_features.lag(window).resample(raw_features).prefix(f"lag_{window}_"))
features = tp.glue(raw_features, *lag_features)
features
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.0946 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03695 | 0.003551 | 0.02929 | … |
1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4785 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Let's take a look at any metric alongside its lagged values. We'll select a small time window, to be able to appreciate how the time series moves to the right as the number of lagged timesteps increases.
f13_lags = features[["f13"] + [f"lag_{i}_f13" for i in range(1, 11)]]
timestamps = f13_lags.timestamps()
f13_lags = f13_lags.filter((timestamps > 21500) & (timestamps < 21600))
f13_lags.plot(max_num_plots=10)
The number of plots (99) is larger than "options.max_num_plots=10". Only the first plots will be printed.
Time to train and evaluate a new model with these new features!
X_train, y_train, X_test, y_test = make_datasets(features, labels)
model = train(X_train, y_train)
evaluate(model, X_train, y_train, X_test, y_test, "lagged features")
Last train timestamp: 18962 Number of samples in train set: 170667 Number of positive (anomalous) samples in train set: 7186 Number of samples in test set: 52398 Number of positive (anomalous) samples in test set: 3188 Iteration 1, loss = 0.26836240 Iteration 2, loss = 0.16748997 Iteration 3, loss = 0.14817471 Iteration 4, loss = 0.13660890 Iteration 5, loss = 0.12654223 Iteration 6, loss = 0.11901831 Iteration 7, loss = 0.11333389 Iteration 8, loss = 0.10884045 Iteration 9, loss = 0.10498634 Iteration 10, loss = 0.10173576 Iteration 11, loss = 0.09892545 Iteration 12, loss = 0.09629730 Iteration 13, loss = 0.09404799 Iteration 14, loss = 0.09212120 Iteration 15, loss = 0.09032455 Iteration 16, loss = 0.08873757 Iteration 17, loss = 0.08747842 Iteration 18, loss = 0.08614309 Iteration 19, loss = 0.08491832 Iteration 20, loss = 0.08392474 Iteration 21, loss = 0.08277086 Iteration 22, loss = 0.08168309 Iteration 23, loss = 0.08081591 Iteration 24, loss = 0.07996610 Iteration 25, loss = 0.07889448 Iteration 26, loss = 0.07808412 Iteration 27, loss = 0.07734530 Iteration 28, loss = 0.07645521 Iteration 29, loss = 0.07567317 Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping. Results: raw features lagged features train_acc 0.635011 0.717378 test_acc 0.577362 0.599498 train_roc_auc 0.920704 0.964004 test_roc_auc 0.729571 0.809318 Train and test labels and predictions
Moving statistic features¶
The improvement in performance given by the lag features was significant!
However, although useful, the raw lagged values aren't enough to provide the model a comprehensive look at each value's past context. Note also that we only gave it a glimpse of 10 steps into the past, and each time series has more than 24k values.
This is where moving statistics can come in handy. Instead of a list of raw values, we can provide the model an aggregation of each metric's values over the last N timesteps. For example, we can tell it what the maximum and minimum value of a metric were in the last steps, of what the standard deviation was in the last 1000.
Luckily, Temporian's window operators make this a breeze.
moving_statistic_features = []
# Compute the moving average, standard deviation, max, and min over different windows
for window in [20, 200, 2000]:
moving_statistic_features.append(raw_features.simple_moving_average(window).prefix(f"avg_{window}_"))
moving_statistic_features.append(raw_features.moving_standard_deviation(window).prefix(f"std_{window}_"))
moving_statistic_features.append(raw_features.moving_max(window).prefix(f"max_{window}_"))
moving_statistic_features.append(raw_features.moving_min(window).prefix(f"min_{window}_"))
features = tp.glue(raw_features, *lag_features, *moving_statistic_features)
features
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.0946 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03695 | 0.003551 | 0.02929 | … |
1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4785 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Taking a look at some of the generated features:
f13_stats = features[["f13", "avg_20_f13", "avg_200_f13", "max_20_f13", "max_200_f13"]]
timestamps = f13_stats.timestamps()
f13_stats = f13_stats.filter((timestamps > 21000) & (timestamps < 21100))
f13_stats.plot(max_num_plots=10)
The number of plots (45) is larger than "options.max_num_plots=10". Only the first plots will be printed.
X_train, y_train, X_test, y_test = make_datasets(features, labels)
model = train(X_train, y_train)
evaluate(model, X_train, y_train, X_test, y_test, "moving statistics")
Last train timestamp: 18962 Number of samples in train set: 170667 Number of positive (anomalous) samples in train set: 7186 Number of samples in test set: 52398 Number of positive (anomalous) samples in test set: 3188 Iteration 1, loss = 0.21733885 Iteration 2, loss = 0.13700396 Iteration 3, loss = 0.10732090 Iteration 4, loss = 0.08874875 Iteration 5, loss = 0.07546138 Iteration 6, loss = 0.06413285 Iteration 7, loss = 0.05557835 Iteration 8, loss = 0.04873534 Iteration 9, loss = 0.04378912 Iteration 10, loss = 0.03967100 Iteration 11, loss = 0.03651742 Iteration 12, loss = 0.03402103 Iteration 13, loss = 0.03181854 Iteration 14, loss = 0.02990912 Iteration 15, loss = 0.02830940 Iteration 16, loss = 0.02705818 Iteration 17, loss = 0.02552236 Iteration 18, loss = 0.02439079 Iteration 19, loss = 0.02345999 Iteration 20, loss = 0.02249410 Iteration 21, loss = 0.02164874 Iteration 22, loss = 0.02067219 Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping. Results: raw features lagged features moving statistics train_acc 0.635011 0.717378 0.940251 test_acc 0.577362 0.599498 0.741069 train_roc_auc 0.920704 0.964004 0.996714 test_roc_auc 0.729571 0.809318 0.860711 Train and test labels and predictions
That's quite an improvement! Our model seems to not be generalizing as well as before, but these new features have definitely helped it learn to recognize anomalies.
Per-group features¶
As of now, each machine's events only have access to that same machine's lagged values and moving statistics.
In some cases, giving the model information about each entity's parent can be helpful. In this case, that could mean for example providing information to each machine's events about the average value of each metric in its group (remember that each machine belongs to one of 3 groups). In this case we don't know the semantics of what a group means and this could be of little to no use - but in other contexts it can be incredibly useful, such as feeding a store's aggregated sales to each product, or one country's music preferences to each user in it!
To compute these hierarchically-aggregated features, Temporian's indexes and .propagate()
operator are incredibly powerful.
grouped_features = []
# Drop the "machine" index to obtain an EventSet indexed by group only
# Operators will now operate on each group, instead of on each machine!
grouped_raw_features = raw_features.drop_index("machine", keep=False)
grouped_features.append(grouped_raw_features.moving_sum(1).prefix("gr_sum_").propagate(raw_features, resample=True))
grouped_features.append(grouped_raw_features.simple_moving_average(10).prefix("gr_sma_10_").propagate(raw_features, resample=True))
grouped_features.append(grouped_raw_features.simple_moving_average(100).prefix("gr_sma_100_").propagate(raw_features, resample=True))
grouped_features.append(grouped_raw_features.simple_moving_average(1000).prefix("gr_sma_1000_").propagate(raw_features, resample=True))
features = tp.glue(raw_features, *lag_features, *moving_statistic_features, *grouped_features)
features
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.07527 | 0.06568 | 0.07023 | 0.07433 | 0 | 0.9333 | 0.274 | 0 | 0.03108 | 0 | 0.1341 | 0.08108 | 0.0274 | 0.06781 | 0.1258 | 0.1506 | … |
1 | 0.08602 | 0.08051 | 0.07581 | 0.07666 | 0 | 0.9308 | 0.275 | 0 | 0.03108 | 0.000122 | 0.1488 | 0.1622 | 0.0548 | 0.0714 | 0.1231 | 0.1645 | … |
2 | 0.07527 | 0.06462 | 0.07135 | 0.07433 | 0 | 0.9282 | 0.275 | 0 | 0.03094 | 0.000366 | 0.1348 | 0.0946 | 0.0274 | 0.06328 | 0.129 | 0.1515 | … |
3 | 0.08602 | 0.04873 | 0.06355 | 0.07085 | 0 | 0.9282 | 0.2731 | 0 | 0.02725 | 0.000244 | 0.1313 | 0.08108 | 0.0274 | 0.06784 | 0.1104 | 0.1456 | … |
4 | 0.08602 | 0.05191 | 0.06243 | 0.07085 | 0 | 0.9333 | 0.274 | 0 | 0.03094 | 0.000244 | 0.1027 | 0.1081 | 0.0411 | 0.07565 | 0.1191 | 0.1184 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | 0.1095 | 0.1266 | 0.1338 | 0.2264 | 0.621 | 0.4093 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.00722 | 0 | 0.03695 | 0.003551 | 0.02929 | … |
1 | 0.25 | 0.1089 | 0.1271 | 0.134 | 0.2264 | 0.5473 | 0.337 | 0 | 9.3e-05 | 1.5e-05 | 0 | 0.009025 | 0 | 0.03926 | 0.004129 | 0.03305 | … |
2 | 0.24 | 0.1228 | 0.1308 | 0.1353 | 0.2264 | 0.4785 | 0.2648 | 0 | 0.000653 | 0.000134 | 8e-06 | 0.01805 | 0 | 0.09858 | 0.009166 | 0.09242 | … |
3 | 0.23 | 0.1093 | 0.1286 | 0.1348 | 0.2264 | 0.4841 | 0.2648 | 0 | 0.001026 | 0.000134 | 8e-06 | 0.01444 | 0 | 0.102 | 0.01292 | 0.09136 | … |
4 | 0.24 | 0.09617 | 0.1254 | 0.1338 | 0.2264 | 0.485 | 0.2649 | 0 | 0.000187 | 1.5e-05 | 0 | 0.02347 | 0 | 0.02632 | 0.009579 | 0.02257 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.2424 | 0.01391 | 0.01491 | 0.01739 | 0.9286 | 0.4351 | 0.374 | 0 | 0.002196 | 0.000394 | 0 | 0.03985 | 0 | 0.01917 | 0.01931 | 0.01111 | … |
1 | 0.2424 | 0.01385 | 0.01479 | 0.01739 | 0.9286 | 0.4373 | 0.3764 | 0 | 0.000119 | 0 | 0 | 0.01414 | 0 | 0.008432 | 0.002625 | 0.00104 | … |
2 | 0.2222 | 0.01391 | 0.01498 | 0.01755 | 0.9286 | 0.436 | 0.3749 | 0 | 0.000178 | 0 | 0 | 0.01671 | 0 | 0.007965 | 0.003243 | 0.000964 | … |
3 | 0.2323 | 0.009957 | 0.01373 | 0.01708 | 0.9286 | 0.4347 | 0.3734 | 0 | 0.000178 | 0 | 0 | 0.01028 | 0 | 0.007965 | 0.002789 | 0.000832 | … |
4 | 0.2323 | 0.009734 | 0.01324 | 0.01692 | 0.9286 | 0.4353 | 0.374 | 0 | 0.000297 | 0 | 0 | 0.01671 | 0 | 0.008803 | 0.004606 | 0.001191 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
timestamp | f0 | f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | f11 | f12 | f13 | f14 | f15 | … |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.1111 | 0.009689 | 0.0171 | 0.04285 | 0.1579 | 0.8356 | 0.4777 | 0 | 0 | 0 | 0 | 0.03304 | 0 | 0.01214 | 0.005469 | 0.01656 | … |
1 | 0.101 | 0.006856 | 0.01541 | 0.04118 | 0.1579 | 0.8359 | 0.4779 | 0 | 0 | 0 | 0 | 0.02261 | 0 | 0.0238 | 0.003045 | 0.02151 | … |
2 | 0.1212 | 0.01429 | 0.01815 | 0.04248 | 0.1579 | 0.839 | 0.4815 | 0 | 0.001073 | 0 | 0 | 0.01391 | 0 | 0.06879 | 0.01759 | 0.04456 | … |
3 | 0.1313 | 0.01273 | 0.01872 | 0.04267 | 0.1579 | 0.8518 | 0.4956 | 0 | 0.001877 | 0 | 0 | 0.01565 | 0 | 0.06783 | 0.03362 | 0.1281 | … |
4 | 0.1111 | 0.009114 | 0.01751 | 0.04155 | 0.1579 | 0.8665 | 0.5114 | 0 | 0 | 0 | 0 | 0.01913 | 0 | 0.02304 | 0.001367 | 0.01609 | … |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
X_train, y_train, X_test, y_test = make_datasets(features, labels)
model = train(X_train, y_train)
evaluate(model, X_train, y_train, X_test, y_test, "grouped moving statistics")
Last train timestamp: 18962 Number of samples in train set: 170667 Number of positive (anomalous) samples in train set: 7186 Number of samples in test set: 52398 Number of positive (anomalous) samples in test set: 3188 Iteration 1, loss = 0.34921445 Iteration 2, loss = 0.14615578 Iteration 3, loss = 0.11712329 Iteration 4, loss = 0.10027315 Iteration 5, loss = 0.08660832 Iteration 6, loss = 0.07493218 Iteration 7, loss = 0.06539237 Iteration 8, loss = 0.05848477 Iteration 9, loss = 0.05307936 Iteration 10, loss = 0.04870466 Iteration 11, loss = 0.04500330 Iteration 12, loss = 0.04192038 Iteration 13, loss = 0.03925250 Iteration 14, loss = 0.03700998 Iteration 15, loss = 0.03483423 Iteration 16, loss = 0.03309578 Iteration 17, loss = 0.03143816 Iteration 18, loss = 0.02976545 Iteration 19, loss = 0.02842013 Iteration 20, loss = 0.02715795 Iteration 21, loss = 0.02588653 Iteration 22, loss = 0.02489748 Iteration 23, loss = 0.02367352 Iteration 24, loss = 0.02287543 Iteration 25, loss = 0.02200763 Iteration 26, loss = 0.02102837 Iteration 27, loss = 0.02033093 Training loss did not improve more than tol=0.001000 for 3 consecutive epochs. Stopping. Results: raw features lagged features moving statistics \ train_acc 0.635011 0.717378 0.940251 test_acc 0.577362 0.599498 0.741069 train_roc_auc 0.920704 0.964004 0.996714 test_roc_auc 0.729571 0.809318 0.860711 grouped moving statistics train_acc 0.937290 test_acc 0.702891 train_roc_auc 0.996597 test_roc_auc 0.847673 Train and test labels and predictions
Doesn't seem to have been of much use.
Wrapping up¶
In this notebook we learned how to perform feature engineering and visualization using Temporian, applying it to a real-world anomaly detection use case.
There's some further work that could be done in this problem! Here's some ideas:
- Train a larger model! Our two-layer MLP has been alright, but a more capable model would definitely be able to find new patterns in the data - although it'll probably require some extra regularization work too
- Keep adding new features! As we demonstrated, a very simple model can go a long way if the correct features are provided to it. This is where Temporian shines - check out the full list of operators in the API Reference for some inspiration!
- Use the dataset's unlabeled train data to craft an unsupervised solution, and then test it on the labeled test data we used for this notebook. Check out the unsupervised version of this notebook for inspiration!