Detecting payment card fraud with Temporian and YDF¶

Detection of fraud in online banking is critical for banks, businesses, and their consumers. The book "Reproducible Machine Learning for Credit Card Fraud Detection" by Le Borgne et al. introduces the problem of payment card fraud and shows how fraud can be detected using machine learning. However, since banking transactions are sensitive and not widely available, the book uses a synthetic dataset for practical exercises.

This notebook uses the same dataset to show how to use Temporian and YDF to detect fraud. Temporian is used for data preprocessing and feature engineering, while YDF is used for model training. Feature engineering is often critical for temporal data, and this notebook demonstrates how complex feature engineering can be performed with ease using Temporian.

The notebook is divided into three parts:

Download the dataset and import it to Temporian.
Perform various types of augmentations and visualize the correlation between the engineered features and fraud target labels.
Train and evaluate a machine learning model to detect fraud using the engineered features.

Note: This notebook assumes a basic understanding of Temporian. If you are not familiar with Temporian, we recommend that you read the Getting started guide first.

Install and import dependencies¶

In [ ]:

Copied!





# For data preprocessing and augmantation
%pip install temporian -q

# For model training
%pip install ydf -q

# To plots and analyse the dataset
%pip install seaborn -q

# To compute the ROC curve and AUC of the model
%pip install scikit-learn -q
# For data preprocessing and augmantation
%pip install temporian -q

# For model training
%pip install ydf -q

# To plots and analyse the dataset
%pip install seaborn -q

# To compute the ROC curve and AUC of the model
%pip install scikit-learn -q

In [2]:

Copied!





import datetime
import concurrent.futures
from pathlib import Path
from typing import Dict

# Temporian is used for temporal data augmentation
import temporian as tp

# Since the dataset is small, we use Pandas to load the raw data and feed augmented
# data to the model (TF-DF support Pandas DataFrame natively).
import pandas as pd

# We use Temporian to plot time sequences and other temporal data.
# We use Matplotlib and Seaborn to plot non temporal data.
import matplotlib.pyplot as plt
import seaborn as sns

# We use YDF to train the fraud detection model using
# the features computed by Temporian.
import ydf
import datetime
import concurrent.futures
from pathlib import Path
from typing import Dict

# Temporian is used for temporal data augmentation
import temporian as tp

# Since the dataset is small, we use Pandas to load the raw data and feed augmented
# data to the model (TF-DF support Pandas DataFrame natively).
import pandas as pd

# We use Temporian to plot time sequences and other temporal data.
# We use Matplotlib and Seaborn to plot non temporal data.
import matplotlib.pyplot as plt
import seaborn as sns

# We use YDF to train the fraud detection model using
# the features computed by Temporian.
import ydf

Load the data¶

The dataset consists of banking transactions sampled between April 1, 2018 and September 30, 2018. The transactions are stored in CSV files, one for each day. The transactions from April 1, 2018 to August 31, 2018 (inclusive) are used for training, while the transactions from September 1, 2018 to September 30, 2018 are used for evaluation.

In [3]:

Copied!





start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 9, 30)
train_test_split = datetime.datetime(2018, 9, 1)

# Note: You can reduce the end and train/test split dates to speed-up the notebook execution.

# List the input csv files
filenames = []
while start_date <= end_date:
    filenames.append(f"{start_date}.pkl")
    start_date += datetime.timedelta(days=1)
print(f"{len(filenames)} dates")
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 9, 30)
train_test_split = datetime.datetime(2018, 9, 1)

# Note: You can reduce the end and train/test split dates to speed-up the notebook execution.

# List the input csv files
filenames = []
while start_date <= end_date:
    filenames.append(f"{start_date}.pkl")
    start_date += datetime.timedelta(days=1)
print(f"{len(filenames)} dates")

183 dates

The dataset is downloaded and converted into a Pandas dataframe.

In [4]:

Copied!





path_tmp = Path("tmp") / "temporian_bank_fraud_detection"
path_downloads = path_tmp / "downloads"
path_downloads.mkdir(parents=True, exist_ok=True)

def load_date(filename):
    """ Downloads and saves pickle to cache, or loads cached. """
    cached_path = path_downloads / filename
    if cached_path.exists():
        print(">", end="", flush=True)
        df = pd.read_pickle(cached_path)
    else:
        print(".", end="", flush=True)
        df = pd.read_pickle(f"https://github.com/Fraud-Detection-Handbook/simulated-data-raw/raw/main/data/{filename}")
        df.to_pickle(cached_path)
    return df

print("Downloading dataset", end="")
# Download (or load cached) files in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    frames = executor.map(load_date, filenames)
dataset_pd = pd.concat(frames)
print("done")
print(f"Found {len(dataset_pd)} transactions")

dataset_pd
path_tmp = Path("tmp") / "temporian_bank_fraud_detection"
path_downloads = path_tmp / "downloads"
path_downloads.mkdir(parents=True, exist_ok=True)

def load_date(filename):
    """ Downloads and saves pickle to cache, or loads cached. """
    cached_path = path_downloads / filename
    if cached_path.exists():
        print(">", end="", flush=True)
        df = pd.read_pickle(cached_path)
    else:
        print(".", end="", flush=True)
        df = pd.read_pickle(f"https://github.com/Fraud-Detection-Handbook/simulated-data-raw/raw/main/data/{filename}")
        df.to_pickle(cached_path)
    return df

print("Downloading dataset", end="")
# Download (or load cached) files in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    frames = executor.map(load_date, filenames)
dataset_pd = pd.concat(frames)
print("done")
print(f"Found {len(dataset_pd)} transactions")

dataset_pd

Downloading dataset.......................................................................................................................................................................................done
Found 1754155 transactions

Out[4]:

	TRANSACTION_ID	TX_DATETIME	CUSTOMER_ID	TERMINAL_ID	TX_AMOUNT	TX_TIME_SECONDS	TX_TIME_DAYS	TX_FRAUD	TX_FRAUD_SCENARIO
0	0	2018-04-01 00:00:31	596	3156	57.16	31	0	0	0
1	1	2018-04-01 00:02:10	4961	3412	81.51	130	0	0	0
2	2	2018-04-01 00:07:56	2	1365	146.00	476	0	0	0
3	3	2018-04-01 00:09:29	4128	8737	64.49	569	0	0	0
4	4	2018-04-01 00:10:34	927	9906	50.99	634	0	0	0
...	...	...	...	...	...	...	...	...	...
1754150	1754150	2018-09-30 23:56:36	161	655	54.24	15810996	182	0	0
1754151	1754151	2018-09-30 23:57:38	4342	6181	1.23	15811058	182	0	0
1754152	1754152	2018-09-30 23:58:21	618	1502	6.62	15811101	182	0	0
1754153	1754153	2018-09-30 23:59:52	4056	3067	55.40	15811192	182	0	0
1754154	1754154	2018-09-30 23:59:57	3542	9849	23.59	15811197	182	0	0

1754155 rows × 9 columns

We only keep the following columns of interest:

TX_DATETIME: The date and time of the transaction.
CUSTOMER_ID: The unique identifier of the customer.
TERMINAL_ID: The identifier of the terminal where the transaction was made.
TX_AMOUNT: The amount of the transaction.
TX_FRAUD: Whether the transaction is fraudulent (1) or not (0).

Our goal is to predict whether a transaction is fraudulent at the time of the transaction, using only the information from this transaction and previous transactions. The information about whether a transaction is fraudulent is not known at the time of the transaction. Instead, it is only known one week after the transaction. While this is too late to prevent the fraudulent transaction, it is available for future transactions.

In [5]:

Copied!

dataset_pd = dataset_pd[["TX_DATETIME", "CUSTOMER_ID", "TERMINAL_ID", "TX_AMOUNT", "TX_FRAUD"]]
dataset_pd = dataset_pd[["TX_DATETIME", "CUSTOMER_ID", "TERMINAL_ID", "TX_AMOUNT", "TX_FRAUD"]]

Create and plot an EventSet¶

Convert the Pandas DataFrame into a Temporian EventSet.

In [6]:

Copied!

dataset_tp = tp.from_pandas(dataset_pd, timestamps="TX_DATETIME")

dataset_tp
dataset_tp = tp.from_pandas(dataset_pd, timestamps="TX_DATETIME")

dataset_tp

WARNING:root:Feature "CUSTOMER_ID" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "TERMINAL_ID" is an array of numpy.object_ and will be casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).

Out[6]:

features [4]: CUSTOMER_ID (str_) , TERMINAL_ID (str_) , TX_AMOUNT (float64) , TX_FRAUD (int64)

indexes [0]: none

events: 1754155

index values: 1

memory usage: 28.1 MB

index ( ) with 1754155 events

timestamp	CUSTOMER_ID	TERMINAL_ID	TX_AMOUNT	TX_FRAUD
2018-04-01 00:00:31+00:00	596	3156	57.16	0
2018-04-01 00:02:10+00:00	4961	3412	81.51	0
2018-04-01 00:07:56+00:00	2	1365	146	0
…	…	…	…	…
2018-09-30 23:58:21+00:00	618	1502	6.62	0
2018-09-30 23:59:52+00:00	4056	3067	55.4	0
2018-09-30 23:59:57+00:00	3542	9849	23.59	0

We can plot the whole dataset, but the resulting plot will be very busy because all the transactions are currently grouped together.

In [7]:

Copied!

dataset_tp.plot()
dataset_tp.plot()

No description has been provided for this image

We can also plot the transaction of a single client.

In [8]:

Copied!

dataset_tp.add_index("CUSTOMER_ID").plot(indexes="3774")

# Same plot as:
# dataset_tp.filter(dataset_tp["CUSTOMER_ID"].equal("3774")).plot()
dataset_tp.add_index("CUSTOMER_ID").plot(indexes="3774")

# Same plot as:
# dataset_tp.filter(dataset_tp["CUSTOMER_ID"].equal("3774")).plot()

Compute features¶

After exploring the dataset, we want to compute some features that may correlate with fraudulent activities. We will compute the following three features:

Calendar features: Extract the hour of the day and the day of the week as individual features. This is useful because fraudulent transactions may be more likely to occur at specific times.

Moving sum of fraud per customer: For each client, we will extract the number of fraudulent transactions in the last 4 weeks. This is useful because clients who start to commit fraud (maybe the a card was stolen) may be more likely to commit fraud in the future. However, since we only know after a week if a transaction is fraudulent, there will be a lag in this feature.

Moving sum of fraud per terminal: For each terminal, we will extract the number of fraudulent transactions in the last 4 weeks. This is useful because some fraudulent transactions may be caused by ATM skimmers. In this case, many transactions from the same terminal may be fraudulent. However, since we only know after a week if a transaction is fraudulent, there will be a lag in this feature as well.

Features often have parameters that need to be selected. For example, why look at the last 4 weeks instead of the last 8 weeks? In practice, you will want to compute the features with many different parameter values (e.g., 1 day, 2 days, 1 week, 2 weeks, 4 weeks, and 8 weeks). However, to keep this example simple, we will only use 4 weeks here.

In [9]:

Copied!





@tp.compile
def augment_transactions(transactions: tp.types.EventSetOrNode) -> Dict[str, tp.types.EventSetOrNode]:
    """Temporian function to augment the transactions with temporal features."""
    print("TRANSACTIONS:\n", transactions.schema, sep = '')

    # Create a unique ID for each transaction.
    transaction_id = transactions.enumerate().rename("transaction_id")
    transactions = tp.glue(transactions, transaction_id)

    # 1.
    # Hour of day and day of week of each transaction.
    calendar = tp.glue(
        transactions.calendar_hour(),
        transactions.calendar_day_of_week(),
        transactions["transaction_id"],
    )
    print("CALENDAR:\n", calendar.schema, sep = '')

    # 2.
    # Index the transactions per customer
    per_customer = transactions.add_index("CUSTOMER_ID")
    # Lag the fraud by 1 week
    lagged_fraud_per_customer = per_customer["TX_FRAUD"].lag(tp.duration.weeks(1))
    # Moving sum of transactions over the last 4 weeks
    feature_per_customer = lagged_fraud_per_customer.moving_sum(tp.duration.weeks(4), sampling=per_customer)
    # Rename the feature for book-keeping
    feature_per_customer = feature_per_customer.rename("per_customer.moving_sum_frauds")
    # Aggregate the newly computed feature with the ther customer features.
    feature_per_customer = tp.glue(feature_per_customer, per_customer)
    # Print the schema
    print("PER CUSTOMER:\n", per_customer.schema, sep = '')

    # 3.
    # The moving sum of fraud per terminal is similar to the moving sum per customer.
    # Instead of indexing by customer, the dataset is indexed by terminal.
    per_terminal = transactions.add_index("TERMINAL_ID")
    lagged_fraud_per_terminal = per_terminal["TX_FRAUD"].lag(tp.duration.weeks(1))
    feature_per_terminal = lagged_fraud_per_terminal.moving_sum(tp.duration.weeks(4), sampling=per_terminal)
    feature_per_terminal = feature_per_terminal.rename("per_terminal.moving_sum_frauds")
    feature_per_terminal = tp.glue(feature_per_terminal, per_terminal)
    print("PER TERMINAL:\n", per_terminal.schema, sep = '')

    # Join the per customer and per terminal features
    augmented_transactions = feature_per_terminal.drop_index().join(
        feature_per_customer.drop_index()[["per_customer.moving_sum_frauds","transaction_id"]],
        on="transaction_id")

    # Join the calendar features
    augmented_transactions = augmented_transactions.join(
        calendar[["calendar_hour", "calendar_day_of_week", "transaction_id"]],
        on="transaction_id")

    print("AUGMENTED TRANSACTIONS:\n", augmented_transactions.schema)

    return {"output": augmented_transactions}

# Compute the augmanted features
augmented_dataset_tp = augment_transactions(dataset_tp)["output"]
@tp.compile
def augment_transactions(transactions: tp.types.EventSetOrNode) -> Dict[str, tp.types.EventSetOrNode]:
    """Temporian function to augment the transactions with temporal features."""
    print("TRANSACTIONS:\n", transactions.schema, sep = '')

    # Create a unique ID for each transaction.
    transaction_id = transactions.enumerate().rename("transaction_id")
    transactions = tp.glue(transactions, transaction_id)

    # 1.
    # Hour of day and day of week of each transaction.
    calendar = tp.glue(
        transactions.calendar_hour(),
        transactions.calendar_day_of_week(),
        transactions["transaction_id"],
    )
    print("CALENDAR:\n", calendar.schema, sep = '')

    # 2.
    # Index the transactions per customer
    per_customer = transactions.add_index("CUSTOMER_ID")
    # Lag the fraud by 1 week
    lagged_fraud_per_customer = per_customer["TX_FRAUD"].lag(tp.duration.weeks(1))
    # Moving sum of transactions over the last 4 weeks
    feature_per_customer = lagged_fraud_per_customer.moving_sum(tp.duration.weeks(4), sampling=per_customer)
    # Rename the feature for book-keeping
    feature_per_customer = feature_per_customer.rename("per_customer.moving_sum_frauds")
    # Aggregate the newly computed feature with the ther customer features.
    feature_per_customer = tp.glue(feature_per_customer, per_customer)
    # Print the schema
    print("PER CUSTOMER:\n", per_customer.schema, sep = '')

    # 3.
    # The moving sum of fraud per terminal is similar to the moving sum per customer.
    # Instead of indexing by customer, the dataset is indexed by terminal.
    per_terminal = transactions.add_index("TERMINAL_ID")
    lagged_fraud_per_terminal = per_terminal["TX_FRAUD"].lag(tp.duration.weeks(1))
    feature_per_terminal = lagged_fraud_per_terminal.moving_sum(tp.duration.weeks(4), sampling=per_terminal)
    feature_per_terminal = feature_per_terminal.rename("per_terminal.moving_sum_frauds")
    feature_per_terminal = tp.glue(feature_per_terminal, per_terminal)
    print("PER TERMINAL:\n", per_terminal.schema, sep = '')

    # Join the per customer and per terminal features
    augmented_transactions = feature_per_terminal.drop_index().join(
        feature_per_customer.drop_index()[["per_customer.moving_sum_frauds","transaction_id"]],
        on="transaction_id")

    # Join the calendar features
    augmented_transactions = augmented_transactions.join(
        calendar[["calendar_hour", "calendar_day_of_week", "transaction_id"]],
        on="transaction_id")

    print("AUGMENTED TRANSACTIONS:\n", augmented_transactions.schema)

    return {"output": augmented_transactions}

# Compute the augmanted features
augmented_dataset_tp = augment_transactions(dataset_tp)["output"]

TRANSACTIONS:
features: [('CUSTOMER_ID', str_), ('TERMINAL_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64)]
indexes: []
is_unix_timestamp: True

CALENDAR:
features: [('calendar_hour', int32), ('calendar_day_of_week', int32), ('transaction_id', int64)]
indexes: []
is_unix_timestamp: True

PER CUSTOMER:
features: [('TERMINAL_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64), ('transaction_id', int64)]
indexes: [('CUSTOMER_ID', str_)]
is_unix_timestamp: True

PER TERMINAL:
features: [('CUSTOMER_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64), ('transaction_id', int64)]
indexes: [('TERMINAL_ID', str_)]
is_unix_timestamp: True

AUGMENTED TRANSACTIONS:
 features: [('per_terminal.moving_sum_frauds', int64), ('CUSTOMER_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64), ('transaction_id', int64), ('TERMINAL_ID', str_), ('per_customer.moving_sum_frauds', int64), ('calendar_hour', int32), ('calendar_day_of_week', int32)]
indexes: []
is_unix_timestamp: True

Plot the engineered features on the selected customer.

In [10]:

Copied!

augmented_dataset_tp.add_index("CUSTOMER_ID").plot(indexes="3774")
augmented_dataset_tp.add_index("CUSTOMER_ID").plot(indexes="3774")

Save the Temporian program¶

Save the Temporian program that computes the augmented transactions to disk. We will not use this program again in this notebook, but in practice, the feature engineering stage should be included with the model.

A saved Temporian program can be reloaded later or applied on a large dataset using the Apache Beam backend.

In [11]:

Copied!

tp.save(augment_transactions, transactions=dataset_tp, path="/tmp/augment_transactions.tempo")
tp.save(augment_transactions, transactions=dataset_tp, path="/tmp/augment_transactions.tempo")

TRANSACTIONS:
features: [('CUSTOMER_ID', str_), ('TERMINAL_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64)]
indexes: []
is_unix_timestamp: True

CALENDAR:
features: [('calendar_hour', int32), ('calendar_day_of_week', int32), ('transaction_id', int64)]
indexes: []
is_unix_timestamp: True

PER CUSTOMER:
features: [('TERMINAL_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64), ('transaction_id', int64)]
indexes: [('CUSTOMER_ID', str_)]
is_unix_timestamp: True

PER TERMINAL:
features: [('CUSTOMER_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64), ('transaction_id', int64)]
indexes: [('TERMINAL_ID', str_)]
is_unix_timestamp: True

AUGMENTED TRANSACTIONS:
 features: [('per_terminal.moving_sum_frauds', int64), ('CUSTOMER_ID', str_), ('TX_AMOUNT', float64), ('TX_FRAUD', int64), ('transaction_id', int64), ('TERMINAL_ID', str_), ('per_customer.moving_sum_frauds', int64), ('calendar_hour', int32), ('calendar_day_of_week', int32)]
indexes: []
is_unix_timestamp: True

Analyze the engineered features¶

Plot the relation between the engineered features and the label.

Observations: The feature per_terminal.moving_sum_frauds and per_customer.moving_sum_frauds seems to discriminate between fraudulent and non-fraudulent transactions, while the calendar features are not discriminative.

In [12]:

Copied!





augmented_dataset_pd = tp.to_pandas(augmented_dataset_tp)

fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(10, 8))

sns.ecdfplot(data=augmented_dataset_pd, x="per_terminal.moving_sum_frauds", hue="TX_FRAUD", ax=axs[0,0])
sns.ecdfplot(data=augmented_dataset_pd, x="per_customer.moving_sum_frauds", hue="TX_FRAUD", ax=axs[1,0])
sns.ecdfplot(data=augmented_dataset_pd, x="calendar_hour", hue="TX_FRAUD", ax=axs[0,1])
sns.ecdfplot(data=augmented_dataset_pd, x="calendar_day_of_week", hue="TX_FRAUD", ax=axs[1,1])
augmented_dataset_pd = tp.to_pandas(augmented_dataset_tp)

fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(10, 8))

sns.ecdfplot(data=augmented_dataset_pd, x="per_terminal.moving_sum_frauds", hue="TX_FRAUD", ax=axs[0,0])
sns.ecdfplot(data=augmented_dataset_pd, x="per_customer.moving_sum_frauds", hue="TX_FRAUD", ax=axs[1,0])
sns.ecdfplot(data=augmented_dataset_pd, x="calendar_hour", hue="TX_FRAUD", ax=axs[0,1])
sns.ecdfplot(data=augmented_dataset_pd, x="calendar_day_of_week", hue="TX_FRAUD", ax=axs[1,1])

Out[12]:

<Axes: xlabel='calendar_day_of_week', ylabel='Proportion'>

Split the data¶

The next step is to split the dataset into a training and testing dataset.

One common approach is to use the EventSet.timestamps() operator. This operator converts the timestamp of a transaction into a feature that can be compared to train_test_split.

In [13]:

Copied!





is_train = augmented_dataset_tp.timestamps() < train_test_split.timestamp()
is_test = ~is_train

# Plot
is_train = is_train.rename("is_train")
is_test = is_test.rename("is_test")
tp.plot([is_train, is_test])
is_train = augmented_dataset_tp.timestamps() < train_test_split.timestamp()
is_test = ~is_train

# Plot
is_train = is_train.rename("is_train")
is_test = is_test.rename("is_test")
tp.plot([is_train, is_test])

An alternative and equivalent solution is to create a demarcating event that separates the training and testing examples. We can then use the EventSet.since_last() and EventSet.isnan() operators to compute for each transaction whether the demarcating event has already been seen.

In [14]:

Copied!





# Create a demarcating event.
train_test_switch_tp = tp.event_set(timestamps=[train_test_split])

# Plot
train_test_switch_tp.plot()

# All the transactions before the demarcating event are part of the training dataset (i.e. `is_train=True`)
is_train = train_test_switch_tp.since_last(sampling=augmented_dataset_tp).isnan()
is_test = ~is_train

# Plot
is_train = is_train.rename("is_train")
is_test = is_test.rename("is_test")
tp.plot([is_train, is_test])
# Create a demarcating event.
train_test_switch_tp = tp.event_set(timestamps=[train_test_split])

# Plot
train_test_switch_tp.plot()

# All the transactions before the demarcating event are part of the training dataset (i.e. `is_train=True`)
is_train = train_test_switch_tp.since_last(sampling=augmented_dataset_tp).isnan()
is_test = ~is_train

# Plot
is_train = is_train.rename("is_train")
is_test = is_test.rename("is_test")
tp.plot([is_train, is_test])

We can now split the dataset into training and testing.

In [15]:

Copied!

augmented_dataset_train_tp = augmented_dataset_tp.filter(is_train)
augmented_dataset_test_tp = augmented_dataset_tp.filter(is_test)

# Print the schema of the training dataset
augmented_dataset_train_tp.schema.features
augmented_dataset_train_tp = augmented_dataset_tp.filter(is_train)
augmented_dataset_test_tp = augmented_dataset_tp.filter(is_test)

# Print the schema of the training dataset
augmented_dataset_train_tp.schema.features

Out[15]:

[('per_terminal.moving_sum_frauds', int64),
 ('CUSTOMER_ID', str_),
 ('TX_AMOUNT', float64),
 ('TX_FRAUD', int64),
 ('transaction_id', int64),
 ('TERMINAL_ID', str_),
 ('per_customer.moving_sum_frauds', int64),
 ('calendar_hour', int32),
 ('calendar_day_of_week', int32)]

Train a model¶

We first convert the Temporal EventSets into Pandas DataFrames. YDF consumes Pandas DataFrames natively.

In [16]:

Copied!





# Temporian EventSet to Pandas DataFrame
dataset_train_pd = tp.to_pandas(augmented_dataset_train_tp, timestamp_to_datetime=False)
dataset_test_pd = tp.to_pandas(augmented_dataset_test_tp, timestamp_to_datetime=False)

print(f"Train example: {len(dataset_train_pd)}")
print(f"Test example: {len(dataset_test_pd)}")
# Temporian EventSet to Pandas DataFrame
dataset_train_pd = tp.to_pandas(augmented_dataset_train_tp, timestamp_to_datetime=False)
dataset_test_pd = tp.to_pandas(augmented_dataset_test_tp, timestamp_to_datetime=False)

print(f"Train example: {len(dataset_train_pd)}")
print(f"Test example: {len(dataset_test_pd)}")

Train example: 1466282
Test example: 287873

We can then train a TF-DF model.

In [19]:

Copied!





model_path = path_tmp / "ydf_model"
if model_path.exists():
    model = ydf.load_model(str(model_path))
else:
    ydf.verbose(2)
    learner = ydf.GradientBoostedTreesLearner(
        label="TX_FRAUD",
        features=["per_customer.moving_sum_frauds",
                 "per_terminal.moving_sum_frauds",
                 "calendar_hour",
                 "calendar_day_of_week"])
    model = learner.train(dataset_train_pd)
    ydf.verbose(1)
    model.save(str(model_path))
model_path = path_tmp / "ydf_model"
if model_path.exists():
    model = ydf.load_model(str(model_path))
else:
    ydf.verbose(2)
    learner = ydf.GradientBoostedTreesLearner(
        label="TX_FRAUD",
        features=["per_customer.moving_sum_frauds",
                 "per_terminal.moving_sum_frauds",
                 "calendar_hour",
                 "calendar_day_of_week"])
    model = learner.train(dataset_train_pd)
    ydf.verbose(1)
    model.save(str(model_path))

Warning: ydf.verbose(2) but logs cannot be displayed in the cell. Check colab logs or install wurlitzer with 'pip install wurlitzer'

Train model on 1466282 examples

[INFO 24-03-26 19:30:36.0728 CET dataset.cc:400] max_vocab_count = -1 for column TX_FRAUD, the dictionary will not be pruned by size.
[WARNING 24-03-26 19:30:36.1534 CET gradient_boosted_trees.cc:1840] "goss_alpha" set but "sampling_method" not equal to "GOSS".
[WARNING 24-03-26 19:30:36.1534 CET gradient_boosted_trees.cc:1851] "goss_beta" set but "sampling_method" not equal to "GOSS".
[WARNING 24-03-26 19:30:36.1535 CET gradient_boosted_trees.cc:1865] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".
[INFO 24-03-26 19:30:36.1535 CET abstract_learner.cc:128] No input feature explicitly specified. Using all the available input features.
[INFO 24-03-26 19:30:36.1535 CET abstract_learner.cc:142] The label "TX_FRAUD" was removed from the input feature set.
[INFO 24-03-26 19:30:36.1535 CET gradient_boosted_trees.cc:544] Default loss set to BINOMIAL_LOG_LIKELIHOOD
[INFO 24-03-26 19:30:36.1535 CET gradient_boosted_trees.cc:1171] Training gradient boosted tree on 1466282 example(s) and 4 feature(s).
[INFO 24-03-26 19:30:36.1924 CET gradient_boosted_trees.cc:1214] 1319613 examples used for training and 146669 examples used for validation
[INFO 24-03-26 19:30:36.6044 CET gradient_boosted_trees.cc:1590] 	num-trees:1 train-loss:0.074777 train-accuracy:0.991936 valid-loss:0.072254 valid-accuracy:0.992152
[INFO 24-03-26 19:30:36.9006 CET gradient_boosted_trees.cc:1592] 	num-trees:2 train-loss:0.074167 train-accuracy:0.992011 valid-loss:0.071683 valid-accuracy:0.992186
[INFO 24-03-26 19:31:07.2584 CET gradient_boosted_trees.cc:1592] 	num-trees:93 train-loss:0.070782 train-accuracy:0.992315 valid-loss:0.068867 valid-accuracy:0.992418

Model trained in 0:00:34.875405

[INFO 24-03-26 19:31:11.0267 CET early_stopping.cc:53] Early stop of the training because the validation loss does not decrease anymore. Best valid-loss: 0.0688434
[INFO 24-03-26 19:31:11.0267 CET gradient_boosted_trees.cc:270] Truncates the model to 73 tree(s) i.e. 73  iteration(s).
[INFO 24-03-26 19:31:11.0270 CET gradient_boosted_trees.cc:333] Final model num-trees:73 valid-loss:0.068843 valid-accuracy:0.992432

It is a good idea to look at the model.

In [23]:

Copied!

model.describe()
model.describe()

Out[23]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : TX_FRAUD
Features (4) : per_customer.moving_sum_frauds per_terminal.moving_sum_frauds calendar_hour calendar_day_of_week
Weights : None
Trained with tuner : No
Model size : 1016 kB

Number of records: 1466282
Number of columns: 5

Number of columns by type:
	NUMERICAL: 4 (80%)
	CATEGORICAL: 1 (20%)

Columns:

NUMERICAL: 4 (80%)
	0: "per_customer.moving_sum_frauds" NUMERICAL mean:0.504242 min:0 max:30 sd:1.61012
	1: "per_terminal.moving_sum_frauds" NUMERICAL mean:0.197961 min:0 max:53 sd:1.59567
	2: "calendar_hour" NUMERICAL mean:11.5006 min:0 max:23 sd:5.05715
	3: "calendar_day_of_week" NUMERICAL mean:2.98623 min:0 max:6 sd:2.00086

CATEGORICAL: 1 (20%)
	4: "TX_FRAUD" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"0" 1454148 (99.1725%)

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: TX_FRAUD
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.0688434

Accuracy: 0.992432  CI95[W][0 1]
ErrorRate: : 0.00756806


Confusion Table:
truth\prediction
            0    1
    0  145310  191
    1     919  249
Total: 146669

Variable importances measure the importance of an input feature for a model.

    1. "per_terminal.moving_sum_frauds"  0.859896 ################
    2. "per_customer.moving_sum_frauds"  0.241117 
    3.                  "calendar_hour"  0.240364 
    4.           "calendar_day_of_week"  0.203979

    1. "per_terminal.moving_sum_frauds" 63.000000 ################
    2. "per_customer.moving_sum_frauds"  8.000000 #
    3.                  "calendar_hour"  2.000000

    1. "per_terminal.moving_sum_frauds" 697.000000 ################
    2.                  "calendar_hour" 568.000000 ##########
    3. "per_customer.moving_sum_frauds" 458.000000 #####
    4.           "calendar_day_of_week" 342.000000

    1. "per_terminal.moving_sum_frauds" 2452.551309 ################
    2. "per_customer.moving_sum_frauds" 212.165475 
    3.                  "calendar_hour" 195.976629 
    4.           "calendar_day_of_week" 121.874625

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 73

Only printing the first tree.

Tree #0:
    "per_terminal.moving_sum_frauds">=2.5 [s:0.00122549 n:1319613 np:10761 miss:0] ; pred:5.39188e-09
        ├─(pos)─ "per_terminal.moving_sum_frauds">=18.5 [s:0.0176001 n:10761 np:3536 miss:0] ; pred:4.68486
        |        ├─(pos)─ "per_terminal.moving_sum_frauds">=25.5 [s:0.00442454 n:3536 np:1480 miss:0] ; pred:2.38372
        |        |        ├─(pos)─ "per_terminal.moving_sum_frauds">=37.5 [s:0.000963484 n:1480 np:183 miss:0] ; pred:1.43237
        |        |        |        ├─(pos)─ "per_terminal.moving_sum_frauds">=42.5 [s:0.000909302 n:183 np:59 miss:0] ; pred:0.429633
        |        |        |        |        ├─(pos)─ pred:-0.100838
        |        |        |        |        └─(neg)─ pred:0.682034
        |        |        |        └─(neg)─ "per_customer.moving_sum_frauds">=15 [s:0.000825983 n:1297 np:5 miss:0] ; pred:1.57386
        |        |        |                 ├─(pos)─ pred:5
        |        |        |                 └─(neg)─ pred:1.55216
        |        |        └─(neg)─ "per_terminal.moving_sum_frauds">=22.5 [s:0.000892859 n:2056 np:721 miss:0] ; pred:3.06854
        |        |                 ├─(pos)─ "calendar_hour">=8.5 [s:0.00249672 n:721 np:526 miss:1] ; pred:2.57515
        |        |                 |        ├─(pos)─ pred:2.94433
        |        |                 |        └─(neg)─ pred:1.57933
        |        |                 └─(neg)─ "calendar_day_of_week">=5.5 [s:0.000665012 n:1335 np:187 miss:0] ; pred:3.335
        |        |                          ├─(pos)─ pred:2.55967
        |        |                          └─(neg)─ pred:3.4613
        |        └─(neg)─ "per_terminal.moving_sum_frauds">=3.5 [s:0.00906797 n:7225 np:6188 miss:0] ; pred:5
        |                 ├─(pos)─ "per_terminal.moving_sum_frauds">=9.5 [s:0.0036483 n:6188 np:3617 miss:0] ; pred:5
        |                 |        ├─(pos)─ "per_terminal.moving_sum_frauds">=15.5 [s:0.00202378 n:3617 np:1087 miss:0] ; pred:5
        |                 |        |        ├─(pos)─ pred:4.83334
        |                 |        |        └─(neg)─ pred:5
        |                 |        └─(neg)─ "per_terminal.moving_sum_frauds">=5.5 [s:0.0014287 n:2571 np:1679 miss:0] ; pred:5
        |                 |                 ├─(pos)─ pred:5
        |                 |                 └─(neg)─ pred:5
        |                 └─(neg)─ "per_customer.moving_sum_frauds">=4.5 [s:0.00150037 n:1037 np:55 miss:0] ; pred:2.98837
        |                          ├─(pos)─ "calendar_hour">=15.5 [s:0.00309917 n:55 np:15 miss:0] ; pred:1.0023
        |                          |        ├─(pos)─ pred:-0.100838
        |                          |        └─(neg)─ pred:1.41598
        |                          └─(neg)─ "per_customer.moving_sum_frauds">=0.5 [s:0.00156966 n:982 np:275 miss:1] ; pred:3.09961
        |                                   ├─(pos)─ pred:3.87046
        |                                   └─(neg)─ pred:2.79977
        └─(neg)─ "per_customer.moving_sum_frauds">=3.5 [s:9.99603e-06 n:1308852 np:34423 miss:0] ; pred:-0.0385175
                 ├─(pos)─ "per_customer.moving_sum_frauds">=10.5 [s:9.78865e-05 n:34423 np:8963 miss:0] ; pred:0.194919
                 |        ├─(pos)─ "per_customer.moving_sum_frauds">=12.5 [s:1.71467e-05 n:8963 np:6754 miss:0] ; pred:-0.00742264
                 |        |        ├─(pos)─ "calendar_hour">=18.5 [s:3.59056e-06 n:6754 np:620 miss:0] ; pred:-0.0361589
                 |        |        |        ├─(pos)─ pred:0.0361646
                 |        |        |        └─(neg)─ pred:-0.043469
                 |        |        └─(neg)─ "calendar_hour">=5.5 [s:3.33246e-05 n:2209 np:1922 miss:1] ; pred:0.0804382
                 |        |                 ├─(pos)─ pred:0.107507
                 |        |                 └─(neg)─ pred:-0.100838
                 |        └─(neg)─ "per_customer.moving_sum_frauds">=4.5 [s:5.82315e-05 n:25460 np:14304 miss:0] ; pred:0.266153
                 |                 ├─(pos)─ "per_customer.moving_sum_frauds">=8.5 [s:2.2442e-05 n:14304 np:2840 miss:0] ; pred:0.347929
                 |                 |        ├─(pos)─ pred:0.232434
                 |                 |        └─(neg)─ pred:0.376541
                 |                 └─(neg)─ "per_terminal.moving_sum_frauds">=0.5 [s:1.75842e-05 n:11156 np:1074 miss:0] ; pred:0.161301
                 |                          ├─(pos)─ pred:0.317204
                 |                          └─(neg)─ pred:0.144693
                 └─(neg)─ "per_terminal.moving_sum_frauds">=1.5 [s:6.43231e-06 n:1274429 np:7365 miss:0] ; pred:-0.0448228
                          ├─(pos)─ "per_customer.moving_sum_frauds">=0.5 [s:1.24714e-05 n:7365 np:1866 miss:1] ; pred:0.35884
                          |        ├─(pos)─ "calendar_hour">=13.5 [s:0.000145744 n:1866 np:679 miss:0] ; pred:0.432404
                          |        |        ├─(pos)─ pred:0.238714
                          |        |        └─(neg)─ pred:0.543201
                          |        └─(neg)─ "calendar_hour">=12.5 [s:2.62393e-05 n:5499 np:2374 miss:0] ; pred:0.333877
                          |                 ├─(pos)─ pred:0.405193
                          |                 └─(neg)─ pred:0.2797
                          └─(neg)─ "per_customer.moving_sum_frauds">=0.5 [s:2.1902e-06 n:1267064 np:256151 miss:1] ; pred:-0.0471692
                                   ├─(pos)─ "per_customer.moving_sum_frauds">=2.5 [s:3.175e-06 n:256151 np:25561 miss:0] ; pred:-0.0114934
                                   |        ├─(pos)─ pred:0.0534486
                                   |        └─(neg)─ pred:-0.0186923
                                   └─(neg)─ "per_terminal.moving_sum_frauds">=0.5 [s:4.92022e-07 n:1010913 np:60759 miss:0] ; pred:-0.0562089
                                            ├─(pos)─ pred:-0.0225495
                                            └─(neg)─ pred:-0.0583613

Evaluate the model¶

Finally, we plot the ROC (Receiver operating characteristic) curve and compute the AUC (Area Under the Curve).

In [24]:

Copied!

model.evaluate(dataset_test_pd)
model.evaluate(dataset_test_pd)

Out[24]:

accuracy:

0.991055

AUC: '1' vs others:

0.794674

PR-AUC: '1' vs others:

0.2325

loss:

0.037454

num examples:

287873

num examples (weighted):

287873

Confusion matrix

Label \ Pred	0	1
0	284752	2001
1	574	546

Observations:

The AUC of 0.79 shows the capability of the model to detect frauds.
The engineered features we created are efficient at identifying some types of fraud, as evidenced by the recall at low FPR (see FPR=0.02, TPR=0.5).
However, for FPRs greater than 0.02, the TPR increases slowly, indicating that the remaining types of fraud are more difficult to detect. We need to conduct further analysis and create new features to improve our ability to detect them.

Homework

Do you have any ideas for other features that could improve the model's performance? For example, we could compute features per customer and per terminal, or we could create features related to the transacted amount. These changes could help us reach an AUC of >0.88.