Loan Outcomes Prediction¶

In this example we retrieve and prepare the data from the Financial relational dataset using Temporian.

This dataset contains 682 loans (606 successful and 76 not successful) along with their information and transactions. The standard task is to predict the outcome for finished loans (A vs B in loan.status) at the time of the loan start (defined by loan.date).

Dependencies¶

Since the dataset is provided in a relational database, we'll use sqlalchemy to retrieve the data.

In [2]:

Copied!

# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q
# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

In [3]:

Copied!

%pip install "sqlalchemy<2.0" pymysql -q
%pip install "sqlalchemy<2.0" pymysql -q

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

In [4]:

Copied!





import pandas as pd

import temporian as tp
from pathlib import Path
from time import time
from sqlalchemy import create_engine
import pandas as pd

import temporian as tp
from pathlib import Path
from time import time
from sqlalchemy import create_engine

Load data¶

We'll first read the data from the tables loan and trans (transactions or trx) using SQL.

The connection is performed using the public credentials provided in the link above.

We load the data in pandas dataframes df_loan and df_trx, and we save them in CSV files as cache.

In [5]:

Copied!





tmp_dir = Path("tmp") / "temporian_loan"
tmp_dir.mkdir(parents=True, exist_ok=True)

path_df_loan = tmp_dir / "loan.csv"
path_df_trx = tmp_dir / "trans.csv"

if not path_df_loan.exists() or not path_df_trx.exists():
    print("Connecting to DB...")
    conn_str = (
        "mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306/financial"
    )
    engine = create_engine(conn_str)
    print("Retrieving table: loan -> df_loan")
    t0 = time()
    df_loan = pd.read_sql("SELECT * FROM loan", engine)
    t1 = time()
    print(f"Took {t1 - t0:.1f} seconds")

    print("Retrieving table: trans -> df_trx")
    df_trx = pd.read_sql("SELECT * FROM trans", engine)
    print(f"Took {time() - t1:.1f} seconds")

    print(f"Caching tables into: {tmp_dir}")
    df_trx.to_csv(path_df_trx)
    df_loan.to_csv(path_df_loan)
else:
    print(f"Loading cached tables from {tmp_dir}")
    df_trx = pd.read_csv(path_df_trx, index_col=[0])
    df_loan = pd.read_csv(path_df_loan, index_col=[0])
tmp_dir = Path("tmp") / "temporian_loan"
tmp_dir.mkdir(parents=True, exist_ok=True)

path_df_loan = tmp_dir / "loan.csv"
path_df_trx = tmp_dir / "trans.csv"

if not path_df_loan.exists() or not path_df_trx.exists():
    print("Connecting to DB...")
    conn_str = (
        "mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306/financial"
    )
    engine = create_engine(conn_str)
    print("Retrieving table: loan -> df_loan")
    t0 = time()
    df_loan = pd.read_sql("SELECT * FROM loan", engine)
    t1 = time()
    print(f"Took {t1 - t0:.1f} seconds")

    print("Retrieving table: trans -> df_trx")
    df_trx = pd.read_sql("SELECT * FROM trans", engine)
    print(f"Took {time() - t1:.1f} seconds")

    print(f"Caching tables into: {tmp_dir}")
    df_trx.to_csv(path_df_trx)
    df_loan.to_csv(path_df_loan)
else:
    print(f"Loading cached tables from {tmp_dir}")
    df_trx = pd.read_csv(path_df_trx, index_col=[0])
    df_loan = pd.read_csv(path_df_loan, index_col=[0])

Loading cached tables from tmp/temporian_loan

/var/folders/k4/wnjmb6mx5hl3chl6wfn7plxc0000gn/T/ipykernel_59206/2969384387.py:28: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
  df_trx = pd.read_csv(path_df_trx, index_col=[0])

In [6]:

Copied!

# Check we have 682 loans
df_loan
# Check we have 682 loans
df_loan

Out[6]:

	loan_id	account_id	date	amount	duration	payments	status
0	4959	2	1994-01-05	80952	24	3373.0	A
1	4961	19	1996-04-29	30276	12	2523.0	B
2	4962	25	1997-12-08	30276	12	2523.0	A
3	4967	37	1998-10-14	318480	60	5308.0	D
4	4968	38	1998-04-19	110736	48	2307.0	C
...	...	...	...	...	...	...	...
677	7294	11327	1998-09-27	39168	24	1632.0	C
678	7295	11328	1998-07-18	280440	60	4674.0	C
679	7304	11349	1995-10-29	419880	60	6998.0	C
680	7305	11359	1996-08-06	54024	12	4502.0	A
681	7308	11362	1996-12-27	129408	24	5392.0	A

682 rows × 7 columns

In [7]:

Copied!

# Check 1_056_320 transactions
df_trx
# Check 1_056_320 transactions
df_trx

Out[7]:

	trans_id	account_id	date	type	operation	amount	balance	k_symbol	bank	account
0	1	1	1995-03-24	PRIJEM	VKLAD	1000	1000	NaN	NaN	NaN
1	5	1	1995-04-13	PRIJEM	PREVOD Z UCTU	3679	4679	NaN	AB	41403269.0
2	6	1	1995-05-13	PRIJEM	PREVOD Z UCTU	3679	20977	NaN	AB	41403269.0
3	7	1	1995-06-13	PRIJEM	PREVOD Z UCTU	3679	26835	NaN	AB	41403269.0
4	8	1	1995-07-13	PRIJEM	PREVOD Z UCTU	3679	30415	NaN	AB	41403269.0
...	...	...	...	...	...	...	...	...	...	...
1056315	3682983	10451	1998-08-31	PRIJEM	NaN	62	17300	UROK	NaN	NaN
1056316	3682984	10451	1998-09-30	PRIJEM	NaN	49	13442	UROK	NaN	NaN
1056317	3682985	10451	1998-10-31	PRIJEM	NaN	34	10118	UROK	NaN	NaN
1056318	3682986	10451	1998-11-30	PRIJEM	NaN	26	8398	UROK	NaN	NaN
1056319	3682987	10451	1998-12-31	PRIJEM	NaN	42	13695	UROK	NaN	NaN

1056320 rows × 10 columns

Filter valid data¶

The task for this dataset is to predict the loan outcome based on the transactions (trx) that a given account performed before the loan.

Each account_id should only have one loan, so we'll flag the valid transactions in each account by finding those that are previous to the date_loan.

In [8]:

Copied!

# Check that all accounts have a single loan
assert not df_loan["account_id"].duplicated().any()
# Check that all accounts have a single loan
assert not df_loan["account_id"].duplicated().any()

In [9]:

Copied!





# Convert date columns
df_loan["date"] = pd.to_datetime(df_loan["date"])
df_trx["date"] = pd.to_datetime(df_trx["date"])

# Remove unused columns
df_loan.drop(columns=["loan_id"], inplace=True, errors="ignore")

# Remove trx without loan and get loan dates into transactions
df_trx = df_trx.join(
    df_loan[["account_id", "date"]],
    on=["account_id"],
    how="inner",
    rsuffix="_loan",
)

# Only transactions before the loan can be used for prediction
df_trx["valid"] = df_trx["date_loan"] >= df_trx["date"]

# Delete no longer used
df_trx.drop(columns=["date_loan", "account_id_loan"], inplace=True)

print(f"Valid transactions ({len(df_trx)=}):")
df_trx["valid"].value_counts()
# Convert date columns
df_loan["date"] = pd.to_datetime(df_loan["date"])
df_trx["date"] = pd.to_datetime(df_trx["date"])

# Remove unused columns
df_loan.drop(columns=["loan_id"], inplace=True, errors="ignore")

# Remove trx without loan and get loan dates into transactions
df_trx = df_trx.join(
    df_loan[["account_id", "date"]],
    on=["account_id"],
    how="inner",
    rsuffix="_loan",
)

# Only transactions before the loan can be used for prediction
df_trx["valid"] = df_trx["date_loan"] >= df_trx["date"]

# Delete no longer used
df_trx.drop(columns=["date_loan", "account_id_loan"], inplace=True)

print(f"Valid transactions ({len(df_trx)=}):")
df_trx["valid"].value_counts()

Valid transactions (len(df_trx)=149397):

Out[9]:

False    81315
True     68082
Name: valid, dtype: int64

Types of transaction¶

According to this analysis there should be only 2 types of transactions:

Credit (PRIJEM)
Debit (VYDAJ)

In [10]:

Copied!

df_trx["type"].unique()
df_trx["type"].unique()

Out[10]:

array(['PRIJEM', 'VYDAJ', 'VYBER'], dtype=object)

Not exactly as expected. But looking at the types of operation, in the same link above, VYBER should also correspond to withdrawals.

We'll verify if those assumptions hold, by calculating the balance from the transaction amounts, and comparing it to the registered balance.

Create a Temporian EventSet¶

In [11]:

Copied!





trx_evset = tp.from_pandas(
    df_trx, timestamps="date", indexes=["account_id"]
)
trx_evset
trx_evset = tp.from_pandas(
    df_trx, timestamps="date", indexes=["account_id"]
)
trx_evset

WARNING:root:Feature "type" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "operation" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "k_symbol" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
WARNING:root:Feature "bank" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).

Out[11]:

features [9]: trans_id (int64) , type (str_) , operation (str_) , amount (int64) , balance (int64) , k_symbol (str_) , bank (str_) , account (float64) , valid (bool_)

indexes [1]: account_id (int64)

events: 149397

index values: 647

memory usage: 12.0 MB

index ( account_id: 1 ) with 239 events

timestamp	trans_id	type	operation	amount	balance	k_symbol	bank	account	valid
1995-03-24 00:00:00+00:00	1	PRIJEM	VKLAD	1000	1000	nan	nan	nan	True
1995-04-13 00:00:00+00:00	5	PRIJEM	PREVOD Z UCTU	3679	4679	nan	AB	4.14e+07	True
1995-04-23 00:00:00+00:00	199	PRIJEM	VKLAD	12600	17279	nan	nan	nan	True
1995-04-30 00:00:00+00:00	3530438	PRIJEM	nan	19	17298	UROK	nan	nan	True
1995-05-13 00:00:00+00:00	6	PRIJEM	PREVOD Z UCTU	3679	20977	nan	AB	4.14e+07	True
…	…	…	…	…	…	…	…	…	…

index ( account_id: 2 ) with 478 events

timestamp	trans_id	type	operation	amount	balance	k_symbol	bank	account	valid
1993-02-26 00:00:00+00:00	276	PRIJEM	VKLAD	1100	1100	nan	nan	nan	True
1993-03-12 00:00:00+00:00	279	PRIJEM	PREVOD Z UCTU	20236	21336	nan	ST	6.649e+07	True
1993-03-28 00:00:00+00:00	697	PRIJEM	VKLAD	3700	25036	nan	nan	nan	True
1993-03-31 00:00:00+00:00	3530483	PRIJEM	nan	14	25050	UROK	nan	nan	True
1993-04-12 00:00:00+00:00	280	PRIJEM	PREVOD Z UCTU	20236	45286	nan	ST	6.649e+07	True
…	…	…	…	…	…	…	…	…	…

index ( account_id: 3 ) with 117 events

timestamp	trans_id	type	operation	amount	balance	k_symbol	bank	account	valid
1997-07-07 00:00:00+00:00	837	PRIJEM	VKLAD	1000	1000	nan	nan	nan	True
1997-07-11 00:00:00+00:00	844	PRIJEM	VKLAD	7502	8502	nan	nan	nan	True
1997-07-31 00:00:00+00:00	3530553	PRIJEM	nan	24	8526	UROK	nan	nan	True
1997-08-06 00:00:00+00:00	982	PRIJEM	VKLAD	7800	16326	nan	nan	nan	True
1997-08-11 00:00:00+00:00	845	PRIJEM	VKLAD	7502	23828	nan	nan	nan	True
…	…	…	…	…	…	…	…	…	…

index ( account_id: 4 ) with 186 events

timestamp	trans_id	type	operation	amount	balance	k_symbol	bank	account	valid
1996-02-21 00:00:00+00:00	1017	PRIJEM	VKLAD	800	800	nan	nan	nan	True
1996-03-11 00:00:00+00:00	1020	PRIJEM	PREVOD Z UCTU	5553	6353	DUCHOD	UV	8.99e+07	True
1996-04-11 00:00:00+00:00	1021	PRIJEM	PREVOD Z UCTU	5553	11906	DUCHOD	UV	8.99e+07	True
1996-05-11 00:00:00+00:00	1022	PRIJEM	PREVOD Z UCTU	5553	17459	DUCHOD	UV	8.99e+07	True
1996-06-11 00:00:00+00:00	1023	PRIJEM	PREVOD Z UCTU	5553	23012	DUCHOD	UV	8.99e+07	True
…	…	…	…	…	…	…	…	…	…

… (643 more indexes not shown)

Check transaction plots¶

In [12]:

Copied!

# Visualize some data
trx_evset.plot()
# Visualize some data
trx_evset.plot()

The number of plots (5823) is larger than "options.max_num_plots=20". Only the first plots will be printed.

No description has been provided for this image

Features with Temporian¶

Calculate balance¶

Let's create a feature to calculate the balance from all the transaction amounts. The calculation has some rules that we need to figure out, but if we do it right we can check it by comparing the registered balance field with our calculated balance feature.

According to the dataset documentation and the analysis we did above, transactions (trx) can be deposits or withdrawals, depending on the type. The code PRIJEM indicates deposit and VYDAJ or VYBER should be withdrawals.

The amount is always positive so we need to multiply it by -1 for withdrawals before adding it on the balance.

Then we accumulate the amount (with sign) from all transactions to get the calculated balance.

In [13]:

Copied!





# Filter only valid transactions
valid_trx = trx_evset.filter(trx_evset["valid"])

# Get amount with sign +/- depending on type of transaction
# - PRIJEM: deposit
# - VYDAJ or VYBER: withdrawal
is_deposit = valid_trx["type"].equal("PRIJEM")
is_vydaj = valid_trx["type"].equal("VYDAJ")
is_vyber = valid_trx["type"].equal("VYBER")
is_withdrawal = is_vydaj | is_vyber

trx_sign = -1 * is_withdrawal.cast(int) + is_deposit.cast(int)
trx_signed_amount = trx_sign * valid_trx["amount"]

# Accumulated amount (signed) should be equal to "balance"
balance = trx_signed_amount.cumsum()

# Output features
result = tp.glue(
    valid_trx[["amount", "balance"]],
    balance.rename("balance_calculated"),
)

# Plot registered vs. calculated balance, only 3 accounts (2 plots per index)
result.plot(style="line", features=["balance", "balance_calculated"], max_num_plots=6)
# Filter only valid transactions
valid_trx = trx_evset.filter(trx_evset["valid"])

# Get amount with sign +/- depending on type of transaction
# - PRIJEM: deposit
# - VYDAJ or VYBER: withdrawal
is_deposit = valid_trx["type"].equal("PRIJEM")
is_vydaj = valid_trx["type"].equal("VYDAJ")
is_vyber = valid_trx["type"].equal("VYBER")
is_withdrawal = is_vydaj | is_vyber

trx_sign = -1 * is_withdrawal.cast(int) + is_deposit.cast(int)
trx_signed_amount = trx_sign * valid_trx["amount"]

# Accumulated amount (signed) should be equal to "balance"
balance = trx_signed_amount.cumsum()

# Output features
result = tp.glue(
    valid_trx[["amount", "balance"]],
    balance.rename("balance_calculated"),
)

# Plot registered vs. calculated balance, only 3 accounts (2 plots per index)
result.plot(style="line", features=["balance", "balance_calculated"], max_num_plots=6)

The number of plots (1294) is larger than "options.max_num_plots=6". Only the first plots will be printed.

As seen from the plots, the calculated balance matches the registered balance value. This means that the codes we're using to identify deposits and withdrawals was correct.

Customize matplotlib plots¶

We can customize the matplotlib figure by setting the return_fig=True argument. This way, the plots can be customized basically in any way, using the backend methods.

For example, suppose we want a specific time interval in the x-axis and fixed scale in y-axis:

In [14]:

Copied!





from datetime import date

# Let's compare indexes 1, 2 and 192
fig = result.plot(style="line", features=["balance", "balance_calculated"], indexes=[1, 2], return_fig=True)

for ax in fig.axes:
    ax.set_ylim([0, 60000])
    ax.set_xlim([date(1996, 1, 1), date(1996, 12, 31)])
from datetime import date

# Let's compare indexes 1, 2 and 192
fig = result.plot(style="line", features=["balance", "balance_calculated"], indexes=[1, 2], return_fig=True)

for ax in fig.axes:
    ax.set_ylim([0, 60000])
    ax.set_xlim([date(1996, 1, 1), date(1996, 12, 31)])

Show some interactive plots¶

In order to show interactive plots, we need the additional bokeh library.

After that, we just need to specify interactive=True in the plot function. Keep in mind that the backend is no longer matplotlib in this case, so you might notice some differences in the plot styles.

In [15]:

Copied!

!pip install bokeh -q
!pip install bokeh -q

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip

In [16]:

Copied!





# Count the number of transactions per day and month
count_trx_daily = valid_trx["amount"].moving_count(tp.duration.days(1.0))
count_trx_monthly = valid_trx["amount"].moving_count(tp.duration.weeks(4.0))

counts = tp.glue(valid_trx["balance"],
    count_trx_daily.rename("daily_count"),
    count_trx_monthly.rename("monthly_count")
)

counts.plot(
    indexes=[1, 2],
    features=["balance", "daily_count", "monthly_count"],
    width_px=600,  # Narrow so that controls are visible (right side bar)
    interactive=True,
)
# Count the number of transactions per day and month
count_trx_daily = valid_trx["amount"].moving_count(tp.duration.days(1.0))
count_trx_monthly = valid_trx["amount"].moving_count(tp.duration.weeks(4.0))

counts = tp.glue(valid_trx["balance"],
    count_trx_daily.rename("daily_count"),
    count_trx_monthly.rename("monthly_count")
)

counts.plot(
    indexes=[1, 2],
    features=["balance", "daily_count", "monthly_count"],
    width_px=600,  # Narrow so that controls are visible (right side bar)
    interactive=True,
)