Loan Outcomes Prediction¶
In this example we retrieve and prepare the data from the Financial relational dataset using Temporian.
This dataset contains 682 loans (606 successful and 76 not successful) along with their information and transactions. The standard task is to predict the outcome for finished loans (A vs B in loan.status
) at the time of the loan start (defined by loan.date).
Dependencies¶
Since the dataset is provided in a relational database, we'll use sqlalchemy
to retrieve the data.
# Skip this cell if you are running the notebook locally and have already installed temporian.
%pip install temporian -q
[notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
%pip install "sqlalchemy<2.0" pymysql -q
[notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import temporian as tp
from pathlib import Path
from time import time
from sqlalchemy import create_engine
Load data¶
We'll first read the data from the tables loan
and trans
(transactions or trx) using SQL.
The connection is performed using the public credentials provided in the link above.
We load the data in pandas dataframes df_loan
and df_trx
, and we save them in CSV files as cache.
tmp_dir = Path("tmp") / "temporian_loan"
tmp_dir.mkdir(parents=True, exist_ok=True)
path_df_loan = tmp_dir / "loan.csv"
path_df_trx = tmp_dir / "trans.csv"
if not path_df_loan.exists() or not path_df_trx.exists():
print("Connecting to DB...")
conn_str = (
"mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306/financial"
)
engine = create_engine(conn_str)
print("Retrieving table: loan -> df_loan")
t0 = time()
df_loan = pd.read_sql("SELECT * FROM loan", engine)
t1 = time()
print(f"Took {t1 - t0:.1f} seconds")
print("Retrieving table: trans -> df_trx")
df_trx = pd.read_sql("SELECT * FROM trans", engine)
print(f"Took {time() - t1:.1f} seconds")
print(f"Caching tables into: {tmp_dir}")
df_trx.to_csv(path_df_trx)
df_loan.to_csv(path_df_loan)
else:
print(f"Loading cached tables from {tmp_dir}")
df_trx = pd.read_csv(path_df_trx, index_col=[0])
df_loan = pd.read_csv(path_df_loan, index_col=[0])
Loading cached tables from tmp/temporian_loan
/var/folders/k4/wnjmb6mx5hl3chl6wfn7plxc0000gn/T/ipykernel_59206/2969384387.py:28: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False. df_trx = pd.read_csv(path_df_trx, index_col=[0])
# Check we have 682 loans
df_loan
loan_id | account_id | date | amount | duration | payments | status | |
---|---|---|---|---|---|---|---|
0 | 4959 | 2 | 1994-01-05 | 80952 | 24 | 3373.0 | A |
1 | 4961 | 19 | 1996-04-29 | 30276 | 12 | 2523.0 | B |
2 | 4962 | 25 | 1997-12-08 | 30276 | 12 | 2523.0 | A |
3 | 4967 | 37 | 1998-10-14 | 318480 | 60 | 5308.0 | D |
4 | 4968 | 38 | 1998-04-19 | 110736 | 48 | 2307.0 | C |
... | ... | ... | ... | ... | ... | ... | ... |
677 | 7294 | 11327 | 1998-09-27 | 39168 | 24 | 1632.0 | C |
678 | 7295 | 11328 | 1998-07-18 | 280440 | 60 | 4674.0 | C |
679 | 7304 | 11349 | 1995-10-29 | 419880 | 60 | 6998.0 | C |
680 | 7305 | 11359 | 1996-08-06 | 54024 | 12 | 4502.0 | A |
681 | 7308 | 11362 | 1996-12-27 | 129408 | 24 | 5392.0 | A |
682 rows × 7 columns
# Check 1_056_320 transactions
df_trx
trans_id | account_id | date | type | operation | amount | balance | k_symbol | bank | account | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 1995-03-24 | PRIJEM | VKLAD | 1000 | 1000 | NaN | NaN | NaN |
1 | 5 | 1 | 1995-04-13 | PRIJEM | PREVOD Z UCTU | 3679 | 4679 | NaN | AB | 41403269.0 |
2 | 6 | 1 | 1995-05-13 | PRIJEM | PREVOD Z UCTU | 3679 | 20977 | NaN | AB | 41403269.0 |
3 | 7 | 1 | 1995-06-13 | PRIJEM | PREVOD Z UCTU | 3679 | 26835 | NaN | AB | 41403269.0 |
4 | 8 | 1 | 1995-07-13 | PRIJEM | PREVOD Z UCTU | 3679 | 30415 | NaN | AB | 41403269.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1056315 | 3682983 | 10451 | 1998-08-31 | PRIJEM | NaN | 62 | 17300 | UROK | NaN | NaN |
1056316 | 3682984 | 10451 | 1998-09-30 | PRIJEM | NaN | 49 | 13442 | UROK | NaN | NaN |
1056317 | 3682985 | 10451 | 1998-10-31 | PRIJEM | NaN | 34 | 10118 | UROK | NaN | NaN |
1056318 | 3682986 | 10451 | 1998-11-30 | PRIJEM | NaN | 26 | 8398 | UROK | NaN | NaN |
1056319 | 3682987 | 10451 | 1998-12-31 | PRIJEM | NaN | 42 | 13695 | UROK | NaN | NaN |
1056320 rows × 10 columns
Filter valid data¶
The task for this dataset is to predict the loan outcome based on the transactions (trx) that a given account performed before the loan.
Each account_id
should only have one loan, so we'll flag the valid
transactions in each account by finding those that are previous to the date_loan
.
# Check that all accounts have a single loan
assert not df_loan["account_id"].duplicated().any()
# Convert date columns
df_loan["date"] = pd.to_datetime(df_loan["date"])
df_trx["date"] = pd.to_datetime(df_trx["date"])
# Remove unused columns
df_loan.drop(columns=["loan_id"], inplace=True, errors="ignore")
# Remove trx without loan and get loan dates into transactions
df_trx = df_trx.join(
df_loan[["account_id", "date"]],
on=["account_id"],
how="inner",
rsuffix="_loan",
)
# Only transactions before the loan can be used for prediction
df_trx["valid"] = df_trx["date_loan"] >= df_trx["date"]
# Delete no longer used
df_trx.drop(columns=["date_loan", "account_id_loan"], inplace=True)
print(f"Valid transactions ({len(df_trx)=}):")
df_trx["valid"].value_counts()
Valid transactions (len(df_trx)=149397):
False 81315 True 68082 Name: valid, dtype: int64
Types of transaction¶
According to this analysis there should be only 2 types of transactions:
- Credit (
PRIJEM
) - Debit (
VYDAJ
)
df_trx["type"].unique()
array(['PRIJEM', 'VYDAJ', 'VYBER'], dtype=object)
Not exactly as expected. But looking at the types of operation, in the same link above, VYBER
should also correspond to withdrawals.
We'll verify if those assumptions hold, by calculating the balance from the transaction amounts, and comparing it to the registered balance.
Create a Temporian EventSet¶
trx_evset = tp.from_pandas(
df_trx, timestamps="date", indexes=["account_id"]
)
trx_evset
WARNING:root:Feature "type" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_). WARNING:root:Feature "operation" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_). WARNING:root:Feature "k_symbol" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_). WARNING:root:Feature "bank" is an array of numpy.object_ and was casted to numpy.string_ (Note: numpy.string_ is equivalent to numpy.bytes_).
timestamp | trans_id | type | operation | amount | balance | k_symbol | bank | account | valid |
---|---|---|---|---|---|---|---|---|---|
1995-03-24 00:00:00+00:00 | 1 | PRIJEM | VKLAD | 1000 | 1000 | nan | nan | nan | True |
1995-04-13 00:00:00+00:00 | 5 | PRIJEM | PREVOD Z UCTU | 3679 | 4679 | nan | AB | 4.14e+07 | True |
1995-04-23 00:00:00+00:00 | 199 | PRIJEM | VKLAD | 12600 | 17279 | nan | nan | nan | True |
1995-04-30 00:00:00+00:00 | 3530438 | PRIJEM | nan | 19 | 17298 | UROK | nan | nan | True |
1995-05-13 00:00:00+00:00 | 6 | PRIJEM | PREVOD Z UCTU | 3679 | 20977 | nan | AB | 4.14e+07 | True |
… | … | … | … | … | … | … | … | … | … |
timestamp | trans_id | type | operation | amount | balance | k_symbol | bank | account | valid |
---|---|---|---|---|---|---|---|---|---|
1993-02-26 00:00:00+00:00 | 276 | PRIJEM | VKLAD | 1100 | 1100 | nan | nan | nan | True |
1993-03-12 00:00:00+00:00 | 279 | PRIJEM | PREVOD Z UCTU | 20236 | 21336 | nan | ST | 6.649e+07 | True |
1993-03-28 00:00:00+00:00 | 697 | PRIJEM | VKLAD | 3700 | 25036 | nan | nan | nan | True |
1993-03-31 00:00:00+00:00 | 3530483 | PRIJEM | nan | 14 | 25050 | UROK | nan | nan | True |
1993-04-12 00:00:00+00:00 | 280 | PRIJEM | PREVOD Z UCTU | 20236 | 45286 | nan | ST | 6.649e+07 | True |
… | … | … | … | … | … | … | … | … | … |
timestamp | trans_id | type | operation | amount | balance | k_symbol | bank | account | valid |
---|---|---|---|---|---|---|---|---|---|
1997-07-07 00:00:00+00:00 | 837 | PRIJEM | VKLAD | 1000 | 1000 | nan | nan | nan | True |
1997-07-11 00:00:00+00:00 | 844 | PRIJEM | VKLAD | 7502 | 8502 | nan | nan | nan | True |
1997-07-31 00:00:00+00:00 | 3530553 | PRIJEM | nan | 24 | 8526 | UROK | nan | nan | True |
1997-08-06 00:00:00+00:00 | 982 | PRIJEM | VKLAD | 7800 | 16326 | nan | nan | nan | True |
1997-08-11 00:00:00+00:00 | 845 | PRIJEM | VKLAD | 7502 | 23828 | nan | nan | nan | True |
… | … | … | … | … | … | … | … | … | … |
timestamp | trans_id | type | operation | amount | balance | k_symbol | bank | account | valid |
---|---|---|---|---|---|---|---|---|---|
1996-02-21 00:00:00+00:00 | 1017 | PRIJEM | VKLAD | 800 | 800 | nan | nan | nan | True |
1996-03-11 00:00:00+00:00 | 1020 | PRIJEM | PREVOD Z UCTU | 5553 | 6353 | DUCHOD | UV | 8.99e+07 | True |
1996-04-11 00:00:00+00:00 | 1021 | PRIJEM | PREVOD Z UCTU | 5553 | 11906 | DUCHOD | UV | 8.99e+07 | True |
1996-05-11 00:00:00+00:00 | 1022 | PRIJEM | PREVOD Z UCTU | 5553 | 17459 | DUCHOD | UV | 8.99e+07 | True |
1996-06-11 00:00:00+00:00 | 1023 | PRIJEM | PREVOD Z UCTU | 5553 | 23012 | DUCHOD | UV | 8.99e+07 | True |
… | … | … | … | … | … | … | … | … | … |
Check transaction plots¶
# Visualize some data
trx_evset.plot()
The number of plots (5823) is larger than "options.max_num_plots=20". Only the first plots will be printed.
Features with Temporian¶
Calculate balance¶
Let's create a feature to calculate the balance from all the transaction amounts. The calculation has some rules that we need to figure out, but if we do it right we can check it by comparing the registered balance
field with our calculated balance
feature.
According to the dataset documentation and the analysis we did above, transactions (trx) can be deposits or withdrawals, depending on the type
. The code PRIJEM
indicates deposit and VYDAJ
or VYBER
should be withdrawals.
The amount is always positive so we need to multiply it by -1
for withdrawals before adding it on the balance.
Then we accumulate the amount (with sign) from all transactions to get the calculated balance.
# Filter only valid transactions
valid_trx = trx_evset.filter(trx_evset["valid"])
# Get amount with sign +/- depending on type of transaction
# - PRIJEM: deposit
# - VYDAJ or VYBER: withdrawal
is_deposit = valid_trx["type"].equal("PRIJEM")
is_vydaj = valid_trx["type"].equal("VYDAJ")
is_vyber = valid_trx["type"].equal("VYBER")
is_withdrawal = is_vydaj | is_vyber
trx_sign = -1 * is_withdrawal.cast(int) + is_deposit.cast(int)
trx_signed_amount = trx_sign * valid_trx["amount"]
# Accumulated amount (signed) should be equal to "balance"
balance = trx_signed_amount.cumsum()
# Output features
result = tp.glue(
valid_trx[["amount", "balance"]],
balance.rename("balance_calculated"),
)
# Plot registered vs. calculated balance, only 3 accounts (2 plots per index)
result.plot(style="line", features=["balance", "balance_calculated"], max_num_plots=6)
The number of plots (1294) is larger than "options.max_num_plots=6". Only the first plots will be printed.
As seen from the plots, the calculated balance matches the registered balance value. This means that the codes we're using to identify deposits and withdrawals was correct.
Customize matplotlib plots¶
We can customize the matplotlib figure by setting the return_fig=True
argument. This way, the plots can be customized basically in any way, using the backend methods.
For example, suppose we want a specific time interval in the x-axis and fixed scale in y-axis:
from datetime import date
# Let's compare indexes 1, 2 and 192
fig = result.plot(style="line", features=["balance", "balance_calculated"], indexes=[1, 2], return_fig=True)
for ax in fig.axes:
ax.set_ylim([0, 60000])
ax.set_xlim([date(1996, 1, 1), date(1996, 12, 31)])
Show some interactive plots¶
In order to show interactive plots, we need the additional bokeh
library.
After that, we just need to specify interactive=True
in the plot function. Keep in mind that the backend is no longer matplotlib
in this case, so you might notice some differences in the plot styles.
!pip install bokeh -q
[notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip
# Count the number of transactions per day and month
count_trx_daily = valid_trx["amount"].moving_count(tp.duration.days(1.0))
count_trx_monthly = valid_trx["amount"].moving_count(tp.duration.weeks(4.0))
counts = tp.glue(valid_trx["balance"],
count_trx_daily.rename("daily_count"),
count_trx_monthly.rename("monthly_count")
)
counts.plot(
indexes=[1, 2],
features=["balance", "daily_count", "monthly_count"],
width_px=600, # Narrow so that controls are visible (right side bar)
interactive=True,
)