Detecting Credit Card Fraud with Snowflake, Snowpark, and Amazon SageMaker Studio Lab

Part 1 of 5: Loading the Credit Card Transaction Data Into Snowflake

Author: Josh Patterson
Date: zzzzzzz xxth, 2022

Other entries in this series:

For more use cases like these, check out our Financial Services vertical in our Use Case Repository.

Introduction

In this blog post series we will demonstrate how to build a credit card fraud detection system on realistic data.

Series Key Take-Aways:

  1. Learning how to manage timeseries data on Snowflake
  2. Perform scalable feature engineering with Snowpark on Snowflake
  3. Use Amazon SageMaker Studio Lab to build multiple models
  4. Deploying machine learning applications to the cloud with streamlit

In this series we further explore feature engineering on Snowpark and then show you have to use Amazon SageMaker Studio Lab to model the feature data in Snowflake.

In this article (Part 1 of the series) we're going to document

  • Working with data in Parquet files
  • Setting up tables and other configuration in Snowflake
  • Ingesting timeseries data into Snowflake
Now let's take a closer look at our credit card transaction dataset.

The Credit Card Transaction Dataset

The work in this series is based on the dataset and general workflow from the online book:

Reproducible Machine Learning for Credit Card Fraud Detection - Practical Handbook, by Le Borgne, Yann-Aël and Siblini, Wissam and Lebichot, Bertrand and Bontempi, Gianluca, 2022, Université Libre de Bruxelles

The project also includes a seperate github repository containing the raw transaction data in .pkl files, which we'll use in our example further below in this article.

In places we've transformed the data to other formats or used different models for modeling the data, but the overall workflow ideas are the same. This also gives us a good reference baseline on which to compare Snowpark's dataframe API against the pandas dataframe API.

We'll start out by taking a look at how to convert the raw dataset's file format (pkl) into something that we can ingest into a Snowflake table.

Creating Parquet Files from the Raw PKL Transaction Files

When bulk loading a lot of data into Snowflake there are multiple things to think about, such as:

  • what are the source files?
  • what type of data is in the source files?
  • how many files are there and how big should they be?

We note that Snowflake supports the following file formats for data loading:

  • CSV
  • JSON
  • AVRO
  • ORC
  • PARQUET
  • XML

We're not going to delve too deeply into the file format arguments (outside of the sidebar below), but for this blog series we're going to convert the pkl files to parquet files.

Our reasoning here is that while CSV files can load faster into Snowflake, CSVs are row-orientated (which means they’re slow to query by themselves) and difficult to store efficiently. CSVs also lack the ability to carry a schema embedded in the file format.

We want to explore using Parquet files as they carry the schema embedded in the file format, compress the best (Parquet is an efficient columnar data storage format that supports complex nested data structures in a flat columnar format), and are ideal for working with huge amounts of complex data (as Parquet offers a variety of data compression and encoding options). Parquet files are also easier to work with because they are supported by so many different projects.

Parquet and Arrow

The simple way to describe the purpose of each project:

  • Parquet is the standard for columnar data on disk
  • Apache Arrow to represent columnar data in memory

Parquet is stored in a highly efficient way to work with data on disk, reducing the volume of data read in (e.g., by working with only the columns you need at that moment). Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.

We still need to bring data into memory to work with it and Arrow’s standard format allows zero-copy reads which removes virtually all serialization overhead. In a blog-post Wes McKinney summarizes how Arrow helps here as follows:

"...you can memory map huge, bigger-than-RAM datasets and evaluate pandas-style algorithms on them in-place without loading them into memory like you have to with pandas now. You could read 1 megabyte from the middle of a 1 terabyte table, and you only pay the cost of performing those random reads totalling 1 megabyte"
W. McKinney, Apache Arrow and the “10 Things I Hate About pandas” (2017), Blog

Interoperability between Parquet and Arrow has been a key design goal from the start. Given that both are columnar the projects include efficient vectorized converters from one to the other where we can read from Parquet to Arrow much faster than in a row-oriented representation.

An example of a tool using both projects is Pandas. A user can save a Pandas data frame to Parquet and read a Parquet file to in-memory Arrow. Pandas can work directly with the Arrow columns as a great example of interoperability.

A great example of how performant Parquet files can be is how HuggingFace is able to iterate through 17GB of data in less than a minute with a RAM footprint of 9MB by using Apache Arrow under the hood.

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
Apache Arrow, Landing Page (2020), Apache Arrow Website

Using Apache Arrow for internal serialization allows us to map blobs of data on-drive without doing any deserialization. This gives us the advantage of using our dataset directly on disk where we can use memory-mapping and pay effectively zero cost. Additionally, random access is O(1) which is powerful from an algorithmic standpoint.

So while we may only be using Parquet (and Arrow under the hood) to move data into Snowflake in this example, understanding how Parquet files work is a useful tool in our toolbelt, especially if we had to do any pre-processing on the data before loading into Snowflake.

Script to Convert PKL Files to Parquet Files

To get a copy of the raw credit card transaction data from the project mentioned above, clone the github repository with the command below:

git clone https://github.com/Fraud-Detection-Handbook/simulated-data-raw

That will give you a local copy of the raw credit card transaction data from the online book project (in the .pkl file format).

In the code listing below we show how to use the pyarrow python module to convert .pkl files into a single Parquet file containing 1,754,155 rows × 9 columns.


import pandas as pd
import os
import pyarrow.parquet as pq
import pyarrow as pa

for pickle_file_name in os.listdir("./source/pkl/data/path/"):
  print(pickle_file_name)

all_df = pd.concat([pd.read_pickle(strBasePath + pickle_file_name) for pickle_file_name in os.listdir(strBasePath)]).reset_index(drop=True)

table = pa.Table.from_pandas(all_df)

parquet_path = "./your/path/here/cc_txn_data_pkl_all_files_noindex.parquet"
pq.write_table(table, parquet_path, version='1.0') 

Once we have our parquet file(s), we can now load it into the a Snowflake database in the Snowflake data cloud.

Managing Credit Card Transaction Data with Snowflake

When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. This Snowflake columnar database engine is based on SQL and supports ANSI SQL.

Some things we want to consider when choosing where to manage our transaction data:

  • what happens when we have to load large amounts of data? what should we consider?
  • what happens when we need to process (feature engineering) large amounts of transaction data for model?
  • what type of files can we load into our data storage system?

When we have to load large amounts of data we may want to look at efficient file formats that enable compression. Snowflake recommends data files roughly 100-250 MB (or larger) in size compressed.

Transactional data (e.g., credit card transactions, here) creates large amounts of repetitive data. Scalable systems such as Snowflake help us in the feature creation phase of machine learning because many times the data we actually model is smaller than the incoming raw transaction data.

As we wrote previously in this article, Snowflake supports many types of files including Parquet, which we're focused on in this article.

Loading Raw Transaction Data Into Snowflake with Parquet Files

Snowflake Cloud Data Warehouse is a cloud-native, fully relational ANSI SQL data warehouse service available in both AWS and Azure It's scalability and ability to quickly provision a data warehouse make it the right place to start our sensor data management operation for our pilot program.

We have our data in parquet but we need to do a few things before we can load the data into Snowflake:

  • Sign-up for a Snowflake account (free, if you don't already have one)
  • Create a database
  • Create a table in the new database to hold the transaction data
  • Create a temporary stage in Snowflake (with a file format) to hold the data before loading
  • Upload the parquet data into the stage
  • Copy the data from the stage over into the final table
Let's start off by creating a database for our project in Snowflake.

Looking for Snowflake Help?

Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.

Creating a Credit Card Company Database in Snowflake

You'll need to use either the online Web UI for Snowflake or the command-line interface (SnowSQL CLI Client). We're going to do everything from the command line via the SnowSQL in this blog post:

https://docs.snowflake.com/en/user-guide/snowsql.html

Once you have the SnowSQL CLI Client installed, open a terminal window and log into your Snowflake account from the command line with the following command:

$ snowsql -a [account_name] -u [user_name]

This should show console output similar to the output below:

jpatanooga#COMPUTE_WH@(no database).(no schema)>

Now that we have connectivity, let's move on to creating a database and table for our transaction data

Creating a Database for Our Credit Card Transaction Data on Snowflake

Now let's use the SnowSQL client to create a database and a table to manage our data.

To create a database (CREDIT_CARD_COMPANY_DB) use the following command from SnowSQL CLI:

create or replace database CREDIT_CARD_COMPANY_DB;

Creating a Table to Hold the Credit Card Transaction Data

If we change our current datasbase to CREDIT_CARD_COMPANY_DB the output should look like:

jpatanooga#COMPUTE_WH@(no database).(no schema)>use CREDIT_CARD_COMPANY_DB; +----------------------------------+ | status | |----------------------------------| | Statement executed successfully. | +----------------------------------+ 1 Row(s) produced. Time Elapsed: 0.386s jpatanooga#COMPUTE_WH@CREDIT_CARD_COMPANY_DB.PUBLIC>

Next we want to create a table in our Snowflake database.


create or replace table CUSTOMER_CC_TRANSACTIONS (

TRANSACTION_ID int,
TX_DATETIME timestamp,
CUSTOMER_ID int,
TERMINAL_ID int,
TX_AMOUNT FLOAT,
TX_TIME_SECONDS int,
TX_TIME_DAYS int,
TX_FRAUD int,
TX_FRAUD_SCENARIO int
  
);

We can execute the the commands above from SnowSQL or from a file via SnowSQL:

snowsql -a [xxxxxxx.us-east-1] -u [user_name] -f ./create_pm_db_and_table.sql 

Now let's move on to uploading the Parquet-based credit card transaction data into our Snowflake table.

Load Credit Card Transaction Dataset into Snowflake as Parquet Files

We can get more detail on our table CUSTOMER_CC_TRANSACTIONS with the describe table command:

describe table CUSTOMER_CC_TRANSACTIONS; +-------------------+------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------+ | name | type | kind | null? | default | primary key | unique key | check | expression | comment | policy name | |-------------------+------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------| | TRANSACTION_ID | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TX_DATETIME | TIMESTAMP_NTZ(9) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | CUSTOMER_ID | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TERMINAL_ID | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TX_AMOUNT | FLOAT | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TX_TIME_SECONDS | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TX_TIME_DAYS | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TX_FRAUD | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | | TX_FRAUD_SCENARIO | NUMBER(38,0) | COLUMN | Y | NULL | N | N | NULL | NULL | NULL | NULL | +-------------------+------------------+--------+-------+---------+-------------+------------+-------+------------+---------+-------------+ 9 Row(s) produced. Time Elapsed: 0.556s

Create a Snowflake File Format

When we load the data into the Snowflake stage, we'll need to tell the system a bit about how the data is formatted. Since we're using parquet files we'll create a parquet file format called parquet_format inside our database.

create or replace file format parquet_format
  type = parquet

We'll take the default option values for our file format.

Create Stage and Load Data into the Stage

We next need to put the data in an internal staging table on snowflake. First we need to create our temporary internal stage with the command below:

create or replace temporary stage parquet_cc_data_loading_stage
	file_format = parquet_format;

Each table has a Snowflake stage allocated to it by default for storing files (when using "Table Stage", check out the staging documentation).

We can see what all stages have been created for a database with the command show stages;.

We will use the associated staging table (@parquet_cc_data_loading_stage) to load the data into Snowflake. We can see this commadn below:

PUT file:///tmp/cc_txn_data_pkl_all_files_noindex.parquet @parquet_cc_data_loading_stage;

Copy Transaction Data from Stage to Final Table

We note that by default Snowflake reads Parquet data into a single Variant column (Variant is a tagged universal type that can hold up to 16 MB of any data type supported by Snowflake). The data in this Variant column can be queried with standard SQL (including joining against its data).

There is also the option to move the Parquet columns directly into separate Snowflake columns (extracting the individual columns into a structured schema) during the data load phase from stage to table. The structured approach gives us a little more performance on the load phase, and is easier to manage with the defined schema. We can see this load statement in action below.

copy into CUSTOMER_CC_TRANSACTIONS
  from (select
  $1:TRANSACTION_ID,
  $1:TX_DATETIME::varchar,
  $1:CUSTOMER_ID,
  $1:TERMINAL_ID,
  $1:TX_AMOUNT,
  $1:TX_TIME_SECONDS,
  $1:TX_TIME_DAYS,
  $1:TX_FRAUD,
  $1:TX_FRAUD_SCENARIO  
  from @parquet_cc_data_loading_stage/cc_txn_data_pkl_all_files_noindex.parquet);

In the SQL-listing above, the main body of the COPY statment includes extraction of the labeled fields contained in the Parquet data, mapping them directly to the corresponding column in CUSTOMER_CC_TRANSACTIONS.

Loading data into fully structured (columnarized) schema is ~10-20% faster than landing it into a VARIANT (Reference: "How to Load Terabytes into Snowflake").

A Note About Date Fields

It's worth noting that if we do not explicitly copy the datetime column from Parquet to a Snowflake column as a varchar, it will be treated as a number. Treating the column as a number (and not varchar) causes Snowflake to interpret the datetime incorrectly, resulting in weird date-things happening (e.g., "bad dates"). A maddening wrinkle, no doubt!

You can see us explicitly calling this out in the line $1:TX_DATETIME::varchar, above.

To do a quick visual check that the transaction data loaded correctly, run the following command:

select * from CUSTOMER_CC_TRANSACTIONS limit 4;

The output should look similar to the following:

+----------------+-------------------------------+-------------+-------------+-----------+-----------------+--------------+----------+-------------------+ | TRANSACTION_ID | TX_DATETIME | CUSTOMER_ID | TERMINAL_ID | TX_AMOUNT | TX_TIME_SECONDS | TX_TIME_DAYS | TX_FRAUD | TX_FRAUD_SCENARIO | |----------------+-------------------------------+-------------+-------------+-----------+-----------------+--------------+----------+-------------------| | 412660 | 2018-05-14 00:00:01.000000000 | 851 | 9335 | 61.65 | 3715201 | 43 | 0 | 0 | | 412661 | 2018-05-14 00:01:46.000000000 | 2462 | 8563 | 125.74 | 3715306 | 43 | 0 | 0 | | 412662 | 2018-05-14 00:02:22.000000000 | 923 | 1514 | 97.75 | 3715342 | 43 | 0 | 0 | | 412663 | 2018-05-14 00:03:02.000000000 | 3142 | 3268 | 11.09 | 3715382 | 43 | 0 | 0 | +----------------+-------------------------------+-------------+-------------+-----------+-----------------+--------------+----------+-------------------+

Next Steps: Ready for Feature Engineering

In this post in our series we learned some basics around Parquet files and how to load them into Snowflake.

Now that we have established our cloud-based credit card transaction data management platform and loaded our data, let's move on to building some features with Snowflake and Snowpark.

Normally do lots of EDA (as in our previous blog series on Applied Predictive Maintenance), but here we're going to skip that as its been done for us by the authors of the original project, so we're going to focus on building out the project on Snowpark and Amazon SageMaker Studio Lab in the next 2 entries in this series.

Next: Part 2 of 5: Scalable Feature Engineering with Snowpark

Looking for Snowflake Help?

Our team can help -- we help companies with Snowflake platform operations, analytics, and machine learning.