We will be working on kaggle competition dataset. Here is the link https://www.kaggle.com/c/ieee-fraud-detection
Dataset contains 2 csv files. One file contains Transaction data and another contains Identity data. Both files will having Transaction id as common.
Better explanation of data can be found in the Data description of competition.
For any problem, we should look at the data, extract important things (called Data Analysis) and see if we can produce any new features from existing ones (called featuring engineering). For most of the problems Featuring engineering helps a lot.
We need some libraries to work in the data pandas, numpy, matplotlib,… we will knowing what’s required along the way. Let’s get our hands dirty then. #excited😃
labels in the data are skewed to non-frauds. We call it imbalanced data.
import the libraries
read the files
we have TransactionDT, it represents the time transaction has made in seconds from the time they started collecting data. Day count will be more helpful than seconds. When frauds get unauthorized cards then tend to use all the money as fast as possible. They might do all the transactions in one day. If we have day number, we can calculate how many times that particular customer used the card that day. we can do lot more. That might help us
don’t say that you don’t know what 86400 means. #urgenius😉. BTW do you know that every day has 86400 seconds. Interesting!! isn’t it?
we have produce same features for test set to along the way. We should have same features for both training and evaluation. (order of features should also be the same. always make sure of that)
There are features card1-card6, which payment card information such as card type, card category, issue bank, country, etc.
we are going to use them to produce some features.
Standard Deviation might help us to make a better model. You know Standard Deviation is the rate spread of the data.
Then comes the final part with files. We have to merge transaction file and identity file. we merge the files on TransactionID
test column names and train column names are inconsistent. Let’s change the names to be same
remove the columns with more than 90%of null values. They are of no use for us, if it is having lot of nulls in them. and remove columns with single unique value. we don’t need them. They don’t contribute anything while taking decision if all are same.
we should convert categorical features to numerical features.
we don’t have to do anything with nans. LightGBM handles them. But data shouldn’t be having any inf values. we should replace them with nans, if exists any.
we don’t need TrasactionDT, we have day number. Note: remove index column, level_0, unnamed: 0 column, it exists in your data.
we have completed pre-processing the data. Storing pre-processed files is the best practice. You don’t want to do that everytime.
woohoo!! we are going to build a model using LightGBM. #veryexcited🙌
LightGBM classifier has many parameters. check out it in lightgbm docs
try out different values
Most important thing is selecting which metric to use. In the competition they have mentioned that evaluation will be done by ROC AUC metric. (search it learn it. You know how this type learning works). We don’t need accuracy here. Since labels are imbalanced.
Note: we have used last 50000 data points for evaluation’
we should save the model. So that we don’t need to do training again. such a time taking process
clf.booster_.save_model(‘lgbm.txt’)
Everything is done. Now predict for test data. create a submission file. submit to the competition. And see the socre
I have got 91+ % for this model.
You try different methods. Do parameter tuning and try to improve the score. That’s what we do
Happy machine learning 😇
If you find this useful. clap please 👏
Thanks for reading