Companies are always in the search for ways by which it can reduce its operational costs. Some of the costliest affairs are promotions and marketing. With the advent of internet, targeted advertisements are a thing in the 21st century. In the earlier day’s companies would put out their posters and banners in places where they are most hopeful about getting more business. But in recent times we have become equipped with more powerful techniques and better understanding of the problem. In this case study I have attempted one such problem with the help of Machine Learning.
Table of Contents
- Business Problem
- Mapping to ML/DL Problem
- Understanding the data
- Exploratory Data Analysis
- Designing Features
- Doing Classification
- Future Work
1. Business Problem
Elo, is one of the largest payment brands in brazil. It has tied up with various merchants and through them offers discounts to its cardholders. Now the company wants to find out which customers are mostly using these services. So that it can focus more on those customers and serve them better.
It gives a score called Loyalty Score (also called target) to each of its customer indicating how loyal a customer is towards the services. This step can save additional costs borne by the company towards promotion.
This is an essentially a Kaggle problem.
2. Mapping to ML/DL Problem
We have to predict a score called Loyalty Score for each of the customer. Loyalty score lies in real domain. This problem can be posed as regression problem with loyalty scores as output.
The metric provided in the competition is RMSE, which stands for Root Mean Squared Error.
As we know that RMSE is very prone to outliers. We could have used Mean Absolute Deviation instead to reduce the effect of outliers. But our dataset contains around 1% data which are outliers. So, RMSE is a good metric to handle those edge cases.
3. Understanding the data
The data for this problem can be downloaded from Kaggle. All the data provided by Elo are simulated and not actual data. The dataset is contained in 5 files:
- train.csv and test.csv: This file consists of card id, various categorical features whose meaning are not explicit and also the loyalty score which we have to predict.
- Historical_transactions.csv: This consists of 3 month’s worth of transactions for each card id.
- new_merchant_period.csv: This file consists of 2 months of transactions. These data are not present in historical transactions.
- Merchants.csv: This file consists of merchant’s details.
Note: Transactional files are almost 3GB in size and for most of the operations you need those data in memory.
Explicit meaning of columns can be found on Kaggle link.
4. Exploratory Data Analysis
Here we will try to see correlation between various columns.
4.1 Checking if there’s any correlation between first_active_month and target. Both these columns are present in train.csv
sns.FacetGrid(df_train, size = 10)\
.map(plt.scatter, 'first_active_month', 'target')\
- One thing that can be clearly observed that dates which are more recent have higher loyalty score.
- Also, it shows that more no of users have started using the Elo service lately.
- This feature clearly indicates a trend in loyalty score.
- Also there seems a constant value which is present below -30, which can’t be separated using first_active_month.
4.2 Checking how target is distributed
sns.FacetGrid(df_train, size = 7)\
- Most of the data is centered around zero.
- There exist some scores which have values in the range of -30.
- We need to deal with card ids having score less than -30. We will see what can be used to differentiate these cards from other cards.
- There are more than 2000 points that are between -10 and -34(excluded).
- These is roughly 1% of data, so cannot really ignore this data points.
4.3 Checking how the features on train dataset are correlated with target.
fig, ax = plt.subplots(1, 3, figsize = (18,10))
sns.boxplot( x = 'feature_1', y= 'target', ax = ax, data = df_train)
sns.boxplot( x = 'feature_2', y= 'target', ax = ax, data = df_train)
sns.boxplot( x = 'feature_3', y= 'target', ax = ax, data = df_train)
- These categorical features aren’t really helpful at predicting the target variable.
- The boxplots are overlapping for all the categories.
4.4 Checking how many transactions where unauthorized. These columns can be found out on both transaction files.
- There are some transactions in historical_transactions which are not authorized. We can combine these transactions with their respective card_id and see what are their loyalty scores.
- We don’t have any unauthorized transactions in new_merchant_transactions,we can drop this column from new_merchant_transactions as all the values are same.
So, here we have merged transactions data with train data on card_id to see how unauthorized transaction affect loyalty score,
plt.hist(loyalty_Scores_UnAuth, color='g', label='Unauthorized_Trans')
plt.hist(loyalty_Scores_Auth, color='r', label='Authorized_Trans')
plt.gca().set(title='Authorized and Unauthorized transactions', ylabel='Count')
- This seems to be evenly distributed not much can be deduced from these loyalty scores.
- The distributions are both similar so the latter overshadows the former.
- Unauthorized transactions aren’t really affecting the loyalty score.
- They can be due to some error in network or other reasons.
4.5 Checking how no of installments which is present in transactional data is affecting our target. Installment column can be found out in transactions files.
fig, ax = plt.subplots(1, 1, figsize = (7,7))
df_histTrans['installments'].value_counts().plot(kind = 'barh')
- Most of the installments are 0 and 1.
- There are also installments having values -1 and 999.
sns.FacetGrid(df1, size = 10)\
.map(plt.scatter, 'installments', 'target')\
- We can clearly see that loyalty scores are high for card ids having lower installments.
- This feature can be very useful while designing features as there is a clear trend.
4.6 Checking whether more frequent users have higher loyalty scores
We will first group the transaction data based on card ids and select the purchase_date column. After that, we will calculate the mean difference of days between consecutive transactions for a particular card id and see its relation against target.
- Here we calculated the difference of days between transactions for a particular card id and took the mean of it.
- We can see that cards ids which are more frequent have higher loyalty scores also many of them have lower loyalty scores.
- The outliers i.e., card ids having lowest loyalty scores have their differences mostly lower side.
- No such trend can be seen
- first_active_month is a useful feature as it is somewhat able to correlate with the target variable.
- Around 1% of data have values less than -30. We need to see how these affect our overall prediction.
- Most of the transactions are authorized.
- Instalments is another very important feature.
- Most features will come from the transactional files.
- Feature Engineering is one of the most important aspects of this case study.
Note: There were some columns which contained missing values. Columns which had very low percentage of missing values were replaced by either column’s mean or mode based on the type of data. In some categorical columns where we had slightly higher percentage of missing values, we introduced a new category for those missing values.
5. Designing Features
5.1 train.csv and test.csv: These files contain first_active_month which is a very useful feature. We can use time elapsed since first purchase as a feature. We can also use starting month, day and year as feature. We will do mean encoding on the categorical features based on whether the point is outlier or not.
df_train_FE['is_rare'] = 0
df_train_FE.loc[df_train_FE['target'] < -30, 'is_rare'] = 1#mean_encoding based on whether the points are outliers are not
for f in ['feature_1','feature_2','feature_3']:
mean_encoding = df_train_FE.groupby([f])['is_rare'].mean()
df_train_FE[f] = df_train_FE[f].map(mean_encoding)
df_test_FE[f] = df_test_FE[f].map(mean_encoding)
5.2 historical_transactions.csv and new_merchant_transactions.csv: These files are most important for extracting features, we can aggregate different columns after grouping them using card_id’s.
We marked purchases which are done on a weekend. We extracted features based on difference between today and the purchase_date.
Here we are aggregating the transactional data:
We can also try aggregate based on both card_id and month lag and get various features from installments and purchase_amount
5.3 merchants.csv: For extracting features based on merchant’s dataset we will first vertically concatenate both the transaction files into a single file. After that we will merge this new file with merchant’s data on card_id. We can then look to extract various features based on aggregation of different columns based on this new data.
After all this we can merge all the data into single file for training purpose. We also need to drop columns which we don’t further need from the final dataset. It can also happen that some rows contain inf/-inf values after calculating the features. We need to handle those.
For training the model we did a basic train test split and tried various models
6.1 Linear Regression:
6.2 AdaBoost: This model didn’t do well for this problem.
6.3 Custom Model: We tried this custom model where we first split the whole data into two sets train and test in the ratio 80:20. We then split the train data into two sets train and val in the ratio 50:50. We used this train data to create k (say = 100) samples where every sample was created by doing sampling with replacement. We then created k models and trained this k samples on them. We trained using Decision Trees. After that we passed our val data to this k models getting k predictions for each data point. Using these predictions, we create a dataset and trained a meta model to predict the target. For the meta-model part, we used LGBM.
This model performed decently.
6.4 LGBM: LightGBM is one the variants of Gradient Boosting.
This model performed quite well.
6.5 LGBM with Stratified Fold based on outliers’ points: Here we first stratified the data based on whether those data points are outliers are not. We used stratified k-fold for training the model. We used LGBM here as the base model. The model performed the best.
Out of all these models LGBM Stratified k- folds gave me the best Kaggle score.
Here is my Kaggle Submission :
8. Doing Classification
We also tried to train classification models using this dataset. Our classes being Loyal and Not Loyal.
We first standardized the target column and assigned the class ‘Loyal’ if the target was ≥ 0 else ‘Not Loyal’.
We used the same feature set as generated above and trained using them. We tried two models Logistic Regression and XgBoost.
- Logistic Regression:
We have also deployed the regression model on an AWS cloud using the LGBM model. We have provided a basic UI where users can select the data from train.csv. Columns like card_id, first_active_month, feature1, feature2, feature3 can be selected and then submitted. It calculates other columns for this particular card id based on transactional data and then outputs a predicted Loyalty Score.
The files required for deployment can be found on this GitHub link.
- LGBM when trained using stratified data based on outliers proved to be a better model.
- Transactional data was very important for designing features.
- Features need to be carefully designed. Domain knowledge can certainly help to build better features.
11. Future Work
- We can come up with more features based on various transactional data.
- We can work on our Custom Model and look to improve it fine-tuning the hyperparameters. Or, maybe changing the architecture.
- We can work on ways to handle the outliers. This is because this points seriously affect our overall RMSE score.
- We can also look to try other models like Bayesian Regression.
Thank you for reading the blog. Hope you liked it!