A Starbucks Capstone Challenge
This is a capstone project of the Data Scientist Nanodegree Program of Udacity.
Photograph: Material from Luqman Hafiz, edited by Zheng Dai
Project Overview
This is a capstone project of the Data Scientist Nanodegree Program of Udacity. Based on the given dataset, I need to create a model to predict the behaviour of a customer.
The goal of this project is to find an approperate offer for a starbucks customer based on their purchase behavior and response to the offers that sent previously. The customers receive different offer messages, and their responses was captured during 30 days which are provided by Starbucks.
To reach this goal, I will build a machine learning model that could predict whether a customer have high probability to complete an offer and tell the offer type. The conclusion of this project may help Starbucks to find out several target groups could have a positive reaction towards some certain offers. The most relavent features may include: income, gender, age , platform, device, offer message and type. some of those features may have higher weight than others.
The problem can be simply to a binary classification problem. the model will predict whether a certain customer would more likely to complete an offer based on their profiles. And metric of the problem also need to be considerated. It can be simplify to an overall accuracy, f1_score and precision of the model
Problem Statement And Undserstanding Data
The goal of this project is to combine transaction, user profile, and offer data to determine if a customer will respond best to which offer type. This dataset is a simplified version of the real Starbucks app because the dataset only reflect only one product from dozens of products of Starbucks sells.
This goal can be achieved by following the following :
- Loading and Exploring the Data.
- Visualizing and Analysis the Data.
- Processing the Data.
- Feature Selection and Engineering Feature
- Normalizing the numerical features and Shuffle the Data.
- Trying several Learning Models.
- Evaluating and Choosing the model by using the f1-weight score.
- Using GridSearchCV to find the best parameters to improve the performance of the chosen model.
- Review the prediction result by using Confusion Matrix
Dataset Description
The data is contained in three files:
- portfolio.json : containing offer ids and metadata about each offer
- profile.json: demographic data for each customer
- transcript.json : records for transactions, events for offers
Metrics
To evaluate our model’s performance, I need to choose a socre to compare different model. The problem is a real-word problem, and the solution of this problem may tolerate sending offer to some customer may not reponse, but should not skip any customer may response. That means lower False Negatives is more important than lower False Positive.
Moreover the dataset may include imbalanced classes data. In this case, F1-score is more fit to be used as metric of machine learning selection. The target of this prediction could be a classified integer value which represent a customer will response to a type of offer. An metric can fit multilabel purpose should be used in this case. I would choose between f1_micro, f1_macro, f1_weighted , and f1_samples.
Here are description for each f1 metric from scikit-learn.org:
'binary'
:
Only report results for the class specified bypos_label
. This is applicable only if targets (y_{true,pred}
) are binary.
'micro'
: Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
'weighted'
: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
'samples'
: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs fromaccuracy_score
).
In my case, it is not a multipal dimentional label data for prediction, but a multipal classification in label result. so I can not use f1_samples. I also care the prediction result for each label, as offer type is critical for my goal. Therefore, f1_weighted is best choice for my case
f1_weighted is the function to compute f1 for each label, and returns the average considering the proportion for each label in the dataset. It better to be used for this purpose.
Explanatory Data Analysis
Portfolio Dataset
This Dataset contains 6 columns:
- reward (int)
- channels (list of strings)
- difficulty (int) — the minimum required to spend to complete an offer
- duration (int) — time for the offer to be open, in days
- offer_type (string) — a type of offer ie BOGO, discount, informational
- id (string) — offer id
Head of the Portfolio Dataset.
The Exploration of the Portfolio Dataset showed the following:
- portfolio data has only 10 rows. a channels column which is a list of channel. For Machine learning purpose, each column to represent one feature is helpful to know the relevance of a feature to learning result. It is necessary to split list column to several feature columns. Similar idea fo offer type, each type can be a new feature column in dataset.
- Firstly, I need to split offer and channels to corresponding dummies.
- Secondly, create a method to clean the portfolio
After cleaning and creating dummies, I got a cleaned_protfolio with following columns: [‘offer_index’, ‘offer_id’, ‘reward’, ‘difficulty’, ‘duration’,
‘channels’, ‘offer_type’, ‘channel_email’, ‘channel_mobile’,
‘channel_social’, ‘channel_web’, ‘offer_bogo’, ‘offer_discount’,
‘offer_informational’, ‘channels_str’]
Profile Dataset
This Dataset contains 5 Columns:
- gender (str): gender of the customer (M: Male, F: Female, O: Others).
- age (int): age of the customer.
- id (str): customer id.
- became_member_on (int): the date when the customer created an app account.
- income (float): customer’s income.
Here is the Basic information of the Profile Dataset:
From overviewof profile data, I can see some gender can be None, and income can be NaN. My aim for this project is to predict whether a user will make a purchase decision based on profile, means the gender and income are key features of a profile. I will check how many rows include NaN and None, then drop them if possible.
The check shows that, gender and income has 12.8% empty data. Because the missing count of gender and income are the same. There is high probablity they happen on the same rows. To check it, I can union the two sliced data and check if count is 2175.
Overview of numberic data in profile data:
All count are 2175, means the missing age and missing income from the same row Then we can drop these rows. and gender can be translate to two columns. moreover, A number of days since a user become to a member can be more meaningful than a become_member_on day infomation. the clean method will caculate membership_days
and format become_member_on
to year-month-day.
Here are the process to clean profile.json
- clone original data
- Create dummies for gender
- drop none and empty data
- format became_member_on column to yyyy-mm-dd
- create ‘membership_days’ by caculating days since became_member_on
- rename id to customer_id
- reorder columns
- concat dummies to cloned data
- reset index
And get the cleaned_profile data like this:
Visulaize profile data
It is intresing to has an intuitive feeling of user profiles. check the cols again of profiles, to be sure what columns I can present. Then I can create a function to visualize the data based on genders and other attibutes.
A few gender_other involved in the dataset, but due to the amount is too low, it is hard to predict any conclution from those data. ignore those part from Visulaization.
Here is a quick count distribution of gender vs age in the profile dataset:
From the pictures above, in all age groups, people around 40–70 age show more in the dataset.
After 65 years old, the count is decreasing. This dataset has no data before 18 years old, only includes adult. The totle count of male is bigger than female and others.
And here is count distribution of gender vs income in the profile dataset:
From the pictures above, after 72000, the distribution of male and female is decreasing.
The male decrease much more than female.
The dataset does not include people under 30000.
3. Transcript Dataset:
This Dataset contains 4 columns
- person (str) — customer-id
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- value — (dictionary of strings) — either an offer id or transaction amount depending on the record
- time (int) — time in hours since the start of the test. The data begins at time t=0
As the data presented above, there are 4 events can happenm and the ‘value’ column is a dictionary in which can include different data based on events. After a check, avaliable keys in value
column can be 'amount', 'offer id', or 'offer_id(with underscore), reward' to clear about that, it need to split into amount , offer_id(offer id) , and reward. In total, here need 3 columns need to be added.
Here are the process to clean transcript.json
- clone original data
- extract value for each key from value column
- create event dummies
- drop value column
- rename person to customer_id
- reorder columns
- concat dummies to cloned data
- reset index
After processing , I got cleaned_transcript as follow:
A quick visulization shows the distribution of number of events happend to a customer.
The most customer get around 10–23 events. and it is about 5 * 2400 = 12000 customers. and another look is de distribution of each type of events:
The most event is a single transaction. the number of offer completed is about half of offer received.
Thoughts
Both profile dataset and transcript dataset have customer_id column. I can combine them to one dataset to give a profile more features. After feature selection and engineering, machine learning model can have a rich and clean data to analysis.
Feature Selection and Engineering Feature
The Columns in profile and transcript dataset will be reviewed to put into a new dataset for later learning. And non-exist features willl be created.
In profile , gender , income , and age, are normally need to keep to describe a user. since I have membership_days, the became_member_on can be ignored.
In transcript, one customer may have multiple transcrips. several events can happen for one users. count the offer_completed , offer_received, offer_viewed, and transaction are useful
offer_discount_completed_sum, offer_bogo_completed_sum, and offer_informational_completed_sum need to be engineered as target and label for future prediction
It is also interesting to see how many times one cutomer interact with one offer. since in the portfolio dataset, it only has 10 offers, I create 10 columns to store the count of a offer happened in for a customer.
The aggregated data including following columns. They used first
mean
and sum
method to get the aggregated result.
There are 25 features in the grouped data. I can see the offer_informational_completed_sum columns is empty , no offer_informational get completed, this column can be removed. After remove it, I have more useful data for Machine leaning to process.
Before doing modeling , I got some visualization to explore the data have now:
A quick plot above shows distribution of offer completed count by customer. I can see that similar number of customer comleted 0, 1,2,3,4 offers, and a bit less people finished 5 times offer. a litte amount people finished 6 offer.
Another quick plot shows the different of distribution offer complete between male and female. The males do not finish offers through the incresing of offer count, while more of the female finsished 2 or 3 offers. Much less female complete offer when offer finish count is 0 and 1. Similar the males and the females finished when count is 2 ,3, 4, 5, and 6.
One more picture I can show is the comparasing of completed event and incompleted events. bit less event are compeleted in the dataset. This pie chart shows that how balanced the data set is. this step verified the dataset is fit the puporse of the usaged in machine learning.
The above plots shows the distribution of channels in offer_completed or offer_incompleted with different offer type. Althrough, there are 4 channels means 24 combinations can happen, but in dataset only 4 combinations of appears. the less happened is web_email combination.
Overall, the web_email channel happend less than other channels, and no informational offer get completed as it is just infomational message. And more web_email_mobile channels get completed.
Quick Conclusion
grouped_df.shape shows (14825, 28). The grouped DataFrame ends up having 28 columns. All numberic column will be used for the machine learning model. In next step, Machine leaning predictioin, the target column is ‘offer_discount_completed_sum’ and ‘offer_bogo_completed_sum’. because ‘offer_infomational_completed_sum’ are always empty, only 2 predictable offer type left. I can mark the target as a binary value to distint either BOGO or discount. I will verify the dataset and choose the most related colomuns for machine learning.
Data preprocessing for Machine Learning
To choose the best machine learning model , this section will do another turn Data preprocessing, normalizing, and engeneering.
- Define target split point as a rule to categorize customer into positive and negative
- Convert continuous data to binary data by cut into binary columns
- Remove categorical and unique columns including ‘customer_id’, ‘event_completed’, ‘complete_rate’,’gender’, ‘gender_other’ keep ‘complete_target’
- Normalize the all numberical features:
- Create a heatmap to analyze the signifcant correlations
- Remove less relavent/importance features and Shuffle the data
Split data into positive and negative
Target of this dataset is to find out which profile may complete a offer and type. because an customer may complete and incomplete in multiple offsers, I need to set a line to distinct customers who may complete offer. One way to do that is to check the complete rate for each customer.
Simply do a caclulation of percentage of completed events in recieved. The average of of the complete_rate column is 0.487376, I may use it to generate a column ‘complete_target’ to split customers into 2 groups.
The describe method shows the data range of each column, age is between 18–101, became_member_on is between 2013–2018, and membership_days is from 1208–3031
convert continuous data to binary data
it is better to convert continuous data in to segments to engineering more data to use. It can also reflect the real world user groups for making a prediction choice. e.g age group, and member ship days group. This groups may not be split by day level, but , for example, splited by 5 years or 100 days.
I created following columns:
myear_2013, myear_2014, myear_2015, myear_2016, myear_2017, myear_2018
age_17_40, age_40_60, age_60_80, age_80_110
mdays_1200_1500, mdays_1500_2000, mdays_2000_2500, mdays_2500_3050
After some clean and validation, I did a step a average the weight of numbers.
Normalize data and Remove less relavent features
Before doing normalizing , I need to remove string data from data set: ‘customer_id’, ‘event_completed’, ‘complete_rate’, ‘gender’. The normalizing can only works for numberical data.
It is worth to see the correalation of normalized data
This shows some columns has less correcation with other columns , which means less useful for building my model. From the heatmap showing above, the less relavent columns `time`, `gender_other`, `age_80_110`, and `myear_2013, and can be removed.
Shuffle the data and Create target
The target is to predict a customer may complete a offer , and the offer_type. I use 0, 1, 2 to represent the answer respectivly. so the target is a list can only include 0,1,2.
Use train_test_split from sklearn
can quick split data into train and test.
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=40)
Machine Learning Modeling and Analysis
As states at very beginning, the dataset is a realworld inbalanced dataset and lower false negative is more important than lower false positive. therefor f1-score can fit better as the metric. This section will get train and test data , and use GridSearchCV to evaluate f1-score for each algorithm, find the best score and faster one to build Model
This section including following steps:
- Model selection , test different models
- Feature importance
- Predict test data
- Confusion Matrix
average = weighted says the function to compute f1 for each label, and returns the average considering the proportion for each label in the dataset.
From https://stackoverflow.com/questions/55740220/macro-vs-micro-vs-weighted-vs-samples-f1-score
In my case, it is not a multipal dimentional label data for prediction, but a multipal classification in label result. so I can not use f1_samples. I also care the prediction result for each label, as offer type is critical for my goal. Therefore, f1_weighted is best choice for my case
Difficulty And Complications in Coding
Because I have 4 Classifiers need to test, and Due to many duplication works happend, the method `GridSearchCV` will be used many times with the similar way. I also need to customized the usage of GridSearchC . sometime need the same print output , some time need differen cv value and verbose level. To make the coding more flexible , I create several functinons to handle the reuse case, with some for loops :
`get_forest_score`, `test_each_rfc_conf_dimention`, `get_clf_f1`.
The code can be found at in github notebook
Setup each Classifier for comparasion
I got comparison of 4 models with default settings:
# Initialize classification algorithms
# Set default configuration for each classifier explicitly
# The random_state set to 42 to get the same value each time
To reproduce the evaluation result: I give following configurations:
LogisticRegression:
penalty: l2
dual: False
tol: 1e-4
C: 1.0
fit_intercept: True
intercept_scaling: 1
class_weight: None
random_state: 42
solver: lbfgs
max_iter: 230 (when max_iter < 224, error: ITERATIONS REACHED)
multi_class: auto
verbose: 0
warm_start: False
n_jobs: None
l1_ratio: None
RandomForestClassifier:
n_estimators: 100
criterion: gini
max_depth: None
min_samples_split: 2
min_samples_leaf: 1
min_weight_fraction_leaf: 0.0
max_features: auto
max_leaf_nodes: None
min_impurity_decrease: 0.0
bootstrap: True
oob_score: False
n_jobs: None
random_state: 42
verbose: 0
warm_start: False
class_weight: None
ccp_alpha: 0.0
max_samples: None
GradientBoostingClassifier:
loss: deviance
criterion: friedman_mse
min_samples_split: 2
min_samples_leaf: 1
min_weight_fraction_leaf: 0.0
max_depth: 3
min_impurity_decrease: 0.0
init: None
random_state: 42
max_features: None
verbose: 0
max_leaf_nodes: None
warm_start: False
validation_fraction: 0.1
n_iter_no_change: None
tol: 1e-4
ccp_alpha: 0.0
AdaBoostClassifier
base_estimator: None
n_estimators: 50
learning_rate: 1.0
algorithm: SAMME.R
random_state: None
------------------------------------------------------------
LogisticRegression
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Time Used: 2.19 secs
F1 Score: 0.9281
------------------------------------------------------------
RandomForestClassifier
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Time Used: 4.17 secs
F1 Score: 0.965104
------------------------------------------------------------
GradientBoostingClassifier
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Time Used: 25.89 secs
F1 Score: 0.971976
------------------------------------------------------------
AdaBoostClassifier
Fitting 4 folds for each of 1 candidates, totalling 4 fits
Time Used: 2.31 secs
F1 Score: 0.906107
The time-used value denpends on computer power, I was running on 2.6 GHz 6-Core Intel Core i7 Mackbook pro.
When GridSearchCV given a cv=4, random_state=42, and use the default configurations for all the classifiers, (LogisticRegression need set max_iter bigger than 224, I give 230) RandomForestClassifier
takes about 4 seconds to get score 0.965. It gets better score than AdaBoostClassifier
and LogisticRegression
, but slower then these two. Meanwhile RandomForestClassifier
is runing much faster time (about 6 times) than GradientBoostingClassifier
which has better score 0.972. Considering the balance between score and time, the best performing classifier algorithm among the above 4 classifiers is RandomForestClassifier.
Tune Parameters for RandomForestClassifier
Before doing a full scale GrivSearchCV , I set a bunch of speed test for each configuration. This process can helps me to understanding the impact to performance and speed.
First of all , I set cv to 10, because when CV is heigher, I can will get different combinatin of params, and the score can be better. I have tried on max_samples, criterion, bootstrap, max_features, max_depth, min_samples_leaf, min_samples_split, n_estimators.
And I get following result:
max_samples: 0.1 to 1.0 takes linear increasing time, 1.0 get better.
criterion, gini and entropy takes similar time, entropy get better
bootstrap, False takes 500%-600% longer time, False get better
max_features, auto and sqrt takes similar time, auto get better
max_depth from 10 to 80, they takes similar time 30 get better
min_samples_leaf, from 1 to 4 takes similar time, 1 get better
min_samples_split, from 2 to 8 takes similar time, 2 get better
n_estimators, 100 to 400 takes linear increasing time, 400 get better
Using GrivSearchCV can find the best parameter for classifier automatically:
RandomForestClassifier
Fitting 10 folds for each of 4 candidates, totalling 40 fits
Time Used: 352.55 secs
F1 Score: 0.970791
>>>>>>>>>>>>>>>>>>>> result:
forest_score 0.970790619690374
forest_estimator RandomForestClassifier(criterion='entropy', max_depth=30, n_estimators=1200,
random_state=42)
best_params {'criterion': 'entropy', 'max_depth': 30, 'n_estimators': 1200}
time_used 352.55
Conclusion of RandomForestClassifier configuration
:
RandomForestClassifier
get the best score 0.97 (Fitting 10 folds) with following configuration:
n_estimators=1200,
criterion="entropy",
max_depth=30,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features="auto",
max_leaf_nodes=None,
min_impurity_decrease=0.0,
bootstrap=True,
oob_score=False,
n_jobs=None,
random_state=42,
verbose=0,
warm_start=False,
class_weight=None,
ccp_alpha=0.0,
max_samples=None
I can see that, after fine tuning the trained RandomForestClassifier, I get the better f1_score from 0.96 to 0.97
Feature importance by RandomForestClassifier
To find out top 10 importance feature to my model, I caculated the weight for each feature.
From the results shown above I can conclude following:
Top 10 features which influence whether the customer will respond to an offer or not after viewing the offer are:
- ‘reward’ (get money) recieved by the customer is the most important feature, which get more than double score of the seconde importance feature. It represents how much amount as a reward. The reward stimulus works well to push a customer is getting back, response, and complete an offer.
- the count of offer_bogo and offer_discount features are following
reward
, as they have strong signal to decide offer type. - ‘amount’ (spent money) is the next largest feature which represents, which could play a big influence if the customer will complete an offer after viewing the offer. it is about double score of the third importance feature.
- ‘event received’ which represents, if Starbucks sent the offer to customer and the customer open the offer message is likely to be responded and complete the offer. Its importance score is bigger than the forth importance feature significantly.
- ‘offer count’ which represents, the total number of offer a customer received. from this feature the importance score reducing gradully after it. the importance order could be different depends on variety factors.
- ‘channel email’ , ‘duration’ , ‘channel web’ , ‘event reviewed’ ,’difficulty’, ‘membership days’ play similar roles to influence complete result
Predict test data & Caculate Confusion Matrix
Compute confusion matrix to evaluate the accuracy of a classification. It has i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class.
the pictures above shows the prediction result, rows index represent the true value, columns represent predict result.
If I pick one offer type as a true or false to get a binary metric, I can get more metric such as Precision and Recal.
After convert multiple classification to binary classification , here is a result for offer_bogo
---Prediction of an customer may accept offer_bogo---
Precision 94.80301760268232
Recall 95.92875318066157
Accuracy 97.0310391363023
F1-score 95.36256323777404
misclassifying acceptance 4.071246819338422
misclassifying refusal 2.454473475851148
The results shows that, the prediction model has Precision 94.8% , Recall 95.9%, Accuracy 97.0%, and F1-score 95.4%. To the customers who may accept an offer, there are 2.45% misclassifying acceptance. To the customers who may refuse an offer, there are 4.0% misclassifying refusal.
Because misclassifying refusal is less than misclassifying acceptance, the predictive model is doing well as it has very low chances of missing an customer who may respond an offer than who may ignore.
In this project, sending offer to a potential customer with high chance of responding is more important than to a person may refuse the offer. That means, even through both higher precision and higher recall are goal we trying to go for, the higher recall is more meaningful to this project than precision. Therefore, the predictive model would fit the purpose in this project.
Review and Conclusion
The problem in this project is to build a model that predicts whether a starbucks’ customer will respond to an offer. After clean the Problem Definition, the solution has mainly four steps.
First is Data Loading and Cleaning. I was preprocessing portfolio, profile and transaction datasets, and get several cleaned and verified data. Also did some understanding of cleaned data and show some visilizations.
Second, Feature Selection and Engineering. In this step I was combined 3 datas sets in to one by join customer_id. To keep more information from transaction dataset, customer events data need to be aggreated.
Third, Normalizing and Engineering Data for Machine Learning. Dummy data and cut data to generate new features, and remove the features that has less relevent. shuffle the data and split data for train and test.
Last, based on the splitted data I get a quick comparison of f1-score and time used between 4 algorithms by GridSearch. And I chose the best estimator “RandomForestClassifier” to train and predict the splitted data. A confusion matrix was generated to verify the performance my model.
The estimator forest_estimator
which I got at end, can fit the purpose of predicting a customer would or not response an stabucks' offer.
Improvement and Reflection
First of all, the result looks like pretty good, I can understood the data well before implement any engineering code. and I plan the project through the conclution from each step and section. That is fallow a methodology called reflection in action.
However, the result is based on several choice of data selection , feature engineering , and algorithm selection. Once the dataset changed, for example, some feature was dropped such as gender_others
, time
, and offer_discount
. because they played a low correlation with other features. But in nother dataset , these dropped feature may shows higher weight than current dataset. In nother dataset, gender_others
could be higher than gender_male
, then gender_male
can be ignored. The less correlation features can be detected by an algorithm automaticaly , then remove them or choose top N relevent features.
Another improvement is use higher folds in GridSeachCV, than may give better result but running bit logger time. the slower is tolerable, but the result is critial for later Model Building.