Uses of Automated Feature Engineering for Predictive Modelling

by Aug 7, 2019Blog, Other

Introduction

I am an intern learning about generating predictive models. Manual feature engineering is a critical part of the model building process. With the advent of automated feature engineering by Featuretools, an open source python framework, manual processes can be minimized, allowing for a combination of manual and automated features to maximize the efficiency and accuracy of a model. 

The idea for this project is to leverage the automated feature engineering functionality of Featuretools with a specific use case to learn more about the generated features and their importance to model building. The use case is for a restaurant aggregator site like Yelp, which has clients who pay to have their restaurant featured. Through the site, consumers are able to book reservations and place various filters on restaurants they are searching for, such as area and cuisine.

The use case is to help the aggregator site to make sure restaurants do not churn. The goal of the model is to predict which clients are likely to leave. Knowing early can help the aggregator site be proactive in preventing churn.

Goals of the project

Objectives include 1) using Featuretools to observe what kind of features are automatically generated based on our data and 2) ascertaining how significant the generated features are respective to our use case model.

Aspects of this project include creating unique indices for datasets, creating entity sets containing multiple datasets, and defining the relationship between those datasets. The next step requires deep synthesis learning with Featuretools to automate features that are added to our primary dataset. Production and analyses of multiple models simulated from different versions of this data (combination of manual and automated features, only manual features, and only automated features) will help determine the use of Featuretools and its implications for future modelling.

Data Variables

Client Data

  1. Agent ID: The ID of the agent assigned to the restaurant
  2. Client ID: the ID number of a specific restaurant
  3. Client Join Date: the date the restaurant was added to the aggregator site
  4. Client Left Date: if applicable, the date the restaurant left the aggregator site
  5. Total number of seats: Maximum occupancy of the restaurant
  6. Internet reservation: whether a user of the site can make an online reservation for a restaurant [yes/no]
  7. Service Level: subscription level of client

Call Data

  1. Client ID: the ID number of a specific restaurant
  2. Date: the date and time of the call from an agent to that restaurant 
  3. Agent_ID: the ID of the agent who called the restaurant
  4. Notes: content of the call

Data Manipulation

Before using Featuretools to perform automated feature engineering, data manipulation of client and call data was required. Using pandas in a Jupyter notebook allowed for standard operations such as loading client and call datasets, dropping columns, and adding a churn flag column. 

#Read in the data

client_cleaned = pd.read_csv(‘client.csv’, parse_dates = [‘Date_Added_as_Client’])

calls_cleaned = pd.read_csv(‘result.csv’, parse_dates = [‘CallDate’])

#Cleaning up data and removing Japanese character fields

clients = client_cleaned.drop(columns=[‘Store_Name’,’Location’,

                               ‘Documents_for_Leaving’,

                               ‘Documents_for_Leaving_Memo’,

                               ‘Genre_of_Food’, 

                               ‘Average_customer_transaction’,

                               ‘Search_budget’,

                               ‘Regular_Holiday’, 

                               ‘Service_Area’,

                               ‘Middle_Service_Area’,

                               ‘Area’])

calls = calls_cleaned.drop

(columns=[‘ClientTalker’,’Dialogue_channel’,

          ‘Reason_For_Call’,’CallResult’])

#Creating a churn flag column for the client data

def is_nan(x):

    return isinstance(x, float) and math.isnan(x)

def churn(data):

    if not is_nan(data):

        return 1

    else:

        return 0

clients[‘churn_flag’] = clients[‘Date_Left’].apply(churn)

One manipulation required to use Featuretools was having one column in our “parent” table which had unique values for each row. Due to the fact that our client data did not have such a column (Client ID’s were repeated, as the data was collected from a period of 14 months), a manipulation was performed to create a new column which was a combination of the Client ID and the month it was collected from, called “client/month”. This column was added to both the client data and the call data, so the data could be linked to each other via that column later. 

How Featuretools Is Used

After the data was transformed, it was compatible to interact with Featuretools. The first step was creating a new EntitySet in which client and call data was added.

# Creating a new EntitySet

es = ft.EntitySet(id = ‘clients’)

#Creating entities to put into the EntitySet

es = es.entity_from_dataframe(entity_id = ‘clients’, 

     dataframe = clients, 

     variable_types = {‘year_month’: ft.variable_types.Categorical,         ‘churn_flag’: ft.variable_types.Categorical},

     index = ‘client/month’, 

     time_index = ‘Date_Added_as_Client’)

es = es.entity_from_dataframe(entity_id = ‘calls’, 

     dataframe = calls,

     variable_types = {‘Business period’ : ft.variable_types.Categorical, ‘year_month’ : ft.variable_types.Categorical},

     make_index = True,

     index = ‘call_id’,

     time_index = ‘CallDate’)

An optional field to infer the variable types in this table is called variable_types. If no value is entered, Featuretools automatically infers type for each variable. Some categorical variables had been categorized as numerical, such as year_month, churn_flag, and business_period. Using ft.variable_types.[Variable_Type], these variables were changed to be categorical.  Running [es] shows both datasets added to the EntitySet.

The next step is defining the relationship between the two datasets, related in this case by the column field “client/month”. Using the ft.Relationships functionality of Featuretools, parent client data related to child call data by their common column, “client/month”. This relationship was then added to the EntitySet, which can be seen by running [es] again. 

# Relationship between call data and client data

client_call_relationship = ft.Relationship(es[‘clients’][‘client/month’],

                es[‘calls’][‘client/month’])

# Add the relationship to the entity set

es = es.add_relationship(client_call_relationship)

After the relationship is defined, Featuretools allows us to perform automated deep feature synthesis on the data. 

# Perform deep feature synthesis without specifying primitives

feature_matrix, features_list = ft.dfs(entityset=es,

                                       target_entity= ‘clients’,

                                       chunk_size=0.01)

Aggregation and transformation functions as well as combination of two functions (called a feature with a depth of 2) were applied onto variables in the existing data and subsequently added to the existing dataset. A list of the features added to our original dataset is shown below.

After these features were added, their utility could be evaluated in model building using ScoreFast (™).

Downsampling Data Before Modelling

Looking at the distribution of the data after uploading it onto the ScoreFast(™) platform proved useful, as it showed that the churn_flag data was skewed, with 90.22% of the data having a churn_flag value of 0. This unbalanced class distribution caused the first round of modelling to have disproportionately high error in categorizing the churn_flag as 1. Therefore, downsampling data to have a 2:1 class0:class1 ratio helped to balance the dataset. 

count0,count1 = dataset.churn_flag.value_counts()

dataset_class0 = dataset[dataset[‘churn_flag’]==0]

dataset_class1 = dataset[dataset[‘churn_flag’]==1]

dataset_class_0_under_2 = dataset_class0.sample(2*count1)

dataset_test_under = pd.concat([dataset_class_0_under_2, dataset_class1], axis=0)

Code for downsampling the audited data

Distribution of the data before downsampling

Distribution of the data after downsampling

Modelling Phase

Using ScoreFast (™), this project culminates in evaluating the importance of the features added to our data in order to create a machine learning model that effectively predicts client churn. Model testing was performed on the combined dataset including manual and automated features (M1), the original dataset that included only manually engineered features (M2), and a third dataset that contained only the automated features generated by Featuretools (M3).

From the results of the modelling phase, the algorithm with the best precision and recall across both models was the XGBOOST algorithm, so it was chosen for prediction. When compared, M1 (manual and automated features) has the best accuracy, AUC, precision, and recall. 

M1 Model Overview

M2 Model Overview

M3 Model Overview 

The variable importances of the features for each model are shown below.

Variable Importances for M1

Variable Importances for M2

Variable Importances for M3

Most of the statistically significant variables are manually engineered features, with certain automated features providing useful information to train the model, such as the count of calls and the day the client was added. In M3 this is emphasized again, because without manual features, the count of the calls for that month, the year the restaurant was added as a client, and how many times a year they were contacted proved critical to the model.

Prediction Results

On ScoreFast(™), we have the option of scoring a percentage of our data, or scoring individual rows of the data to see the model’s confidence in predicting the response variable, in this case the churn_flag. Scoring individual rows can give an idea of how confident the model predicts a churn_flag of 0 or 1 compared to the actual value.

M1 Prediction Value

M2 Prediction Value

M3 Prediction Value

All models predicted the accurate churn_flag value of 0 with fairly high accuracy, with M1 having slightly better confidence than M2, and M3 having fairly less confidence in the prediction.

Conclusions and Remarks

From going through data manipulation, Featuretools deep learning, modelling and prediction, I have learned the importance of finding efficient ways to make feature engineering as efficient as possible. In this use case, Featuretools was great for helping with stacking features on top of each other for numerical data; however, a majority of the data from this use case was categorical, which did not always mesh well with Featuretools aggregation functions.

Featuretools is a great way to start the feature engineering process or add auxiliary aggregation features after manually engineering other features. I look forward to exploring how it can be used with more numerical data and other projects in the future.

python

#Read in the data
client_cleaned = pd.read_csv(‘client.csv’, parse_dates = [‘Date_Added_as_Client’])
calls_cleaned = pd.read_csv(‘result.csv’, parse_dates = [‘CallDate’])

Related Posts

[et_pb_dfbm_blog dfbm_fullwidth=”off” article_distance=”20″ item_animation=”faded” custom_posttypes=”post” include_post_categories=”250″ posts_number=”3″ post_content_background_color=”#f9f9f9″ content_absolute=”on” content_absolute_event=”off” thumb_same_height=”on” define_thumb_height=”300″ header_tag=”h3″ show_content=”excerpt” show_limit_words=”on” show_comments=”off” show_limit_categories=”on” show_author=”off” show_date=”off” read_more=”off” show_pagination=”off” use_inner_shadow=”off” disabled_on=”off|off|off” _builder_version=”3.25.3″ post_body_line_height=”1.5em” post_header_font=”Montserrat||||||||” post_header_font_size=”17px” post_header_line_height=”1em” post_meta_font=”Montserrat||||||||” post_meta_text_color=”#000000″ post_meta_font_size=”11px” post_meta_line_height=”1.2em” post_meta_link_font=”Montserrat|300|||||||” post_meta_link_text_color=”#000000″ post_meta_link_font_size=”11px” post_meta_link_line_height=”1.2em” post_content_line_height=”1em” text_orientation=”left” custom_css_filt_blogs_post_meta=”font-weight: 200;” custom_css_filt_blogs_post_meta_a=”font-weight: 200;” saved_tabs=”all” text_shadow_horizontal_length=”text_shadow_style,%91object Object%93″ text_shadow_horizontal_length_tablet=”0px” text_shadow_vertical_length=”text_shadow_style,%91object Object%93″ text_shadow_vertical_length_tablet=”0px” text_shadow_blur_strength=”text_shadow_style,%91object Object%93″ text_shadow_blur_strength_tablet=”1px” post_body_text_shadow_horizontal_length=”post_body_text_shadow_style,%91object Object%93″ post_body_text_shadow_horizontal_length_tablet=”0px” post_body_text_shadow_vertical_length=”post_body_text_shadow_style,%91object Object%93″ post_body_text_shadow_vertical_length_tablet=”0px” post_body_text_shadow_blur_strength=”post_body_text_shadow_style,%91object Object%93″ post_body_text_shadow_blur_strength_tablet=”1px” post_header_text_shadow_horizontal_length=”post_header_text_shadow_style,%91object Object%93″ post_header_text_shadow_horizontal_length_tablet=”0px” post_header_text_shadow_vertical_length=”post_header_text_shadow_style,%91object Object%93″ post_header_text_shadow_vertical_length_tablet=”0px” post_header_text_shadow_blur_strength=”post_header_text_shadow_style,%91object Object%93″ post_header_text_shadow_blur_strength_tablet=”1px” post_meta_text_shadow_horizontal_length=”post_meta_text_shadow_style,%91object Object%93″ post_meta_text_shadow_horizontal_length_tablet=”0px” post_meta_text_shadow_vertical_length=”post_meta_text_shadow_style,%91object Object%93″ post_meta_text_shadow_vertical_length_tablet=”0px” post_meta_text_shadow_blur_strength=”post_meta_text_shadow_style,%91object Object%93″ post_meta_text_shadow_blur_strength_tablet=”1px” post_meta_link_text_shadow_horizontal_length=”post_meta_link_text_shadow_style,%91object Object%93″ post_meta_link_text_shadow_horizontal_length_tablet=”0px” post_meta_link_text_shadow_vertical_length=”post_meta_link_text_shadow_style,%91object Object%93″ post_meta_link_text_shadow_vertical_length_tablet=”0px” post_meta_link_text_shadow_blur_strength=”post_meta_link_text_shadow_style,%91object Object%93″ post_meta_link_text_shadow_blur_strength_tablet=”1px” post_content_text_shadow_horizontal_length=”post_content_text_shadow_style,%91object Object%93″ post_content_text_shadow_horizontal_length_tablet=”0px” post_content_text_shadow_vertical_length=”post_content_text_shadow_style,%91object Object%93″ post_content_text_shadow_vertical_length_tablet=”0px” post_content_text_shadow_blur_strength=”post_content_text_shadow_style,%91object Object%93″ post_content_text_shadow_blur_strength_tablet=”1px” pagination_text_text_shadow_horizontal_length=”pagination_text_text_shadow_style,%91object Object%93″ pagination_text_text_shadow_horizontal_length_tablet=”0px” pagination_text_text_shadow_vertical_length=”pagination_text_text_shadow_style,%91object Object%93″ pagination_text_text_shadow_vertical_length_tablet=”0px” pagination_text_text_shadow_blur_strength=”pagination_text_text_shadow_style,%91object Object%93″ pagination_text_text_shadow_blur_strength_tablet=”1px” pagination_text_current_text_shadow_horizontal_length=”pagination_text_current_text_shadow_style,%91object Object%93″ pagination_text_current_text_shadow_horizontal_length_tablet=”0px” pagination_text_current_text_shadow_vertical_length=”pagination_text_current_text_shadow_style,%91object Object%93″ pagination_text_current_text_shadow_vertical_length_tablet=”0px” pagination_text_current_text_shadow_blur_strength=”pagination_text_current_text_shadow_style,%91object Object%93″ pagination_text_current_text_shadow_blur_strength_tablet=”1px” post_body_fb_text_shadow_horizontal_length=”post_body_fb_text_shadow_style,%91object Object%93″ post_body_fb_text_shadow_horizontal_length_tablet=”0px” post_body_fb_text_shadow_vertical_length=”post_body_fb_text_shadow_style,%91object Object%93″ post_body_fb_text_shadow_vertical_length_tablet=”0px” post_body_fb_text_shadow_blur_strength=”post_body_fb_text_shadow_style,%91object Object%93″ post_body_fb_text_shadow_blur_strength_tablet=”1px” post_header_fb_text_shadow_horizontal_length=”post_header_fb_text_shadow_style,%91object Object%93″ post_header_fb_text_shadow_horizontal_length_tablet=”0px” post_header_fb_text_shadow_vertical_length=”post_header_fb_text_shadow_style,%91object Object%93″ post_header_fb_text_shadow_vertical_length_tablet=”0px” post_header_fb_text_shadow_blur_strength=”post_header_fb_text_shadow_style,%91object Object%93″ post_header_fb_text_shadow_blur_strength_tablet=”1px” post_meta_fb_text_shadow_horizontal_length=”post_meta_fb_text_shadow_style,%91object Object%93″ post_meta_fb_text_shadow_horizontal_length_tablet=”0px” post_meta_fb_text_shadow_vertical_length=”post_meta_fb_text_shadow_style,%91object Object%93″ post_meta_fb_text_shadow_vertical_length_tablet=”0px” post_meta_fb_text_shadow_blur_strength=”post_meta_fb_text_shadow_style,%91object Object%93″ post_meta_fb_text_shadow_blur_strength_tablet=”1px” post_meta_link_fb_text_shadow_horizontal_length=”post_meta_link_fb_text_shadow_style,%91object Object%93″ post_meta_link_fb_text_shadow_horizontal_length_tablet=”0px” post_meta_link_fb_text_shadow_vertical_length=”post_meta_link_fb_text_shadow_style,%91object Object%93″ post_meta_link_fb_text_shadow_vertical_length_tablet=”0px” post_meta_link_fb_text_shadow_blur_strength=”post_meta_link_fb_text_shadow_style,%91object Object%93″ post_meta_link_fb_text_shadow_blur_strength_tablet=”1px” post_content_fb_text_shadow_horizontal_length=”post_content_fb_text_shadow_style,%91object Object%93″ post_content_fb_text_shadow_horizontal_length_tablet=”0px” post_content_fb_text_shadow_vertical_length=”post_content_fb_text_shadow_style,%91object Object%93″ post_content_fb_text_shadow_vertical_length_tablet=”0px” post_content_fb_text_shadow_blur_strength=”post_content_fb_text_shadow_style,%91object Object%93″ post_content_fb_text_shadow_blur_strength_tablet=”1px” read_button_feat_text_shadow_horizontal_length=”read_button_feat_text_shadow_style,%91object Object%93″ read_button_feat_text_shadow_horizontal_length_tablet=”0px” read_button_feat_text_shadow_vertical_length=”read_button_feat_text_shadow_style,%91object Object%93″ read_button_feat_text_shadow_vertical_length_tablet=”0px” read_button_feat_text_shadow_blur_strength=”read_button_feat_text_shadow_style,%91object Object%93″ read_button_feat_text_shadow_blur_strength_tablet=”1px” box_shadow_horizontal_read_button_feat_tablet=”0px” box_shadow_vertical_read_button_feat_tablet=”0px” box_shadow_blur_read_button_feat_tablet=”40px” box_shadow_spread_read_button_feat_tablet=”0px” read_button_filt_text_shadow_horizontal_length=”read_button_filt_text_shadow_style,%91object Object%93″ read_button_filt_text_shadow_horizontal_length_tablet=”0px” read_button_filt_text_shadow_vertical_length=”read_button_filt_text_shadow_style,%91object Object%93″ read_button_filt_text_shadow_vertical_length_tablet=”0px” read_button_filt_text_shadow_blur_strength=”read_button_filt_text_shadow_style,%91object Object%93″ read_button_filt_text_shadow_blur_strength_tablet=”1px” box_shadow_horizontal_read_button_filt_tablet=”0px” box_shadow_vertical_read_button_filt_tablet=”0px” box_shadow_blur_read_button_filt_tablet=”40px” box_shadow_spread_read_button_filt_tablet=”0px” add_button_text_shadow_horizontal_length=”add_button_text_shadow_style,%91object Object%93″ add_button_text_shadow_horizontal_length_tablet=”0px” add_button_text_shadow_vertical_length=”add_button_text_shadow_style,%91object Object%93″ add_button_text_shadow_vertical_length_tablet=”0px” add_button_text_shadow_blur_strength=”add_button_text_shadow_style,%91object Object%93″ add_button_text_shadow_blur_strength_tablet=”1px” box_shadow_horizontal_add_button_tablet=”0px” box_shadow_vertical_add_button_tablet=”0px” box_shadow_blur_add_button_tablet=”40px” box_shadow_spread_add_button_tablet=”0px” box_shadow_horizontal_tablet=”0px” box_shadow_vertical_tablet=”0px” box_shadow_blur_tablet=”40px” box_shadow_spread_tablet=”0px” z_index_tablet=”500″ /]