Final Assignment: Customer Return Prediciton

Business Analytics and Data Science - Winter 2020/2021 - Prof. Dr. Stefan Lessmann

Submission by David Schulte

Problem statement (as given by the instructor)

Customers send back a substantial part of the products that they purchase online. Return shipping is expensive for online platforms and return orders are said to reach 50% for certain industries and products. Nevertheless, free or inexpensive return shipping has become a customer expectation and de-facto standard in the fierce online competition on clothing, but shops have indirect ways to influence customer purchase behavior. For purchases where return seems likely, a shop could, for example, restrict payment options or display additional marketing communication.

For this assignment, you are provided with real-world data by an online retailer. Your task is to identify the items that are likely to be returned. When a customer is about to purchase a item, which is likely to be returned, the shops is planning to show a warning message. Your task is to build a targeting model to balance potential sales and return risk in order to optimize shop revenue. The data you receive is artificially balanced (1:1 ratio between returns and non-returns).

The metric which will be used to evaluate the final predictions will be the minimization of the given cost matrix, in which v is the price of the item.

Cost matrix

Solution

First, we will import from librabries commonly used in Data Science.
Additionally, I wrote several classes, which we want to import here. This makes this notebook more readable.

Explanatory data analysis and data preparation

We will load the data into a pandas dataframe.

Let us have a first look at the data.

We can instantly make the following observations.

Our data has the following features:

Data type conversion

Before we focus more on the data, we want to typecast the features to more suitable datatypes.
We will use the class DataTransformer, that I wrote for this project. It is a custom transformer that has several functions that help us with the process of data cleaning and feature engineering. It is compatible with scikit-learn. This brings some advantages, as we will see later in this notebook.

We will change the datatypes of our features. We will downcast the numerical values, as neither of them needs 64 bits storage. The dates will be converted to the pandas datatype datetime. The other features with type object will be converted into category.
Although the IDs are categorical features, we will leave them as they are for now, because we can visualize them better this way.

Our data is now easier to work with and memory usage has decreased significantly.

Visualization of the numerical features

Let us have a closer look at the data.

We make the following observations:

Comparing known and unknown data

Now, we will inspect the unknown data to see, if it is similar to the known data.

The histograms are mostly very similar. However, the following differences stand out:

Before continuing, we want to make sure that the user IDs from both datasets belong to the same users and no IDs were being reassigned. To do this, we will test, if a user with a specific ID that is contained in both data sets has the same attributes, namely user_dob, user_title and user_state.

We will do the same with the items, by checking, if they have the same brand_id.

The IDs are in fact consistent.

Exploring of the categorical variables

There are a lot of different item sizes. There may be a way to regroup them, but we will further investigate this. The reason is that there is no indication, why the size of an item should influence the probability of it being returned.

We see that the vast majority of orders were being done by female customers. Also, some titles were not reported.

The return rate of companies is different. However, we will not consider it, because there are so few companies in our dataset.

The average return rates of the different states are very similar. That is what we expected. We will later drop this feature.

Gaps in the data

We already noticed that we have gaps in the data in two of our features. Let us take a closer look.

There is no pattern visible in the gaps, therefore they are not related to each other. This what we would expect.
An explanation for the missing values for user_dob could be that users are not required to fill in this information.
The missing values for delivery_date, however, are more interesting.

It looks like, the items without delivery date were not being returned.

That is very interesting. Items without delivery date were not being returned. We have two possible explanations for that. 1) Delivery dates are determined, after the items have been delivered. The users canceled their order before the items were shipped, which would not count as a return.
2) Delivery dates are the predicted dates of delivery. The items can not be delivered. What speaks against this, is that a user should not be allowed by the ordering system to order an item that can not be delivered.

We will assume that the first explanation is right. The fact that we have a sound explanation for this rule and that the data supports our assumption so clearly, we will use it in our model. We could either create a boolean value delivered and hope that our model will learn to apply our rule by itself, or we could hardcode it. In our model, we will use the latter approach, because the association is so strict.

Let us now take a step back and think about the information delivery_date gives us. The value of this feature seems to be determined after the customer submits an order. If we recall the use case of our prediction model, we quickly realize that we will not have this information at the time our model makes new predictions. Therefore, we are not able to use delivery_date as a feature. Also a model that predicts our unknown data will not be applicable to the case it was designed for.

To deal with this problem, will create a parameter in our model, which we call user_delivery_dates.

We will for now use the feature, as it will likely improve the prediction of our unknown data. Furthermore, it is possible that we misunderstood the process, in which the feature values are determined. The best way to deal with this confusion would be to ask our retail company about it.

For now, we will not fill the gaps of our data. The reason is that we will later drop both of the affected feature values. Before, we will construct features that use them and fill the gaps in these features.

Further exploration

We can see that the average price of returned items is higher than the one of not returned items.

There are items that have a price of 0.

Excluding one data point, all those items have the size label unsized.

Some users returned those free items. That would be a very uncommon behavior, if the items were actually free. We are not sure what to make of this, but will create a boolean feature that is tells us, if the item is labeled as free.

Feature Engineering

First, we will create the following features.

If we have set user_delivery_dates, we will also create the following two features:

We will also create several features that are dependent on other data points and labels.
When doing so, one has to be very careful in order to prevent data leakage. When evaluating our model, the training data must not contain any information that is determined by the test data. Also, the features of our test data must not be computed using test data labels.
We will keep that in mind, and construct this features differently for training and test data. Our transformer inherits from the scikit-learn class TransformerMixin. This enables us to make this distinction in a very elegant way.
We will create the following dependent features:

The following features will be computed using labels. Therefore, we will compute them differently for training and test data:

For now, we will regard the whole dataset as training data, as we want to first see our new features.

Dropping unnecessary features

We will drop the following features:

Our data now looks like this:

Because we used features with missing values to compute user_age and delivery_span, there are still gaps in the data. We will handle that in the following sections.

Visualization of new features

We can see that there some outliers in some of the features. Also some of the distributions are skewed right.

Handling of gaps and outliers

We will fill the gaps in user_age and delivery_span with the median of their values in the training set. We also do not believe that our customers are less than 16 or more than a 100 years old. Therefore we will also assign those users the user age median of the training data. We do the same with the negative values for delivery_span.
We will handle the other outliers with upper bounds. We will assign it to every value that exceeds it. The upper bounds are as follows:

Handling skewness

We will try to handle the skewness of the features item_price, order_num_items, user_orders, item_popularity and item_color_popularity by applying the function log(x+1) to them. The addition of 1 inside the log function is important, as some of the features have value 0.

Correlations of features and target

Because we created dependent features, we want to train our transformer and then inspect the correlations on our test data. In order to do this, we split our data into a training and a test set.

There is some autocorrelation in our features, which is generally bad. However, the correlations are not too extreme.
What is very good to see, on the other hand, is that some of our features have a fairly high correlation to our target value. They promise to be great predictors.

Modelling

Now that we have prepared our data, we want to find a good prediction model. We will find a model with optimal AUC and then search for the threshold is expected to minimize cost. Before testing different estimators, we will always apply our transformations to prepare the data and then apply a StandardScaler to scale the data.

We will try to use Logistic Regression with Elastic net regularization. We will perform a gridsearch to find the best parameters for the regularization parameter C and the ratio of Lasso-Regularization l1_ratio.

Now, we will try to use a Random Forest Classifier and search over the parameters n_estimators, criterion and max_depths.

The AUC values for both models with every parameter combination are extremely similar. It is disappointing that we do not gain any improvement by model tuning. On the other hand, the models are very consistent.

We will decide to use Logistic Regression with its best parameters in our model.

Finding a prediciton threshold that minimizes costs

We will use the function calculate_costs to calculate the cost of binary prediction. Using this function, we will perform a grid search over possible threshold values and find the threshold that minimizes our costs with function find_optimal_threshold.

Now, we will use our defined functions. We will split our dataset 5 times in order to avoid variability. Our final threshold will be computed as the average of the 5 thresholds computed.

Prediction of the unknown values

Now that we have our model and an optimal prediction threshold to minimize cost, we will train our model on all the known data points and make a prediction of the unknown values.

Conclusion

The prediction of item returns is a difficult task. We found a good way to minimize cost, but the average cost in our dataset is still nearly 9€. In the end, the decision to return an item is often dependent on the taste of the customer and other factors that we can not observe. Still, our model should still be helpful to our retail company.
There are some more interesting approaches to tackle the problem that we did not pursuit. For example, it would be interesting to use a Neural Network as our estimator. Also, one could use a clustering algorithm on users or items to engineer better features. Theses approaches could improve the results further.