Common Mistakes when handling skewed count data in machine learning

1. Background

Count data is everywhere. Count data sounds so easy to deal with: they are just infinite integers, nothing special. If you think so, then you probably would handle them wrong. How so? This blog aims to provide you some tips for working with count data in Machine Learning (ML), to help you prevent some common mistakes that you may never have noticed before.

Let’s start with a simple question. Suppose that we are developing ML models about movie watching, and there is a field called “count of cartoon movies that the user watched in the last 6 months”. Since everyone has a different taste of movies, we see values like 0, 1, 2, 3…101 (yes, the user who watched 101 movies must be a huge fan). Now here’s the question for you:

What statistical distribution does this count data may follow?

If your answer is “normal distribution” or “I don’t know”, then congrats! I am sure this blog will help you.

If you don’t have much statistics knowledge, that’s fine. This blog aims to provide * hands-on* ML techniques, though I am also providing some details in statistics for readers who are curious to know.

First, let’s look at the plot and the data summary of this count data (a toy data created for demonstration). As we know, the fastest to dive into a brand new dataset is to make some plots. So here we go:

import numpy as np
import pandas as pd
import plotly.express as px
count = np.concatenate((np.zeros(201),np.repeat(1, 50),np.repeat(2, 40)))
np.random.seed(1)
a=np.random.choice(range(3, 60),90)
b=np.random.choice(range(61, 100),20)
df = pd.DataFrame(np.concatenate((count, a, b, [101])),columns=['val'])
fig = px.histogram(df, x='val',nbins=200)
fig.show()

Now I am sure you would no longer assume it’s a normal distribution. And you probably notice that the data is highly skewed with a long tail on the right, and a lot of 0s. Actually, if we look at the median, half of them are 0s which indicates half of the users did not watch any cartoon movies in the last 6 months. You may start to agree with me (if you didn’t before) that this kind of data may need to be treated properly in your ML model. Yes, this type of data has a specific category in statistics: zero-inflated, indicating the distribution of the count data is sparse with many 0s.

Next, in section 2 we will discuss the common mistakes that newbies have when dealing with skewed count data as a modeling feature (i.e. independent variable). Then in section 3, we’ll see caveats when predicting the skewed count data is your goal (i.e. response variable). If you are ready, then let’s start!

2. Skewed count data as an independent variable

Common mistake 1: Remove outliers based on data’s statistics

We all know that the 1st thing to do after data exploration is data cleaning, where outlier removal is an important part. But please be careful when you apply some common outlier removal methods such as the Standard Deviation Method and Interquartile Range Method. As you see from the below, you probably would end up removing most of the values which are not outliers, since the median is 0 in this skewed data.

Mistake 1: use Standard Deviation Method -> the calculated percentage of outliers is: 0.965

# Mistake 1: Standard Deviation Method
# calculate summary statistics
data_mean, data_std = np.mean(df.val), np.std(df.val)
# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

# remove outliers
outliers_removed = [x for x in df.val if x > lower and x < upper]
print('the calculated percentage of outliers is:',100*len(outliers_removed)/df.shape[0])

Mistake 2: use Interquartile Range Method -> the calculated percentage of outliers is: 19.7% (better than Standard Deviation Method but still too many as outliers)

# Mistake 2: Interquartile Range Method
# calculate interquartile range
q25, q75 = np.percentile(df.val, 25), np.percentile(df.val, 75)
iqr = q75 - q25
# calculate the outlier cutoff
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off

# identify outliers
outliers = [x for x in df.val if x < lower or x > upper]
print('the calculated percentage of outliers is:',100*len(outliers)/df.shape[0])

Solution:

Be cautious with methods based on the data’s distribution statistics such as mean or median. We could consider using methods like percentile capping (remove points that are larger than a certain threshold such as 99%).

Common mistake 2: throw skewed count data into linear regression without doing anything

When performing linear regression models, people usually ignore the basic assumptions of using OLS models, like the error/residuals after modeling with all the independent variables should follow a normal distribution with a mean of 0. Thus, the OLS model would not give you reliable results if you do nothing about the skewed count data.

Some statistics behind this if you are interested: You don’t need to have independent variables follow normal distribution for linear regression, but why you still care about if your independent variables are skewed? The reasons are: 1) it’s highly possible that the skewed data would violate the assumption we mentioned above (normality of residual/homoscedasticity), and sacrifice the model accuracy, though it’s hard to tell before you actually plot the residuals. 2) heavy tails could increase the probability of having high leverage, which affect performance of your regression model. 3) it’s very common to calculate the t-statistics and p-value after running a linear regression, aims to evaluate if the estimated coefficient is significantly different from 0. To make the inference valid, the distribution of estimated coefficients have to follow a normal distribution.

So what to do with it? Some people may think about using some data transformation methods. Well, this is the correct direction, but please be sure to avoid the next mistake:

Common mistake 3: use Box-Cox method for data transformation

Box-Cox is a great way to transform data but it should not be applied here also for the reason of assumption it holds: the data has to be all negative or all positive. With a large amount of 0 in our data, Box-Cox cannot be used anymore.

Solutions for data transformation:

1) Add a small number (I would suggest 0.5) to all the values in this field and do log transformation. Why add 0.5? Because you cannot do log to 0, it’s undefined in mathematics.

2) Do a square root transformation by taking the square root of the values.

Both methods can help with the skewness, as shown in the output below.

print('skewness of raw data:', df['val'].skew())
# log transformation
df['log_trans'] = np.log(df.val+0.5)
print('skewness of log-tranformed data:', df['log_trans'].skew())
# square root transformation
df['sqr_trans'] = np.sqrt(df.val)
print('skewness of square root tranformed data:', df['sqr_trans'].skew())

Additional tip:

If your data is really sparse such as >60% are 0s, I would suggest you to change it as a binary variable:
```
df['binary'] = np.where(df.val>0, 1, 0)
```
If you are using non-linear ML methods such as tree models, you may not need to worry too much about having skewed data. However, linear regression models are widely used in industry for multiple reasons like interpretability, legislation requirements, etc.

3. Skewed count data as a response variable

When skewed count (zero-inflated) data is response variable, we may face a few similar issues like outlier removal, and you could refer to section 2 for the solutions. In terms of modeling the skewed count data, we would want to avoid the mistake:

Common mistake 4: deal with the data as a general continuous response variable

Many people fit the data with linear regression (OLS) and find the model performance is not well. This is understandable since the data is NOT from the normal distribution (as discussed in section 1), so the OLS would not estimate it well (though it’s valid to do so). To get better model performance, we need to understand that the data distribution under the hood. It’s back to the question we had at the beginning: “ What statistical distribution does this count data may follow?” The answer is:

Rather than describe it with a single distribution, it’s better to describe it as a binomial (watch or not) and a Poisson distribution (counts if watched any).

Theory suggests that the excess zeros are generated by a separate process from the count values, i.e. the positive values(ref. 1)). Therefore, we could think the data has two parts: we could see the first part as a binary variable indicates if the value is 0 or not. And in the second part, we see the data come from a Poisson distribution for counts data.

There are two ways you could try to fit the model:

3.1 Machine Learning method

If you have some 0s (<30%):

You could try fit data with one model using the Zero-Inflated Regressor from Python library “scikit-lego”(ref 2)

If you have excess 0s:

Though you could still use the Python library “scikit-lego” as mentioned above, I personally would feel a bit concerned with so many zeros since it indicates an obvious segmentation/separation of the data. As we discussed the data comes from two processes, I would suggest you consider creating two ML models for this type of data:

The first model would predict if the value equals 0 or not. In the blog’s example, we use df[‘binary’] created in section 2 as the response variable for the first step.
Second step, we filter out all the 0s and develop a model for the values larger than zero.

In our example of “watching cartoon movies”, we could consider this process like a “user segmentation” by predicting 1) If users watched cartoon movies,

2) focus on users who watched cartoon movies and predict how many they’ve watched.

3.2 Statistical modeling

There are many studies that discuss zero-inflated data. Here are a few models you could try (Ref. 1):

Zero-inflated Poisson Regression -
Zero-inflated Negative Binomial Regression – Negative binomial regression does better with overdispersed data, i.e. variance much larger than the mean.
Ordinary Count Models – Poisson or negative binomial models might be more appropriate if there are no excess zeros.

4. Summary

Count data is so common but they can be so difficult to handle with excess zeros. This blog provides you four common mistakes when cleaning, manipulating, and modeling the skewed count data. I hope you could learn from this blog so you feel more confident in your machine learning journey!

Note: the positive values I created in this example don’t necessarily follow Poisson distribution, since it’s just toy data for demonstration purposes. In reality, this kind of positive count data should follow the Poisson distribution.

💖Love the story? Please feel free to subscribe to the mailing list for DS and ML fans, and become a Medium memberfor more DS blogs!🤩

References

https://stats.idre.ucla.edu/r/dae/zip/
https://scikit-lego.netlify.app/meta.html