UX Planet

Follow publication

UX Planet is a one-stop resource for everything related to user experience.

Follow publication

How to clean AB testing data before analysis

Nsky

Published in

UX Planet

3 min readOct 29, 2018

After a sufficient amount of data collected and AB test completed, some of you might think it’s time for data analysis, but before we start to analyze we need to ETL the data.

ETL stands for Extract, Transform, Load. In this article, we will talk about the transform part, which is in our case cleaning. Often the collected data contains outliers, and let’s say, if we want to apply a parametric method to compare means, then this method requires robust data.

Let’s talk about an experiment with “average check” metric. In the average check dataset, we want to find mean value. In this case, outliers could have a significant influence on our mean value. Let’s say if the average check of a regular customer is $30, but there are customers who typically spend $300, the mean value will be skewed. One of the options is to find and remove the outliers from the dataset manually. Sounds easy, right? However, the challenge is if our dataset contains millions of observations.

Obviously, we have many outliers in this dataset. Most likely, these outliers are the clients who spent significantly more as compared to an average customer, these are also known as so-called “whales”. These users may generate a significant revenue share, and they must be analyzed separately, and one of the reason is that their behavioral patterns are most likely to differ. We’ll uncover this topic in future posts. So, how do we deal with such a dataset, to stabilize and clean the outliers?

One of the options that come to mind is to get rid of values outside of three standard deviations from the mean.

Let’s delve a little bit into the theory. Three sigma rule — almost all values of a normally distributed random variable lie within (x̅-3σ; x̅+3σ). With a probability 0.9973, the value of a normally distributed random variable lies in the specified interval (if the value is not obtained as a result of sampling).

Graph of the probability density of a normal distribution

We can use the following function in R language to exclude all those data that lies outside of three standard deviations from the mean.

outliers.rm <- function(x){
      return(x[abs(x - (quantile(x, 0.25) + quantile(x, 0.75))/2) <= 2*IQR(x)])
}

After applying this function, you will clean a significant amount of outliers in your dataset. Worth to note though, that you should use this method very carefully because it can remove too much data in your sample. This number could be %30 or even more.

There are many more data cleaning approaches. In this article, we have reviewed only one of them which is pretty simple and quite popular.

In practice, a data cleaning method depends on the data quality and nature. Couple less conservative methods to mention the Box-Cox transformation or methods using the median to compare samples, such as Kruskal-Wallis criterion. We will discover other data cleaning methods in the coming articles.

Originally published at awsmd.com.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in UX Planet

343K Followers

Last published 2 hours ago

UX Planet is a one-stop resource for everything related to user experience.

Written by Nsky

99 Followers

44 Following

Experimentation

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

A/B Experimentation using Chi-Square Test of Independence

Toph Nguyen

A/B Experimentation using Chi-Square Test of Independence

How to conduct chi-square in R for beginners

Oct 29, 2024

How Netflix Leveraged A/B Testing to Enhance User Experience

Vignesh

How Netflix Leveraged A/B Testing to Enhance User Experience

Netflix, the global entertainment giant, is renowned for its cutting-edge use of data-driven strategies to optimize user experiences. One…

Oct 23, 2024

Brandon Tully

UX Is Not Design. UX Is Infrastructure.

The Infrastructure Mindset

Mar 26

What I Wish I Knew Before Becoming A Data Scientist (2): All About Interviews

Women in Technology

Lu Zhenna

What I Wish I Knew Before Becoming A Data Scientist (2): All About Interviews

How to navigate data scientist job interviews and pave the way to your dream job?

Mar 18

Top 50 LeetCode Problems for Data Scientist and Machine Learning Interview Preparation — Updated…

The Data Beast

Maximizing A/B Testing Accuracy: Part 1 — Control Variate and Stratified Sampling

A/B Testing for Data Science Series (6): Advanced Strategies for A/B Testing

Nov 17, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech

UX Planet

How to clean AB testing data before analysis

One of the options that come to mind is to get rid of values outside of three standard deviations from the mean.

Graph of the probability density of a normal distribution

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in UX Planet

Written by Nsky

No responses yet

More from Nsky and UX Planet

Experiment sample size calculation using power analysis

If you use experiments to evaluate a product feature, and I hope you do, the question of the minimum required sample size to get…

Ultimate Guide to Color in UX/UI Design

Tips, theory & best practices from beginner to advanced

16 little UI design tips that make a big impact

A step by step UI design case study to quickly fix an example user interface using logic-driven UI design tips

Sample size calculation and power analysis for AB testing

If you use experiments to evaluate product feature, and I hope you do, the question of the minimum required sample size to get…

Recommended from Medium

A/B Experimentation using Chi-Square Test of Independence

How to conduct chi-square in R for beginners

How Netflix Leveraged A/B Testing to Enhance User Experience

Netflix, the global entertainment giant, is renowned for its cutting-edge use of data-driven strategies to optimize user experiences. One…

UX Is Not Design. UX Is Infrastructure.

The Infrastructure Mindset

What I Wish I Knew Before Becoming A Data Scientist (2): All About Interviews

How to navigate data scientist job interviews and pave the way to your dream job?

Top 50 LeetCode Problems for Data Scientist and Machine Learning Interview Preparation — Updated…

In the competitive world of data science and machine learning interviews, algorithmic thinking and problem‐solving skills are essential…

Maximizing A/B Testing Accuracy: Part 1 — Control Variate and Stratified Sampling

A/B Testing for Data Science Series (6): Advanced Strategies for A/B Testing