Most Data Science (DS) projects have a clear goal we want to achieve, such as the development of a supervised or unsupervised machine learning(ML) model, or a hypothesis we want to test. However, it’s also not uncommon that a dataset is given for exploration without any specific analysis goal defined. This situation happens when you attend a datathon, or study groups like Meetup groups.  The dataset is provided to spark creative DS ideas and see how many insights you can provide, and what story you can tell with it. In my opinion, these tasks are more difficult than the ones with a predefined DS question. It’s easy to get stuck after you clean the data and make a bunch of plots to check the data distributions since you would start to ask the critical questions:

What else can I do with data visualization?

What perspectives should I do to dig into the data for story-telling and valuable insights?

Asking these questions is the first step to structured thinking for the advanced EDA (Exploratory Data Analysis) as a data scientist. Structured thinking not only needs brainstorming but also reasonable logic. In this blog, I hope to share a method you could follow for more advanced EDA beyond data cleaning and distribution checking, and tips to deal with these open-ended DS projects, based on a practice dataset and Jupyter notebook (available in my Github)

1. Overview of structured thinking for advanced EDA

what is structured thinking?

Structured thinking is a process of putting a framework to an unstructured problem. Having a structure not only helps an analyst understand the problem at a macro level, it also helps by identifying areas which require deeper understanding.(Ref. 1)

What is “advance EDA”?

For DS projects with a clearly defined goal, EDA is to help understand the issues and distributions of the data, resulting in data cleaning and pre-processing for a model-ready dataset. For DS projects without a clearly defined goal, advanced EDA includes extra steps of plotting hidden correlations within the data and in-depth analysis with a specific perspective.

Let’s get started with the method!

The 5 steps method is summarized below, and I will illustrate more details in the following sections with the practice dataset. Please note that this method has limitations and may not apply to every dataset. Please see the discussions in the last section of this blog.

  • Step 1: Preliminary EDA (univariate) with individual variable

    Goal: use the data dictionary to understand what each variable means, check their distributions, perform data cleaning and pre-processing to deal with outliers, missing values, etc.

  • Step 2: Categorize variables by characteristics

    Goal: Based on how the data was collected, group the variables into context variables (variables represent characteristics like age, location, gender) and dynamic variables (variables observed/monitored during data collection, such as account balance, number of transactions, game results).

  • Step 3: Explore correlations between variables

    Goal: understand how variables are related to each other by creating plots to show the correlations, performing segmentation to the observations, creating new variables, etc.

  • Step 4: “Zoom In” analysis

    Goal: Pick one subset for in-depth analysis by comparing the subset to the whole population. The subset could be one observation (e.g. one object vs. all) or one category (organic food vs. all kinds of food)

  • Step 5: Discover modeling potential

    Goal: investigate the potential of data modeling, and provide analytical results with statistical methods and/or machine learning techniques. For example, try to find a variable as a response variable for predictive modeling, or perform clustering to understand the correlations bear underneath.

2. Introduction to the practice dataset

To show how this method works and how to tell a story along with the analysis, an example DS project is created with a practice dataset of a marathon race. The data dictionary is shown below:

  • Place: The order in which each racer finished relative to racers of the same gender
  • Num: Racer’s bib number
  • Name: Name of the racer
  • Ag: Age of the racer
  • Hometown: Hometown of the racer. For domestic racers, it’s “City, State”, for foreign racers, it’s “Country, .”
  • Gender: Racer’s gender
  • div: An age division comprises racers of the same gender group
  • gun_s: Gun time (minutes) indicates elapsed time from the formal start of the race and when the racer crossed the finish line
  • net_s: Net time (minutes) indicates elapsed time from when the racer crossed the starting line and when the racer crossed the finish line
  • pace_s: Racer’s average time (minutes) per mile during this race

Here’s a view of the first few rows of this dataset:

It looks pretty simple, isn’t it? I am pretty sure that it’s not difficult to check the distributions of each column. We already mentioned some basic descriptive plots to make, such as density plot and box plot for numerical variables (e.g. pace, gun time), bar chart, and pie chart for categorical variables (e.g. gender, division)

Now Let’s dive into the dataset and see how we apply the method step-by-step!

3. Advanced EDA by steps

3.1 Preliminary EDA (univariate)

Most people are already familiar with this step which is usually involved with data cleaning. There are many tutorials for this part so I am not going to talk much about it. The goal of this step is to generate a cleaned dataset which you feel good to start the exciting part of digging for insights from it! If you are interested in what I did for the data cleaning, please feel free to check the Jupyter notebook in Github named “part1_data_cleaning”.

3.2 Grouping variables

The second step does not take much effort in coding but a lot in critical thinking. After we get a general idea of the data, we want to group them into context variables (variables represent characteristics) and dynamic variables (variables observed during data collection). Why this step? Because:

  • To prepare for Step 3 where we create plots to see how the two groups of variables interact with each other.
  • To pick a story-telling perspective (i.e. pick subset from a context variable) for “Zoom in” analysis in Step 4
  • To discover the potential of using the context variables for clustering analysis and racers segmentation modeling in Step 5
  • To look for a responsible variable among the dynamic variables when exploring modeling potential in Step 5

To group the variables, we need to think about how the data was collected and what structures the data from. For this example dataset, the table below shows the grouping results: the race results (e.g. net time, gun time) were collected during the Marathon competition, and other features are characteristics of the racers:

3.3 Explore interactions between variables

An interesting data story always includes an exploration of how variables interact with each other. When the dataset has many variables, there are so many ways to look at the data so it is maybe difficult to know how to efficiently explore your dataset for the most valuable insights. That’s why we grouped the variables in Step 2 and now we can use some techniques to efficiently find useful correlations from the variables. With the context variables and dynamic variables, there can be 3 combinations when exploring the interactions between them, and each combination results in a different perspective for data exploration:

Let’s try it with the practice dataset:

1 Context variable vs. context variable

In section 3.1 (Preliminary EDA) above, we already gain some basic understanding of the racers’ characteristics regarding their distributions and statistics. The plot below shows some statistics about gender and age.

To find further correlations between gender and age, we can use a side-by-side plot to show the counts in different divisions (based on age). Some interesting insights are shown in the plot below, such as most racers are from age 30-49 (61.9%) for both male and females, female racers are less than the male in all divisions except 20-29 and 30-39

fig = px.histogram(df, x="div", color="gender", 
                         hover_data=df.columns,opacity=0.8)
fig.update_xaxes(title="Division")
fig.update_xaxes(categoryorder='array', categoryarray= ['0-14','15-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89'])

fig.update_yaxes(title="Count of Racers")
fig.update_layout(barmode='group')
fig.show()

Similarly, we could try plotting other context variables. For example, use gender and hometown to draw a map and see if male and female racers are evenly distributed across the country.

2 Dynamic variable vs. dynamic variable

Dynamic variables are usually used as the response variable in ML models. They are the variables we’d like to study on to answer a real-world question. Some examples would be banks analyze customers’ account balances to predict if the customers will churn, stores analyze customers’ shopping records to decide what product to suggest to them. Here the account balance and shopping records are the dynamic variables we want to investigate. In the example data, the dynamic variables are the race results (net time, gun time, pace, and place). They are the key variables that we would like to work with to tell a story about the race.

Exploring data should not be limited to the variables provided, we could also create new variables, just like what we do in feature engineering. The race results include two timed variables: gun time and net time. After reading the data description, we know that the gun time is the net time plus the time spent to cross the starting line after the gun was fired. Therefore, we could use this information to create the new variable which I call “cross_t”, to indicate the difference between gun time and net time.

Why this new variable is interesting? We need to use some knowledge outside the data itself. Thinking about the situation when the data was collected is a good technique to discover a new angle to look at the data - we tell a good story by asking good questions first. Before the marathon began, racers stood in line to wait for the gun to fire. Did racers who were ambitious to win care about if he/she stood close to the start line? How would that impact their race results? To answer these questions, I drew a plot for cross time vs. net time to look at the fastest racers who did well:

import plotly.express as px
fig = px.scatter(df, x="cross_t", y="net_s", color='gender',opacity=0.5)
fig.update_xaxes(title="Cross Time (minutes)")
fig.update_yaxes(title="Net Time (minutes)")

fig.show()

Insights for story-telling:

  • In general, racers who had shorter cross time also had shorter net time
  • Most of the fastest racers (net time < 40 min.) also spent less time to cross the start line after the gun was fired (they took the race seriously from the beginning!)

3 Context variable vs. dynamic variable

By putting context variables and dynamic variables together, we could see how the racers’ characteristics affected their results. Notice that this is different from the basic analysis in step 1 (preliminary EDA): now we are evaluating the race results in different subgroups of the population.

Let’s first start with the gun time results and gender group, we can check how are they interact with a side by side boxplot:

import plotly.express as px
fig = px.box(df, y='gun_s', x='gender',  color="gender", points="all",
             notched=False, # used notched shape
             title="Box plot of net time",

            )
fig.update_xaxes(title="Gender")
fig.update_yaxes(title="Net time (minutes)")
fig.update_traces(orientation='v') # horizontal box plots
fig.show()

Similarly, we could do the same plot for pace vs. gender:

Insights for story-telling: the male group was faster than the female group.

We can also check the net time differences in division groups:

fig = px.box(df, x="div", y="net_s", color="gender")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.update_xaxes(categoryorder='array', categoryarray= ['0-14','15-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89'])
fig.update_xaxes(title="Division")
fig.update_yaxes(title="Net time (minutes)")
fig.show()

Insights for story-telling: racers in the age group of 15-19 had the best records of net time, for both females and males. Also male group had better results in every division, compared to female racers.

Remember the new variable “cross_t” we created in the last section? I cannot wait to see how it interacts with the context variables! But before that, let me plot the gun time and net time altogether, for different gender groups:

plt.figure(figsize=(12, 6))
sns.kdeplot(df_f.loc[:,'gun_s'], color='orange', linestyle='--',label='Female gun time (min)', linewidth=2)
sns.kdeplot(df_m.loc[:,'gun_s'], label='Male gun time (min)', linestyle='--', color='green',linewidth=2)
sns.kdeplot(df_f.loc[:,'net_s'], color='orange',label='Female net time (min)', linewidth=2)
sns.kdeplot(df_m.loc[:,'net_s'], label='Male net time (min)', color='green',linewidth=2)
plt.legend(loc=1, prop={'size': 10})
plt.xlabel('Time', fontsize=16);plt.ylabel('Density', fontsize=16)

Here I notice that the density curve of male gun time is “fatter” (more flatten) than that of female gun time, which makes me curious to know what happened after the gun was fired till racers across the start line. So I plotted the density of the new variables “cross_t” vs. gender:

plt.figure(figsize=(12, 6))
sns.kdeplot(df.loc[df.gender=='F','cross_t'], label='Female', color='orange',linewidth=2)
sns.kdeplot(df.loc[df.gender=='M','cross_t'], label='Male', color='green',linewidth=2)

plt.xlabel('Cross Time (minutes)', fontsize=16);plt.ylabel('Density', fontsize=16); 
plt.title('Distribution of cross time in minutes', fontsize=18)

plt.legend(loc=1, prop={'size': 10})

First, this plot is interesting because of the 3 peaks. Also, if we look at the timestamp at ~2.2 minutes we can see some patterns: at the beginning, the density line of male group is above the line of females, but the situation goes to the opposite after 2.2 minutes. So I did some simple calculation to abstract the insights for story-telling:

  • Most racers have cross time of 1.3 min, 3.6 min. and 5.2 min, regardless of gender
  • Male racers are more aggressive at the beginning: 75.5% of female racers spent >2.2 minutes to cross the start line, compared to 54.7% of the male racers

Other thoughts before we go to the next step:

  • In addition to the plots mentioned above, we could draw maps with the “hometown” variable and see how that affect the race results
  • Also, it is always nice to draw a correlation plot (e.g. heatmap) for all numerical variables. Doing so can help you avoid colinearity if you decided to fit a model in Step 5

3.4 “Zoom In” analysis

After Step 3 in the last section, we already got some good insights from the dataset. How could we investigate further? The “Zoom In” analysis is a useful trick when I am digging deeper into the data to tell a story. “Zoom In” means we pick a subset of data or a group from the population, and define an interesting question to analyze. So far we have been focusing on the whole population (a.k.a all racers). I am interested in picking a racer (Chris Doe) who is a middling performer, and try to answer the question of “how much time separated Chris from the top racers”. Here’s some basic information about this racer:

Then I did some calculation and created some visualizations to compare his results to the top 10% racers in his division:

fig = px.histogram(df_cd, x="net_s")
dt = df_m.loc[df_m.Name=="Chris Doe",'net_s']
fig.add_vline(x=np.sum(dt), line_width=3, line_dash="dash", line_color="brown")
fig.add_vline(x=perc, line_width=3, line_dash="dash", line_color="yellow")
fig.add_vline(x=np.sum(np.mean(df_top.net_s)), line_width=3, line_dash="dash", line_color="red")
fig.update_yaxes(title="Count of racers")
fig.update_xaxes(title="Net time (minutes)")
fig.show()

Insights for story-telling:

  • Chris needs to reduce his net time by 8 minutes to enter the top 10% racers group!
  • Chris’s net time is 11.6 minutes more than the average net time in the top 10% racers group

Wow! It seems there’s a long way to go to be the top racers! Hope you could do better next time, Chris!

3.5 Discover modeling potential with machine learning

By following the steps above, we provided insights with different perspectives. I believe we gain more understanding of the data and also the race. Now it’s time to brainstorm and find the opportunities of creating a DS model from this dataset, with machine learning (ML) techniques.

The first thing that came to my mind is hierarchical modeling with net time as the response variable. When I was plotting divisions vs. gender in section 3.3, I have noticed that the number of racers in each division are very different. In other words, some divisions have over 300 racers but some have below 50. This indicated a good potential to create a hierarchical model by considering the random effects from divisions and gender. We could add other variables (i.e. cross time) as the independent variable for fixed effects. So I started with a simple two-way ANOVA (analysis of variance) to test my idea.

# get ANOVA table as R like output
# drop div 80-89 because only male group has racer in this range
df1 = df.copy().loc[df.loc[:,'div']!='80-89',:]
# Ordinary Least Squares (OLS) model
model = ols('net_s ~ C(div) + C(gender)+ C(div):C(gender)', data=df1).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table.round(2)

Insights for story-telling: Results from 2-way ANOVA shows that division and gender could affect the net time significantly. But their interaction does not have significant effect.

The next step would be to create hierarchical models to test the effects of random intercept with random slopes and compare the model performance. Limited by the length of the blog I did not continue this work. But if you are interested you could keep exploring. I would suggest use a library called lme4 in R to fit the model. Also, use “brms” to try Bayesian hierarchical models.

Another modeling opportunity I can think about is clustering analysis with the racers. We could try to segment the racers and see if that can help predict the net time.

4 Summary and discussion

This blog provides an easy-to-follow method to help you think through and discover insights behind the data, using advanced EDA techniques. I hope by following the steps, you learned the basics of structure thinking to tell a good story with the “open-ended” data science projects.

Also, I want to point out the limitation of this method presented. The method of grouping variables into context variables and dynamic variables may not apply to every dataset. For example, for dataset where time is an important factor to consider. Suppose we have a dataset about users’ shopping activities in one year, we may not have much “context” information (users’ characteristics like age, gender, etc.), instead, we have information associated with timestamps (i.e. when a user opened an account or made a purchase). Then we need to take factors like seasonality, sequence of user activities into modeling considerations. This would be a different topic for EDA and hopefully, I could share more with you with another blog soon.

💖Love the story? Please feel free to subscribe to the mailing list for DS and ML fans, and become a Medium memberfor more DS blogs!🤩 Reference:

  1. The art of structured thinking and analyzing (https://www.analyticsvidhya.com/blog/2013/06/art-structured-thinking-analyzing/)