EDA

May 11, 2018•6,860 words

Motivation

In my graduate program we had an excellent text on Regression Methods in Biostatistics (which I will refer to as VGSM). Unfortunately, all of the examples and code from the courses, labs, and text were in Stata. As an R user, I kept having to translate the topics from Stata into R. But this actually turned into a valuable exercise because it forced me to separate the underlying statistical concepts from the programming languages.

I've decided to post various parts of the text/topics because I'm sure there will be future R users in the same boat :)

What is Exploratory Data Analysis (EDA)?

"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone--as the first step." - John W. Tukey, Exploratory Data Analysis, 1977

I think the "first step" is the most important part point of exploratory data analysis (EDA)--it should precede any modeling, inferences, or fancy machine learning algorithm. Without looking at the predictors and outcomes in a graph, modeling data can be a waste of time or misleading.

I'll be the first to admit, I'm not great at remembering tests, assumptions, transformations, or writing equations on a whiteboard. I'm OK at math, but I have never been exceptional at it. But I love visualizing data, and this passion (and a lot of general curiosity) has saved me plenty of times when I'm faced with a data question.

Essential graphs for EDA

The chapter that addresses exploring data in VGSM is chapter 2 "Exploratory and Descriptive Methods". The authors recommend three graphs for numerical data: box-plots, histograms, and Q-Q plots.

The Data

The data we will be using for this post comes from the "Western Collaborative Group Study (WstClbGrpStdy)". You can read more about it here. These data are available in the native wcgs data frame, but I'll be loading the version that came with my text because it has a few additional variables.

I also name this imported file WstClbGrpStdy because I like to capitalize and disemvowel (yes, that is a thing, thank you, James Joyce) names of data frames to distinguish them from other R objects (vectors, lists, etc.).

These data come from an epidemiological study of behavioral attributes and coronary heart disease. The findings were published in 1964, and ironically there isn't a single graph used to display these data in their manuscript. It's hard to know how familiar the authors were with EDA (John Tukey didn't published Exploratory Data Analysis until 1977), so we can't really hold the lack of graphs and figures against them. The code to import this data set is below:

WstClbGrpStdy <- readxl::read_xls("Data/wcgs.xls")

Structure vs. glimpse

The tidyverse packages all have a underlying philosophy around tidy data, with a consistent syntax and well defined terms/ideas.

A simple example of this concept utils::str() vs. dplyr::glimpse(). Viewing the shape and structure of an object is a common starting point, but it can also turn into a place of confusion for newcomers to the R language. The str() function gives a lot of information (and jargon) that isn't always necessary to start exploring your data.

I prefer using dplyr::glimpse() because it 1) puts more of the data on the screen than str(), 2) it displays the format for each variable in the data set in a consistent way, and 3) it omits the additional information that occasionally gets printed with str().

See the image below for an example:

str_vs_glimpse

I add 78 to ensure everything prints nicely to the screen width.

glimpse(WstClbGrpStdy, 78)

    ## Observations: 3,154
    ## Variables: 22
    ## $ age      <dbl> 50, 51, 59, 51, 44, 47, 40, 41, 50, 43, 59, 54, 48, 39, ...
    ## $ arcus    <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
    ## $ behpat   <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A...
    ## $ bmi      <dbl> 31.32, 25.33, 28.69, 22.15, 22.31, 27.12, 23.24, 22.96, ...
    ## $ chd69    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "N...
    ## $ chol     <dbl> 249, 194, 258, 173, 214, 206, 190, 212, 130, 233, 181, 2...
    ## $ dbp      <dbl> 90, 74, 94, 80, 80, 76, 78, 84, 70, 80, 86, 76, 78, 74, ...
    ## $ dibpat   <chr> "Type A", "Type A", "Type A", "Type A", "Type A", "Type ...
    ## $ height   <dbl> 67, 73, 70, 69, 71, 64, 70, 70, 71, 68, 72, 67, 71, 70, ...
    ## $ id       <dbl> 2343, 3656, 3526, 22057, 12927, 16029, 3894, 11389, 1268...
    ## $ lnsbp    <dbl> 4.883, 4.787, 5.063, 4.836, 4.836, 4.754, 4.804, 4.868, ...
    ## $ lnwght   <dbl> 5.298, 5.257, 5.298, 5.011, 5.075, 5.063, 5.088, 5.075, ...
    ## $ ncigs    <dbl> 25, 25, 0, 0, 0, 80, 0, 25, 0, 25, 10, 0, 20, 0, 4, 0, 0...
    ## $ sbp      <dbl> 132, 120, 158, 126, 126, 116, 122, 130, 112, 120, 130, 1...
    ## $ smoke    <chr> "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No"...
    ## $ t1       <dbl> -1.6334, -4.0634, 0.6397, 1.1218, 2.4250, -0.7875, -0.60...
    ## $ time169  <dbl> 1367, 2991, 2960, 3069, 3081, 2114, 2929, 3010, 3104, 28...
    ## $ typchd69 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
    ## $ uni      <dbl> 0.48607, 0.18595, 0.72780, 0.62446, 0.37898, 0.73550, 0....
    ## $ weight   <dbl> 200, 192, 200, 150, 160, 158, 162, 160, 195, 187, 206, 1...
    ## $ wghtcat  <chr> "170-200", "170-200", "170-200", "140-170", "140-170", "...
    ## $ agec     <chr> "46-50", "51-55", "56-60", "51-55", "41-45", "46-50", "3...

“Exploratory data analysis is detective work–in the purest sense–finding and revealing the clues.” - John W. Tukey, Exploratory Data Analysis, 1977

The first variable we will look at in this data frame is the systolic blood pressure (sdp). In R, we want to make sure this variable is formatted correctly before putting it in a graph. We can also use glimpse() on a single variable.

WstClbGrpStdy$sbp %>% glimpse()
##  num [1:3154] 132 120 158 126 126 116 122 130 112 120 ...

Base R histograms with `hist()`

A histogram takes a numeric variable and displays the range of values of across the horizontal scale (x-axis). Histograms also divide the horizontal scale into segments, called “bins.” The left-hand side of a histogram displays how many observations fall into each bin (y-axis).

Base R comes with a set of graphics:: functions for plotting data (this just means you won’t have to install any packages to use them). We can build a histogram using base R with the graphics::hist(). This function takes the data you want to plot as the first argument (WstClbGrpStdy$sbp).

# base R histogram of sbp
graphics::hist(WstClbGrpStdy$sbp)

Notice this plot lists “Histogram of WstClbGrpStdy$sbp” as the title and labels the x-axis as “WstClbGrpStdy$sbp”.

Technically this is all the function needs, but I recommend supplying something to the main = title and xlab = label arguments. Titles and labels help describe what we’re seeing (or what we expected to see) when R renders the plot. Future us will thank present us if take the extra time to add a title and label our plots.

graphics::hist(x = WstClbGrpStdy$sbp,
               main = "Histogram of systolic blood pressure",
               xlab = "Systolic blood pressure in mm Hg (sbp)")

Curious what to include in a title or label? I tend to stick with the type of graph and the variable I’m displaying in plain English (no jargon). For labels, I include the name of the measurement (no acronyms) and the units. A title and label for every graph might seem like an arduous level of detail, but if we end up investigating a data set with many abbreviated, nebulous variable names, we are going to appreciate
being able to look at a plot or graph and know precisely what it contains.

How many bins?

The graphics::hist() command will automatically divide up the data into bins, which usually isn’t a bad option. We can also specify the number of bins with the breaks argument. This changes the number and width of the vertical bars. As suggested in VGSM, the

“rule of thumb is to choose the number of bins to be about 1 + 3.3 log10(n)”

We can calculate this below:

wcgs_sbp_bins <- 1 + 3.3*log10(nrow(WstClbGrpStdy))
wcgs_sbp_bins
  ## [1] 12.55

This new numerical vector can be supplied to the breaks argument in the hist() graph.

Adding reference lines to a `hist()` plot

Add a reference line with a measure of central tendency to the histogram graph with the abline() function. The first argument is either h or v (for a horizontal or vertical line), followed by the values you want the line to represent (in our case the median() of sbp).

I add two additional arguments to designate the color (col = "blue")
and width of the line (lwd = 4).

graphics::hist(x = WstClbGrpStdy$sbp,
               main = "Histogram of systolic blood pressure",
               xlab = "Systolic blood pressure in mm Hg (sbp)",
               breaks = wcgs_sbp_bins)
abline(v = median(WstClbGrpStdy$sbp),
               col = "blue",
               lwd = 4)

Histogram quick plots with `qplot`

The next option for making a histogram comes from the ggplot2 package. The qplot() function (short for quick plot) is a fast way to plot a single variable. Titles can be added using ggtitle() and the same xlab() function from before.

ggplot2::qplot(sbp, data = WstClbGrpStdy) +
    ggplot2::ggtitle("Histogram of systolic blood pressure") +
    xlab("Systolic blood pressure in mm Hg (sbp)")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The qplot function will create a histogram or bar chart depending on the format of the variable you provide (numeric or factor). As the message tells us, the default number of bins is 30. This is more than twice the number of bins we calculated before, and we can see this has a
profound effect on how these data are represented in the graph (more on this later).

If you’re like me and prefer to use the pipes from the magrittr package, this can be re-written using a little syntactic sugar:

WstClbGrpStdy %>% ggplot2::qplot(sbp, data = .) +
    ggplot2::ggtitle("Histogram of systolic blood pressure") +
    xlab("Systolic blood pressure in mm Hg (sbp)")

Why did we have to specify a period (`.`) in the `data =` argument?

magrittr interprets the period argument as “take the named parameter as the object on the left of the %>% operator.”

So Data %>% function(variable, named_argument = .) is equivalent to

function(variable, named_argument = Data)

Notice we also get a message from this graph about the number of bins being used. The stat_bin() argument is using a default of 30, but we can adjust this with bins argument.

Adding the median to a `qplot()`

We can also add the median to this plot by adding a geom, specifically a ggplot2::geom_vline(). This function takes an aesthetic mapping argument (aes(xintercept = median(sbp))), a color argument (col), and a size argument that will determine the width of the vertical line.

WstClbGrpStdy %>%
        ggplot2::qplot(sbp, data = ., bins = wcgs_sbp_bins) +
        ggplot2::geom_vline(aes(xintercept = median(sbp)), # add median
                        col = 'blue',
                        size = 2) +
        ggplot2::ggtitle("Histogram of systolic blood pressure") +
        xlab("Systolic blood pressure in mm Hg (sbp)")

This histogram now looks more like the histogram we made using graphics::hist() above (but with fewer bins).

`ggplot` histogram plots

Below is an example of how to get a histogram using ggplot2. I’ve covered using ggplot2
elsewhere, so I won’t repeat myself here. But know that ggplot2 allows users to build a plot layer by layer, starting with a data argument, mapping variables to aesthetics (aes(x = sbp)), and then supplying a geom. We got a small introduction to this method in the example above when we added the ggplot2::geom_vline() layer to the ggplot2::qplot().

Again, we can use magrittr to increase the readability on these functions.

WstClbGrpStdy %>% # Data
    ggplot2::ggplot(aes(x = sbp)) + # variable
        ggplot2::geom_histogram(bins = wcgs_sbp_bins) + # geom
        ggplot2::geom_vline(aes(xintercept = median(sbp)), # add median
                col = 'blue',
                size = 2) +
        ggplot2::ggtitle("Histogram of systolic blood pressure") + # title
        xlab("Systolic blood pressure in mm Hg (sbp)") # label

I don’t use qplot()s very often, but only because I like the having the ability to customize and build my plots layer-by-layer.

What are we seeing in a histogram?

These histograms are showing us the frequency of each value in sbp. We can see above that most systolic blood pressures appear to be relatively close (although slightly lower than) to the median. We can also see the distribution extends to the right of the median (higher sbp values) all the way out to ~230 mm Hg. This is called a “right-skewed” distribution–the tail extends out to the right side (vs a “left-skewed” distribution where the tail extends out to the left side).

median(WstClbGrpStdy$sbp)
    ## [1] 126

It’s also a good idea to view multiple histograms with different levels of bins to get an idea for how this variable’s values are distributed.

# Create some histograms -----
WCGS_hist_15 <- WstClbGrpStdy %>% # Data
    ggplot(aes(x = sbp)) + # variable
        geom_histogram(bins = 15) + # geom
        geom_vline(aes(xintercept = median(sbp)),
                col = 'blue',
                size = 1) +
        ggtitle("Histogram of sbp (15 bins)") + # title
        xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_20 <- WstClbGrpStdy %>% # Data
    ggplot(aes(x = sbp)) + # variable
        geom_histogram(bins = 20) + # geom
        geom_vline(aes(xintercept = median(sbp)),
                col = 'blue',
                size = 1) +
        ggtitle("Histogram of sbp (20 bins)") + # title
        xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_25 <- WstClbGrpStdy %>% # Data
    ggplot(aes(x = sbp)) + # variable
        geom_histogram(bins = 25) + # geom
        geom_vline(aes(xintercept = median(sbp)),
                col = 'blue',
                size = 1) +
        ggtitle("Histogram of sbp (25 bins)") + # title
        xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_30 <- WstClbGrpStdy %>% # Data
    ggplot(aes(x = sbp)) + # variable
        geom_histogram(bins = 30) + # geom
        geom_vline(aes(xintercept = median(sbp)),
                col = 'blue',
                size = 1) +
        ggtitle("Histogram of sbp (30 bins)") + # title
        xlab("Systolic blood pressure in mm Hg (sbp)") # label
# arrange these plots ----
library(ggpubr)
ggarrange(WCGS_hist_15, WCGS_hist_20, WCGS_hist_25, WCGS_hist_30,
          ncol = 2, nrow = 2)

ggsave(filename = "Images/02_ggplot2_hists.png")

When we look at all four levels of bins, we can see that the distributions have a collection of scores on the left-hand side, but the tail extends to the higher or more positive scores. The is an example of right-skewness.

When building histograms, we’re looking for the highs, lows, and overall spread of the data. The trick is to look at enough bin values to give us an idea about the variation of values in our variable, without making it too granular to interpret.

What aren’t we seeing in a histogram?

The histogram doesn’t tell us anything about the summary statistics we might be using for inferences (such as the mean, standard deviation, median, etc.). It also doesn’t give us a clear picture of the extreme cases–it’s more of a “bird’s eye view” of a variable’s distribution. For more descriptive information, we will have to move onto box-plots.

Percentiles, Medians, and Box-plots

Box-plots were originally created by John W. Tukey (although he called them “box-and-whisker” plots). Box-plots are useful for understanding the shape of variable’s distribution because they divide up the values into distinct proportions. For example, if we order all the values in a variable from lowest to highest, the middle number is the 50th percentile (also called the median). I know half the values lie below this number, and the other half are above it.

Why use the median and not the mean (or average)?

The median isn’t as sensitive to extreme values as the mean. I’ll demonstrate this with a quick example. Take a vector (x) with 10 values and get the median() and mean().

x <- c(4, 7, 9, 11, 13, 15, 17, 19, 21)
median(x)
    ## [1] 13
mean(x)
    ## [1] 12.89

These two numbers are relatively close to one another. Now we can take a quick look at the distribution with hist(x).

hist(x)

This distribution looks symmetrical–more importantly, it looks symmetrical around the mean and the median (12.89 and 13). This is important because many modelling techniques assume the distribution of the variables are “normal” or approximately bell-shaped. Now we create a second vector (y), which is identical to x, but the lowest and highest values have been changed substantially (1 to 0.0001 and 21 to 21000). Calculate the median() and mean() of y.

y <- c(0.0001, 3, 7, 9, 11, 13, 15, 17, 21000)
median(y)
    ## [1] 11
mean(y)
    ## [1] 2342

Notice how the median didn’t change at all, but the mean has increased from 12.89 to 2342. Take a look at the distribution of y with hist():

hist(y)

The bell-shape is gone, replaced by just two bars (one for the 8 values under 5,000, and one for the single value above 20,0000).

Base R box-plots with `boxplot()`

Box-plots use quartiles to divide the data into the 25th, 50th (median), and 75th percentiles. We can create a box-plot in base R using the graphics::boxplot() function.

graphics::boxplot(WstClbGrpStdy$sbp,
                  main = "Box-plot of Systolic blood pressure in mm Hg",
                  ylab = "Systolic blood pressure in mm Hg (sbp)")

According to the authors in VGSM, a box-plot displays the following information:

Location, as measured by the median

Spread, as measured by the height of the box (this is called the interquartile range or IQR)

Range of the observations

Presence of outliers

Some information about shape

I’ve created a graphic that adds a little detail for each element of the box-plot.

Relationship between histograms and box-plots

$box\_plot.png$

Each of these imaginary box-plots has a corresponding histogram. These box plots weren’t created in R (and aren’t drawn perfectly to scale), but I’ve included this figure to illustrate the information captured in
a box-plot might be hard to see in a histogram:

Extreme values (sometimes called outliers) are made visible with a box-plot–they show up as dots or values beyond the 1.5x thresholds of each “whisker”.
We can also see the location changes by the shifting line within the box (the median).

Box-plots with `qplot()`

The code to create a box-plot is very similar to creating a histogram with qplot(). We supply a data frame (WstClbGrpStdy) to the qplot() function, but we also need to add an x argument (x = " "), because a box-plot geom needs to map the values of sbp onto the y axis. We add our title and label, but also set the xlab() label to NULL so the x doesn’t display.

WstClbGrpStdy %>%
    qplot(x = "", y = sbp, data = ., geom = "boxplot") +
      ggtitle("Box-plot of Systolic blood pressure in mm Hg") +
      ylab("Systolic blood pressure in mm Hg (sbp)") +
      xlab(NULL)

`ggplot` box-plots

Finally, we can extend the ggplot2 functions to the layered grammar by piping a data frame to the ggplot(aes( functions, mapping x to an empty value (" "), y to the sbp variable, then adding the
geom_boxplot().

The title and labels get added in the same way of as the qplot() function.

WstClbGrpStdy %>%
    ggplot(aes(x = "", y = sbp)) +
        geom_boxplot() +
            ggtitle("Box-plot of Systolic blood pressure in mm Hg") +
            ylab("Systolic blood pressure in mm Hg (sbp)") +
            xlab(NULL)

What are we seeing in a boxplot?

To summarize, the box displays the upper and lower quartiles as boundaries for three percentiles of the data: 1) the bottom of the box is the 25th percentile, meaning one-quarter of the values are below this value 2) the median (the line in the middle of the box) tells us how far the middle value is from the 25th and 75th percentile, and 3) the top of the box is the 75th percentile, so one-quarter of the values are above this value.

The entire length of the box is the interquartile range (IQR). The whiskers show 1) the maximum values within 1.5 times the IQR from the upper (75th) quartile, and 2) the minimum values that are within 1.5x the IQR from the lower (25th) quartile. The points beyond the whiskers are the extreme values.

With regard to these data, I thought this description from VGSM was enlightening,

“right-skewness will be indicated if the upper whisker is longer than the lower whisker or if there are more outliers in the upper range. Both the boxplot and the histogram show evidence for right-skewness in the SBP data.”

require(ggplot2)
require(ggpubr)
sbp_boxplot = WstClbGrpStdy %>%
                ggplot(aes(x = "",
                           y = sbp)) +
                geom_boxplot(show.legend = FALSE) +
                    coord_flip() +
                    scale_y_continuous(expand = c(0,5),
                                   limit = c(90, 250)) +
    ggtitle("Box-plot & Histogram of systolic blood pressure") +
                    xlab(NULL) +
                    ylab(NULL)
sbp_histogram = WstClbGrpStdy %>% # Data
                ggplot(aes(x = sbp)) + # variable
                geom_histogram(bins = 20) + # geom
                geom_vline(aes(xintercept = median(sbp)),
                        col = 'blue',
                        size = 1) +
                    scale_x_continuous(expand = c(0,5),
                           limit = c(90, 250))
ggarrange(sbp_boxplot,
          sbp_histogram,
          heights = c(2.5, 3),
          align = "hv",
          ncol = 1,
          nrow = 2)

By combining these graphs, we can see how the median (blue line) in the histogram splits the distribution in the box, and how the skewness in the histogram is represented in the whiskers and points of the box-plot.

Base R Q-Q plots

This brings us to the final plot we will cover: the Q-Q plot, or quantile-quantile plot. This plot is helpful when we want to know how closely our data aligns with the bell-shaped “normal” distribution.

We can create a Q-Q plot in base R using the following functions. The qqnorm() function needs a variable we want to plot (sbp).

qqnorm(WstClbGrpStdy$sbp,
        main = "Systolic blood pressure QQ plot",
        ylab = "Systolic Blood Pressure mm Hg (sbp)")

The interpretation of the Q-Q plot is the following from VGSM,

“a normal Q–Q plot is constructed so that the data points fall along an approximately straight line when the data are from a normal distribution, and deviate systematically from a straight line when the data are from other distributions.”

What does it mean if data are normally distributed?

In a broader sense, if I were to randomly sample a data point from a variable with a symmetrical, bell-shaped distribution, that point has an approximately equal probability of coming from either side of the distribution’s mean (or median).

The sbp data do not seem to fall along a straight line, but it would be helpful if we had a straight line for reference. We can add a straight line with qqline()–this function needs the same variable we provided to qqnorm(), but I also specified the line width in the lwd = 2 argument. I also included a col argument to color the systolic blood pressure values, and a grid() argument to give our coordinate system some gridlines.

qqnorm(WstClbGrpStdy$sbp,
        main = "Systolic blood pressure QQ plot",
        col = "blue",
        ylab = "Systolic Blood Pressure mm Hg (sbp)")
qqline(WstClbGrpStdy$sbp,
    lwd = 2)
grid(lty = "dotted",
    col = "gray75")

We can see these values definitely do not fall along the straight reference line of “theoretical quantiles”.

Q-Q quick plots with `stat_qq()`

The Q-Q plots in ggplot2 can be built in a variety of ways. The first is by setting the sample aesthetic to the variable we want to see (sbp) and then adding a stat_qq() layer.

WstClbGrpStdy %>%
    ggplot(aes(sample = sbp)) +
        stat_qq() +
    ggtitle("Q-Q plot of systolic blood pressure")

If we want to add a reference line to the stat_qq() plot, ggplot2 version 2.2.1.9000 is now equipped with a function stat_qq_line() for a straight reference line. I use the geom for creating the points on the graph (geom_qq()) along with some opacity and color options, then add the reference line with stat_qq_line().

WstClbGrpStdy %>%
    ggplot(aes(sample = sbp)) +
        geom_qq(alpha = 0.3,
                color = "blue") +
            stat_qq_line() +
    ggtitle("Q-Q plot of systolic blood pressure")

Q-Q plots are also helpful for determining the shape of the distribution. See the following quote from VGSM text:

“The shape and direction of the curvature can be used to diagnose the deviation from normality. Upward curvature is indicative of right-skewness, while downward curvature is indicative of left-skewness.”

The upward curve of the systolic blood pressure shows these data are right-skewed (which we knew from looking at the other two plots, but it’s always good to verify this again!).

Heavy and light tailed distributions

We are going to diverge from the sbp variable in the WstClbGrpStdy data frame and use the Tails data frame to show how two different distributions look in the three graphs we covered thus far.

Tails %>% glimpse(78)
    ## Observations: 1,000
    ## Variables: 2
    ## $ heavy_tail <dbl> -0.56, -0.40, -0.03, -0.77, -0.26, -0.38, -0.47, 0.32,...
    ## $ light_tail <dbl> 0.04615, 0.12579, 0.02782, 0.03540, 0.02829, 0.02750, ...

This data frame only has two variables, heavy_tail and light_tail. As the VGSM text notes, Q-Q plots can tell us a lot about how a distribution deviates from the normal “bell-curve”.

“The other two common patterns are S-shaped. An S-shape as in [see below] indicates a heavy-tailed distribution,”

We can plot the heavy_tail data below and arrange all three plots to display together.

Heavy_qq <- Tails %>%
    ggplot(aes(sample = heavy_tail)) +
        geom_qq(alpha = 0.3,
                color = "blue") +
            stat_qq_line() +
     ggtitle("Q-Q plot of heavy tail distribution")
            # ylab(NULL)
Heavy_box <- Tails %>%
                ggplot(aes(x = "y",
                           y = heavy_tail)) +
                geom_boxplot(show.legend = FALSE) +
                    coord_flip() +
                    scale_y_continuous(expand = c(0, 1),
                                   limit = c(-6, 6)) +
     ggtitle("Box-plot of heavy tail distribution") +
                    xlab(NULL) +
                    ylab(NULL)
Heavy_hist <- Tails %>%
                ggplot(aes(x = heavy_tail)) +
                geom_histogram() +
                geom_vline(aes(xintercept = median(heavy_tail)),
                            col = 'blue',
                            size = 1) +
                scale_x_continuous(expand = c(0, 1),
                                   limit = c(-6, 6)) +
    ggtitle("Histogram of heavy tail distribution") +
                    xlab(NULL) +
                    ylab(NULL)
# arrange these plots ----
library(ggpubr)
ggarrange(Heavy_hist,
          Heavy_qq,
          Heavy_box,
          align = "hv",
          ncol = 2,
          nrow = 2,
          widths = c(5, 5, 5))

The S-curve being referred to is the blue line in the Q-Q plot, and the distribution has a heavy tail if the line of values curves above the reference line on the positive (higher) end of the x-axis, but below the line on the negative (or lower) end of the x-axis.

We can also see the box-plot has many outliers outside the whiskers.

“while an S-shape like [see below] is indicative of a light-tailed distribution.”

Light_qq <- Tails %>%
    ggplot(aes(sample = light_tail)) +
        geom_qq(alpha = 0.3,
                color = "blue") +
            stat_qq_line() +
    ggtitle("Q-Q plot of light tail distribution")
Light_box <- Tails %>%
                ggplot(aes(x = "y",
                           y = light_tail)) +
                geom_boxplot(show.legend = FALSE) +
                    coord_flip() +
    ggtitle("Box-plot of light tail distribution") +
                    xlab(NULL) +
                    ylab(NULL)
Light_hist <- Tails %>%
                ggplot(aes(x = light_tail)) +
                geom_histogram() +
                geom_vline(aes(xintercept = median(light_tail)),
                            col = 'blue',
                            size = 1) +
    ggtitle("Histogram of light tail distribution") +
                    xlab(NULL) +
                    ylab(NULL)
# arrange these plots ----
library(ggpubr)
ggarrange(Light_hist,
          Light_qq,
          Light_box,
          align = "hv",
          ncol = 2,
          nrow = 2,
          widths = c(5, 5, 5))

This S-curve crosses the reference line, but in opposing directions from the heavy tail data (the line of values curves below the reference line on the positive (higher) end of the x-axis, and above the reference line on the negative (or lower) end of the x-axis).

The box-plot shows fewer outliers beyond the whiskers here, too.

Data Transformations

Many times data will require transformations before they can be entered into a statistical model. We have seen that variables can deviate from a normal distribution in their skewness or kurtosis, so visualization is important both before and after any transformations are made.

Common transformations are natural log (log) and log10. I liked the description of what a log transformation does to a skewed distribution in the VGSM,

“…a log transformation deemphasizes differences at the upper end of the scale and emphasizes those at the lower end.”

We will create two new variables of sbp and look at their distributions using histograms (set at 25 bins each).

WstClbGrpStdy <- WstClbGrpStdy %>%
    dplyr::mutate(log_sbp = log(sbp))
WCGS_log_sbp_hist_15 <- WstClbGrpStdy %>%
    ggplot(aes(x = log_sbp)) +
    geom_histogram(bins = 25) +
    geom_vline(aes(xintercept = median(log_sbp)),
            col = 'blue',
            size = 1) +
    geom_vline(aes(xintercept = mean(log_sbp)),
            col = 'red',
            size = 1) +
    ggtitle("Histogram of log(sbp) (25 bins)") +
    xlab("Ln of Systolic blood pressure (log_sbp)") # label
WCGS_hist_15 <- WstClbGrpStdy %>% # Data
    ggplot(aes(x = sbp)) + # variable
        geom_histogram(bins = 25) + # geom
        geom_vline(aes(xintercept = median(sbp)),
            col = 'blue',
            size = 1) +
        geom_vline(aes(xintercept = mean(sbp)),
            col = 'red',
            size = 1) +
        ggtitle("Histogram of sbp (25 bins)") + # title
        xlab("Systolic blood pressure in mm Hg (sbp)") # label
# arrange these plots ----
library(ggpubr)
ggarrange(WCGS_hist_15,
          WCGS_log_sbp_hist_15,
          align = "hv",
          ncol = 2,
          widths = c(5, 5))

We can see the log_sbp distribution is more symmetrical around the median/mean.

Doesn’t transforming my data make it harder to interpret when I model or graph them?

Maybe, but we just showed sbp before and after a transformation, so it’s possible to present both distributions without the reader losing track of the plot. The authors also point “a difference on the transformed scale is still a difference.”

Categorical Variables

Categorical or count data has less options for graphing, but it’s still possible to create a bar graph and see how these numbers differ within a variable. Below is a bar graph of behpat, which is the behavior type as a factor with levels A1, A2, B3, and B4.

# convert behpat to factor -----
WstClbGrpStdy$behpat <- factor(WstClbGrpStdy$behpat)
# create bar chart -----
WstClbGrpStdy %>% ggplot(aes(x = behpat)) +
    geom_bar() +
    ggtitle("Bar chart of behavior types") +
    xlab("Behavior types behpat")

Outcome and Predictor Variable Scatter Plots

This section introduces multivariate plots with a plot of systolic blood pressure (sbp) versus weight. I reproduce these plots below in base R with the plot() function.

plot(WstClbGrpStdy$sbp, WstClbGrpStdy$weight,
     main = "Systolic blood pressure vs. weight",
     xlab = "Weight",
     ylab = "Systolic blood pressure")

And we can create a similar chart in ggplot2 with the geom_point() function.

WstClbGrpStdy %>%
    ggplot(aes(x = weight, y = sbp)) +
        geom_point() +
        ggtitle("Systolic blood pressure vs. weight") +
            xlab("Weight") +
            ylab("Systolic blood pressure")

ggsave(filename = "Images/02_scatter_wtXsbp.png")

Adding a `LO`cally `WE`ighted `S`catterplot `S`moother (or `LOWESS`)

Sometimes it helpful to fit a line through a scatter plot of two continuous variables to see if their relationship is linear. The text uses the lowess sbp weight, bw(0.25) from Stata, and I’ve reproduced this graph below in ggplto2() with a ggplot2::geom_smooth(method = "loess"). Note that this includes a standard error (in gray).

WstClbGrpStdy %>%
    ggplot(aes(x = weight, y = sbp)) +
        ggplot2::geom_point() +
        ggplot2::geom_smooth(method = "loess") +
        ggplot2::ggtitle("Systolic blood pressure vs. weight") +
            xlab("Weight") +
            ylab("Systolic blood pressure")

“This is all just a fancy way of drawing a flexible curve through a cloud of points.”

The upward trends are due to the lack of data at these ends of the distributions.

Multivariate box-plots

Box-plots are also useful when we need to look at the distribution of a variable’s values across the levels or values of another variable. For example, if we wanted to examine systolic blood pressure sbp across behavior types (in the behpat variable), we could do this with the code below:

WstClbGrpStdy %>%
    ggplot(aes(x = behpat, y = sbp)) +
    geom_boxplot() +
        ggtitle("Box-plot of Systolic blood pressure and behavior types") +
        ylab("Systolic blood pressure in mm Hg (sbp)") +
        xlab("Behavior types (behpat)")

BONUS: What if you want to use different statistics in your box-plot?

The four box plots above show the median, interquartile range, and extreme values in sbp across the behavior types in behpat. But what if we wanted to compare these summary statistics to the mean and standard deviation of sbp across the behavior categories?

Fortunately ggplot2 will let me customize each element of the box-plot using the ggplot2::stat_summary() and ggplot2::stat_boxplot() functions.

WstClbGrpStdy %>%
    group_by(behpat) %>%
    summarise(mean_sbp = mean(sbp),
              med_sbp = median(sbp))
    ## # A tibble: 4 x 3
    ##   behpat mean_sbp med_sbp
    ##   <fct>     <dbl>   <dbl>
    ## 1 A1         129.     126
    ## 2 A2         130.     128
    ## 3 B3         128.     124
    ## 4 B4         127.     126

WstClbGrpStdy %>%
    ggplot(aes(x = behpat,
        y = sbp,
        color = behpat)) + # different color box for each level of wghtcat
    geom_boxplot(aes(x = behpat,
            y = sbp),
            show.legend = FALSE) + # remove legend
    stat_summary(fun.y = mean, # plots mean sbp per level of wghtcat
                color = "darkslategray",
                geom = "point",
                size = 1.5) +
    stat_summary(fun.data = mean_sd, # plots sd of sbp per level of wghtcat
                    geom = "errorbar", # as error bars
                    linetype = 2, # makes them dashed
                    color = "darkslategray", # and differnt color
                    width = 0.3,
                    size = 0.5) +
    stat_boxplot(aes(x = behpat,# customizes boxplot error bar
                        y = sbp),
                        show.legend = FALSE,
                        geom = "errorbar",
                        linetype = 1,
                        width = 0.2) +
        ggtitle("Box-plot of Systolic blood pressure and behavior types") +
        ylab("Systolic blood pressure in mm Hg (sbp)") +
        xlab("Behavior types (behpat)")

This shows us how systolic blood pressure varies according to the type of behavior. The box plot shows the distribution of sbp, while the two stat_summary() functions show the mean and standard deviations over the four levels of behpat.

The next section shows how to create a cross-tabulation of weight categories and behavior types in Stata using tabulate behpat wghtcat, column.

We can do this in R, but first we should convert the wghtcat variable to a factor and order the levels.

WstClbGrpStdy$wghtcat <- factor(WstClbGrpStdy$wghtcat,
                                 levels = c("< 140", "140-170",
                                            "170-200", "> 200"))

To create cross-tabs for factor variables, we could use base R’s table
function.

table(WstClbGrpStdy$behpat, WstClbGrpStdy$wghtcat)
    ##
    ##      < 140 140-170 170-200 > 200
    ##   A1    20     125      98    21
    ##   A2   100     612     514    99
    ##   B3    90     610     443    73
    ##   B4    22     191     116    20

But I recommend learning some of dplyrs handy programming syntax, because then you’ll be able to create your own custom table functions. These were adapted from this thread on StackOverflow.

freq_crosstab <- function(data, var1, var2) {
  var1 <- rlang::enquo(var1)
  var2 <- rlang::enquo(var2)
  data %>%
    dplyr::count(!!var1, !!var2) %>%
    tidyr::spread(!!var2, n, fill = 0) %>%
    dplyr::mutate(Total := rowSums(dplyr::select(., -!!var1)),
                  Freq = Total / sum(Total),
                  Freq = paste0(round(Freq*100, 2), "%")) %>%
    dplyr::bind_rows(dplyr::bind_cols(!!rlang::quo_name(var1) := "Total",
                               dplyr::summarize_if(., is.numeric, sum)))
}
freq_crosstab(WstClbGrpStdy, behpat, wghtcat)
    ## # A tibble: 5 x 7
    ##   behpat `< 140` `140-170` `170-200` `> 200` Total Freq
    ##   <chr>    <dbl>     <dbl>     <dbl>   <dbl> <dbl> <chr>
    ## 1 A1          20       125        98      21   264 8.37%
    ## 2 A2         100       612       514      99  1325 42.01%
    ## 3 B3          90       610       443      73  1216 38.55%
    ## 4 B4          22       191       116      20   349 11.07%
    ## 5 Total      232      1538      1171     213  3154 <NA>

Or add some percents to the frequency columns and make it tall.

freq_tall_table <- function(data,
                       group_var,
                       prop_var) {
  group_var <- enquo(group_var)
  prop_var  <- enquo(prop_var)
  data %>%
    group_by(!!group_var, !!prop_var) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n),
       freq = paste0(round(freq*100, 2), "%")) %>%
    ungroup
}
freq_tall_table(WstClbGrpStdy, behpat, wghtcat)
    ## # A tibble: 16 x 4
    ##    behpat wghtcat     n freq
    ##    <fct>  <fct>   <int> <chr>
    ##  1 A1     < 140      20 7.58%
    ##  2 A1     140-170   125 47.35%
    ##  3 A1     170-200    98 37.12%
    ##  4 A1     > 200      21 7.95%
    ##  5 A2     < 140     100 7.55%
    ##  6 A2     140-170   612 46.19%
    ##  7 A2     170-200   514 38.79%
    ##  8 A2     > 200      99 7.47%
    ##  9 B3     < 140      90 7.4%
    ## 10 B3     140-170   610 50.16%
    ## 11 B3     170-200   443 36.43%
    ## 12 B3     > 200      73 6%
    ## 13 B4     < 140      22 6.3%
    ## 14 B4     140-170   191 54.73%
    ## 15 B4     170-200   116 33.24%
    ## 16 B4     > 200      20 5.73%

Knowing how to use the tidyr functions is always helpful for rearranging columns and rows.

freq_tall_table(WstClbGrpStdy, behpat, wghtcat) %>%
    tidyr::spread(wghtcat, n)
    ## # A tibble: 16 x 6
    ##    behpat freq   `< 140` `140-170` `170-200` `> 200`
    ##    <fct>  <chr>    <int>     <int>     <int>   <int>
    ##  1 A1     37.12%      NA        NA        98      NA
    ##  2 A1     47.35%      NA       125        NA      NA
    ##  3 A1     7.58%       20        NA        NA      NA
    ##  4 A1     7.95%       NA        NA        NA      21
    ##  5 A2     38.79%      NA        NA       514      NA
    ##  6 A2     46.19%      NA       612        NA      NA
    ##  7 A2     7.47%       NA        NA        NA      99
    ##  8 A2     7.55%      100        NA        NA      NA
    ##  9 B3     36.43%      NA        NA       443      NA
    ## 10 B3     50.16%      NA       610        NA      NA
    ## 11 B3     6%          NA        NA        NA      73
    ## 12 B3     7.4%        90        NA        NA      NA
    ## 13 B4     33.24%      NA        NA       116      NA
    ## 14 B4     5.73%       NA        NA        NA      20
    ## 15 B4     54.73%      NA       191        NA      NA
    ## 16 B4     6.3%        22        NA        NA      NA

Multivariable Descriptions

Finally, this chapter concludes with a discussion of graphing and displaying pairwise combinations of variables. The scatter plot matrix is shown following a correlation matrix in Stata (correlate sbp age weight height (obs=3154)). To create a correlation matrix in R, we could use the cor() function. I like to keep everything as tibble() or data.frame in my working environment, so I do a little reformatting to the output.

CorrTable <- WstClbGrpStdy %>%
    dplyr::select(sbp,
                  age,
                  weight,
                  height) %>%
    cor() %>%
    as_tibble() %>%
    rownames_to_column() %>%
    mutate(rowname =
        case_when(rowname == 1 ~ 'sbp',
                  rowname == 2 ~ 'age',
                  rowname == 3 ~ 'weight',
                  rowname == 4 ~ 'height'))
CorrTable
    ## # A tibble: 4 x 5
    ##   rowname    sbp     age  weight  height
    ##   <chr>    <dbl>   <dbl>   <dbl>   <dbl>
    ## 1 sbp     1       0.166   0.253   0.0184
    ## 2 age     0.166   1      -0.0344 -0.0954
    ## 3 weight  0.253  -0.0344  1       0.533
    ## 4 height  0.0184 -0.0954  0.533   1

A graph is displayed that shows weight and sbp across categories of behavior type. ggplot2 can do this type of graphing with facet_wraping. We can take what we’ve learned about the geom_smooth(method = "loess") and apply it here, only now we will add a group = behpat argument and a facet_wrap() function.

WstClbGrpStdy %>%
    ggplot(aes(x = weight,
               y = sbp,
               group = behpat)) + # separate by this factor
    geom_point(aes(color = behpat), # color aesthetic added to this layer
                   show.legend = FALSE) + # don't need this
    geom_smooth(method = "loess") + # add the loess line
        facet_wrap(. ~ behpat, # now wrap the plot in a 2x2 layout
                   nrow = 2,
                   ncol = 2) +
    ggtitle("Scatterplots of SBP vs. weight by behavior pattern")

Great! Now we can look at more complicated combinations of plots, like the scatter plot matrix.

The pairs() plot shows multiple pairwise correlations between sbp age weight and height below.

pairs(WstClbGrpStdy[ , c("sbp", "age", "weight", "height")],
     main = "A scatter plot matrix between sbp, age, weight, and height")

This chart is lacking in some details, so we will add additional formatting and get a smoothed line through the lower scatter plots.

# WstClbGrpStdy[ , c(14, 1, 20, 9)]
pairs(WstClbGrpStdy[ , c(14, 1, 20, 9)],
                   lower.panel = panel.smooth,
      main = "A scatter plot matrix between sbp, age, weight, and height")

We can also create the scatter plot matrix with ggplot2 using the GGally package (you might also need the fix found here: devtools::install_github("ggobi/ggally#266")).

This function needs us to specify the columns we want to use in the plot, and list()s of options for the lower() and diag() plots.

# devtools::install_github("ggobi/ggally#266")
# WstClbGrpStdy[ , c(14, 1, 20, 9)] # use a little column subsetting...
GGally::ggpairs(WstClbGrpStdy[ , c(14, 1, 20, 9)],
                lower = list(
                    continuous = "smooth",
                    mapping = aes(alpha = 0.1)),
                diag = list(
                    continuous = "barDiag",
                    mapping = aes(alpha = 0.1))) +
    ggtitle("A scatter plot matrix between sbp, age, weight, and height")

Compare this to the results in CorrTable:

knitr::kable(CorrTable)

rowname	sbp	age	weight	height
sbp	1.0000	0.1657	0.2532	0.0184
age	0.1657	1.0000	-0.0344	-0.0954
weight	0.2532	-0.0344	1.0000	0.5329
height	0.0184	-0.0954	0.5329	1.0000

In conclusion

I hope anyone using VGSM or in the TICR program using R (or wanting to) will find this useful. I will be posting the other chapters after adding newer options/packages that have come out since the time I completed my coursework.

EDA

Motivation

What is Exploratory Data Analysis (EDA)?

Essential graphs for EDA

The Data

Structure vs. glimpse

Base R histograms with `hist()`

How many bins?

Adding reference lines to a `hist()` plot

Histogram quick plots with `qplot`

Why did we have to specify a period (`.`) in the `data =` argument?

Adding the median to a `qplot()`

`ggplot` histogram plots

What are we seeing in a histogram?

What aren’t we seeing in a histogram?

Percentiles, Medians, and Box-plots

Base R box-plots with `boxplot()`

Relationship between histograms and box-plots

Box-plots with `qplot()`

`ggplot` box-plots

What are we seeing in a boxplot?

Base R Q-Q plots

What does it mean if data are normally distributed?

Q-Q quick plots with `stat_qq()`

Heavy and light tailed distributions

Data Transformations

Doesn’t transforming my data make it harder to interpret when I model or graph them?

Categorical Variables

Outcome and Predictor Variable Scatter Plots

Adding a `LO`cally `WE`ighted `S`catterplot `S`moother (or `LOWESS`)

Multivariate box-plots

BONUS: What if you want to use different statistics in your box-plot?

Multivariable Descriptions

In conclusion

More from Martin J Frigaard
All posts

Twitter data with R

EDA

Motivation

What is Exploratory Data Analysis (EDA)?

Essential graphs for EDA

The Data

Structure vs. glimpse

Base R histograms with hist()

How many bins?

Adding reference lines to a hist() plot

Histogram quick plots with qplot

Why did we have to specify a period (.) in the data = argument?

Adding the median to a qplot()

ggplot histogram plots

What are we seeing in a histogram?

What aren’t we seeing in a histogram?

Percentiles, Medians, and Box-plots

Base R box-plots with boxplot()

Relationship between histograms and box-plots

Box-plots with qplot()

ggplot box-plots

What are we seeing in a boxplot?

Base R Q-Q plots

What does it mean if data are normally distributed?

Q-Q quick plots with stat_qq()

Heavy and light tailed distributions

Data Transformations

Doesn’t transforming my data make it harder to interpret when I model or graph them?

Categorical Variables

Outcome and Predictor Variable Scatter Plots

Adding a LOcally WEighted Scatterplot Smoother (or LOWESS)

Multivariate box-plots

BONUS: What if you want to use different statistics in your box-plot?

Multivariable Descriptions

In conclusion

More from Martin J FrigaardAll posts

Twitter data with R

Base R histograms with `hist()`

Adding reference lines to a `hist()` plot

Histogram quick plots with `qplot`

Why did we have to specify a period (`.`) in the `data =` argument?

Adding the median to a `qplot()`

`ggplot` histogram plots

Base R box-plots with `boxplot()`

Box-plots with `qplot()`

`ggplot` box-plots

Q-Q quick plots with `stat_qq()`

Adding a `LO`cally `WE`ighted `S`catterplot `S`moother (or `LOWESS`)

More from Martin J Frigaard
All posts