EDA
May 11, 2018•6,860 words
Motivation
In my graduate program we had an excellent text on Regression Methods in Biostatistics (which I will refer to as VGSM). Unfortunately, all of the examples and code from the courses, labs, and text were in Stata. As an R user, I kept having to translate the topics from Stata into R. But this actually turned into a valuable exercise because it forced me to separate the underlying statistical concepts from the programming languages.
I've decided to post various parts of the text/topics because I'm sure there will be future R users in the same boat :)
What is Exploratory Data Analysis (EDA)?
"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone--as the first step." - John W. Tukey, Exploratory Data Analysis, 1977
I think the "first step" is the most important part point of exploratory data analysis (EDA)--it should precede any modeling, inferences, or fancy machine learning algorithm. Without looking at the predictors and outcomes in a graph, modeling data can be a waste of time or misleading.
I'll be the first to admit, I'm not great at remembering tests, assumptions, transformations, or writing equations on a whiteboard. I'm OK at math, but I have never been exceptional at it. But I love visualizing data, and this passion (and a lot of general curiosity) has saved me plenty of times when I'm faced with a data question.
Essential graphs for EDA
The chapter that addresses exploring data in VGSM is chapter 2 "Exploratory and Descriptive Methods". The authors recommend three graphs for numerical data: box-plots, histograms, and Q-Q plots.
The Data
The data we will be using for this post comes from the "Western Collaborative Group Study (WstClbGrpStdy)". You can read more about it here. These data are available in the native wcgs
data frame, but I'll be loading the version that came with my text because it has a few additional variables.
I also name this imported file WstClbGrpStdy
because I like to capitalize and disemvowel (yes, that is a thing, thank you, James Joyce) names of data frames to distinguish them from other R objects (vectors, lists, etc.).
These data come from an epidemiological study of behavioral attributes and coronary heart disease. The findings were published in 1964, and ironically there isn't a single graph used to display these data in their manuscript. It's hard to know how familiar the authors were with EDA (John Tukey didn't published Exploratory Data Analysis until 1977), so we can't really hold the lack of graphs and figures against them. The code to import this data set is below:
WstClbGrpStdy <- readxl::read_xls("Data/wcgs.xls")
Structure vs. glimpse
The tidyverse packages all have a underlying philosophy around tidy data, with a consistent syntax and well defined terms/ideas.
A simple example of this concept utils::str()
vs. dplyr::glimpse()
. Viewing the shape and structure of an object is a common starting point, but it can also turn into a place of confusion for newcomers to the R language. The str()
function gives a lot of information (and jargon) that isn't always necessary to start exploring your data.
I prefer using dplyr::glimpse()
because it 1) puts more of the data on the screen than str()
, 2) it displays the format for each variable in the data set in a consistent way, and 3) it omits the additional information that occasionally gets printed with str()
.
See the image below for an example:
I add 78
to ensure everything prints nicely to the screen width.
glimpse(WstClbGrpStdy, 78)
## Observations: 3,154
## Variables: 22
## $ age <dbl> 50, 51, 59, 51, 44, 47, 40, 41, 50, 43, 59, 54, 48, 39, ...
## $ arcus <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ behpat <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A...
## $ bmi <dbl> 31.32, 25.33, 28.69, 22.15, 22.31, 27.12, 23.24, 22.96, ...
## $ chd69 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "N...
## $ chol <dbl> 249, 194, 258, 173, 214, 206, 190, 212, 130, 233, 181, 2...
## $ dbp <dbl> 90, 74, 94, 80, 80, 76, 78, 84, 70, 80, 86, 76, 78, 74, ...
## $ dibpat <chr> "Type A", "Type A", "Type A", "Type A", "Type A", "Type ...
## $ height <dbl> 67, 73, 70, 69, 71, 64, 70, 70, 71, 68, 72, 67, 71, 70, ...
## $ id <dbl> 2343, 3656, 3526, 22057, 12927, 16029, 3894, 11389, 1268...
## $ lnsbp <dbl> 4.883, 4.787, 5.063, 4.836, 4.836, 4.754, 4.804, 4.868, ...
## $ lnwght <dbl> 5.298, 5.257, 5.298, 5.011, 5.075, 5.063, 5.088, 5.075, ...
## $ ncigs <dbl> 25, 25, 0, 0, 0, 80, 0, 25, 0, 25, 10, 0, 20, 0, 4, 0, 0...
## $ sbp <dbl> 132, 120, 158, 126, 126, 116, 122, 130, 112, 120, 130, 1...
## $ smoke <chr> "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No"...
## $ t1 <dbl> -1.6334, -4.0634, 0.6397, 1.1218, 2.4250, -0.7875, -0.60...
## $ time169 <dbl> 1367, 2991, 2960, 3069, 3081, 2114, 2929, 3010, 3104, 28...
## $ typchd69 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ uni <dbl> 0.48607, 0.18595, 0.72780, 0.62446, 0.37898, 0.73550, 0....
## $ weight <dbl> 200, 192, 200, 150, 160, 158, 162, 160, 195, 187, 206, 1...
## $ wghtcat <chr> "170-200", "170-200", "170-200", "140-170", "140-170", "...
## $ agec <chr> "46-50", "51-55", "56-60", "51-55", "41-45", "46-50", "3...
“Exploratory data analysis is detective work–in the purest sense–finding and revealing the clues.” - John W. Tukey, Exploratory Data Analysis, 1977
The first variable we will look at in this data frame is the systolic blood pressure (sdp
). In R, we want to make sure this variable is formatted correctly before putting it in a graph. We can also use glimpse()
on a single variable.
WstClbGrpStdy$sbp %>% glimpse()
## num [1:3154] 132 120 158 126 126 116 122 130 112 120 ...
Base R histograms with hist()
A histogram takes a numeric variable and displays the range of values of across the horizontal scale (x-axis). Histograms also divide the horizontal scale into segments, called “bins.” The left-hand side of a histogram displays how many observations fall into each bin (y-axis).
Base R comes with a set of graphics::
functions for plotting data (this just means you won’t have to install any packages to use them). We can build a histogram using base R with the graphics::hist()
. This function takes the data you want to plot as the first argument (WstClbGrpStdy$sbp
).
# base R histogram of sbp
graphics::hist(WstClbGrpStdy$sbp)
Notice this plot lists “Histogram of WstClbGrpStdy$sbp” as the title and labels the x-axis as “WstClbGrpStdy$sbp”.
Technically this is all the function needs, but I recommend supplying something to the main =
title and xlab =
label arguments. Titles and labels help describe what we’re seeing (or what we expected to see) when R renders the plot. Future us will thank present us if take the extra time to add a title and label our plots.
graphics::hist(x = WstClbGrpStdy$sbp,
main = "Histogram of systolic blood pressure",
xlab = "Systolic blood pressure in mm Hg (sbp)")
Curious what to include in a title or label? I tend to stick with the type of graph and the variable I’m displaying in plain English (no jargon). For labels, I include the name of the measurement (no acronyms) and the units. A title and label for every graph might seem like an arduous level of detail, but if we end up investigating a data set with many abbreviated, nebulous variable names, we are going to appreciate
being able to look at a plot or graph and know precisely what it contains.
How many bins?
The graphics::hist()
command will automatically divide up the data into bins, which usually isn’t a bad option. We can also specify the number of bins with the breaks
argument. This changes the number and width of the vertical bars. As suggested in VGSM, the
“rule of thumb is to choose the number of bins to be about
1 + 3.3 log10(n)
”
We can calculate this below:
wcgs_sbp_bins <- 1 + 3.3*log10(nrow(WstClbGrpStdy))
wcgs_sbp_bins
## [1] 12.55
This new numerical vector can be supplied to the breaks
argument in the hist()
graph.
Adding reference lines to a hist()
plot
Add a reference line with a measure of central tendency to the histogram graph with the abline()
function. The first argument is either h
or v
(for a horizontal or vertical line), followed by the values you want the line to represent (in our case the median()
of sbp
).
I add two additional arguments to designate the color (col = "blue"
)
and width of the line (lwd = 4
).
graphics::hist(x = WstClbGrpStdy$sbp,
main = "Histogram of systolic blood pressure",
xlab = "Systolic blood pressure in mm Hg (sbp)",
breaks = wcgs_sbp_bins)
abline(v = median(WstClbGrpStdy$sbp),
col = "blue",
lwd = 4)
Histogram quick plots with qplot
The next option for making a histogram comes from the ggplot2
package. The qplot()
function (short for quick plot) is a fast way to plot a single variable. Titles can be added using ggtitle()
and the same xlab()
function from before.
ggplot2::qplot(sbp, data = WstClbGrpStdy) +
ggplot2::ggtitle("Histogram of systolic blood pressure") +
xlab("Systolic blood pressure in mm Hg (sbp)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The qplot
function will create a histogram or bar chart depending on the format of the variable you provide (numeric or factor). As the message tells us, the default number of bins is 30. This is more than twice the number of bins we calculated before, and we can see this has a
profound effect on how these data are represented in the graph (more on this later).
If you’re like me and prefer to use the pipes from the magrittr
package, this can be re-written using a little syntactic sugar:
WstClbGrpStdy %>% ggplot2::qplot(sbp, data = .) +
ggplot2::ggtitle("Histogram of systolic blood pressure") +
xlab("Systolic blood pressure in mm Hg (sbp)")
Why did we have to specify a period (.
) in the data =
argument?
magrittr
interprets the period argument as “take the named parameter as the object on the left of the %>%
operator.”
So Data %>% function(variable, named_argument = .)
is equivalent to
function(variable, named_argument = Data)
Notice we also get a message from this graph about the number of bins
being used. The stat_bin()
argument is using a default of 30
, but we can adjust this with bins
argument.
Adding the median to a qplot()
We can also add the median to this plot by adding a geom
, specifically a ggplot2::geom_vline()
. This function takes an aesthetic mapping argument (aes(xintercept = median(sbp))
), a color argument (col
), and a size
argument that will determine the width of the vertical line.
WstClbGrpStdy %>%
ggplot2::qplot(sbp, data = ., bins = wcgs_sbp_bins) +
ggplot2::geom_vline(aes(xintercept = median(sbp)), # add median
col = 'blue',
size = 2) +
ggplot2::ggtitle("Histogram of systolic blood pressure") +
xlab("Systolic blood pressure in mm Hg (sbp)")
This histogram now looks more like the histogram we made using graphics::hist()
above (but with fewer bins).
ggplot
histogram plots
Below is an example of how to get a histogram using ggplot2
. I’ve covered using ggplot2
elsewhere, so I won’t repeat myself here. But know that ggplot2
allows users to build a plot layer by layer, starting with a data
argument, mapping variables to aesthetics (aes(x = sbp)
), and then supplying a geom
. We got a small introduction to this method in the example above when we added the ggplot2::geom_vline()
layer to the ggplot2::qplot()
.
Again, we can use magrittr
to increase the readability on these functions.
WstClbGrpStdy %>% # Data
ggplot2::ggplot(aes(x = sbp)) + # variable
ggplot2::geom_histogram(bins = wcgs_sbp_bins) + # geom
ggplot2::geom_vline(aes(xintercept = median(sbp)), # add median
col = 'blue',
size = 2) +
ggplot2::ggtitle("Histogram of systolic blood pressure") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
I don’t use qplot()
s very often, but only because I like the having the ability to customize and build my plots layer-by-layer.
What are we seeing in a histogram?
These histograms are showing us the frequency of each value in sbp
. We can see above that most systolic blood pressures appear to be relatively close (although slightly lower than) to the median. We can also see the distribution extends to the right of the median (higher sbp
values) all the way out to ~230 mm Hg. This is called a “right-skewed” distribution–the tail extends out to the right side (vs a “left-skewed” distribution where the tail extends out to the left side).
median(WstClbGrpStdy$sbp)
## [1] 126
It’s also a good idea to view multiple histograms with different levels of bins
to get an idea for how this variable’s values are distributed.
# Create some histograms -----
WCGS_hist_15 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 15) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (15 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_20 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 20) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (20 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_25 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 25) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (25 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_30 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 30) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (30 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
# arrange these plots ----
library(ggpubr)
ggarrange(WCGS_hist_15, WCGS_hist_20, WCGS_hist_25, WCGS_hist_30,
ncol = 2, nrow = 2)
ggsave(filename = "Images/02_ggplot2_hists.png")
When we look at all four levels of bins
, we can see that the distributions have a collection of scores on the left-hand side, but the tail extends to the higher or more positive scores. The is an example of right-skewness.
When building histograms, we’re looking for the highs, lows, and overall spread of the data. The trick is to look at enough bin values to give us an idea about the variation of values in our variable, without making it too granular to interpret.
What aren’t we seeing in a histogram?
The histogram doesn’t tell us anything about the summary statistics we might be using for inferences (such as the mean, standard deviation, median, etc.). It also doesn’t give us a clear picture of the extreme cases–it’s more of a “bird’s eye view” of a variable’s distribution. For more descriptive information, we will have to move onto box-plots.
Percentiles, Medians, and Box-plots
Box-plots were originally created by John W. Tukey (although he called them “box-and-whisker” plots). Box-plots are useful for understanding the shape of variable’s distribution because they divide up the values into distinct proportions. For example, if we order all the values in a variable from lowest to highest, the middle number is the 50th percentile (also called the median). I know half the values lie below this number, and the other half are above it.
Why use the median and not the mean (or average)?
The median isn’t as sensitive to extreme values as the mean. I’ll demonstrate this with a quick example. Take a vector (x
) with 10 values and get the median()
and mean()
.
x <- c(4, 7, 9, 11, 13, 15, 17, 19, 21)
median(x)
## [1] 13
mean(x)
## [1] 12.89
These two numbers are relatively close to one another. Now we can take a quick look at the distribution with hist(x)
.
hist(x)
This distribution looks symmetrical–more importantly, it looks symmetrical around the mean and the median (12.89
and 13
). This is important because many modelling techniques assume the distribution of the variables are “normal” or approximately bell-shaped. Now we create a second vector (y
), which is identical to x
, but the lowest and highest values have been changed substantially (1
to 0.0001
and 21
to 21000
). Calculate the median()
and mean()
of y
.
y <- c(0.0001, 3, 7, 9, 11, 13, 15, 17, 21000)
median(y)
## [1] 11
mean(y)
## [1] 2342
Notice how the median didn’t change at all, but the mean has increased from 12.89
to 2342
. Take a look at the distribution of y
with hist()
:
hist(y)
The bell-shape is gone, replaced by just two bars (one for the 8 values under 5,000, and one for the single value above 20,0000).
Base R box-plots with boxplot()
Box-plots use quartiles to divide the data into the 25th, 50th (median), and 75th percentiles. We can create a box-plot in base R using the graphics::boxplot()
function.
graphics::boxplot(WstClbGrpStdy$sbp,
main = "Box-plot of Systolic blood pressure in mm Hg",
ylab = "Systolic blood pressure in mm Hg (sbp)")
According to the authors in VGSM, a box-plot displays the following information:
- Location, as measured by the median
- Spread, as measured by the height of the box (this is called the interquartile range or IQR)
- Range of the observations
- Presence of outliers
- Some information about shape
I’ve created a graphic that adds a little detail for each element of the box-plot.
Relationship between histograms and box-plots
Each of these imaginary box-plots has a corresponding histogram. These box plots weren’t created in R (and aren’t drawn perfectly to scale), but I’ve included this figure to illustrate the information captured in
a box-plot might be hard to see in a histogram:
Extreme values (sometimes called outliers) are made visible with a box-plot–they show up as dots or values beyond the 1.5x thresholds of each “whisker”.
We can also see the location changes by the shifting line within the box (the median).
Box-plots with qplot()
The code to create a box-plot is very similar to creating a histogram with qplot()
. We supply a data frame (WstClbGrpStdy
) to the qplot()
function, but we also need to add an x argument (x = " "
), because a box-plot geom needs to map the values of sbp
onto the y
axis. We add our title and label, but also set the xlab()
label to NULL
so the x
doesn’t display.
WstClbGrpStdy %>%
qplot(x = "", y = sbp, data = ., geom = "boxplot") +
ggtitle("Box-plot of Systolic blood pressure in mm Hg") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab(NULL)
ggplot
box-plots
Finally, we can extend the ggplot2
functions to the layered grammar by piping a data frame to the ggplot(aes(
functions, mapping x
to an empty value (" "
), y
to the sbp
variable, then adding the
geom_boxplot()
.
The title and labels get added in the same way of as the qplot()
function.
WstClbGrpStdy %>%
ggplot(aes(x = "", y = sbp)) +
geom_boxplot() +
ggtitle("Box-plot of Systolic blood pressure in mm Hg") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab(NULL)
What are we seeing in a boxplot?
To summarize, the box displays the upper and lower quartiles as boundaries for three percentiles of the data: 1) the bottom of the box is the 25th percentile, meaning one-quarter of the values are below this value 2) the median (the line in the middle of the box) tells us how far the middle value is from the 25th and 75th percentile, and 3) the top of the box is the 75th percentile, so one-quarter of the values are above this value.
The entire length of the box is the interquartile range (IQR). The whiskers show 1) the maximum values within 1.5 times the IQR from the upper (75th) quartile, and 2) the minimum values that are within 1.5x the IQR from the lower (25th) quartile. The points beyond the whiskers are the extreme values.
With regard to these data, I thought this description from VGSM was enlightening,
“right-skewness will be indicated if the upper whisker is longer than the lower whisker or if there are more outliers in the upper range. Both the boxplot and the histogram show evidence for right-skewness in the SBP data.”
require(ggplot2)
require(ggpubr)
sbp_boxplot = WstClbGrpStdy %>%
ggplot(aes(x = "",
y = sbp)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
scale_y_continuous(expand = c(0,5),
limit = c(90, 250)) +
ggtitle("Box-plot & Histogram of systolic blood pressure") +
xlab(NULL) +
ylab(NULL)
sbp_histogram = WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 20) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
scale_x_continuous(expand = c(0,5),
limit = c(90, 250))
ggarrange(sbp_boxplot,
sbp_histogram,
heights = c(2.5, 3),
align = "hv",
ncol = 1,
nrow = 2)
By combining these graphs, we can see how the median (blue line) in the histogram splits the distribution in the box, and how the skewness in the histogram is represented in the whiskers and points of the box-plot.
Base R Q-Q plots
This brings us to the final plot we will cover: the Q-Q plot, or quantile-quantile plot. This plot is helpful when we want to know how closely our data aligns with the bell-shaped “normal” distribution.
We can create a Q-Q plot in base R using the following functions. The qqnorm()
function needs a variable we want to plot (sbp
).
qqnorm(WstClbGrpStdy$sbp,
main = "Systolic blood pressure QQ plot",
ylab = "Systolic Blood Pressure mm Hg (sbp)")
The interpretation of the Q-Q plot is the following from VGSM,
“a normal Q–Q plot is constructed so that the data points fall along an approximately straight line when the data are from a normal distribution, and deviate systematically from a straight line when the data are from other distributions.”
What does it mean if data are normally distributed?
In a broader sense, if I were to randomly sample a data point from a variable with a symmetrical, bell-shaped distribution, that point has an approximately equal probability of coming from either side of the distribution’s mean (or median).
The sbp
data do not seem to fall along a straight line, but it would be helpful if we had a straight line for reference. We can add a straight line with qqline()
–this function needs the same variable we provided to qqnorm()
, but I also specified the line width in the lwd = 2
argument. I also included a col
argument to color the systolic blood pressure values, and a grid()
argument to give our coordinate system some gridlines.
qqnorm(WstClbGrpStdy$sbp,
main = "Systolic blood pressure QQ plot",
col = "blue",
ylab = "Systolic Blood Pressure mm Hg (sbp)")
qqline(WstClbGrpStdy$sbp,
lwd = 2)
grid(lty = "dotted",
col = "gray75")
We can see these values definitely do not fall along the straight reference line of “theoretical quantiles”.
Q-Q quick plots with stat_qq()
The Q-Q plots in ggplot2
can be built in a variety of ways. The first is by setting the sample
aesthetic to the variable we want to see (sbp
) and then adding a stat_qq()
layer.
WstClbGrpStdy %>%
ggplot(aes(sample = sbp)) +
stat_qq() +
ggtitle("Q-Q plot of systolic blood pressure")
If we want to add a reference line to the stat_qq()
plot, ggplot2
version 2.2.1.9000 is now equipped with a function stat_qq_line()
for a straight reference line. I use the geom for creating the points on the graph (geom_qq()
) along with some opacity and color options, then add the reference line with stat_qq_line()
.
WstClbGrpStdy %>%
ggplot(aes(sample = sbp)) +
geom_qq(alpha = 0.3,
color = "blue") +
stat_qq_line() +
ggtitle("Q-Q plot of systolic blood pressure")
Q-Q plots are also helpful for determining the shape of the distribution. See the following quote from VGSM text:
“The shape and direction of the curvature can be used to diagnose the deviation from normality. Upward curvature is indicative of right-skewness, while downward curvature is indicative of left-skewness.”
The upward curve of the systolic blood pressure shows these data are right-skewed (which we knew from looking at the other two plots, but it’s always good to verify this again!).
Heavy and light tailed distributions
We are going to diverge from the sbp
variable in the WstClbGrpStdy
data frame and use the Tails
data frame to show how two different distributions look in the three graphs we covered thus far.
Tails %>% glimpse(78)
## Observations: 1,000
## Variables: 2
## $ heavy_tail <dbl> -0.56, -0.40, -0.03, -0.77, -0.26, -0.38, -0.47, 0.32,...
## $ light_tail <dbl> 0.04615, 0.12579, 0.02782, 0.03540, 0.02829, 0.02750, ...
This data frame only has two variables, heavy_tail
and light_tail
. As the VGSM text notes, Q-Q plots can tell us a lot about how a distribution deviates from the normal “bell-curve”.
“The other two common patterns are S-shaped. An S-shape as in [see below] indicates a heavy-tailed distribution,”
We can plot the heavy_tail
data below and arrange all three plots to display together.
Heavy_qq <- Tails %>%
ggplot(aes(sample = heavy_tail)) +
geom_qq(alpha = 0.3,
color = "blue") +
stat_qq_line() +
ggtitle("Q-Q plot of heavy tail distribution")
# ylab(NULL)
Heavy_box <- Tails %>%
ggplot(aes(x = "y",
y = heavy_tail)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
scale_y_continuous(expand = c(0, 1),
limit = c(-6, 6)) +
ggtitle("Box-plot of heavy tail distribution") +
xlab(NULL) +
ylab(NULL)
Heavy_hist <- Tails %>%
ggplot(aes(x = heavy_tail)) +
geom_histogram() +
geom_vline(aes(xintercept = median(heavy_tail)),
col = 'blue',
size = 1) +
scale_x_continuous(expand = c(0, 1),
limit = c(-6, 6)) +
ggtitle("Histogram of heavy tail distribution") +
xlab(NULL) +
ylab(NULL)
# arrange these plots ----
library(ggpubr)
ggarrange(Heavy_hist,
Heavy_qq,
Heavy_box,
align = "hv",
ncol = 2,
nrow = 2,
widths = c(5, 5, 5))
The S-curve being referred to is the blue line in the Q-Q plot, and the distribution has a heavy tail if the line of values curves above the reference line on the positive (higher) end of the x-axis, but below the line on the negative (or lower) end of the x-axis.
We can also see the box-plot has many outliers outside the whiskers.
“while an S-shape like [see below] is indicative of a light-tailed distribution.”
Light_qq <- Tails %>%
ggplot(aes(sample = light_tail)) +
geom_qq(alpha = 0.3,
color = "blue") +
stat_qq_line() +
ggtitle("Q-Q plot of light tail distribution")
Light_box <- Tails %>%
ggplot(aes(x = "y",
y = light_tail)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
ggtitle("Box-plot of light tail distribution") +
xlab(NULL) +
ylab(NULL)
Light_hist <- Tails %>%
ggplot(aes(x = light_tail)) +
geom_histogram() +
geom_vline(aes(xintercept = median(light_tail)),
col = 'blue',
size = 1) +
ggtitle("Histogram of light tail distribution") +
xlab(NULL) +
ylab(NULL)
# arrange these plots ----
library(ggpubr)
ggarrange(Light_hist,
Light_qq,
Light_box,
align = "hv",
ncol = 2,
nrow = 2,
widths = c(5, 5, 5))
This S-curve crosses the reference line, but in opposing directions from the heavy tail data (the line of values curves below the reference line on the positive (higher) end of the x-axis, and above the reference line on the negative (or lower) end of the x-axis).
The box-plot shows fewer outliers beyond the whiskers here, too.
Data Transformations
Many times data will require transformations before they can be entered into a statistical model. We have seen that variables can deviate from a normal distribution in their skewness or kurtosis, so visualization is important both before and after any transformations are made.
Common transformations are natural log (log
) and log10
. I liked the description of what a log transformation does to a skewed distribution in the VGSM,
“…a log transformation deemphasizes differences at the upper end of the scale and emphasizes those at the lower end.”
We will create two new variables of sbp
and look at their distributions using histograms (set at 25 bins each).
WstClbGrpStdy <- WstClbGrpStdy %>%
dplyr::mutate(log_sbp = log(sbp))
WCGS_log_sbp_hist_15 <- WstClbGrpStdy %>%
ggplot(aes(x = log_sbp)) +
geom_histogram(bins = 25) +
geom_vline(aes(xintercept = median(log_sbp)),
col = 'blue',
size = 1) +
geom_vline(aes(xintercept = mean(log_sbp)),
col = 'red',
size = 1) +
ggtitle("Histogram of log(sbp) (25 bins)") +
xlab("Ln of Systolic blood pressure (log_sbp)") # label
WCGS_hist_15 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 25) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
geom_vline(aes(xintercept = mean(sbp)),
col = 'red',
size = 1) +
ggtitle("Histogram of sbp (25 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
# arrange these plots ----
library(ggpubr)
ggarrange(WCGS_hist_15,
WCGS_log_sbp_hist_15,
align = "hv",
ncol = 2,
widths = c(5, 5))
We can see the log_sbp
distribution is more symmetrical around the median/mean.
Doesn’t transforming my data make it harder to interpret when I model or graph them?
Maybe, but we just showed sbp
before and after a transformation, so it’s possible to present both distributions without the reader losing track of the plot. The authors also point “a difference on the transformed scale is still a difference.”
Categorical Variables
Categorical or count data has less options for graphing, but it’s still possible to create a bar graph and see how these numbers differ within a variable. Below is a bar graph of behpat
, which is the behavior type as a factor with levels A1
, A2
, B3
, and B4
.
# convert behpat to factor -----
WstClbGrpStdy$behpat <- factor(WstClbGrpStdy$behpat)
# create bar chart -----
WstClbGrpStdy %>% ggplot(aes(x = behpat)) +
geom_bar() +
ggtitle("Bar chart of behavior types") +
xlab("Behavior types behpat")
Outcome and Predictor Variable Scatter Plots
This section introduces multivariate plots with a plot of systolic blood pressure (sbp
) versus weight. I reproduce these plots below in base R with the plot()
function.
plot(WstClbGrpStdy$sbp, WstClbGrpStdy$weight,
main = "Systolic blood pressure vs. weight",
xlab = "Weight",
ylab = "Systolic blood pressure")
And we can create a similar chart in ggplot2
with the geom_point()
function.
WstClbGrpStdy %>%
ggplot(aes(x = weight, y = sbp)) +
geom_point() +
ggtitle("Systolic blood pressure vs. weight") +
xlab("Weight") +
ylab("Systolic blood pressure")
ggsave(filename = "Images/02_scatter_wtXsbp.png")
Adding a LO
cally WE
ighted S
catterplot S
moother (or LOWESS
)
Sometimes it helpful to fit a line through a scatter plot of two continuous variables to see if their relationship is linear. The text uses the lowess sbp weight, bw(0.25)
from Stata, and I’ve reproduced this graph below in ggplto2()
with a ggplot2::geom_smooth(method = "loess")
. Note that this includes a standard error (in gray).
WstClbGrpStdy %>%
ggplot(aes(x = weight, y = sbp)) +
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "loess") +
ggplot2::ggtitle("Systolic blood pressure vs. weight") +
xlab("Weight") +
ylab("Systolic blood pressure")
“This is all just a fancy way of drawing a flexible curve through a cloud of points.”
The upward trends are due to the lack of data at these ends of the distributions.
Multivariate box-plots
Box-plots are also useful when we need to look at the distribution of a variable’s values across the levels or values of another variable. For example, if we wanted to examine systolic blood pressure sbp
across behavior types (in the behpat
variable), we could do this with the code below:
WstClbGrpStdy %>%
ggplot(aes(x = behpat, y = sbp)) +
geom_boxplot() +
ggtitle("Box-plot of Systolic blood pressure and behavior types") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab("Behavior types (behpat)")
BONUS: What if you want to use different statistics in your box-plot?
The four box plots above show the median, interquartile range, and extreme values in sbp
across the behavior types in behpat
. But what if we wanted to compare these summary statistics to the mean and standard deviation of sbp
across the behavior categories?
Fortunately ggplot2
will let me customize each element of the box-plot using the ggplot2::stat_summary()
and ggplot2::stat_boxplot()
functions.
WstClbGrpStdy %>%
group_by(behpat) %>%
summarise(mean_sbp = mean(sbp),
med_sbp = median(sbp))
## # A tibble: 4 x 3
## behpat mean_sbp med_sbp
## <fct> <dbl> <dbl>
## 1 A1 129. 126
## 2 A2 130. 128
## 3 B3 128. 124
## 4 B4 127. 126
WstClbGrpStdy %>%
ggplot(aes(x = behpat,
y = sbp,
color = behpat)) + # different color box for each level of wghtcat
geom_boxplot(aes(x = behpat,
y = sbp),
show.legend = FALSE) + # remove legend
stat_summary(fun.y = mean, # plots mean sbp per level of wghtcat
color = "darkslategray",
geom = "point",
size = 1.5) +
stat_summary(fun.data = mean_sd, # plots sd of sbp per level of wghtcat
geom = "errorbar", # as error bars
linetype = 2, # makes them dashed
color = "darkslategray", # and differnt color
width = 0.3,
size = 0.5) +
stat_boxplot(aes(x = behpat,# customizes boxplot error bar
y = sbp),
show.legend = FALSE,
geom = "errorbar",
linetype = 1,
width = 0.2) +
ggtitle("Box-plot of Systolic blood pressure and behavior types") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab("Behavior types (behpat)")
This shows us how systolic blood pressure varies according to the type of behavior. The box plot shows the distribution of sbp
, while the two stat_summary()
functions show the mean and standard deviations over the four levels of behpat
.
The next section shows how to create a cross-tabulation of weight categories and behavior types in Stata using tabulate behpat wghtcat, column
.
We can do this in R, but first we should convert the wghtcat
variable to a factor and order the levels.
WstClbGrpStdy$wghtcat <- factor(WstClbGrpStdy$wghtcat,
levels = c("< 140", "140-170",
"170-200", "> 200"))
To create cross-tabs for factor variables, we could use base R’s table
function.
table(WstClbGrpStdy$behpat, WstClbGrpStdy$wghtcat)
##
## < 140 140-170 170-200 > 200
## A1 20 125 98 21
## A2 100 612 514 99
## B3 90 610 443 73
## B4 22 191 116 20
But I recommend learning some of dplyr
s handy programming syntax, because then you’ll be able to create your own custom table functions. These were adapted from this thread on StackOverflow.
freq_crosstab <- function(data, var1, var2) {
var1 <- rlang::enquo(var1)
var2 <- rlang::enquo(var2)
data %>%
dplyr::count(!!var1, !!var2) %>%
tidyr::spread(!!var2, n, fill = 0) %>%
dplyr::mutate(Total := rowSums(dplyr::select(., -!!var1)),
Freq = Total / sum(Total),
Freq = paste0(round(Freq*100, 2), "%")) %>%
dplyr::bind_rows(dplyr::bind_cols(!!rlang::quo_name(var1) := "Total",
dplyr::summarize_if(., is.numeric, sum)))
}
freq_crosstab(WstClbGrpStdy, behpat, wghtcat)
## # A tibble: 5 x 7
## behpat `< 140` `140-170` `170-200` `> 200` Total Freq
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 A1 20 125 98 21 264 8.37%
## 2 A2 100 612 514 99 1325 42.01%
## 3 B3 90 610 443 73 1216 38.55%
## 4 B4 22 191 116 20 349 11.07%
## 5 Total 232 1538 1171 213 3154 <NA>
Or add some percents to the frequency columns and make it tall.
freq_tall_table <- function(data,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
data %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n),
freq = paste0(round(freq*100, 2), "%")) %>%
ungroup
}
freq_tall_table(WstClbGrpStdy, behpat, wghtcat)
## # A tibble: 16 x 4
## behpat wghtcat n freq
## <fct> <fct> <int> <chr>
## 1 A1 < 140 20 7.58%
## 2 A1 140-170 125 47.35%
## 3 A1 170-200 98 37.12%
## 4 A1 > 200 21 7.95%
## 5 A2 < 140 100 7.55%
## 6 A2 140-170 612 46.19%
## 7 A2 170-200 514 38.79%
## 8 A2 > 200 99 7.47%
## 9 B3 < 140 90 7.4%
## 10 B3 140-170 610 50.16%
## 11 B3 170-200 443 36.43%
## 12 B3 > 200 73 6%
## 13 B4 < 140 22 6.3%
## 14 B4 140-170 191 54.73%
## 15 B4 170-200 116 33.24%
## 16 B4 > 200 20 5.73%
Knowing how to use the tidyr
functions is always helpful for rearranging columns and rows.
freq_tall_table(WstClbGrpStdy, behpat, wghtcat) %>%
tidyr::spread(wghtcat, n)
## # A tibble: 16 x 6
## behpat freq `< 140` `140-170` `170-200` `> 200`
## <fct> <chr> <int> <int> <int> <int>
## 1 A1 37.12% NA NA 98 NA
## 2 A1 47.35% NA 125 NA NA
## 3 A1 7.58% 20 NA NA NA
## 4 A1 7.95% NA NA NA 21
## 5 A2 38.79% NA NA 514 NA
## 6 A2 46.19% NA 612 NA NA
## 7 A2 7.47% NA NA NA 99
## 8 A2 7.55% 100 NA NA NA
## 9 B3 36.43% NA NA 443 NA
## 10 B3 50.16% NA 610 NA NA
## 11 B3 6% NA NA NA 73
## 12 B3 7.4% 90 NA NA NA
## 13 B4 33.24% NA NA 116 NA
## 14 B4 5.73% NA NA NA 20
## 15 B4 54.73% NA 191 NA NA
## 16 B4 6.3% 22 NA NA NA
Multivariable Descriptions
Finally, this chapter concludes with a discussion of graphing and displaying pairwise combinations of variables. The scatter plot matrix is shown following a correlation matrix in Stata (correlate sbp age weight height (obs=3154)
). To create a correlation matrix in R, we could use the cor()
function. I like to keep everything as tibble()
or data.frame
in my working environment, so I do a little reformatting to the output.
CorrTable <- WstClbGrpStdy %>%
dplyr::select(sbp,
age,
weight,
height) %>%
cor() %>%
as_tibble() %>%
rownames_to_column() %>%
mutate(rowname =
case_when(rowname == 1 ~ 'sbp',
rowname == 2 ~ 'age',
rowname == 3 ~ 'weight',
rowname == 4 ~ 'height'))
CorrTable
## # A tibble: 4 x 5
## rowname sbp age weight height
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sbp 1 0.166 0.253 0.0184
## 2 age 0.166 1 -0.0344 -0.0954
## 3 weight 0.253 -0.0344 1 0.533
## 4 height 0.0184 -0.0954 0.533 1
A graph is displayed that shows weight
and sbp
across categories of behavior type. ggplot2
can do this type of graphing with facet_wrap
ing. We can take what we’ve learned about the geom_smooth(method = "loess")
and apply it here, only now we will add a group = behpat
argument and a facet_wrap()
function.
WstClbGrpStdy %>%
ggplot(aes(x = weight,
y = sbp,
group = behpat)) + # separate by this factor
geom_point(aes(color = behpat), # color aesthetic added to this layer
show.legend = FALSE) + # don't need this
geom_smooth(method = "loess") + # add the loess line
facet_wrap(. ~ behpat, # now wrap the plot in a 2x2 layout
nrow = 2,
ncol = 2) +
ggtitle("Scatterplots of SBP vs. weight by behavior pattern")
Great! Now we can look at more complicated combinations of plots, like the scatter plot matrix.
The pairs()
plot shows multiple pairwise correlations between sbp
age
weight
and height
below.
pairs(WstClbGrpStdy[ , c("sbp", "age", "weight", "height")],
main = "A scatter plot matrix between sbp, age, weight, and height")
This chart is lacking in some details, so we will add additional formatting and get a smoothed line through the lower scatter plots.
# WstClbGrpStdy[ , c(14, 1, 20, 9)]
pairs(WstClbGrpStdy[ , c(14, 1, 20, 9)],
lower.panel = panel.smooth,
main = "A scatter plot matrix between sbp, age, weight, and height")
We can also create the scatter plot matrix with ggplot2
using the GGally
package (you might also need the fix found here: devtools::install_github("ggobi/ggally#266")
).
This function needs us to specify the columns we want to use in the plot, and list()
s of options for the lower()
and diag()
plots.
# devtools::install_github("ggobi/ggally#266")
# WstClbGrpStdy[ , c(14, 1, 20, 9)] # use a little column subsetting...
GGally::ggpairs(WstClbGrpStdy[ , c(14, 1, 20, 9)],
lower = list(
continuous = "smooth",
mapping = aes(alpha = 0.1)),
diag = list(
continuous = "barDiag",
mapping = aes(alpha = 0.1))) +
ggtitle("A scatter plot matrix between sbp, age, weight, and height")
Compare this to the results in CorrTable
:
knitr::kable(CorrTable)
rowname | sbp | age | weight | height |
---|---|---|---|---|
sbp | 1.0000 | 0.1657 | 0.2532 | 0.0184 |
age | 0.1657 | 1.0000 | -0.0344 | -0.0954 |
weight | 0.2532 | -0.0344 | 1.0000 | 0.5329 |
height | 0.0184 | -0.0954 | 0.5329 | 1.0000 |
In conclusion
I hope anyone using VGSM or in the TICR program using R (or wanting to) will find this useful. I will be posting the other chapters after adding newer options/packages that have come out since the time I completed my coursework.