Visualizing Data

Example plots from Visualizing Data lesson. The examples are in ggplot, but it is possible to make these graphs with other graphing packages, including R’s base graphics. The goals of this lesson were to demonstrate the different graphs and what they are for, not to teach ggplot. We include a brief primer on ggplot below, along with a pointer to the online documentation.

A Brief Primer on ggplot

Graphs in ggplot2 can only be defined on data frames. The variables in the graph – x variables, y variables, the variables that define the color or size of the points – are called aesthetics, and are declared by using the aes function
The ggplot() function declares the graph object. The arguments to ggplot() can include the data frame of interest and the aesthetics. The ggplot() function doesn’t of itself produce a visualization: visualizations are produced by layers.
Layers produce the plots and plot transformations and are added to a given graph object using the + operator. Each layer can also take a data frame and aesthetics as arguments, in addition to plot-specific parameters. Examples of layers are geom_point for a scatterplot or geom_line for a line plot.

The syntax should become clearer below. For online documentation, see http://docs.ggplot2.org/current/index.html.

Preliminaries

Load libraries. Set random seed for reproducible graphs.

You may need to install the packages first; for example to install the hexbin package use the command “install.packages(‘hexbin’)”.

library(ggplot2)
library(hexbin) # for making hexbin plots
library(GGally) # for the pair plots
library(mvtnorm) # for the multivariate gaussians

set.seed(32534557)

Using Color

You should try to use color to guide the viewer to points of interest in the graph

scurve = function(x) {2*(1+exp(-x))^-1 - 1}

N = 12 # multiple of 12
x = 1:N
month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
oursales = 100*(scurve(x-1) + 0.15*rnorm(N))
usf = data.frame(x=x, month=month, units_sold=oursales, company="us")
theirsales = 100*(sin(0.05*pi*x) + 0.15*rnorm(N))
tsf = data.frame(x=x, month=month, units_sold=theirsales, company="them")

dataf = rbind(usf, tsf)

Without color emphasis:

ggplot(dataf, aes(x=x, y=units_sold, color=company)) + 
  geom_point() + geom_line() +
  scale_x_continuous("Month", breaks=dataf$x, labels=dataf$month) + 
  ggtitle("Sales Volume")

With color emphasis:

ggplot(dataf, aes(x=x, y=units_sold, color=company)) + 
  geom_point() + geom_line() +
  scale_x_continuous("Month", breaks=dataf$x, labels=dataf$month) +
  scale_color_manual(values=c("us" = "darkblue", "them" = "darkgray")) +
  ggtitle("Sales Volume")

Examining a Single Variable

Create the data set.

N = 1000

# continuous variable: mixture of gaussians
centers = sample(1:4, size=N, replace=TRUE)
x = centers + rnorm(N, sd=1)

# discrete variable: letters of the alphabet
# frequencies of letters in english 
# source: http://en.algoritmy.net/article/40379/Letter-frequency-English
letterFreqs = c(8.167, 1.492, 2.782, 4.253, 12.702, 2.228,
2.015, 6.094, 6.966, 0.153, 0.772, 4.025, 2.406, 6.749, 7.507, 1.929,
0.095, 5.987, 6.327, 9.056, 2.758, 0.978, 2.360, 0.150, 1.974, 0.074)
letterFreqs = letterFreqs/100

# draw letters proportional to their frequency in English
tokens = sample(letters, size=N, replace=TRUE, prob=letterFreqs)

df = data.frame(x=x, tokens=tokens)

Histograms

# histogram defaults to binwidth range/30. Here we set it explicitly
ggplot(df, aes(x=x)) + geom_histogram(binwidth=0.5)

Density Plots

ggplot(df, aes(x=x)) + geom_density(adjust=0.5)  # set the smoothing kernel to half the default, for a little more detail

Dotplots

# the easy way to do this is with geom_bar (bar plot)
ggplot(df, aes(x=tokens)) + geom_bar()

# Cleveland prefers dot plots, which are a bit more complicated
zero = function (x) {0}  # A function that only returns 0
ggplot(df, aes(x=tokens)) + geom_point(stat="bin") +
                            stat_summary(aes(y=1), fun.ymin=zero, fun.ymax=sum, geom="linerange") +
                            theme(axis.text.x=element_text(size=12, color="black")) # make the font more legible

# Cleveland also recommends that we sort the letters by frequency, to make the plot
# easier to read.
# To do that in ggplot, we have to reorder the factor levels in sorted order
# (it's easier in baseplot)
n = length(df$tokens)
unit = numeric(n)-1 # a vector of all negative 1s
df$tokens = reorder(df$tokens, unit, FUN=sum) # now sorted by frequency, descending

ggplot(df, aes(x=tokens)) + geom_point(stat="bin") +
                            stat_summary(aes(y=1), fun.ymin=zero, fun.ymax=sum, geom="linerange") +
                            theme(axis.text.x=element_text(size=12, color="black"))

# compare with the bar chart
ggplot(df, aes(x=tokens)) + geom_bar()

Examining the Relationship Between Two Continuous Variables

Create the data set.

# we'll do this as a function, so we can create sets of different sizes
 makedata = function(N) {
   x = rnorm(N, sd=5)
   u  = 3+sin(0.05*pi*x)
   v = (1+exp(-x))^-1 + 0.25*rnorm(N) # noisy sigmoid
   
   centers = sample(1:4, size=N, replace=TRUE)
   w = centers + rnorm(N, sd=1)
   map =c("gp1", "gp2", "gp3", "gp4")
   gp = map[centers] # gp is now a categorical variable
   
   data.frame(x=x,u=u,v=v,gp=gp,w=w)
  }

df = makedata(1000)

Line Plots

ggplot(df, aes(x=x, y=u)) + geom_line()

Scatterplots

ggplot(df, aes(x=x,y=v)) + geom_point()

Smoothing Curves

ggplot(df, aes(x=x,y=v)) + geom_point() + geom_smooth()

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Hexbin Plot

densef = makedata(10000)
# you can see the general shape, but you lose some of the internal detail
ggplot(densef, aes(x=x,y=v)) + geom_point() + geom_smooth()

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

# remember hexbin requires the hexbin package (see library call, above)
# as with geom_histogram, you can custom set the binwidth "(binwidth=c(xwidth, ywidth))"
# but we'll leave it as the default
ggplot(densef, aes(x=x,y=v)) + geom_hex() +
  geom_smooth(color="white", se=F) # add the smoothing curve, turn off the standard error ribbon

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Examine the Relationship Between a Continuous and Discrete Variable

Conditional Densityplots

ggplot(subset(df, df$gp %in% c("gp1", "gp2")), aes(w, color=gp)) + geom_density()

# we can do all four categories, but it gets harder to read
ggplot(df, aes(w, color=gp)) + geom_density()

Faceting for Conditional Densityplots Aka Trellis plots, or faceting in R.

ggplot(df, aes(x=w)) + geom_density() + facet_wrap(~gp)

Box and Whisker Plot

# the default style
ggplot(df, aes(x=gp, y=w)) + geom_boxplot()

# my preferred style, with the points jittered beneath
ggplot(df, aes(x=gp, y=w)) + geom_boxplot(outlier.size=0) + # turn off the outlier points
   geom_point(position=position_jitter(width=0.4), alpha=0.2)

   # add the actual points, slightly jittered along x and made partially transparent

General Trellis Plots

ggplot(df, aes(x=x,y=w+u)) + geom_point() + geom_smooth() + facet_wrap(~gp)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

Examining the Relationship Between Two Discrete Variables

Create the data set.

N=1000
categories = c("gp1", "gp2", "gp3", "gp4")
coins = c(0.4, 0.2, 0.75, 0.9)
catfreq = c(1,2,3,4); catfreq = catfreq/sum(catfreq)

gp = sample(categories, size=N, replace=TRUE, prob=catfreq)
target = character(N) # an empty character vector of length N

# not the most efficient way to do this, but easy to read
for(i in 1:4) {
  ix = which(gp==categories[i]) # find which gp members are in category (their indices)
  nx = length(ix)
  target[ix] = ifelse(runif(nx) <= coins[i], "blue", "gray")
  # category i has target="blue" with probability coins[i]
}

# gp has 4 possible values, target has 2
df = data.frame(gp=gp, target=target)

Stacked Bar Charts

ggplot(df, aes(x=gp, fill=target)) + geom_bar() +
  scale_fill_manual(values=c("blue" = "darkblue", "gray"="darkgray")) # set the colors manually

Side-by-side Bar Charts

ggplot(df, aes(x=gp, fill=target)) + geom_bar(position="dodge") +
  scale_fill_manual(values=c("blue" = "darkblue", "gray"="darkgray")) # set the colors manually

Ratio Bar Charts

ggplot(df, aes(x=gp, fill=target)) + geom_bar(position="fill") +
  scale_fill_manual(values=c("blue" = "darkblue", "gray"="darkgray")) # set the colors manually

*Conditional Bar Charts Aka Trellis plots. Called “faceting” in ggplot. A more complicated data set, where target has four possible values.

# make this a function, because we'll use it again
makecolors = function(gp) {
  n = length(gp)
  categories = sort(unique(gp))
  target=character(n)
  colors = c("blue", "gray", "green", "brown")
  # this assumes gp has at most 4 unique values.
  # I should fix it, but I don't need to, for this knitr doc
  coins = list(c(1,2,3,4)/10, c(2,3,4,1)/10, c(3,4,1,2)/10, c(4,1,2,3)/10)
  for(i in 1:4) {
    ix = which(gp==categories[i]) # same gp vector as before
    nx = length(ix)
    target[ix] = sample(colors, size=nx, replace=TRUE, prob=coins[[i]])
    }
  target
}
df4 = data.frame(gp=gp, target=makecolors(gp))

Now plot.

# Let's try a side-by-side, first. You should try all the others,
# just to see what they look like. What is obvious and what is not obvious from 
# each graph?
ggplot(df4, aes(x=gp, fill=target)) + geom_bar(position="dodge") + 
  scale_fill_manual(values=c("blue"="darkblue", "gray"="darkgray", "green"="darkgreen", "brown"="saddlebrown"))

# now faceted on target
ggplot(df4, aes(x=gp)) + geom_bar(position="dodge") + 
  facet_wrap(~target, scales="free_y") # let each facet scale y on its own

# or on gp
ggplot(df4, aes(x=target, fill=target)) + geom_bar(position="dodge") + 
  scale_fill_manual(values=c("blue"="darkblue", "gray"="darkgray", "green"="darkgreen", "brown"="saddlebrown")) + 
  facet_wrap(~gp, scales="free_y") # let each facet scale y on its own

Examining Many Variables At Once

**Pair Plots*

# go back to some earlier datasets
df = makedata(1000)
df$color = makecolors(df$gp)

# plot all the columns but x and u
# the lists that go to upper and lower
# tell ggpairs that continuous-countinouous plots should be scatterplots
# and continuous-discrete plots should be faceted density plots
# and discrete-discrete plots should be "bar" plot
ggpairs(df[, c("v", "gp", "w", "color")], 
        axisLabels="internal", # put the axis labels in the diagonals (default is to put them outside)
        upper=list(continuous='points', combo='facetdensity'),
        lower=list(continuous='points', combo='facetdensity'))

Multidimensional Scaling Multidimensional scaling is a way of visualising points in a high dimensional space in 2D (or 3D)

Make the data.

N = 600

c1 = c(0,0,0,0)
c2 = c(0,3,0,0)
c3 = c(0,1,0,3)

# rmvnorm is in package mvtnorm and returns 
# points drawn from a multivariate gaussian with mean mean and 
# covariance matrix sigma (which defaults to the identity)
blue = as.data.frame(rmvnorm(N/3, mean=c1))
blue$color="blue"

brown = as.data.frame(rmvnorm(N/3, mean=c2))
brown$color="brown"

green=as.data.frame(rmvnorm(N/3, mean=c3))
green$color="green"

# each row is a point in 4-d, plus a color
points=rbind(rbind(blue, brown), green)
summary(points)

##        V1                  V2                V3           
##  Min.   :-3.149730   Min.   :-3.3464   Min.   :-3.472622  
##  1st Qu.:-0.669435   1st Qu.: 0.1412   1st Qu.:-0.658077  
##  Median : 0.014537   Median : 1.0903   Median :-0.007851  
##  Mean   : 0.003651   Mean   : 1.2747   Mean   : 0.010086  
##  3rd Qu.: 0.682536   3rd Qu.: 2.4193   3rd Qu.: 0.697090  
##  Max.   : 2.947769   Max.   : 5.6548   Max.   : 3.129668  
##        V4             color          
##  Min.   :-2.9670   Length:600        
##  1st Qu.:-0.3033   Class :character  
##  Median : 0.6278   Mode  :character  
##  Mean   : 1.0502                     
##  3rd Qu.: 2.4521                     
##  Max.   : 5.6983

Now plot.

# plot the data points (columns 1:4), coded by color
ggpairs(points, 1:4, color="color")

# get the distance matrix (the distances between all points in the dataset
pdists = dist(points[,1:4])

# get the 2d projection
project2d = as.data.frame(cmdscale(pdists, k=2))
summary(project2d)

##        V1                V2         
##  Min.   :-4.0021   Min.   :-5.0646  
##  1st Qu.:-1.2852   1st Qu.:-1.0417  
##  Median :-0.2289   Median : 0.1839  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5183   3rd Qu.: 1.1151  
##  Max.   : 4.4798   Max.   : 4.2395

project2d$color=points$color

# plot it
ggplot(project2d, aes(x=V1, y=V2, color=color, shape=color)) + geom_point() + 
  scale_color_manual(values=c("blue"="darkblue", "brown"="saddlebrown", "green"="darkgreen"))

Visualizing Data

Nina Zumel

February 18, 2015

Visualizing Data

A Brief Primer on ggplot

Preliminaries

Using Color

Examining a Single Variable

Examining the Relationship Between Two Continuous Variables

Examine the Relationship Between a Continuous and Discrete Variable

Examining the Relationship Between Two Discrete Variables

Examining Many Variables At Once