Some example data science plots in R using ggplot2. See for code/details.


Scatterplot with smoothing line through points.

WVPlots::ScatterHist(frm, "x", "y", title="Example Fit")

Scatterplot with best linear fit through points. Also report the R-squared and significance of the linear fit.

Scatterplot compared to the line x = y. Also report the coefficient of determination between x and y (where y is “true outcome” and x is “predicted outcome”).

Scatterplot of (x, y) color-coded by category/group, with marginal distributions of x and y conditioned on group.

fmScatterHistC = data.frame(x=rnorm(50),y=rnorm(50))
fmScatterHistC$cat <- fmScatterHistC$x+fmScatterHistC$y>0
WVPlots::ScatterHistC(fmScatterHistC, "x", "y", "cat", title="Example Conditional Distribution")

Scatterplot of (x, y) color-coded by discretized z. The continuous variable z is binned into three groups, and then plotted as by ScatterHistC

frmScatterHistN = data.frame(x=rnorm(50),y=rnorm(50))
frmScatterHistN$z <- frmScatterHistN$x+frmScatterHistN$y
WVPlots::ScatterHistN(frmScatterHistN, "x", "y", "z", title="Example Joint Distribution")

Plot the relationship y as a function of x with a smoothing curve that estimates \(E[y | x]\). If y is a 0/1 variable as below (binary classification, where 1 is the target class), then the smoothing curve estimates \(P(y | x)\). Since \(y \in \{0,1\}\) with \(y\) intended to be monotone in \(x\) is the most common use of this graph, BinaryYScatterPlot uses a glm smoother by default (use_glm=TRUE, this is essentially Platt scaling), as the best estimate of \(P(y | x)\).

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Hexbin Plot

if(requireNamespace("hexbin", quietly = TRUE)) {
  df = rbind(data.frame(x=rnorm(1000, mean = 1), y=rnorm(1000, mean = 1, sd = 0.5 )),
             data.frame(x = rnorm(1000, mean = -1, sd = 0.5), y = rnorm(1000, mean = -1, sd = 0.5)))
  print(WVPlots::HexBinPlot(df, "x", "y", "Two gaussians"))

Gain Curves

y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))

frm = data.frame(model=x, value=y)

frm$rate = with(frm, value/costs)

frm$isValuable = (frm$value >= as.numeric(quantile(frm$value, probs=0.8)))

Basic curve: each item “costs” the same. The wizard sorts by true value, the x axis sorts by the model, and plots the fraction of the total population.

WVPlots::GainCurvePlot(frm, "model", "value", title="Example Continuous Gain Curve")
## Warning in ggplot2::geom_abline(mapping = ggplot2::aes(x = pctpop, y = pct_outcome, : Using `intercept` and/or `slope` with `mapping` may not have the desired result as mapping is overwritten if either of these is specified

We can annotate a point of the model at a specific x value

## Warning in ggplot2::geom_abline(mapping = ggplot2::aes(x = pctpop, y = pct_outcome, : Using `intercept` and/or `slope` with `mapping` may not have the desired result as mapping is overwritten if either of these is specified

When the x values have different costs, take that into account in the gain curve. The wizard now sorts by value/cost, and the x axis is sorted by the model, but plots the fraction of total cost, rather than total count.

WVPlots::GainCurvePlotC(frm, "model", "costs", "value", title="Example Continuous Gain CurveC")
## Warning in ggplot2::geom_abline(mapping = ggplot2::aes(x = pctpop, y = pct_outcome, : Using `intercept` and/or `slope` with `mapping` may not have the desired result as mapping is overwritten if either of these is specified

ROC Plots

WVPlots::ROCPlot(frm, "model", "isValuable", TRUE, title="Example ROC plot")

Precision-Recall Plot

WVPlots::PRPlot(frm, "model", "isValuable", TRUE, title="Example Precision-Recall plot")

Double Density Plot

WVPlots::DoubleDensityPlot(frm, "model", "isValuable", title="Example double density plot")

Double Histogram Plot

WVPlots::DoubleHistogramPlot(frm, "model", "isValuable", title="Example double histogram plot")

Cleveland Style Dotplots

WVPlots::ClevelandDotPlot(randomDraws, "letter", limit_n = 10,  title = "Top 10 most frequent letters")

WVPlots::ClevelandDotPlot(randomDraws, "letter", sort=0, title="Example Cleveland-style dot plot, unsorted")

WVPlots::ClevelandDotPlot(randomDraws, "letter", sort=1, stem=FALSE, title="Example with increasing sort order + coord_flip, no stem") + ggplot2::coord_flip()

ClevelandDotPlot also accepts an integral x variable. You probably want sort = 0 in this case.

N = 1000
ncar_vec = 0:5
prob = c(1.5, 3, 3.5, 2, 1, 0.75); prob = prob/sum(prob)

df = data.frame(num_cars = sample(ncar_vec, size = N, replace = TRUE, prob=prob))
WVPlots::ClevelandDotPlot(df, "num_cars", sort = 0, title = "Distribution of household vehicle ownership")

Shadow Plots

Plot a bar chart of row counts conditioned on the categorical variable condvar, faceted on a second categorical variable, refinevar. Each faceted plot also shows a “shadow plot” of the totals conditioned on condvar alone.

This plot enables comparisons of sub-population totals across both condvar and refinevar simultaneously.

N = 100

# rough proportions of eye colors
eprobs = c(0.37, 0.36, 0.16, 0.11)

eye_color  = sample(c("Brown", "Blue", "Hazel", "Green"), size = N, replace = TRUE, prob = eprobs)
sex = sample(c("Male", "Female"), size = N, replace = TRUE)

# A data frame of eye color by sex
dframe = data.frame(eye_color = eye_color, sex = sex)

WVPlots::ShadowPlot(dframe, "eye_color", "sex", title = "Shadow plot of eye colors by sex")

Shadow Histogram

Plot a histogram of a continuous variable xvar, faceted on a categorical conditioning variable, condvar. Each faceted plot also shows a “shadow plot” of the unconditioned histogram for comparison.

N = 100

dframe = data.frame(x = rnorm(N), gp = "region 2", stringsAsFactors = FALSE)
dframe$gp = with(dframe, ifelse(x < -0.5, "region 1", 
                                ifelse(x > 0.5, "region 3", gp)))

WVPlots::ShadowHist(dframe, "x", "gp", title = "X values by region")

ShadowHist uses the Brewer Dark2 palette by default. You can pass in another Brewer palette to change the color scheme. If you prefer all the histograms to be the same color, or to use a non-Brewer palette, set palette = NULL and pass in the color scheme directly. Here we plot all the histograms in the same color:

ngp = length(unique(dframe$gp))

WVPlots::ShadowHist(dframe, "x", "gp", title = "X values by region", palette = NULL) + 
  ggplot2::scale_fill_manual(values = rep("darkblue", ngp))

ScatterBox Plots

classes = c("a", "b", "c")
means = c(2, 4, 3)
names(means) = classes
label = sample(classes, size=1000, replace=TRUE)
meas = means[label] + rnorm(1000)
frm2 = data.frame(label=label,
                  meas = meas)

WVPlots::ScatterBoxPlot(frm2, "label", "meas", pt_alpha=0.2, title="Example Scatter/Box plot")

WVPlots::ScatterBoxPlotH(frm2, "meas", "label",  pt_alpha=0.2, title="Example Scatter/Box plot")

Discrete Distribution Plot

frmx = data.frame(x = rbinom(1000, 20, 0.5))
WVPlots::DiscreteDistribution(frmx, "x","Discrete example")

Distribution and Count Plot

d <- data.frame(wt=100*rnorm(100))

Smoothed Scatterplots

y = c(1,2,3,4,5,10,15,18,20,25)
x = seq_len(length(y))
df = data.frame(x=x,y=y)

WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", NULL, title="centered smooth, one group")

WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", NULL, title="left smooth, one group", align="left")

WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", NULL, title="right smooth, one group", align="right")

n = length(x)
df = rbind(data.frame(x=x, y=y+rnorm(n), gp="times 1"),
           data.frame(x=x, y=0.5*y + rnorm(n), gp="times 1/2"),
           data.frame(x=x, y=2*y + rnorm(n), gp="times 2"))

WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", "gp", title="centered smooth, multigroup")

WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", "gp", title="left smooth, multigroup", align="left")

WVPlots::ConditionalSmoothedScatterPlot(df, "x", "y", "gp", title="right smooth, multigroup", align="right")