R for Data Science总结之——ggplot2

R for Data Science总结之——ggplot2

ggplot2作为R语言中经典的画图包,Grammar of Graphics理论的最佳实现,用图层的方式让数据研究人员可以最大程度地自定义化编程作图,其基本结构为:

ggplot(data = ) +
<GEOM_FUNCTION>(
mapping = aes(),
stat = ,
position =
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>

安装和加载:

install.packages("ggplot2")
library(ggplot2)

本文主要介绍Hadley Wickham著作R for Data Science 中的ggplot2包用法。

快速作图qplot()

qplot()函数是graphics包中的plot函数的简单包装,以图层理念的快速作图函数,简单实现为:

# Use data from data.frame
qplot(mpg, wt, data = mtcars)
qplot(mpg, wt, data = mtcars, colour = cyl)
qplot(mpg, wt, data = mtcars, size = cyl)
qplot(mpg, wt, data = mtcars, facets = vs ~ am)

# qplot will attempt to guess what geom you want depending on the input
# both x and y supplied = scatterplot
qplot(mpg, wt, data = mtcars)
# just x supplied = histogram
qplot(mpg, data = mtcars)
# just y supplied = scatterplot, with x = seq_along(y)
qplot(y = mpg, data = mtcars)

注意qplot()会根据给定自变量的数量猜测使用的geom自动作图
手动选择geom进行操作如下:

qplot(mpg, wt, data = mtcars, geom = "path")
qplot(factor(cyl), wt, data = mtcars, geom = c("boxplot", "jitter"))
qplot(mpg, data = mtcars, geom = "dotplot")

ggplot()函数

aes()

与qplot()在一个函数中控制所有的x,y轴变量,data, color, size, facet, geom不同的是,ggplot()中一般在ggplot()函数中控制data并通过aes()函数控制x,y轴变量,group, color, size的分组再根据图层思想,用 “+” 符号添加geom函数,不同的geom_point(), geom_line(), geom_boxplot()控制不同的制图,可以再Rstudio中打出geom再按"Tab"键进行选择,其后仍然用 “+” 符号添加Coordinate和Facet选项,例如:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

aes()函数中可用于分组的变量除group以外,还有color, size, alpha, shape, 而在aes()函数之外设置这些变量则无分组效果,例如:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

效果分别为:

R for Data Science总结之——ggplot2
R for Data Science总结之——ggplot2

Facets

Facets系列函数主要用于花多个图时对其进行分行列处理,例如:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
  
 ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Geoms

选择不同的geom函数可以画不同类型的图,如下分别为点图和圆滑曲线图代码:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Stats

除了geom函数外,还可用stat函数对其进行转换,例如geom_bar()函数画图的柱高代表的是某一变量的数量,对应的stat函数为stat_count();而geom_col()函数画图的柱高代表的是某一变量的数值,其对应的stat函数为stat_identity(),如下两段代码作图结果是相同的:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

也可以在geom函数中对stat默认值进行修改,如下两段代码作图结果是相同的:

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
ggplot(data = diamonds) + 
  stat_identity(mapping = aes(x = cut, y = freq))

位置控制

除此之外柱状图的color属性定义的是边框的颜色,填充颜色应使用fill属性定义,同时还可以将fill赋值为不同的分类变量使其展示不同的颜色:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

同时还可用position属性控制柱状图的堆积模式,默认为stack, 还可设置为identity, dodge, fill:

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

为了防止点的重合可使用如下position = "jitter"的设置方法:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

R for Data Science总结之——ggplot2

Coordinate系统

Coordinate系统可以用于调整作图的整体呈现结果,如进行翻转,或将条形图转成饼图等:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()
  
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()

标签

标签的定义主要为图像的标题例如:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov",
    colour = "Car type"
  )

R for Data Science总结之——ggplot2

对每个数据点进行标签为:

best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_text(aes(label = model), data = best_in_class)

还可以对标签位置进行调整,使其在数据点偏上位置:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)

R for Data Science总结之——ggplot2
同时可以使用ggrepel包的方法使标签更加美观不重叠:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_point(size = 3, shape = 1, data = best_in_class) +
  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)

R for Data Science总结之——ggplot2

Scales

ggplot对scale的默认设置为:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

可以如下调整x坐标轴的坐标值:

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_y_continuous(breaks = seq(15, 40, by = 5))

R for Data Science总结之——ggplot2

同时可以用scale_x_log10(),scale_y_log10()函数使坐标轴对数化,用 scale_colour_brewer(palette = “Set1”)调整调色板,也可以scale_colour_manual(values = c(Republican = “red”, Democratic = “blue”))手动设置。

legend图例

可以用guide()和theme()函数联合控制图例位置:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_smooth(se = FALSE) +
  theme(legend.position = "bottom") +
  guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))

R for Data Science总结之——ggplot2

zooming缩放

coord_cartesian()函数可用于图形的缩放如:

ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))

R for Data Science总结之——ggplot2

全文代码已上传GITHUB点此查看