Exploratory Data Analysis (EDA) is usually the first step when you analyze data, because you want to know what information the dataset carries. In this lab, we introduce basic R functions for EDA in both quantitative and graphical approaches using the movie data as an example.
Before start, always do
Let’s first load the powerful package tidyverse
.
library(tidyverse)
Now let’s load the movie dataset.
movie<- read_csv("movie.csv")
## Rows: 5043 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): color, director_name, actor_2_name, genres, actor_1_name, movie_title, actor_3_name, plot_keywords, movie_imdb_link, language, country, content_rating
## dbl (16): num_critic_for_reviews, duration, director_facebook_likes, actor_3_facebook_likes, actor_1_facebook_likes, gross, num_voted_users, cast_total_facebook_likes, facenumber_in_poster, num_us...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# dimension
dim(movie)
# row number
nrow(movie)
# column number
ncol(movie)
# variable names
names(movie)
# structure
str(movie)
# view a few rows
head(movie)
By default, R treat strings as factors (or categorical variables), in many situations (for example, building a regression model) this is what you want because R can automatically create “dummy variables” from the factors.
# mean
mean(movie$imdb_score)
## [1] 6.442138
# other functions include sd(), median(), quantile(), min() and max()
However, if the vector contains missing value, you need an extra option.
mean(movie$budget)
## [1] NA
mean(movie$budget, na.rm = T)
## [1] 39752620
apply()
for multiple variables (matrix)Let’s generate some random numbers and create a matrix.
n<- 1000
x1<- rnorm(n)
x2<- rnorm(n, 3, 2)
x3<- runif(n, 0, 1)
x4<- rbinom(n, 1, 0.2)
X<- cbind(x1, x2, x3, x4)
apply(X, 2, mean) # "2" means "by column"
## x1 x2 x3 x4
## -0.02205543 2.99034062 0.47956920 0.22200000
mean
in the apply()
with
the summary statistics function you defined in previous exercise. See
what you got?sapply
, lapply
,
tapply
on Google and learn when and how each of them is
used.table()
The frequency (count) table is usually for categorical data. For a
single variable table()
produces the frequency table.
table(movie$color)
##
## Black and White Color
## 209 4815
# you can also get the proportion
prop.table(table(movie$color))
##
## Black and White Color
## 0.04160032 0.95839968
You may have two variables to obtain a two-way table.
table(movie$content_rating, movie$color)
##
## Black and White Color
## Approved 20 35
## G 2 110
## GP 0 6
## M 0 5
## NC-17 1 6
## Not Rated 23 93
## Passed 6 3
## PG 25 675
## PG-13 33 1425
## R 68 2042
## TV-14 0 30
## TV-G 2 8
## TV-MA 0 19
## TV-PG 0 13
## TV-Y 0 1
## TV-Y7 0 1
## Unrated 11 51
## X 3 10
# to get proportion table, you may need to specify the margin (row, column, or the total)
prop.table(table(movie$content_rating, movie$color), 2)
##
## Black and White Color
## Approved 0.1030927835 0.0077211560
## G 0.0103092784 0.0242664902
## GP 0.0000000000 0.0013236267
## M 0.0000000000 0.0011030223
## NC-17 0.0051546392 0.0013236267
## Not Rated 0.1185567010 0.0205162144
## Passed 0.0309278351 0.0006618134
## PG 0.1288659794 0.1489080079
## PG-13 0.1701030928 0.3143613501
## R 0.3505154639 0.4504742996
## TV-14 0.0000000000 0.0066181337
## TV-G 0.0103092784 0.0017648356
## TV-MA 0.0000000000 0.0041914847
## TV-PG 0.0000000000 0.0028678579
## TV-Y 0.0000000000 0.0002206045
## TV-Y7 0.0000000000 0.0002206045
## Unrated 0.0567010309 0.0112508273
## X 0.0154639175 0.0022060446
dplyr
packageThe package dplyr
, part of tidyverse
, is
designed for data management and manipulation. It is very convenient to
use. We will introduce some key functions.
mutate()
– Create new columnsSuppose you want to create two new columns: log(budget) and log(gross), you can write
movie1<- mutate(movie, log_budget=log(budget), log_gross=log(gross))
There are two things you need to be careful: (1) the name of new
column should not contain “space”, “()”, “:”, etc., while commonly used
special character (to separate words) are “_” and “.”, just like
log_budget
; (2) you should always check if the original
column names are spelled correctly whenever you use them. Here you can
always use names(movie)
to get all column names of the data
movie
.
%>%
The best thing of using functions in dplyr
is that you
can sequentially do many jobs with different functions, connected by
%>%
, which can be interpreted as “then”.
The code below is equivalent to above.
movie1<- movie %>%
mutate(log_budget=log(budget), log_gross=log(gross))
select()
– Select columnsSuppose you want to select four columns (movie_title
,
num_voted_users
, imdb_score
and
log_budget
) from movie1
, you can write
movie_sub1<- movie1 %>%
select(movie_title, num_voted_users, imdb_score, log_budget)
The order of the variable names you specified in
select()
will be the actual column order in the output
data. Therefore, select()
can also be used to re-arrange
the columns.
movie_sub1
from movie
instead
of movie1
? In other words, you need to combine above two
tasks to one procedure using %>%
.movie_sub2
that contains
everything except num_voted_users
, imdb_score
and log_budget
? (Think about how do you exclude elements
from a vector.)movie1
with the
following order: movie_title
, imdb_score
,
log_gross
, log_budget
, and the rest columns?
(Try function everything()
.)filter()
– Filter by columnsSuppose that you want to filter out movies whose imdb score is greater than 8.5, you can write the code
movie2<- filter(movie1, imdb_score>8.5)
Or (using %>%
)
movie2<- movie1 %>%
filter(imdb_score>8.5)
movie_title
, along with
imdb_score
, log_budget
, genres
,
whose imdb_score
> 8 and log_budget
<10th
percentile?movie_title
whose
director_facebook_likes
are 0, and
content_rating
are either “PG” or “PG-13”?Sometime you may want to remove all missing values or only keep
missing values and investigate what was going on. We can filter missing
values using is.na()
. Below is a simple example.
a<- c(1, 2, 3, NA, NA)
is.na(a)
## [1] FALSE FALSE FALSE TRUE TRUE
Now suppose that you want all movies whose imdb score is greater than
8.5 and log_budget
is missing, you can write code
movie2<- movie_sub1 %>%
filter(imdb_score>8.5, is.na(log_budget))
What if you want those whose log_budget
is NOT
missing?
movie2<- movie_sub1 %>%
filter(imdb_score>8.5, !is.na(log_budget))
The exclamation sign “!” means “NOT”.
arrange()
– Sort by columnsThis will not change any information in the data, but only sort
certain column increasingly or decreasingly, like “Sort” in Excel.
Suppose you want to sort imdb score from high to low for
movie2
, you can write code
movie2 %>%
arrange(desc(imdb_score))
## # A tibble: 36 × 4
## movie_title num_voted_users imdb_score log_budget
## <chr> <dbl> <dbl> <dbl>
## 1 The Shawshank Redemption 1689764 9.3 17.0
## 2 The Godfather 1155770 9.2 15.6
## 3 Kickboxer: Vengeance 246 9.1 16.6
## 4 The Dark Knight 1676169 9 19.0
## 5 The Godfather: Part II 790926 9 16.4
## 6 The Lord of the Rings: The Return of the King 1215718 8.9 18.4
## 7 Schindler's List 865020 8.9 16.9
## 8 Pulp Fiction 1324680 8.9 15.9
## 9 The Good, the Bad and the Ugly 503509 8.9 14.0
## 10 12 Angry Men 447785 8.9 12.8
## # … with 26 more rows
Here we did not create any new dataset, probably because we don’t
need a new data, but just look at the first several highly rated movies.
desc()
means descending order, so if you want ascending
order, you don’t need desc()
.
rename()
– Rename column namesmovie3<- movie1%>%
rename(Score=imdb_score, Title=movie_title)
In EDA, obtaining summary statistics by group is a very common task.
In dplyr
, we can use the combination of
group_by()
and summarise()
to quickly get what
we want.
Suppose we want to know the average and sd of log_budget
for color and black and white categories, respectively, we can write
movie1 %>%
group_by(color) %>%
summarise(mean=mean(log_budget, na.rm=T), sd=sd(log_budget, na.rm=T))
## # A tibble: 3 × 3
## color mean sd
## <chr> <dbl> <dbl>
## 1 Black and White 15.3 2.09
## 2 Color 16.5 1.68
## 3 <NA> 15.2 1.87
To summarize multiple variables, we can use
summarise_all()
.
movie1 %>%
select(color, imdb_score, log_budget, log_gross, duration)%>%
group_by(color) %>%
summarise_all(mean, na.rm=T)
## # A tibble: 3 × 5
## color imdb_score log_budget log_gross duration
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Black and White 7.23 15.3 15.5 112.
## 2 Color 6.41 16.5 16.4 107.
## 3 <NA> 6.07 15.2 16.2 90.7
For categorical variable, we can always get frequency table such as
movie1 %>%
count(color)%>%
mutate(prop=prop.table(n))
## # A tibble: 3 × 3
## color n prop
## <chr> <int> <dbl>
## 1 Black and White 209 0.0414
## 2 Color 4815 0.955
## 3 <NA> 19 0.00377
Or (using summarise
)
movie1 %>%
group_by(color) %>%
summarise(Freq=n()) %>%
mutate(prop=prop.table(Freq))
## # A tibble: 3 × 3
## color Freq prop
## <chr> <int> <dbl>
## 1 Black and White 209 0.0414
## 2 Color 4815 0.955
## 3 <NA> 19 0.00377
Can you explain above code?
Now we can do more comprehensive tasks. Suppose we want to know summaries of budget and gross across certain movies with certain content ratings, as well as the frequency, we can write
movie1 %>%
select(content_rating, log_budget, log_gross) %>%
filter(content_rating %in% c("PG", "PG-13", "R")) %>%
group_by(content_rating) %>%
summarise_all(list(mean=mean, sd=sd), na.rm=T)
## # A tibble: 3 × 5
## content_rating log_budget_mean log_gross_mean log_budget_sd log_gross_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 PG 17.0 17.2 1.48 1.80
## 2 PG-13 17.2 17.1 1.29 1.83
## 3 R 16.2 15.9 1.52 2.30
Who are the top 10 directors in terms of the number of movies they made?
Can you obtain the average imdb_score
of the movies
made by these top 10 directors?
Who are the top 10 directors in terms of the average
imdb_score
?
Download the CustomerData to your working directory. Load it to R.
What is the average and standard deviation of DebtToIncomeRatio for retired vs nonretired customers?
What is the median of HHIncome for different job categories?
Re-do (6) for those whose DebtToIncomeRatio falls in the lower half (lower 50%).
Can you construct a two-way table that shows proportion of loan default across job category?
How is loan default associated with age and household size, respectively?
mean(x)
, min(x)
,
max(x)
, median(x)
,
quantile(x, prob=0.8)
.sd(x)
, var(x)
,
mad()
(median absolute deviation).first(x)
, last(x)
,
nth(x, 2)
n()
, sum(!is.na(x))
(count for
nonmissing values), n_distinct(x)
(count for unique
values), sum(x>10)
(count by conditions).hist(movie$imdb_score, col = "blue")
Density plot is a nonparametric fitting.
plot(density(movie$imdb_score))
You can make the plot more elegant with different options. For example, adding a title, adjusting the axis range, renaming the axis label, and so on…
You can also add curves on top of an existing plot by using
lines()
or abline()
function.
hist(movie$imdb_score, prob=T, col="green", breaks=20, main="Histogram and Density of IMDB Score", xlab="IMDB Score")
lines(density(movie$imdb_score), col="red", lwd=2)
# Add a vertical line that indicates the average of Sepal Length
abline(v=mean(movie$imdb_score), col="blue", lty=2, lwd=1.5)
We can use the option below to set the window to display two figures (arranged in 1 row 2 columns).
# set display arrangement
par(mfrow=c(1,2))
# now we can draw two graphs in one window
hist(movie$gross, col = "blue")
hist(log(movie$budget), col = "blue")
This only applies for R basic plot
Details about figure margins can be found here.
plot(log(movie$budget), log(movie$gross))
# with a few more options
plot(log(movie$budget), log(movie$gross), xlab = "log(budget)", ylab="log(gross)", pch=3, col=3)
This is often a useful visualization tool so you can see the
relationship between X and Y across different groups. The most
convenient way to produce such a scatter plot is to use
ggplot
.
movie1%>%filter(content_rating %in% c("G", "PG", "PG-13", "R"))%>%
ggplot(aes(log(budget), log(gross), color=content_rating))+
geom_point(size=2)
## Warning: Removed 657 rows containing missing values (geom_point).
Can you explain the above code?
You can also split the groups to multiple figures.
movie1%>%filter(content_rating %in% c("G", "PG", "PG-13", "R"))%>%
ggplot(aes(log(budget), log(gross), color=content_rating))+
geom_point(size=2)+
facet_wrap(~content_rating, nrow=1)
## Warning: Removed 657 rows containing missing values (geom_point).
boxplot(log(movie$gross)~movie$content_rating, col=rainbow(5))
Bar chart is produces by using a vector of single data points, which is often a vector of summary statistics. Therefore, you need to preprocess your data, and get summary statistics before drawing the bar chart.
# bar chart for average of the 4 quantitative variables
aveg<- apply(iris[,1:4], 2, mean)
barplot(aveg, ylab = "Average")
ggplot2
is a plotting system for R, based on the grammar
of graphics, which tries to take the good parts of base and lattice
graphics and none of the bad parts. It takes care of many of the fiddly
details that make plotting a hassle (like drawing legends) as well as
providing a powerful model of graphics that makes it easy to produce
complex multi-layered graphics. More details can be found at http://ggplot2.org/. Please
see the optional lab notes for EDA of the movie data with
ggplot
.
apply()
functiondplyr
packages.