Exploratory Data Analysis (EDA) is usually the first step when you analyze data, because you want to know what information the dataset carries. In this lab, we introduce basic R functions for EDA in both quantitative and graphical approaches using the movie data as an example.

Exploratory Data Analysis by Numbers

Basic summary Statistics

Before start, always do

  • set the working directory!
  • create a new R script (unless you are continuing last project)
  • Save the R script.

Let’s first load the powerful package tidyverse.

library(tidyverse)

Now let’s load the movie dataset.

movie<- read_csv("movie.csv")
## Rows: 5043 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): color, director_name, actor_2_name, genres, actor_1_name, movie_title, actor_3_name, plot_keywords, movie_imdb_link, language, country, content_rating
## dbl (16): num_critic_for_reviews, duration, director_facebook_likes, actor_3_facebook_likes, actor_1_facebook_likes, gross, num_voted_users, cast_total_facebook_likes, facenumber_in_poster, num_us...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Useful functions to take a first look

# dimension
dim(movie)
# row number
nrow(movie)
# column number
ncol(movie)
# variable names
names(movie)
# structure
str(movie)
# view a few rows
head(movie)

By default, R treat strings as factors (or categorical variables), in many situations (for example, building a regression model) this is what you want because R can automatically create “dummy variables” from the factors.

Simple summary statistics for a single numerical variable

# mean
mean(movie$imdb_score)
## [1] 6.442138
# other functions include sd(), median(), quantile(), min() and max()

However, if the vector contains missing value, you need an extra option.

mean(movie$budget)
## [1] NA
mean(movie$budget, na.rm = T)
## [1] 39752620

Exercise

  1. Can you write a function that takes a numerical variable as the input and produces a vector of summary statistics: mean, sd, min, 1st Quantile, median, 3rd quantile, and max?
  2. Can you create a matrix (or table) that shows the above summary statistics for “imdb_score”, “log(budget)”, “log(gross)”, and “duration”?

Use apply() for multiple variables (matrix)

Let’s generate some random numbers and create a matrix.

n<- 1000
x1<- rnorm(n)
x2<- rnorm(n, 3, 2)
x3<- runif(n, 0, 1)
x4<- rbinom(n, 1, 0.2)
X<- cbind(x1, x2, x3, x4)
apply(X, 2, mean)  # "2" means "by column"
##          x1          x2          x3          x4 
## -0.02205543  2.99034062  0.47956920  0.22200000

Exercise

  1. Try to replace mean in the apply() with the summary statistics function you defined in previous exercise. See what you got?
  2. (Optional) Try to search sapply, lapply, tapply on Google and learn when and how each of them is used.

Frequency tables with table()

The frequency (count) table is usually for categorical data. For a single variable table() produces the frequency table.

table(movie$color)
## 
## Black and White           Color 
##             209            4815
# you can also get the proportion
prop.table(table(movie$color))
## 
## Black and White           Color 
##      0.04160032      0.95839968

You may have two variables to obtain a two-way table.

table(movie$content_rating, movie$color)
##            
##             Black and White Color
##   Approved               20    35
##   G                       2   110
##   GP                      0     6
##   M                       0     5
##   NC-17                   1     6
##   Not Rated              23    93
##   Passed                  6     3
##   PG                     25   675
##   PG-13                  33  1425
##   R                      68  2042
##   TV-14                   0    30
##   TV-G                    2     8
##   TV-MA                   0    19
##   TV-PG                   0    13
##   TV-Y                    0     1
##   TV-Y7                   0     1
##   Unrated                11    51
##   X                       3    10
# to get proportion table, you may need to specify the margin (row, column, or the total)
prop.table(table(movie$content_rating, movie$color), 2)
##            
##             Black and White        Color
##   Approved     0.1030927835 0.0077211560
##   G            0.0103092784 0.0242664902
##   GP           0.0000000000 0.0013236267
##   M            0.0000000000 0.0011030223
##   NC-17        0.0051546392 0.0013236267
##   Not Rated    0.1185567010 0.0205162144
##   Passed       0.0309278351 0.0006618134
##   PG           0.1288659794 0.1489080079
##   PG-13        0.1701030928 0.3143613501
##   R            0.3505154639 0.4504742996
##   TV-14        0.0000000000 0.0066181337
##   TV-G         0.0103092784 0.0017648356
##   TV-MA        0.0000000000 0.0041914847
##   TV-PG        0.0000000000 0.0028678579
##   TV-Y         0.0000000000 0.0002206045
##   TV-Y7        0.0000000000 0.0002206045
##   Unrated      0.0567010309 0.0112508273
##   X            0.0154639175 0.0022060446

Data transformation and summary with dplyr package

The package dplyr, part of tidyverse, is designed for data management and manipulation. It is very convenient to use. We will introduce some key functions.

mutate() – Create new columns

Suppose you want to create two new columns: log(budget) and log(gross), you can write

movie1<- mutate(movie, log_budget=log(budget), log_gross=log(gross))

There are two things you need to be careful: (1) the name of new column should not contain “space”, “()”, “:”, etc., while commonly used special character (to separate words) are “_” and “.”, just like log_budget; (2) you should always check if the original column names are spelled correctly whenever you use them. Here you can always use names(movie) to get all column names of the data movie.

Use of %>%

The best thing of using functions in dplyr is that you can sequentially do many jobs with different functions, connected by %>%, which can be interpreted as “then”.

The code below is equivalent to above.

movie1<- movie %>% 
  mutate(log_budget=log(budget), log_gross=log(gross))

select() – Select columns

Suppose you want to select four columns (movie_title, num_voted_users, imdb_score and log_budget) from movie1, you can write

movie_sub1<- movie1 %>% 
  select(movie_title, num_voted_users, imdb_score, log_budget)

The order of the variable names you specified in select() will be the actual column order in the output data. Therefore, select() can also be used to re-arrange the columns.

Exercise

  1. Can you get movie_sub1 from movie instead of movie1? In other words, you need to combine above two tasks to one procedure using %>%.
  2. Can you create a new sub data movie_sub2 that contains everything except num_voted_users, imdb_score and log_budget? (Think about how do you exclude elements from a vector.)
  3. Can you re-arrange the columns of movie1 with the following order: movie_title, imdb_score, log_gross, log_budget, and the rest columns? (Try function everything().)

filter() – Filter by columns

Suppose that you want to filter out movies whose imdb score is greater than 8.5, you can write the code

movie2<- filter(movie1, imdb_score>8.5)

Or (using %>%)

movie2<- movie1 %>%
  filter(imdb_score>8.5)

Exercise

  1. How to filter all movies whose imdb score is between 6 and 7?
  2. How to filter all movies that were made by either Steven Spielberg or Christopher Nolan?
  3. can you get the movie titles and their imdb scores, that were made by Steven Spielberg?
  4. can you get the movie_title, along with imdb_score, log_budget, genres, whose imdb_score > 8 and log_budget<10th percentile?
  5. can you get the movie_title whose director_facebook_likes are 0, and content_rating are either “PG” or “PG-13”?

Missing values

Sometime you may want to remove all missing values or only keep missing values and investigate what was going on. We can filter missing values using is.na(). Below is a simple example.

a<- c(1, 2, 3, NA, NA)
is.na(a)
## [1] FALSE FALSE FALSE  TRUE  TRUE

Now suppose that you want all movies whose imdb score is greater than 8.5 and log_budget is missing, you can write code

movie2<- movie_sub1 %>% 
  filter(imdb_score>8.5, is.na(log_budget))

What if you want those whose log_budget is NOT missing?

movie2<- movie_sub1 %>% 
  filter(imdb_score>8.5, !is.na(log_budget))

The exclamation sign “!” means “NOT”.

arrange() – Sort by columns

This will not change any information in the data, but only sort certain column increasingly or decreasingly, like “Sort” in Excel. Suppose you want to sort imdb score from high to low for movie2, you can write code

movie2 %>% 
  arrange(desc(imdb_score))
## # A tibble: 36 × 4
##    movie_title                                    num_voted_users imdb_score log_budget
##    <chr>                                                    <dbl>      <dbl>      <dbl>
##  1 The Shawshank Redemption                               1689764        9.3       17.0
##  2 The Godfather                                          1155770        9.2       15.6
##  3 Kickboxer: Vengeance                                       246        9.1       16.6
##  4 The Dark Knight                                        1676169        9         19.0
##  5 The Godfather: Part II                                  790926        9         16.4
##  6 The Lord of the Rings: The Return of the King          1215718        8.9       18.4
##  7 Schindler's List                                        865020        8.9       16.9
##  8 Pulp Fiction                                           1324680        8.9       15.9
##  9 The Good, the Bad and the Ugly                          503509        8.9       14.0
## 10 12 Angry Men                                            447785        8.9       12.8
## # … with 26 more rows

Here we did not create any new dataset, probably because we don’t need a new data, but just look at the first several highly rated movies. desc() means descending order, so if you want ascending order, you don’t need desc().

rename() – Rename column names

movie3<- movie1%>%
  rename(Score=imdb_score, Title=movie_title)

Summary by groups

In EDA, obtaining summary statistics by group is a very common task. In dplyr, we can use the combination of group_by() and summarise() to quickly get what we want.

Suppose we want to know the average and sd of log_budget for color and black and white categories, respectively, we can write

movie1 %>%
  group_by(color) %>%
  summarise(mean=mean(log_budget, na.rm=T), sd=sd(log_budget, na.rm=T))
## # A tibble: 3 × 3
##   color            mean    sd
##   <chr>           <dbl> <dbl>
## 1 Black and White  15.3  2.09
## 2 Color            16.5  1.68
## 3 <NA>             15.2  1.87

To summarize multiple variables, we can use summarise_all().

movie1 %>%
  select(color, imdb_score, log_budget, log_gross, duration)%>%
  group_by(color) %>%
  summarise_all(mean, na.rm=T)
## # A tibble: 3 × 5
##   color           imdb_score log_budget log_gross duration
##   <chr>                <dbl>      <dbl>     <dbl>    <dbl>
## 1 Black and White       7.23       15.3      15.5    112. 
## 2 Color                 6.41       16.5      16.4    107. 
## 3 <NA>                  6.07       15.2      16.2     90.7

For categorical variable, we can always get frequency table such as

movie1 %>% 
  count(color)%>%
  mutate(prop=prop.table(n))
## # A tibble: 3 × 3
##   color               n    prop
##   <chr>           <int>   <dbl>
## 1 Black and White   209 0.0414 
## 2 Color            4815 0.955  
## 3 <NA>               19 0.00377

Or (using summarise)

movie1 %>% 
  group_by(color) %>%
  summarise(Freq=n()) %>%
  mutate(prop=prop.table(Freq))
## # A tibble: 3 × 3
##   color            Freq    prop
##   <chr>           <int>   <dbl>
## 1 Black and White   209 0.0414 
## 2 Color            4815 0.955  
## 3 <NA>               19 0.00377

Can you explain above code?

Now we can do more comprehensive tasks. Suppose we want to know summaries of budget and gross across certain movies with certain content ratings, as well as the frequency, we can write

movie1 %>%
  select(content_rating, log_budget, log_gross) %>%
  filter(content_rating %in% c("PG", "PG-13", "R")) %>%
  group_by(content_rating) %>%
  summarise_all(list(mean=mean, sd=sd), na.rm=T)
## # A tibble: 3 × 5
##   content_rating log_budget_mean log_gross_mean log_budget_sd log_gross_sd
##   <chr>                    <dbl>          <dbl>         <dbl>        <dbl>
## 1 PG                        17.0           17.2          1.48         1.80
## 2 PG-13                     17.2           17.1          1.29         1.83
## 3 R                         16.2           15.9          1.52         2.30

Exercise

  1. Who are the top 10 directors in terms of the number of movies they made?

  2. Can you obtain the average imdb_score of the movies made by these top 10 directors?

  3. Who are the top 10 directors in terms of the average imdb_score?

  4. Download the CustomerData to your working directory. Load it to R.

  5. What is the average and standard deviation of DebtToIncomeRatio for retired vs nonretired customers?

  6. What is the median of HHIncome for different job categories?

  7. Re-do (6) for those whose DebtToIncomeRatio falls in the lower half (lower 50%).

  8. Can you construct a two-way table that shows proportion of loan default across job category?

  9. How is loan default associated with age and household size, respectively?

Other useful summary functions

  • Measure of location: mean(x), min(x), max(x), median(x), quantile(x, prob=0.8).
  • Measure of spread: sd(x), var(x), mad() (median absolute deviation).
  • Measure of position: first(x), last(x), nth(x, 2)
  • Counts: n(), sum(!is.na(x)) (count for nonmissing values), n_distinct(x) (count for unique values), sum(x>10) (count by conditions).

Exploratory Data Analysis by Visualization

R Basic plot

histogram for individual variables

hist(movie$imdb_score, col = "blue")

Density plot – Fitted curve for histogram

Density plot is a nonparametric fitting.

plot(density(movie$imdb_score))

Combine the histogram and the density chart.

You can make the plot more elegant with different options. For example, adding a title, adjusting the axis range, renaming the axis label, and so on…

You can also add curves on top of an existing plot by using lines() or abline() function.

hist(movie$imdb_score, prob=T, col="green", breaks=20, main="Histogram and Density of IMDB Score", xlab="IMDB Score")
lines(density(movie$imdb_score), col="red", lwd=2)

# Add a vertical line that indicates the average of Sepal Length
abline(v=mean(movie$imdb_score), col="blue", lty=2, lwd=1.5)

We can use the option below to set the window to display two figures (arranged in 1 row 2 columns).

# set display arrangement
par(mfrow=c(1,2))
# now we can draw two graphs in one window
hist(movie$gross, col = "blue")
hist(log(movie$budget), col = "blue")

This only applies for R basic plot

Details about figure margins can be found here.

scatter plot of two variables

plot(log(movie$budget), log(movie$gross))

# with a few more options
plot(log(movie$budget), log(movie$gross), xlab = "log(budget)", ylab="log(gross)", pch=3, col=3)

The figure below tells you which number stands for which shape.

Scatter plot by group

This is often a useful visualization tool so you can see the relationship between X and Y across different groups. The most convenient way to produce such a scatter plot is to use ggplot.

movie1%>%filter(content_rating %in% c("G", "PG", "PG-13", "R"))%>%
  ggplot(aes(log(budget), log(gross), color=content_rating))+
    geom_point(size=2)
## Warning: Removed 657 rows containing missing values (geom_point).

Can you explain the above code?

You can also split the groups to multiple figures.

movie1%>%filter(content_rating %in% c("G", "PG", "PG-13", "R"))%>%
  ggplot(aes(log(budget), log(gross), color=content_rating))+
    geom_point(size=2)+
    facet_wrap(~content_rating, nrow=1)
## Warning: Removed 657 rows containing missing values (geom_point).

boxplot

boxplot(log(movie$gross)~movie$content_rating, col=rainbow(5))

Bar Chart

Bar chart is produces by using a vector of single data points, which is often a vector of summary statistics. Therefore, you need to preprocess your data, and get summary statistics before drawing the bar chart.

# bar chart for average of the 4 quantitative variables
aveg<- apply(iris[,1:4], 2, mean)
barplot(aveg, ylab = "Average")

Exercise

Use ?barplot or Google search to produce following bar chart.


Back to top

Data Visualization with ggplot2

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics. More details can be found at http://ggplot2.org/. Please see the optional lab notes for EDA of the movie data with ggplot.

Things to remember