Download and Installation

R and RStudio

First, download and install R

Next, download and install RStudio

We will use RStudio throughout the entire course.

Python

Programming with Python

Here are two commonly used environment for python. Both are included in Anaconda.

  • Jupyter Notebook (a web-based interactive computing environment)
  • Spyder (an Integrated Development Environment (IDE), tailored for data science. It’s similar to RStudio but for Python)

I will focus on R for most part of the class. Later this semester, I will demonstrate some machine learning methods in Python. For you, please feel free to explore it anytime.

The easiest way to explore is to use a generative AI (e.g., ChatGPT). Once you have a piece of R code, ask ChatGPT to convert it to Python. It is extremely helpful.

R Basics

RStudio

RStudio is running based on R. It is an IDE (Integrated Development Environment) with many advanced features. This lab notes is created based R Markdown, a very nice and useful tool from RStudio.

There are three panels showing. However, you need the fourth one, which is the editor window. Click the green-plus icon on left-top corner, and select R Script. You write all your code in this editor window, and remember to save it!

Other Panels

  • Console: It shows any command you have run and corresponding output.
  • Environment: It shows what you currently have. Data you have loaded, functions that have been defined, and other R objects.
  • File/Plot…: Files in current working directory, latest plot you have generated…

Write and run code

Assign values to an objective

Type following code in the editor and run line by line. To run a line of code, you can move cursor to that line, and use Crtl+Enter (Command+Enter for Mac). If you want to run multiple lines of code, simply highlight those lines and use the same command.

x<- 33
y= 99
x*y+x/y
## [1] 3267.333

“<-” or “=” (they have little difference, but I prefer “<-”) means assign the RHS value to LHS object. The name of LHS object is defined by yourself.

After you run the code, what did you find in the Global Environment (Workspace) window?

Please try more functions:

log(x); exp(x/y); sin(x); cos(y); sqrt(y)

Exercise

Economic Order Quantity Model: \(Q=\sqrt{2DK/h}\)

  • D=5000: annual demand quantity
  • K=$4: fixed cost per order
  • h=$0.5: holding cost per unit
  • Q=?

Packages

R is open-source software, meaning anyone can contribute by writing R packages and sharing them with the community. A package usually consists of several R functions, datasets, and documentation designed for specific tasks. There are over 10,000 packages in CRAN, covering a wide array of functionalities.

You may call yourself software developer if you can write R packages. If you are interested in writing package, here is a good book to read http://r-pkgs.had.co.nz/.

Example: Installing and Using a Package

Let’s start with a basic example. We’ll install and load the tidyverse package, which is a collection of R packages designed for data science. The tidyverse includes packages like ggplot2 for visualization, dplyr for data manipulation, and others.

install.packages("tidyverse")
library(tidyverse)

Once loaded, you can use functions from the tidyverse. For instance, let’s use ggplot2 to create a simple scatter plot:

# Example: Creating a Scatter Plot
# Using built-in 'mtcars' dataset
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs Weight", x = "Weight", y = "Miles per Gallon")

Exercise 1: Exploring a Package

  1. Use the ggplot() function to create a bar plot showing the number of cars in each cylinder category (cyl) using the mtcars dataset. (No idea? Ask ChatGPT)
  2. Customize the plot by adding labels, changing colors, or experimenting with different geom_ functions.

Additional Example: Data Manipulation with dplyr

The dplyr package, also part of the tidyverse, is extremely useful for data manipulation. Here’s an example:

# Example: Filtering and Summarizing Data
# Filter cars with mpg greater than 20 and summarize average horsepower
mtcars %>%
  filter(mpg > 20) %>%
  summarise(avg_hp = mean(hp))
##   avg_hp
## 1   88.5

Exercise 2: Data Manipulation Practice

  1. Using the mtcars dataset, filter the data to include only cars with 6 cylinders.
  2. Calculate the average miles per gallon (mpg) for these cars.
  3. Arrange the filtered data by descending order of horsepower (hp).

Directory Setup

Working directory is the folder where you may load data, save output, and save the code.

Setting a working directory

Look at current working directory: type getwd() in console. Set working directory: use setwd(“the path”), OR Click Session -> Set Working Directory -> Choose Directory, then choose the folder to which you wish to save your work.

Start a project

This is to create an R project file in a directory. Think about it as creating an “R version” folder. This file is in a specified folder. When you click it, it opens RStudio with the working directory being this folder. This is the easiest way to organize a project that involves many R coding.

Learning Resource

  • Google: simply search “how to … with R”.
  • Stack Overflow: a searchable Q&A site oriented toward programming issues.
  • Cross Validated: a searchable Q&A site oriented toward statistical analysis.
  • R-bloggers: a central hub of content from over 500 bloggers who provide news and tutorials about R.
  • Use Question Mark in R console: This is the most convinient way to learn R functions. More than 80% of the time during my programming was looking at the help document in R.
    • Please try “?lm” (type it in your console).
  • Use Generative AI such as ChatGPT: This is your virtual TA. You’d better make a friend with her!

Back to top






Data Structure

There are four types of data structure in R.

Vector

To assign a list of numbers (vector) to a variable, the numbers within the \(c\) command are separated by commas. As an example, we can create a new variable, called “z” which will contain the numbers 3, 5, 7, and 9:

# Define numerical vector z
z<- c(3,5,7,9)
# Define character vector zz
zz<- c("cup", "plate", "pen", "paper")

Note that you can put a # in front of a line to write comment in code.

Calculation

A single vector

#length
length(z)
## [1] 4
#Average
mean(z)
## [1] 6
#Standard devidation
sd(z)
## [1] 2.581989
#Median
median(z)
## [1] 6
#Max
max(z)
## [1] 9
#Min
min(z)
## [1] 3
#Summary Stats
summary(z)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     4.5     6.0     6.0     7.5     9.0
#frequency
table(z)
## z
## 3 5 7 9 
## 1 1 1 1
table(zz)
## zz
##   cup paper   pen plate 
##     1     1     1     1

Multiple vectors

# define vector z1
z1 <- c(2,4,6,8)

Elementwise operations (must be the same length)

z+z1
## [1]  5  9 13 17
z*z1
## [1]  6 20 42 72
z+2
## [1]  5  7  9 11
z/10
## [1] 0.3 0.5 0.7 0.9

Vector of multiple vectors is still a vector.

# define vector z2
z2 <- c(z, z1)
z2
## [1] 3 5 7 9 2 4 6 8

Row and column bind.

cbind(z, z1)
##      z z1
## [1,] 3  2
## [2,] 5  4
## [3,] 7  6
## [4,] 9  8
rbind(z, z1)
##    [,1] [,2] [,3] [,4]
## z     3    5    7    9
## z1    2    4    6    8

Note, when you bind vectors by row or column, the length of vectors must be the same.

Exercise

Economic Order Quantity Model: \(Q=\sqrt{2DK/h}\). Calculate the order quantity for 5 stores, where the parameter are shown below.

  • Annual demand quantity: D= (5000, 3000, 3500, 6000, 4500)
  • Fixed cost per order: K= (4, 5, 5, 3, 4)
  • Holding cost per unit: h=$0.5
  • Q=?

Subsetting

How to extract the second entry of vector z2?

z2[2]
## [1] 5

How to extract all elements greater than 3 from vector z2?

# first, locate these numbers. That is, find the indices
ind <- which(z2>3)
# Then use these indices to subset
z2[ind]
## [1] 5 7 9 4 6 8

This is equivalent to

z2[z2>3]
## [1] 5 7 9 4 6 8

How to extract all elements greater than 3 and smaller than 6 from vector z2?

z2[which(z2>3 & z2<6)]
## [1] 5 4

This is equivalent to

z2[z2>3 & z2<6]
## [1] 5 4

Which one is equal to 4?

which(z2==4)
## [1] 6

Find the indices of smallest and largest number?

which.min(z2)
## [1] 5
which.max(z2)
## [1] 4

How to order the vector z2 from smallest to largest?

# from smallest to largest
z2[order(z2)]
## [1] 2 3 4 5 6 7 8 9
# from largest to smallest
z2[order(z2, decreasing = T)]
## [1] 9 8 7 6 5 4 3 2

Exercise:

  1. Find the elements of z2 that smaller than 3 or greater than 7.
  2. Find the second largest number in vector z2.

Matrix

Creat a matrix

Creat a Matrix using matrix() function

In addition to rbind() and cbind(), the function matrix() can be used to create a matrix from a given vector.

# define a matrix A
A <- matrix(data = z2, nrow = 4, ncol = 2)
A
##      [,1] [,2]
## [1,]    3    2
## [2,]    5    4
## [3,]    7    6
## [4,]    9    8
class(A)
## [1] "matrix" "array"

In R functions, one may ignore the argument names and just put the imputs in the right order.

A <- matrix(z2, 4, 2)

The default order to position the numbers of a vector to matrix is by column, but you can specify it as by row using an additional argument byrow=TRUE.

A <- matrix(data = z2, nrow = 4, ncol = 2, byrow = TRUE)

Question: What would it be if specify ncol=3?

Matrix Calculation

Deimension

dim(A)
## [1] 4 2

Elementwise operations for matrices

A+2
##      [,1] [,2]
## [1,]    5    7
## [2,]    9   11
## [3,]    4    6
## [4,]    8   10

Transpose and Multiplication

# Transpose
t(A)
##      [,1] [,2] [,3] [,4]
## [1,]    3    7    2    6
## [2,]    5    9    4    8
# Multiplied by a matrix
t(A) %*% A
##      [,1] [,2]
## [1,]   98  134
## [2,]  134  186
# Multiplied by a vector
A %*% c(2,3)
##      [,1]
## [1,]   21
## [2,]   41
## [3,]   16
## [4,]   36

Exercise

  1. Define the following matrix in R. \[A=\left[\begin{array} {rrr} 2 & 1 & 4\\ -1 & 3 & 12\\ 4 & 1 & 4\\ 15 & 8 & 7 \end{array}\right]\]

  2. Multiply by vector (1, 2, -1).

Subsetting

length(z)
## [1] 4
z[1:3]
## [1] 3 5 7
#All but the first element in z
z[-1]
## [1] 5 7 9
A[2,2]
## [1] 9
A[1, ]
## [1] 3 5
A[2:3, ]
##      [,1] [,2]
## [1,]    7    9
## [2,]    2    4
A[, 2]
## [1] 5 9 4 8

Exercise

  1. Define matrix \(B\) as the 2nd and 4th row of matrix \(A\) in previous exercise.
  2. Define a vector \(x\) that is the first column of matrix \(A\).
  3. Multiply matrix \(B\) by the first 3 elements of vector \(x\).

Data Frame

It is also the “datasets” in R. It is a table with each row as an observation, and each column representing a variable. Data frame has column names (variable names) and row names.

Convert a matrix to data frame

Function data.frame(A) converts the matrix \(A\) into a data frame.

mydf <- data.frame(A) 
class(mydf)
## [1] "data.frame"
# variable name of data frame
names(mydf)
## [1] "X1" "X2"
names(mydf)<- c("Math", "History")
mydf <- data.frame(z, z1)
names(mydf)
## [1] "z"  "z1"

Read external data file (.txt and .csv files)

read.csv() is commonly used to import csv files. You can also use the Import Dataset Wizard in RStudio. Package “readxl” allows you to read xls/xlsx files. A tidy version of this function is read_csv(), which is in the package tidyverse. Now import the movie data using both functions, and see what is the difference. Make sure the movie data file is in your working directory.

movie1 <- read.csv("movie.csv")
movie2 <- read_csv("movie.csv")

Load built-in dataset

#Load mtcars dataset that comes with R
data(mtcars)
?mtcars

Summary of a dataset

#Dimension 
dim(mtcars)
#Preview the first few rows
head(mtcars)
#Variable names
names(mtcars)
#Summary
summary(mtcars)
#Structure
str(mtcars)

Now let’s look into some specific variables.

# mean of mpg
mean(mtcars$mpg)
## [1] 20.09062
# frequency of cyl
table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

List

List is a container. You can put different types of objects into a list.

mylist<- list(myvector=z, mymatrix=A, mydata=mtcars)

Most of the output of R function is a list that contains severl objects.

A Simple Linear Regression Model, and a List of output

#fit a simple linear regression between braking distance and speed
lm(mpg~hp, data=mtcars)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp  
##    30.09886     -0.06823

There are three ways to get ith element from a list:

  • listname[[i]]
  • listname[["elementname"]]
  • listname$elementname

Note that you use double square brackets for indexing a list.

reg = lm(mpg~hp, data=mtcars)
reg[[1]]
reg[["coeffcients"]]
reg$coeffcients

A Simple Scatter Plot

plot(mtcars$hp, mtcars$mpg, xlab = "horsepower", ylab = "mpg")

Exercise

  1. Define a vector with values (5, 2, 11, 19, 3, -9, 8, 20, 1). Calculate the sum, mean, and standard deviation.

  2. Re-order the vector from largest to smallest, and make it a new vector.

  3. Convert the vector to a 3*3 matrix ordered by column. What is the sum of first column? What is the number in column 2 row 3? What is the column sum?

  4. Download the CustomerData to your working directory. Load it to R.

  • How many rows and columns are there?
  • Extract all variable names.
  • What is the average “Debt to Income Ratio”?
  • What is the proportion of “Married” customers?

Back to top






R Basic Plot

Histogram

Histogram is used to visualize the distribution of a single variable. Let’s define a vector \(x\) that has 100 random numbers from a standard normal distribution.

x <- rnorm(100)
hist(x)

If \(x\) is a sequence, e.g., time series, we can draw a plot to visualize how it changes over time.

plot(x, type = 'b')

Scatter plot

Scatter plot is used to visualize how two numeric variables are correlated.

x <- rnorm(100)
y <- -1+1.5*x
plot(x, y)

Now let’s infuse some random errors to \(y\).

y <- -1+1.5*x+rnorm(100)
plot(x, y)
abline(lm(y~x), lty=2)  # add a linear regression line on top of the scatter plot

More data visualizations are introduced in Lab 2.

Exercise

  1. Draw a scatter plot between mpg and hp in the mtcars dataset.

  2. Add a regression on top of the scatter plot.

  3. What is the mean and standard deviation of each numeric variables?

Function

  • R programming is essentially applying and writing functions.
  • An R function may require multiple inputs, we call them argument.
  • Using “?+function name” to learn how to use that function.
  • We introduce how to write simple functions here.

Univariate function

The simplest form of a function is the function with one input. For example, \(y=3+2x+x^2\). To define this function in R, we need (1) function name, (2) function(), and (3) input name.

fx<- function(x) 3+2*x+x^2

There are a few different ways to write this function.

fx<- function(x){
  3+2*x+x^2
}
fx<- function(x){
  y<- 3+2*x+x^2
  return(y)
}

Here the function name is simply fx, or anything you like. The input name is x, again, it can be other names. To use this function, we just need function_name(input_name).

fx(x=2)  # or you can omit the input name "x="
## [1] 11

Note: the input name x in the function is NOT an objective in the environment. So you don’t need to worry if you want to define other values as x.

Now try the following. Can you make a scatter plot of x vs y?

x<- seq(-10,10,0.1)
y<- fx(x)

Function beyond numeric calculation

A function for vector winsorization

mywinsor<- function(x, lower, upper){
  x[which(x<lower)]<- lower
  x[which(x>upper)]<- upper
  return(x)
}

You just defined a global function for winsorization Now let’s apply it to vector z2, where we truncate at lower=3 upper=7.

z<- rnorm(10, mean=0, sd=3)
mywinsor(x = z, lower = -2, upper = 2)
##  [1]  2.000000 -2.000000  1.657777 -2.000000  2.000000  0.370794  2.000000
##  [8] -1.565166  2.000000  1.492595

Exercise

  1. For the function we see before, \(f(x)=3+2x+x^2\), what is the derivative of \(f(x)\)? Can you define this function, say \(g(x)\), in R?

  2. Can you define the following function in R? What does this function look like? How is this function different from \(f(x)=x^2\)? Can you compare these two functions in a plot? \[ \begin{align} f(x)=\begin{cases} x^2/(2*0.2) & \quad \text{if } |x|\le 0.2 \\ |x|-0.2/2 & \quad \text{if } |x|> 0.2 \end{cases} \end{align} \]

Back to top






Loop

There are two ways to write a loop: while and for loop. Loop is very useful to do iterative and duplicated computing.

For example: calculate \(1+1/2+1/3+...+1/100\).

Using while loop

i<- 1
x<- 1
while(i<100){
  i<- i+1
  x<- x+1/i
}
x
## [1] 5.187378

Using for loop

x<- 1
for(i in 2:100){
  x<- x+1/i
}
x
## [1] 5.187378

Exercise:

  1. Do you think \(1+1/2^2+1/3^2+...+1/n^2\) converges or diverges as \(n\to \infty\)? Use R to verify your answer.
  2. Fibonacci sequence: 1, 1, 2, 3, 5, 8, 13,… What is the next number? What is the 50th number? Create a vector of first 20 Fibonacci numbers.
  3. In previous exercise 1, where you defined \(f(x)=3+2x+x^2\), what is the value of \(x\) such that \(f(x)\) is minimum? Now try to implement following algorithm, and see what you get.
    • set \(x_0=10\) (or any random number)
    • calculate \(g(x_0)\), which is the derivative of \(f(x)\)
    • compute \(x_1=x_{0}-0.001*g(x_{0})\), and the absolute difference between \(x_1\) and \(x_0\), i.e., \(d=|x_1-x_0|\). Then reset \(x_0=x_1\).
    • repeat previous two steps until \(d<0.00001\). What is the current \(x_0\)? How many iterations have been done?

Back to top






Summary

Do you like R?

YES! (That is great!)

NO! (You may have trouble in the rest of the semester…)

What you need to remember

  • Set working directory.
  • What are the types of data in R? How to define them?
  • How to load an external data file?
  • How to subset/index a vector, a matrix, and a data frame?
  • How to write an R function?
  • How to write loop in R?
  • Don’t forget about your virtual TA! This is a friend you will need!