Here are two commonly used environment for python. Both are included in Anaconda.
I will focus on R for most part of the class. Later this semester, I will demonstrate some machine learning methods in Python. For you, please feel free to explore it anytime.
The easiest way to explore is to use a generative AI (e.g., ChatGPT). Once you have a piece of R code, ask ChatGPT to convert it to Python. It is extremely helpful.
RStudio is running based on R. It is an IDE (Integrated Development Environment) with many advanced features. This lab notes is created based R Markdown, a very nice and useful tool from RStudio.
There are three panels showing. However, you need the fourth one, which is the editor window. Click the green-plus icon on left-top corner, and select R Script. You write all your code in this editor window, and remember to save it!
Type following code in the editor and run line by line. To run a line of code, you can move cursor to that line, and use Crtl+Enter (Command+Enter for Mac). If you want to run multiple lines of code, simply highlight those lines and use the same command.
x<- 33
y= 99
x*y+x/y
## [1] 3267.333
“<-” or “=” (they have little difference, but I prefer “<-”) means assign the RHS value to LHS object. The name of LHS object is defined by yourself.
After you run the code, what did you find in the Global Environment (Workspace) window?
Please try more functions:
log(x); exp(x/y); sin(x); cos(y); sqrt(y)
Economic Order Quantity Model: \(Q=\sqrt{2DK/h}\)
R is open-source software, meaning anyone can contribute by writing R packages and sharing them with the community. A package usually consists of several R functions, datasets, and documentation designed for specific tasks. There are over 10,000 packages in CRAN, covering a wide array of functionalities.
You may call yourself software developer if you can write R packages. If you are interested in writing package, here is a good book to read http://r-pkgs.had.co.nz/.
Let’s start with a basic example. We’ll install and load the
tidyverse
package, which is a collection of R packages
designed for data science. The tidyverse
includes packages
like ggplot2
for visualization, dplyr
for data
manipulation, and others.
install.packages("tidyverse")
library(tidyverse)
Once loaded, you can use functions from the tidyverse
.
For instance, let’s use ggplot2
to create a simple scatter
plot:
# Example: Creating a Scatter Plot
# Using built-in 'mtcars' dataset
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
labs(title = "Scatter Plot of MPG vs Weight", x = "Weight", y = "Miles per Gallon")
ggplot()
function to create a bar plot showing
the number of cars in each cylinder category (cyl
) using
the mtcars
dataset. (No idea? Ask ChatGPT)geom_
functions.The dplyr
package, also part of the
tidyverse
, is extremely useful for data manipulation.
Here’s an example:
# Example: Filtering and Summarizing Data
# Filter cars with mpg greater than 20 and summarize average horsepower
mtcars %>%
filter(mpg > 20) %>%
summarise(avg_hp = mean(hp))
## avg_hp
## 1 88.5
mtcars
dataset, filter the data to include
only cars with 6 cylinders.mpg
) for these
cars.hp
).Working directory is the folder where you may load data, save output, and save the code.
Look at current working directory: type getwd() in console. Set working directory: use setwd(“the path”), OR Click Session -> Set Working Directory -> Choose Directory, then choose the folder to which you wish to save your work.
This is to create an R project file in a directory. Think about it as creating an “R version” folder. This file is in a specified folder. When you click it, it opens RStudio with the working directory being this folder. This is the easiest way to organize a project that involves many R coding.
There are four types of data structure in R.
To assign a list of numbers (vector) to a variable, the numbers within the \(c\) command are separated by commas. As an example, we can create a new variable, called “z” which will contain the numbers 3, 5, 7, and 9:
# Define numerical vector z
z<- c(3,5,7,9)
# Define character vector zz
zz<- c("cup", "plate", "pen", "paper")
Note that you can put a # in front of a line to write comment in code.
#length
length(z)
## [1] 4
#Average
mean(z)
## [1] 6
#Standard devidation
sd(z)
## [1] 2.581989
#Median
median(z)
## [1] 6
#Max
max(z)
## [1] 9
#Min
min(z)
## [1] 3
#Summary Stats
summary(z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 4.5 6.0 6.0 7.5 9.0
#frequency
table(z)
## z
## 3 5 7 9
## 1 1 1 1
table(zz)
## zz
## cup paper pen plate
## 1 1 1 1
# define vector z1
z1 <- c(2,4,6,8)
Elementwise operations (must be the same length)
z+z1
## [1] 5 9 13 17
z*z1
## [1] 6 20 42 72
z+2
## [1] 5 7 9 11
z/10
## [1] 0.3 0.5 0.7 0.9
Vector of multiple vectors is still a vector.
# define vector z2
z2 <- c(z, z1)
z2
## [1] 3 5 7 9 2 4 6 8
Row and column bind.
cbind(z, z1)
## z z1
## [1,] 3 2
## [2,] 5 4
## [3,] 7 6
## [4,] 9 8
rbind(z, z1)
## [,1] [,2] [,3] [,4]
## z 3 5 7 9
## z1 2 4 6 8
Note, when you bind vectors by row or column, the length of vectors must be the same.
Economic Order Quantity Model: \(Q=\sqrt{2DK/h}\). Calculate the order quantity for 5 stores, where the parameter are shown below.
How to extract the second entry of vector z2?
z2[2]
## [1] 5
How to extract all elements greater than 3 from vector z2?
# first, locate these numbers. That is, find the indices
ind <- which(z2>3)
# Then use these indices to subset
z2[ind]
## [1] 5 7 9 4 6 8
This is equivalent to
z2[z2>3]
## [1] 5 7 9 4 6 8
How to extract all elements greater than 3 and smaller than 6 from vector z2?
z2[which(z2>3 & z2<6)]
## [1] 5 4
This is equivalent to
z2[z2>3 & z2<6]
## [1] 5 4
Which one is equal to 4?
which(z2==4)
## [1] 6
Find the indices of smallest and largest number?
which.min(z2)
## [1] 5
which.max(z2)
## [1] 4
How to order the vector z2 from smallest to largest?
# from smallest to largest
z2[order(z2)]
## [1] 2 3 4 5 6 7 8 9
# from largest to smallest
z2[order(z2, decreasing = T)]
## [1] 9 8 7 6 5 4 3 2
In addition to rbind()
and cbind()
, the
function matrix()
can be used to create a matrix from a
given vector.
# define a matrix A
A <- matrix(data = z2, nrow = 4, ncol = 2)
A
## [,1] [,2]
## [1,] 3 2
## [2,] 5 4
## [3,] 7 6
## [4,] 9 8
class(A)
## [1] "matrix" "array"
In R functions, one may ignore the argument names and just put the imputs in the right order.
A <- matrix(z2, 4, 2)
The default order to position the numbers of a vector to matrix is by column, but you can specify it as by row using an additional argument byrow=TRUE.
A <- matrix(data = z2, nrow = 4, ncol = 2, byrow = TRUE)
Question: What would it be if specify ncol=3?
dim(A)
## [1] 4 2
A+2
## [,1] [,2]
## [1,] 5 7
## [2,] 9 11
## [3,] 4 6
## [4,] 8 10
# Transpose
t(A)
## [,1] [,2] [,3] [,4]
## [1,] 3 7 2 6
## [2,] 5 9 4 8
# Multiplied by a matrix
t(A) %*% A
## [,1] [,2]
## [1,] 98 134
## [2,] 134 186
# Multiplied by a vector
A %*% c(2,3)
## [,1]
## [1,] 21
## [2,] 41
## [3,] 16
## [4,] 36
Define the following matrix in R. \[A=\left[\begin{array} {rrr} 2 & 1 & 4\\ -1 & 3 & 12\\ 4 & 1 & 4\\ 15 & 8 & 7 \end{array}\right]\]
Multiply by vector (1, 2, -1).
length(z)
## [1] 4
z[1:3]
## [1] 3 5 7
#All but the first element in z
z[-1]
## [1] 5 7 9
A[2,2]
## [1] 9
A[1, ]
## [1] 3 5
A[2:3, ]
## [,1] [,2]
## [1,] 7 9
## [2,] 2 4
A[, 2]
## [1] 5 9 4 8
It is also the “datasets” in R. It is a table with each row as an observation, and each column representing a variable. Data frame has column names (variable names) and row names.
Function data.frame(A)
converts the matrix \(A\) into a data frame.
mydf <- data.frame(A)
class(mydf)
## [1] "data.frame"
# variable name of data frame
names(mydf)
## [1] "X1" "X2"
names(mydf)<- c("Math", "History")
mydf <- data.frame(z, z1)
names(mydf)
## [1] "z" "z1"
read.csv()
is commonly used to import
csv files. You can also use the Import Dataset Wizard
in RStudio. Package “readxl” allows you to read xls/xlsx files. A tidy
version of this function is read_csv()
, which is in the
package tidyverse
. Now import the movie data using both
functions, and see what is the difference. Make sure the movie data file
is in your working directory.
movie1 <- read.csv("movie.csv")
movie2 <- read_csv("movie.csv")
#Load mtcars dataset that comes with R
data(mtcars)
?mtcars
#Dimension
dim(mtcars)
#Preview the first few rows
head(mtcars)
#Variable names
names(mtcars)
#Summary
summary(mtcars)
#Structure
str(mtcars)
Now let’s look into some specific variables.
# mean of mpg
mean(mtcars$mpg)
## [1] 20.09062
# frequency of cyl
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
List is a container. You can put different types of objects into a list.
mylist<- list(myvector=z, mymatrix=A, mydata=mtcars)
Most of the output of R function is a list that contains severl objects.
#fit a simple linear regression between braking distance and speed
lm(mpg~hp, data=mtcars)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Coefficients:
## (Intercept) hp
## 30.09886 -0.06823
There are three ways to get ith element from a list:
listname[[i]]
listname[["elementname"]]
listname$elementname
Note that you use double square brackets for indexing a list.
reg = lm(mpg~hp, data=mtcars)
reg[[1]]
reg[["coeffcients"]]
reg$coeffcients
plot(mtcars$hp, mtcars$mpg, xlab = "horsepower", ylab = "mpg")
Define a vector with values (5, 2, 11, 19, 3, -9, 8, 20, 1). Calculate the sum, mean, and standard deviation.
Re-order the vector from largest to smallest, and make it a new vector.
Convert the vector to a 3*3 matrix ordered by column. What is the sum of first column? What is the number in column 2 row 3? What is the column sum?
Download the CustomerData to your working directory. Load it to R.
Histogram is used to visualize the distribution of a single variable. Let’s define a vector \(x\) that has 100 random numbers from a standard normal distribution.
x <- rnorm(100)
hist(x)
If \(x\) is a sequence, e.g., time series, we can draw a plot to visualize how it changes over time.
plot(x, type = 'b')
Scatter plot is used to visualize how two numeric variables are correlated.
x <- rnorm(100)
y <- -1+1.5*x
plot(x, y)
Now let’s infuse some random errors to \(y\).
y <- -1+1.5*x+rnorm(100)
plot(x, y)
abline(lm(y~x), lty=2) # add a linear regression line on top of the scatter plot
Draw a scatter plot between mpg and hp in the mtcars dataset.
Add a regression on top of the scatter plot.
What is the mean and standard deviation of each numeric variables?
The simplest form of a function is the function with one input. For
example, \(y=3+2x+x^2\). To define this
function in R, we need (1) function name, (2) function()
,
and (3) input name.
fx<- function(x) 3+2*x+x^2
There are a few different ways to write this function.
fx<- function(x){
3+2*x+x^2
}
fx<- function(x){
y<- 3+2*x+x^2
return(y)
}
Here the function name is simply fx
, or anything you
like. The input name is x
, again, it can be other names. To
use this function, we just need
function_name(input_name)
.
fx(x=2) # or you can omit the input name "x="
## [1] 11
Note: the input name x
in the function
is NOT an objective in the environment. So you don’t need to worry if
you want to define other values as x
.
Now try the following. Can you make a scatter plot of x
vs y
?
x<- seq(-10,10,0.1)
y<- fx(x)
A function for vector winsorization
mywinsor<- function(x, lower, upper){
x[which(x<lower)]<- lower
x[which(x>upper)]<- upper
return(x)
}
You just defined a global function for winsorization Now let’s apply
it to vector z2
, where we truncate at lower=3 upper=7.
z<- rnorm(10, mean=0, sd=3)
mywinsor(x = z, lower = -2, upper = 2)
## [1] 2.000000 -2.000000 1.657777 -2.000000 2.000000 0.370794 2.000000
## [8] -1.565166 2.000000 1.492595
For the function we see before, \(f(x)=3+2x+x^2\), what is the derivative of \(f(x)\)? Can you define this function, say \(g(x)\), in R?
Can you define the following function in R? What does this function look like? How is this function different from \(f(x)=x^2\)? Can you compare these two functions in a plot? \[ \begin{align} f(x)=\begin{cases} x^2/(2*0.2) & \quad \text{if } |x|\le 0.2 \\ |x|-0.2/2 & \quad \text{if } |x|> 0.2 \end{cases} \end{align} \]
There are two ways to write a loop: while and for loop. Loop is very useful to do iterative and duplicated computing.
For example: calculate \(1+1/2+1/3+...+1/100\).
i<- 1
x<- 1
while(i<100){
i<- i+1
x<- x+1/i
}
x
## [1] 5.187378
x<- 1
for(i in 2:100){
x<- x+1/i
}
x
## [1] 5.187378
YES! (That is great!)
NO! (You may have trouble in the rest of the semester…)