Exercise 6: Subsets of data and logical operators
Logical vector and index vector
A lot of times we want to get a subset of data filtering rows or columns of a dataframe. For which we can perform logical test and get TRUE or FALSE as result. This vector of logical can then be used to subset the observations from a dataframe.
For example, Lets extract observation from bodydata with Weight greater than 80. You might be wondering why following code does not work,
isHeavy <- Weight > 80
Error in eval(expr, envir, enclos): object 'Weight' not found
But remember that, the variable Weight is a part of bodydata. We have to extract Weight from the bodydata first. In R, with and within function helps you in this respect. In the following code, with function goes inside bodydata and execute the expression Weight > 80.
isHeavy <- with(bodydata, Weight > 80)
Here the logical vector isHeavy is computed by performing a logical operation on Weight variable within bodydata. The same operation can be done as,
isHeavy <- bodydata$Weight > 80
Take a look at this variable, what is it? :
head(isHeavy)
[1] FALSE TRUE FALSE FALSE FALSE TRUE
Yes, it is a vector of TRUE and FALSE with same length as Weight. Here the condition has compared each element of Weight results TRUE if it is greater than 80 and FALSE if it is less than 80.
- Identify the elements
We can identify which observations that are heavy by the
which()functionHeavyId <- which(isHeavy)This will return a vector of row index for the observations that are heavy, i.e. greater than 80. So how many are heavy? To find the size of a vector we can use
lengthfunction.length(HeavyId)[1] 94Here, 94 observations have Weight larger than 80.
Exercise
Identify who are taller than 180 and save this logical vector as an object called
isTall.Answer
isTall <- with(bodydata, Height > 180)How many observations have height taller than 180?
Answer
TallId <- which(isTall) length(TallId)[1] 76How many observations are both tall and heavy? Here, you can use
lengthfunction as above to find how many person are taller than 180.Answer
isBoth <- isHeavy * isTall- How is this computation done?
- Here
isHeavyandisTallcontainsTRUEandFALSE. The multiplication of logical operator results a logical vector withTRUEonly if both the vectors areTRUEelseFALSE.
Alternatively :
isBoth <- which(isHeavy & isTall)The
&operator resultTRUEif bothisHeavyandisTallareTRUEelse,FALSEwhich is same as previous.
Subsetting data frame
Example 1
Lets create a subset of the data called bodydataTallAndHeavy containing only the observations for tall and heavy persons as defined by isBoth.
bodydataTallAndHeavy <- bodydata[isBoth, ]
For other logical tests see help file ?Comparison
Example 2
Lets create a random subset of 50 observations. For this we first sample 50 row index randomly from all rows in bodydata. The sample function is used for the purpose. In the following code, nrow(bodydata) return the number of rows in bodydata. The sample function takes two argument x which can be a vector or a integer and size which is the size of the sample to be drawn.
idx <- sample(x = nrow(bodydata), size = 50)
Here, 50 rows are sampled from the total number of rows and the index of the selected rows are saved on vector idx.
Using this vector we can select the observations in bodydata to create a new data set called bodydataRandom as,
bodydataRandom <- bodydata[idx, ]
Here is the first five rows of bodydataRandom dataset.
head(bodydataRandom, n = 5)
Weight Height Age Circumference
234 74.1 168 55 90.1
283 53.6 162 23 63.2
91 70.2 175 27 73.5
378 58.0 163 20 71.0
105 55.5 169 20 68.0
Exercise
Create a subset of dataset bodydata including the observation with Age larger an 55 and Circumference larger than 80. Save this dataset named subdata.
Answer
idx <- with(bodydata, Age > 55 & Circumference > 80)
subdata <- bodydata[idx, ]
subdata
Weight Height Age Circumference
136 76.4 185 62 94.8
189 73.6 175 60 90.5
207 66.8 168 62 81.5
231 80.0 174 65 98.6
For those who are interested in playing more with data, have a look at http://r4ds.had.co.nz/transform.html