My Battle in Subsetting Data with "NA"s
R is a language and environment for statistical computing and graphics. I use it daily for my research. To use a metaphor that resembles the relationship between R (or any other languages for the same purpose) and the data, it’s like mining precious metals. You could get a piece of land full of gold, with the a series of exploration and strategic mining, treasure will be discovered (in this case, interesting findings).
Missing data has been a big issue in any type of data anlayisis. For me, the first lesson I learned in practice is that in R, NA
is not comparable. The NA
is simply meaning “nothing exists here”, when comparing NA>5
, R will give you NA
. However, R views NA
s the same, when comparing NA==NA
, R returns TRUE
. This simple yet attention-needing rule is very easy to forget, I want to give an example here for anyone who are interested.
To start with, I have a huge matrix with thousands of columns and rows. I need to reduce the matrix by certain criteria. The below matrix is a simplification of the problem.
> data<-matrix(c(4,3,NA,9,8,7,7,5,NA,NA,8,6),ncol=2,nrow=6,dimnames=list(NULL,c("apple","orange")))
> data<-as.data.frame(data)
> data
apple orange
1 4 7
2 3 5
3 NA NA
4 9 NA
5 8 8
6 7 6
I need to get the value with apple > 5
and orange > 4
by using
data1<-data[ data$apple>5 & !is.na(data$apple) & data$orange>4,]
ideally, this will give
>data1
apple orange
5 8 8
6 7 6
However, I got rows with some bizzare NA
s mixed with the information I need
> data1
apple orange
NA NA NA
5 8 8
6 7 6
It took me a while to realize why since I didn’t notice there were also NA
s in “orange”, with one seemingly complete column in a high volume of data it requires extra care when doing QC. Why the NA
appeared here is becuase NA
passed the same boolean apple > 5
(remeber NA
is not comparable) and showed as NA
instead of TRUE
or FALSE
, when it goes to the second boolean !is.na(data$apple)
it returns FALSE
. 1 NA
and 1 FALSE
returns a NA
(from the rule we know we can’t compare NA
to anything but NA
). When the last boolean is processed on orange, the same logic reappeared and left a NA
, that’s why we see NA
in the final result.
One way to get it right is to filter out all the NA
s from columns of interest and then subset, this way there will not be any NA
issue. There is also another way to solve this with the same logic I used before, which is
> data1<-data[ data$apple>5 & !is.na(data$orange) & data$orange>4,]
the reason why this will work is because NA
s in apple is a subset of NA
s in orange, by filtering out NA
s in the orange there’s no need to worry about NA
s in the apple.