My Battle in Subsetting Data with "NA"s

May 25, 2017

R is a language and environment for statistical computing and graphics. I use it daily for my research. To use a metaphor that resembles the relationship between R (or any other languages for the same purpose) and the data, it’s like mining precious metals. You could get a piece of land full of gold, with the a series of exploration and strategic mining, treasure will be discovered (in this case, interesting findings).

Missing data has been a big issue in any type of data anlayisis. For me, the first lesson I learned in practice is that in R, NA is not comparable. The NA is simply meaning “nothing exists here”, when comparing NA>5, R will give you NA. However, R views NAs the same, when comparing NA==NA, R returns TRUE. This simple yet attention-needing rule is very easy to forget, I want to give an example here for anyone who are interested.

To start with, I have a huge matrix with thousands of columns and rows. I need to reduce the matrix by certain criteria. The below matrix is a simplification of the problem.

> data<-matrix(c(4,3,NA,9,8,7,7,5,NA,NA,8,6),ncol=2,nrow=6,dimnames=list(NULL,c("apple","orange")))
> data<-as.data.frame(data)
> data
     apple orange
  1     4      7
  2     3      5
  3     NA    NA
  4     9      NA
  5     8      8
  6     7      6

I need to get the value with apple > 5 and orange > 4 by using

data1<-data[ data$apple>5 & !is.na(data$apple) & data$orange>4,]

ideally, this will give

>data1
      apple orange
  5      8      8
  6      7      6

However, I got rows with some bizzare NAs mixed with the information I need

> data1
     apple orange
  NA    NA     NA
  5      8      8
  6      7      6

It took me a while to realize why since I didn’t notice there were also NAs in “orange”, with one seemingly complete column in a high volume of data it requires extra care when doing QC. Why the NA appeared here is becuase NA passed the same boolean apple > 5 (remeber NA is not comparable) and showed as NA instead of TRUE or FALSE, when it goes to the second boolean !is.na(data$apple) it returns FALSE. 1 NA and 1 FALSE returns a NA (from the rule we know we can’t compare NA to anything but NA). When the last boolean is processed on orange, the same logic reappeared and left a NA, that’s why we see NA in the final result.

One way to get it right is to filter out all the NAs from columns of interest and then subset, this way there will not be any NA issue. There is also another way to solve this with the same logic I used before, which is

> data1<-data[ data$apple>5 & !is.na(data$orange) & data$orange>4,]

the reason why this will work is because NAs in apple is a subset of NAs in orange, by filtering out NAs in the orange there’s no need to worry about NAs in the apple.