# My Battle in Subsetting Data with "NA"s

R is a language and environment for statistical computing and graphics. I use it daily for my research. To use a metaphor that resembles the relationship between R (or any other languages for the same purpose) and the data, it’s like mining precious metals. You could get a piece of land full of gold, with the a series of exploration and strategic mining, treasure will be discovered (in this case, interesting findings).

Missing data has been a big issue in any type of data anlayisis. For me, the first lesson I learned in practice is that in R, `NA` is not comparable. The `NA` is simply meaning “nothing exists here”, when comparing `NA>5`, R will give you `NA`. However, R views `NA`s the same, when comparing `NA==NA`, R returns `TRUE`. This simple yet attention-needing rule is very easy to forget, I want to give an example here for anyone who are interested.

To start with, I have a huge matrix with thousands of columns and rows. I need to reduce the matrix by certain criteria. The below matrix is a simplification of the problem.

``````> data<-matrix(c(4,3,NA,9,8,7,7,5,NA,NA,8,6),ncol=2,nrow=6,dimnames=list(NULL,c("apple","orange")))
> data<-as.data.frame(data)
> data
apple orange
1     4      7
2     3      5
3     NA    NA
4     9      NA
5     8      8
6     7      6
``````

I need to get the value with `apple > 5` and `orange > 4` by using

``````data1<-data[ data\$apple>5 & !is.na(data\$apple) & data\$orange>4,]
``````

ideally, this will give

``````>data1
apple orange
5      8      8
6      7      6
``````

However, I got rows with some bizzare `NA`s mixed with the information I need

``````> data1
apple orange
NA    NA     NA
5      8      8
6      7      6

``````

It took me a while to realize why since I didn’t notice there were also `NA`s in “orange”, with one seemingly complete column in a high volume of data it requires extra care when doing QC. Why the `NA` appeared here is becuase `NA` passed the same boolean `apple > 5` (remeber `NA` is not comparable) and showed as `NA` instead of `TRUE` or `FALSE`, when it goes to the second boolean `!is.na(data\$apple)` it returns `FALSE`. 1 `NA` and 1 `FALSE` returns a `NA` (from the rule we know we can’t compare `NA` to anything but `NA`). When the last boolean is processed on orange, the same logic reappeared and left a `NA`, that’s why we see `NA` in the final result.

One way to get it right is to filter out all the `NA`s from columns of interest and then subset, this way there will not be any `NA` issue. There is also another way to solve this with the same logic I used before, which is

``````> data1<-data[ data\$apple>5 & !is.na(data\$orange) & data\$orange>4,]
``````

the reason why this will work is because `NA`s in apple is a subset of `NA`s in orange, by filtering out `NA`s in the orange there’s no need to worry about `NA`s in the apple.