Monday, October 13, 2014

The %in% operator in R

1. If you want a logical value for more specific elements, whether they are in a longer vector:

Test if shorter vectors are in longer vectors
6:10 %in% 1:36
## [1] TRUE TRUE TRUE TRUE TRUE

Or, test which elements of long vectors are in short vector
1:36 %in% 6:10
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE

Or, partially-overlapping:
6:10 %in% 1:8
## [1]  TRUE  TRUE  TRUE FALSE FALSE

The function is especially useful with character vectors or factors:
c("d", "e") %in% c("a", "b", "c", "d")
## [1]  TRUE FALSE

2. If you want to know the indexes of the specific elements inside a larger vector (notice the order):

In this case, we are asking which elements of (1:36 %in% 1:6) are TRUE:
which(1:36 %in% 6:10)
## [1]  6  7  8  9 10

The less useful case would be to ask which(6:10) %in% 1:36. Because the vector 6:10 is five elements long, and all elements are true, it just returns:
which(6:10 %in% 1:36)
## [1] 1 2 3 4 5

As above, I think this is especially useful with characters or factors. So often in R, I have wanted to query the positions of multiple factors in a larger vector. Naively, I have tried to use something like:
which(c("d", "e") == c("a", "b", "c", "d", "e", "a", "b", "c", "d", "e"))
## [1]  9 10

Which INCORRECTLY gives only the last two elements. The %in% operator gives the correct answer:
which(c("a", "b", "c", "d", "e", "a", "b", "c", "d", "e") %in% c("d", "e"))
## [1]  4  5  9 10

But notice what happens when the two vectors are only partially-overlapping:
# partially overlapping case
which(c("a", "b", "c", "d") %in% c("d", "e"))
## [1] 4
In the partially-overlapping case, you get the 4th element, because “d” was the only element in the second set. So you have to be careful, this may or may not be what you want depending on the scenario.

The %in% operator in R

I have recently discovered the %in% operator in R, and have been surprised at the lack of online documentation for this simple, yet SUPER USEFUL, operator. Here is how it works:

1. If you want a logical value for more specific elements, whether they are in a longer vector:

6:10 %in% 1:36
## [1] TRUE TRUE TRUE TRUE TRUE
6:10 %in% 1:8
## [1]  TRUE  TRUE  TRUE FALSE FALSE

No comments:

Post a Comment