Modes and missing values

library(moder)

What makes moder special is the way it handles missing values. It has two main guiding principles:

  1. Mode functions should return NA if and only if the data cannot answer the user’s question.
  2. Any deviation from point 1 should be the result of an explicit choice by the user.

Both points directly follow from R’s general approach towards missing values. This vignette explains them in some detail. Finally, it spells out the underlying theory. I recommend reading “Get started” first if you haven’t already.

1. Return NA only if question can’t be answered

General points

NA is not some other value that shows up in the data. It is a placeholder for any and all values that we don’t know, so we should treat it accordingly. This is why mean() and median() sometimes return NA:

x1 <- c(7, 7, 7, 8, 8, 9, 9, NA)
mean(x1)
#> [1] NA
median(x1)
#> [1] NA

We don’t know the true value behind NA, so we don’t know how it would influence the mean or the median. This is always true for the mean (but not for the median).1 What about the mode? Missing values will sometimes, but not always, render it unknown. Therefore, mode implementations require special care and nuance.

In x1, we know that 7 is a mode. It appears three times, and even if the NA was actually an 8 or a 9, that value would only appear three times, as well. This means we know one mode, but not necessarily the whole set of modes. mode_first() is happy to return 7, but mode_all() insists on the complete set of modes. It can’t be determined, so the function returns NA:

x1
#> [1]  7  7  7  8  8  9  9 NA
mode_first(x1)
#> [1] 7
mode_all(x1)
#> [1] NA

Some distributions have a clear set of modes even though some of its values are missing. Here, 1 is more frequent than 0 even if each NA masks another 0:

x2 <- c(1, 1, 1, 1, 1, 0, 0, NA, NA)
mode_first(x2)
#> [1] 1
mode_all(x2)
#> [1] 1

Where is mode_single() in all of this? It calls mode_all() internally, so it handles missing values just like this function does. Differences only occur if mode_all() returns multiple modes.

Possible (minimal and maximal) sets of modes

You might want to find all values that are known to be modes, even if missing values make it unclear whether certain other values are. Use mode_possible_min() for this. You can think of the function as a less strict version of mode_all(): It also takes all modes it can find, but unlike mode_all(), it doesn’t insist on the full set of modes.

Either 8 or 9 may be a mode, depending on NA, but 7 is a mode in any case:

x1
#> [1]  7  7  7  8  8  9  9 NA
mode_possible_min(x1)
#> [1] 7

Here, "a" is definitely a mode, "b" is only a mode if the missing value is also "b", but "c" can’t be a mode:

mode_possible_min(c("a", "a", "a", "b", "b", "c", NA))
#> [1] "a"

Other distributions may not have a clear minimum because too many of its values are missing. If both NAs below are FALSE, then FALSE is the mode. Otherwise, TRUE is:

mode_possible_min(c(TRUE, TRUE, FALSE, NA, NA))
#> [1] NA

The mirror image of mode_possible_min() is mode_possible_max(). It returns the greatest possible set of modes, given the number of missing values. This is all about the theoretical maximum, so the function can return values that are not guaranteed to be modes!

As above, "a" is a known mode, "b" may be a mode, but "c" can’t be one:

mode_possible_max(c("a", "a", "a", "b", "b", "c", NA))
#> [1] "a" "b"

There is no clear maximum in x1 because either 8 or 9 can be a mode, but not both together. With one more 7, though, 7 is more frequent anyways:

x1
#> [1]  7  7  7  8  8  9  9 NA
mode_possible_max(x1)
#> [1] NA
mode_possible_max(c(x1, 7))
#> [1] 7

2. Still return non-NA value only if chosen by user

Ignoring NAs

Each moder function that attempts to determine (actual) modes has an na.rm argument. This works exactly like in mean() and median(): It’s FALSE by default, but if the user sets it to TRUE, missing values are removed from x before the statistic is computed.

x1
#> [1]  7  7  7  8  8  9  9 NA
mean(x1, na.rm = TRUE)
#> [1] 7.857143
median(x1, na.rm = TRUE)
#> [1] 8
mode_all(x1, na.rm = TRUE)
#> [1] 7

You should think really carefully before removing NAs. Are you sure the missing values are unnecessary to answer your specific question?

This approach makes sure to inform the user if the data can’t answer their questions. They might still choose to ignore missings, but this will always require an explicit statement. In this way, users will know what they are doing, and can make their own decisions about data analysis.

The first known mode

mode_first() has a logical accept argument that is FALSE by default. If set to TRUE, the function will pick the first value known to be a mode. This means it won’t strictly look for the first-appearing value that is a mode, but accept the first value that is a mode without counting NAs in its favor:

x4 <- c(6, 4, 4, 4, NA, NA, 1)
mode_first(x4)
#> [1] NA
mode_first(x4, accept = TRUE)
#> [1] 4

The NAs might both be 6, and a known 6 appears at the very start, so this might be the first mode. It’s unclear, so we get NA by default. However, 4 is the first value that we know is a mode. Setting accept to TRUE makes the function pragmatically accept 4 — we want to find a mode, after all, and we know one.

Minimum and complete sets of modes

In mode_single() as well, accept is FALSE by default. The purpose is to check whether there is exactly one mode by using the complete set of modes, not the minimum set that just so happens to be known. Set accept to TRUE to avoid returning NA when one mode is known, but not all modes are:

x5 <- c(4, 4, 4, 7, 7, NA)
mode_single(x5)
#> [1] NA
mode_single(x5, accept = TRUE)
#> [1] 4

Theory

Taking NAs seriously enough

Some earlier mode functions for R treat NA like a known, constant, and distinct value. For example, they return 1 for c(1, 1, 1, 2, 2, NA, NA, NA). With one more NA, they return NA because now it is more frequent than 1. This approach is suboptimal because it doesn’t treat NA as missing. In both examples, 2 could very well be more frequent than 1 if enough NAs represent 2. Treating NA like any other value effectively assumes that all NAs represent the same value, and that it’s different from all known ones (although I’m sure this was not the intention).

Rather than forming a distinct and cohesive group, the NAs may represent any number of known or unknown values, so this procedure can be misleading. Consider that NA == NA returns NA: If any two values are unknown, it’s also unknown whether they are equal (Wickham 2019, ch. 3.2.3).

What’s more, categorizing missing values as a separate group is actually an imputation strategy — and perhaps not a very wise one. I think imputation by default is not the job of operations that are supposed to simply determine a statistic. All moder functions treat NAs as genuinely missing values, just like mean(), median(), and language primitives such as == do. Sometimes, this means the functions can just ignore NAs because a known value is more frequent than the next-most-frequent value and all NAs taken together. They will only return NA if the data cannot possibly answer the user’s question.

All choices about imputation will then be left to the user. In this way, moder functions draw a clear line between estimation and imputation.

Not taking NAs too seriously

Certain other mode implementations return NA whenever the input vector contains any NAs at all. This is not necessary — in fact, it’s overly conservative. We have seen above that a distribution may have a clear set of modes even if some of its values are missing. This is a marked difference to mean(), where any missings really do make estimation impossible.

It all depends on the relation between counts: Is the most frequent known value at least as frequent as the sum of the counts of the second-most-frequent known value and all missing values? If so, this value is known to be the mode. Note that mode_first(), mode_all() and mode_single() don’t allow for ties among known values if any other values are missing (they return NA). That is because missings may secretly count for any “tied” known value, which would break the tie.

In a vector like c(1, 1, NA), the mode is clearly 1, no matter what the NA stands for. It is just as clear as in c(1, 1, 2). A mode function that returns NA instead of 1 would be incorrect for c(1, 1, 2), so it would be just as incorrect for c(1, 1, NA).

Again, think of the parallels to base R. NA ^ 2 returns NA, but NA ^ 0 returns 1. In the second case, we don’t need to know the true value behind NA to determine the result. Such knowledge is built into R (Wickham 2019, ch. 3.2.3), and I think function design should emulate this pattern to keep the semantics of NA consistent across contexts.

It’s not just R that works this way: SQL, Python (pandas) and, to a large degree, Julia2 implement the same system of logic, called Kleene logic.

NA return types

If a function that is supposed to return one or more modes returns NA instead, this NA has the same type as the input vector, x. Functions that attempt to count modes or to determine their frequency only ever return these integers or NA_integer_, i.e., a missing value of type integer. However, the differences between NA types are not very important (and perhaps not widely known). They can usually be disregarded.

References

Wickham, H. (2019). Advanced R (Second Edition), CRC Press/Taylor and Francis Group. https://adv-r.hadley.nz/index.html.


  1. The median has some of the same complexities as the mode. In a vector like c(1, 1, NA) or even c(1, 1, 1, 1, 2, 2, NA), the median is known to be 1, regardless of the value behind NA. This raises some questions about the way stats::median.default() handles missing values. See the naidem package for a solution to this issue.↩︎

  2. In Julia, missing^0 returns missing instead of 1. This doesn’t correspond to Kleene logic, but nor does R’s median() function. See footnote 1.↩︎