function

Find all duplicated values in R

In the past few weeks, I had time, with only a few tasks to do, so I tried to optimize some of my functions.
For example, I needed a function which tells me which rows of a data.frame are duplicated. duplicated() is a really fast tool for this, but I needed something more: I had to know not only the repeating instances of the data, but the first one, too. It is useful if you have a database loaded as a data.frame, and you want to know where the items identified by a key differ.
The basic idea that you searc the repeating instances of the keys, then compare them to the whole original df. It can be pretty slow after getting the indices which are duplicated, then select the given rows and coloumns of the df, but the worst part is the comparing. You cannot just use %in%, because it does not works. After I realized this, I looked inside the duplicated(), what is in it?

> duplicated
function (x, incomparables = FALSE, ...)
UseMethod("duplicated")

Well, it did not help too much, but I learned that this is a method, which means if I call the function, itt will check its argument's class, and will try to call the appropriate function for it, in this case, the duplicated.data.frame(). So, I looked in that one:

> duplicated.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...)
{
if (!identical(incomparables, FALSE))
.NotYetUsed("incomparables != FALSE")
if (length(x) != 1L)
duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
else duplicated(x[[1L]], fromLast = fromLast, ...)
}

That's the one. In the first if, it calls a .NotYetUsed(), which is unknown for me, but one can see it only shows a message (I don't know yet why and how this works here):

> .NotYetUsed
function (arg, error = TRUE)
{
msg <- gettextf("argument '%s' is not used (yet)", arg)
if (error)
stop(msg, domain = NA, call. = FALSE)
else warning(msg, domain = NA, call. = FALSE)
}

After the first if clause did not gave me much hint, the second become my true guide. It says that if the number of coloumns (because length.data.frame gives us that number) of the data.frame are not one (1=1L, for more information check this), paste the coloumns together, and seach duplicated on those, as the result will be a character vector, so duplicated.character will work for that. If the data.frame consists only of one coloumn, then extract it with "[[", it will give us a vector too, and the appropriate duplicated() can be used.

Note the trick at the pasting step: it uses a do.call(). I couldn't decode it fully, but somewhere in the StackOwerFlow - R help - R-bloggers trinagle I read that it creates one funcion call from its arguments, while the *apply family creates a function call for each item of its agruments. Somewhere here is the wichery. Although after collecting all this information, I decided to use the example of duplicated.data.frame(), and search the duplicated values in the pasted versions, then search the duplicated ones in the pasted too with the %in% operator (a.k.a. the match() function). Here is the code, which can be found on GitHub:

allDuplicated <- function(x){
# This function returns TRUE for the first occurrence, too.
x.pasted <- do.call("paste0", c(x))
d <- x.pasted[duplicated(x.pasted)]
return(x.pasted %in% d)
}

As usual, the code available on GitHub.

I also did a little benchmarking: I used a data.frame with 2 coloumns and 25492 rows, which had only 81 unique values. To repeat the duplicated() function 100 times with a for() loop, took 8.03 sec, while my allDuplicated() needed 8.83 sec, it was about a tenth slower. I also measured a 1000 times repetition: the overall time the duplicated() needed was 83.19, while the allDuplicaded ran in only 81.92 sec. I don't know, why, maybe because I looked at a notepad window while the duplicated() worked...

I tried it without disturbance, on a generated non-repeating same-dimensional dataset. This time the allDuplicated() needed less time to run 1000 times, only 103.63 sec compared to the 104.36 sec of the duplicated() function.Explanations are welcomed!

Comments