performance - Fastest way to filter a data.frame list column contents in R / Rcpp

Question

Welcome To Ask or Share your Answers For Others

performance - Fastest way to filter a data.frame list column contents in R / Rcpp

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - Fastest way to filter a data.frame list column contents in R / Rcpp

I have a data.frame:

df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b", 
"c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame")

with a list column (each with a character vector):

> str(df)
'data.frame':   3 obs. of  2 variables:
     $ id  : int  1 2 3
     $ vars:List of 3
      ..$ : chr "a"
      ..$ : chr  "a" "b" "c"
      ..$ : chr  "b" "c"

I want to filter the data.frame according to setdiff(vars,remove_this)

library(dplyr)
library(tidyr)
res <- df %>% mutate(vars = lapply(df$vars, setdiff, "a"))

which gets me this:

   > res
      id vars
    1  1     
    2  2 b, c
    3  3 b, c

But to get drop the character(0) vars I have to do something like:

res %>% unnest(vars) # and then do the equivalent of nest(vars) again after...

Actual datasets:

560K rows and 3800K rows that also have 10 more columns (to carry along).

(this is quite slow, which leads to question...)

What is the Fastest way to do this in `R`?

Is there a dplyr/ data.table/ other faster method?
How to do this with Rcpp?

UPDATE/EXTENSION:

can the column modification be done in place rather then by copying the lapply(vars,setdiff(... result?
what's the most efficient way to filter out for vars == character(0) if it must be a seperate step.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:22:57+0000

Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

library(data.table)
dt = as.data.table(df)  # or use setDT to convert in place

dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
#   id  vars newcol
#1:  2 a,b,c    b,c
#2:  3   b,c    b,c

You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].

Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
#   id vars
#1:  2  b,c
#2:  3  b,c

To carry along the other columns, I think it's easiest to simply merge back:

dt[, othercol := 5:7]

# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
#   id vars i.vars othercol
#1:  2  b,c  a,b,c        6
#2:  3  b,c    b,c        7

Categories

performance - Fastest way to filter a data.frame list column contents in R / Rcpp

performance - Fastest way to filter a data.frame list column contents in R / Rcpp

Actual datasets:

What is the Fastest way to do this in `R`?

UPDATE/EXTENSION:

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

performance - Fastest way to filter a data.frame list column contents in R / Rcpp

performance - Fastest way to filter a data.frame list column contents in R / Rcpp

Actual datasets:

What is the Fastest way to do this in R?

UPDATE/EXTENSION:

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

What is the Fastest way to do this in `R`?