Around data.table

This note will be interesting for those who use the table data processing library for R - data.table, and, perhaps, will be glad to see the flexibility of its application with various examples.

Inspired by a good example of a colleague , and hoping that you already read his article, I suggest digging deeper towards optimizing code and performance based on data.table .

Introduction: where does data.table come from?


It is best to start acquaintance with the library a little from afar, namely, from the data structures from which the data.table object (hereinafter, DT) can be obtained.

Array


The code
## arrays ---------

arrmatr <- array(1:20, c(4,5))

class(arrmatr)

typeof(arrmatr)

is.array(arrmatr)

is.matrix(arrmatr)



One of these structures is an array ( ? Base :: array ). As in other languages, arrays are multidimensional here. However, it is interesting that, for example, a two-dimensional array begins to inherit properties from the matrix class (? Base :: matrix ), and a one-dimensional array, which is also important, does not inherit from the vector ( ? Base :: vector ).

It should be understood that the type of data contained in a facility should be checked function base :: the typeof , which returns the internal description of the type according to the R Internals - common language protocol associated with the first-born the C .

Another command to determine the class of an object is base :: class, returns in the case of vectors a vector type (it differs from the internal name, but also allows you to understand the data type).

List


From a two-dimensional array, it is a matrix, you can go to the list ( ? Base :: list ).

The code
## lists ------------------

mylist <- as.list(arrmatr)

is.vector(mylist)

is.list(mylist)


In this case, several things happen at once:

  • The second dimension of the matrix collapses, that is, we get both a list and a vector.
  • The list thus inherits from these classes. It should be borne in mind that a single (scalar) value from the cell of the matrix array will correspond to the list item.

Due to the fact that the list is also a vector, some functions for vectors can be applied to it.

Dataframe


From a list, matrix or vector, you can go to the data frame ( ? Base :: data.frame ).

The code
## data.frames ------------

df <- as.data.frame(arrmatr)
df2 <- as.data.frame(mylist)

is.list(df)

df$V6 <- df$V1 + df$V2


What is interesting in it: the data frame inherits from the list! The columns of the data frame are list cells. This will be important in the future when we use the functions applied to lists.

data.table


You can get DT ( ? Data.table :: data.table ) from a data frame , list, vector or matrix. For example, like this (in place).

The code
## data.tables -----------------------
library(data.table)

data.table::setDT(df)

is.list(df)

is.data.frame(df)

is.data.table(df)


It is useful that, like the data frame, DT inherits the properties of the list.

DT and memory


Unlike all other objects in the R base, DTs are passed by reference. If you need to copy to a new memory area, you need the data.table :: copy function or you need to make a selection from the old object.

The code
df2 <- df

df[V1 == 1, V2 := 999]

data.table::fsetdiff(df, df2)

df2 <- data.table::copy(df)

df[V1 == 2, V2 := 999]

data.table::fsetdiff(df, df2)


On this introduction comes to an end. DT is a continuation of the development of data structures in R, which mainly occurs due to the expansion and acceleration of operations performed on objects of the data frame class. In this case, inheritance from other primitives is preserved.

Some examples of using data.table properties


Like a list ...


Iterating over the rows of a data frame or DT is not a good idea, since the loop code in the R language is much slower than C , and it is quite possible to walk through the columns in columns that are usually much less. Walking through the columns, remember that each column is a list item, which usually contains a vector. And operations on vectors are well vectorized in the basic functions of the language. You can also use the selection operators specific to lists and vectors: `[[`, `$` .

The code
## operations on data.tables ------------

#using list properties

df$'V1'[1]

df[['V1']]

df[[1]][1]

sapply(df, class)

sapply(df, function(x) sum(is.na(x)))


Vectorization


If there is a need to go through the lines of a large DT, writing a function with vectorization is the best solution. But if it does not work, it should be remembered that the cycle within the DT still cycle faster in the R , because it runs on the C .

Let's try on a larger example with 100K lines. We will pull the first letter from the words included in the column vector w .

Updated

The code
library(magrittr)
library(microbenchmark)

## Bigger example ----

rown <- 100000

dt <- 
	data.table(
		w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

# vectorization

microbenchmark({
	dt[
		, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		, by = 1:nrow(dt)
	   ]
})

# second

first_l_f <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		do.call(rbind, .) %>%
		`[`(,1)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f(w))
		]
})

# third

first_l_f2 <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		unlist %>%
		matrix(nrow = 3) %>%
		`[`(1,)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f2(w))
		]
})


The first run iterates through the lines:

Unit: milliseconds
expr min
{dt [, `: =` (first_l, unlist (strsplit (w, split = "", fixed = T)) [1]), by = 1: nrow (dt)]} 439.6217
lq mean median uq max neval
451.9998 460.1593 456.2505 460.9147 621.4042 100

The second run, where vectorization is done by turning the list into a matrix and taking elements on the slice with index 1 (the latter is the actual vectorization) . I will recover: vectorization at the level of the strsplit function , which can take a vector as input. It turns out that the process of turning the list into a matrix is ​​much harder than the vectorization itself, but in this case it is much faster than the non-vectorized version.

Unit: milliseconds
expr min lq mean median uq max neval
{dt [, `: =` (first_l,. (First_l_f (w)))]} 93.07916 112.1381 161.9267 149.6863 185.9893 442.5199 100

The median acceleration is 3 times .

The third run, where the scheme of transformation into a matrix is ​​changed.
Unit: milliseconds
expr min lq mean median uq max neval
{dt [, `: =` (first_l,. (First_l_f2 (w)))]} 32.60481 34.13679 40.4544 35.57115 42.11975 222.972 100

The median acceleration is 13 times .

You have to experiment with this matter, the more - the better.

Another example with vectorization, where there is also text, but it is close to real conditions: different length of words, different number of words. It is required to get the first 3 words. Like this:



Here the previous function does not work, since the vectors are of different lengths, and we set the size of the matrix. Redo this by delving into the Internet.

The code
# fourth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

first_l_f3 <- function(sd, n)
{
	l <- strsplit(sd, split = ' ', fixed = T)
	
	maxl <- max(lengths(l))
	
	sapply(l, "length<-", maxl) %>%
		`[`(n,) %>%
		as.character
}

microbenchmark({
	dt[
		, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
		]
})

dt[
	, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
	]



Unit: milliseconds
expr min lq mean median
{dt [, `: =` ((paste0 ("w_", 1: 3)), strsplit (w, split = "", fixed = T))]} 851.7623 916.071 1054.5 1035.199
uq max neval
1178.738 1356.816 100


The script ran at an average speed of 1 second. Not bad.

Another, more economical way foundkablag:

The code
# fifth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

microbenchmark({
	
	w_split <- dt[
		, data.table::tstrsplit(w, split = ' ', fixed = T, keep = 1L:3L)
		]
	
	dt[
		, `:=` (
			w_1 = as.character(w_split[[1]])
			, w_2 = as.character(w_split[[2]])
			, w_3 = as.character(w_split[[3]])
		)
		]

})



Median 186, 5 times cheaper ...

Connected by one chain ...


You can work with DT objects using chaining. It looks like hooking the parenthesis syntax to the right, in fact, sugar.

The code
# chaining

res1 <- dt[a == 'a'][sample(.N, 100)]

res2 <- dt[, .N, a][, N]

res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]



Flowing through pipes ...


The same operations can be done through piping, it looks similar, but functionally richer, since you can use any methods, not just DT. We derive the logistic regression coefficients for our synthetic data with a number of filters on diesel fuel.

The code
# piping

samplpe_b <- dt[a %in% head(letters), sample(b, 1)]

res4 <- 
	dt %>%
	.[a %in% head(letters)] %>%
	.[, 
	  {
	  	dt0 <- .SD[1:100]
	  	
	  	quants <- 
	  		dt0[, c] %>%
	  		quantile(seq(0.1, 1, 0.1), na.rm = T)
	  	
	  	.(q = quants)
	  }
	  , .(cond = b > samplpe_b)
	  ] %>%
	glm(
		cond ~ q -1
		, family = binomial(link = "logit")
		, data = .
	) %>%
	summary %>%
	.[[12]]


Statistics, machine learning, etc. inside DT


You can use lambda functions, but sometimes it’s better to create them separately, register the entire data analysis pipeline, and then they work inside the DT. The example is enriched with all the above features, plus a few useful things from the DT arsenal (such as accessing the DT itself inside the DT by reference, sometimes not inserted sequentially, but to be).

The code
# function

rm(lm_preds)

lm_preds <- function(
	sd, by, n
)
{
	
	if(
		n < 100 | 
		!by[['a']] %in% head(letters, 4)
	   )
	{
		
		res <-
			list(
				low = NA
				, mean = NA
				, high = NA
				, coefs = NA
			)
		
	} else {

		lmm <- 
			lm(
				d ~ c + b
				, data = sd
			)
		
		preds <- 
			stats::predict.lm(
				lmm
				, sd
				, interval = "prediction"
				)
		
		res <-
			list(
				low = preds[, 2]
				, mean = preds[, 1]
				, high = preds[, 3]
				, coefs = coefficients(lmm)
			)
	}

	res
	
}

res5 <- 
	dt %>%
	.[e < 0] %>%
	.[.[, .I[b > 0]]] %>%
	.[, `:=` (
		low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
		, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
		, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
		, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
		, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
		, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
	)
	, a
	] %>%
	.[!is.na(mean), -'e', with = F]


# plot

plo <- 
	res5 %>%
	ggplot +
	facet_wrap(~ a) +
	geom_ribbon(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, ymin = low
			, ymax = high
			, fill = a
		)
		, size = 0.1
		, alpha = 0.1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = mean
			, color = a
		)
		, size = 1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = d
		)
		, size = 1
		, color = 'black'
	) +
	theme_minimal()

print(plo)



Conclusion


I hope that I was able to create a complete, but, of course, not complete, picture of such an object as data.table, starting from its properties related to inheritance from R classes and ending with its own chips and surroundings from tidyverse elements. I hope this helps you to better learn and use this library for work and fun .



Thank!

Full code


The code
## load libs ----------------

library(data.table)
library(ggplot2)
library(magrittr)
library(microbenchmark)


## arrays ---------

arrmatr <- array(1:20, c(4,5))

class(arrmatr)

typeof(arrmatr)

is.array(arrmatr)

is.matrix(arrmatr)


## lists ------------------

mylist <- as.list(arrmatr)

is.vector(mylist)

is.list(mylist)


## data.frames ------------

df <- as.data.frame(arrmatr)

is.list(df)

df$V6 <- df$V1 + df$V2


## data.tables -----------------------

data.table::setDT(df)

is.list(df)

is.data.frame(df)

is.data.table(df)

df2 <- df

df[V1 == 1, V2 := 999]

data.table::fsetdiff(df, df2)

df2 <- data.table::copy(df)

df[V1 == 2, V2 := 999]

data.table::fsetdiff(df, df2)


## operations on data.tables ------------

#using list properties

df$'V1'[1]

df[['V1']]

df[[1]][1]

sapply(df, class)

sapply(df, function(x) sum(is.na(x)))


## Bigger example ----

rown <- 100000

dt <- 
	data.table(
		w = sapply(seq_len(rown), function(x) paste(sample(letters, 3, replace = T), collapse = ' '))
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

# vectorization

# zero - for loop

microbenchmark({
	for(i in 1:nrow(dt))
		{
		dt[
			i
			, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		]
	}
})

# first

microbenchmark({
	dt[
		, first_l := unlist(strsplit(w, split = ' ', fixed = T))[1]
		, by = 1:nrow(dt)
	   ]
})

# second

first_l_f <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		do.call(rbind, .) %>%
		`[`(,1)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f(w))
		]
})

# third

first_l_f2 <- function(sd)
{
	strsplit(sd, split = ' ', fixed = T) %>%
		unlist %>%
		matrix(nrow = 3) %>%
		`[`(1,)
}

dt[, first_l := NULL]

microbenchmark({
	dt[
		, first_l := .(first_l_f2(w))
		]
})

# fourth

rown <- 100000

words <-
	sapply(
		seq_len(rown)
		, function(x){
			nwords <- rbinom(1, 10, 0.5)
			paste(
				sapply(
					seq_len(nwords)
					, function(x){
						paste(sample(letters, rbinom(1, 10, 0.5), replace = T), collapse = '')
					}
				)
				, collapse = ' '
			)
		}
	)

dt <- 
	data.table(
		w = words
		, a = sample(letters, rown, replace = T)
		, b = runif(rown, -3, 3)
		, c = runif(rown, -3, 3)
		, e = rnorm(rown)
	) %>%
	.[, d := 1 + b + c + rnorm(nrow(.))]

first_l_f3 <- function(sd, n)
{
	l <- strsplit(sd, split = ' ', fixed = T)
	
	maxl <- max(lengths(l))
	
	sapply(l, "length<-", maxl) %>%
		`[`(n,) %>%
		as.character
}

microbenchmark({
	dt[
		, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
		]
})

dt[
	, (paste0('w_', 1:3)) := lapply(1:3, function(x) first_l_f3(w, x))
	]


# chaining

res1 <- dt[a == 'a'][sample(.N, 100)]

res2 <- dt[, .N, a][, N]

res3 <- dt[, coefficients(lm(e ~ d))[1], a][, .(letter = a, coef = V1)]

# piping

samplpe_b <- dt[a %in% head(letters), sample(b, 1)]

res4 <- 
	dt %>%
	.[a %in% head(letters)] %>%
	.[, 
	  {
	  	dt0 <- .SD[1:100]
	  	
	  	quants <- 
	  		dt0[, c] %>%
	  		quantile(seq(0.1, 1, 0.1), na.rm = T)
	  	
	  	.(q = quants)
	  }
	  , .(cond = b > samplpe_b)
	  ] %>%
	glm(
		cond ~ q -1
		, family = binomial(link = "logit")
		, data = .
	) %>%
	summary %>%
	.[[12]]


# function

rm(lm_preds)

lm_preds <- function(
	sd, by, n
)
{
	
	if(
		n < 100 | 
		!by[['a']] %in% head(letters, 4)
	   )
	{
		
		res <-
			list(
				low = NA
				, mean = NA
				, high = NA
				, coefs = NA
			)
		
	} else {

		lmm <- 
			lm(
				d ~ c + b
				, data = sd
			)
		
		preds <- 
			stats::predict.lm(
				lmm
				, sd
				, interval = "prediction"
				)
		
		res <-
			list(
				low = preds[, 2]
				, mean = preds[, 1]
				, high = preds[, 3]
				, coefs = coefficients(lmm)
			)
	}

	res
	
}

res5 <- 
	dt %>%
	.[e < 0] %>%
	.[.[, .I[b > 0]]] %>%
	.[, `:=` (
		low = as.numeric(lm_preds(.SD, .BY, .N)[[1]])
		, mean = as.numeric(lm_preds(.SD, .BY, .N)[[2]])
		, high = as.numeric(lm_preds(.SD, .BY, .N)[[3]])
		, coef_c = as.numeric(lm_preds(.SD, .BY, .N)[[4]][1])
		, coef_b = as.numeric(lm_preds(.SD, .BY, .N)[[4]][2])
		, coef_int = as.numeric(lm_preds(.SD, .BY, .N)[[4]][3])
	)
	, a
	] %>%
	.[!is.na(mean), -'e', with = F]


# plot

plo <- 
	res5 %>%
	ggplot +
	facet_wrap(~ a) +
	geom_ribbon(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, ymin = low
			, ymax = high
			, fill = a
		)
		, size = 0.1
		, alpha = 0.1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = mean
			, color = a
		)
		, size = 1
	) +
	geom_point(
		aes(
			x = c * coef_c + b * coef_b + coef_int
			, y = d
		)
		, size = 1
		, color = 'black'
	) +
	theme_minimal()

print(plo)


All Articles