DataFest: Large Data in R

2018-03-29

Data

660 MB file: 2008.csv

Package Libraries

  • reading in the data

    • readr (tidyverse – better than read.csv)
    • data.table (competitive advantage when reading data larger than 250 Mb)
  • data manipulators

    • dplyr
    • ggplot2
  • pryr

  • Benchmarking

    • microbenchmark
    • benchmark
    • profvis
library(tidyverse)
library(data.table)
file.info(list.files("data", full.names=TRUE))

Load Data

Comparing times

Function user system elapsed
readr::read_csv 94.01 84.73 181.98
data.table::fread 6.78 0.69 2.72

read_csv

standard read with read_csv

air <- read_csv("../tutotial-big-data-large-data/data/2008.csv")

data.table

Much faster with fread()!

The data.table package is useful for importing data larger than 1-2 GB. It’s often a good idea to convert the imported data frame to a tibble: my_tibble <- as_tibble(my_imported_data.table)

system.time({flights = fread("../tutotial-big-data-large-data/data/2008.csv",
                             showProgress = FALSE)})
##    user  system elapsed 
##    6.81    1.07    4.58
#class(flights) # "data.frame"

flights2 <- as_tibble(flights)
system.time({flights3 <- fread("data/2008.csv", showProgress = FALSE)})
# user    system    elapsed 
# 6.78    0.69      2.72 
system.time({air2 <- read_csv("data/2008.csv")})
#    user  system  elapsed 
#   94.01   84.73  181.98 

Preserve data table as a binary matrix

.Rdata

does note appear to load faster than even fread()

save(flights, file="data/flights.Rdata")
system.time(load(file="data/flights.Rdata"))
##    user  system elapsed 
##   14.85    0.25   15.81

Practical Advice

Use databases when the data are much larger than 1-2 GB

look at the DB tool in tidyverse: library(dbplyr. https://dbplyr.tidyverse.org/

Sampling the data

use sample_frac()

library(pryr)
## 
## Attaching package: 'pryr'
## The following object is masked from 'package:data.table':
## 
##     address
## The following objects are masked from 'package:purrr':
## 
##     compose, partial

flights subset is a sample of flights

set.seed(20180329)
flights_sub = flights %>% sample_frac(0.2)
object.size(flights)
## 953666048 bytes
object.size(flights_sub)
## 196613792 bytes

then free up memory

rm(flights)
rm(air)
gc()  # garbage collection
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells    886464   47.4    1647027   88.0   1647027   88.0
## Vcells 157314353 1200.3  434741301 3316.9 434487206 3314.9

Other Useful Tools

Progress Bars

p = progress_estimated(50, min_time = 0)
for(i in 1:50)
{
  # Calculate something compliated
  Sys.sleep(0.1)
  
  p$tick()$print()
}

BenchMarking

system.time(rnorm(1e6))
##    user  system elapsed 
##    0.09    0.00    0.09
system.time(rnorm(1e4) %*% t(rnorm(1e4)))
##    user  system elapsed 
##    0.26    0.07    0.36

Benchmark

library(microbenchmark)

Or, We can also use the rbenchmark package library(rbenchmark)

library(microbenchmark)

d = abs(rnorm(1000))
r = microbenchmark(
      exp(log(d)/2),
      d^0.5,
      sqrt(d),
      times = 1000
    )
print(r)
## Unit: microseconds
##           expr    min      lq      mean  median      uq     max neval cld
##  exp(log(d)/2) 93.981 101.591 127.46900 103.494 119.665 818.813  1000   c
##          d^0.5 42.615  46.420  71.56409  47.561  66.967 821.857  1000  b 
##        sqrt(d)  6.088   8.371  30.06372   9.132  15.601 586.335  1000 a

Profiling

library(profvis)

set.seed(20180329)
flights_small = flights_sub %>% sample_n(100000)
profvis({
  m = lm(AirTime ~ Distance, data = flights_small)
  plot(AirTime ~ Distance, data = flights_small)
  abline(m, col = "red")
})

PDF Graphics

png("time_vs_dist.png", width=1024, height=800)
ggplot(flights_small, aes(y=AirTime,x=Distance)) +
  geom_point(alpha=0.01, size=0.5)
## Warning: Removed 2195 rows containing missing values (geom_point).
dev.off()
## png 
##   2
ggsave("time_vs_dist_ggsave.png")
## Saving 7 x 5 in image
## Warning: Removed 2195 rows containing missing values (geom_point).