Read Parquet in R

The new LOBSTER engine working on AWS cloud, can output order book data in either Parquet or CSV format . For the sake of efficiency, it outputs Parquet file by default. For Spark and Python users, loading the parquet files is trivial. For R users, it needs a little bit more efforts. However, Apache Arrow's new-released R arrow package makes the task massively simple now. Here are the steps of installing and using this package:

> install.packages("arrow")
> install.packages("data.table")
> arrow::install_arrow()

After the installation and restart your R session , you can load a single file by

>df <- arrow::read_parquet(file_name)

Please refer to the official R arrow document and here for the details. However, for loading a local directory with parquet partitions, more efforts are required. This thread provides a solution by facilitating lapply function of R,

>df <- data.table::rbindlist(lapply(Sys.glob("parquet_directory/part-*.parquet"), arrow::read_parquet))

This line works in local storage, I doubt that it could work in S3 storage (without testing). Personally, I prefer the alternative solution by using reticulate package. This approach requires some additional installation

> install.packages("reticulate")
> arrow::install_pyarrow()

After the installation process completes and the R session restarts , you can then read the parquet directory by

> library(reticulate)
> pd <- import("pandas")
> df <- pd$read_parqet(parquet_directory)

A full example can be found in this function and test by this script on Github.

This entry was posted in R. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *