The new LOBSTER engine working on AWS cloud, can output order book data in either Parquet or CSV format . For the sake of efficiency, it outputs Parquet file by default. For Spark and Python users, loading the parquet files is trivial. For R users, it needs a little bit more efforts. However, Apache Arrow's new-released R
package makes the task massively simple now. Here are the steps of installing and using this package:arrow
> install.packages("arrow")
> install.packages("data.table")
> arrow::install_arrow()
After the installation and restart your R session , you can load a single file by
>df <- arrow::read_parquet(file_name)
Please refer to the official R arrow document and here for the details. However, for loading a local directory with parquet partitions, more efforts are required. This thread provides a solution by facilitating
function of R,lapply
>df <- data.table::rbindlist(lapply(Sys.glob("parquet_directory/part-*.parquet"), arrow::read_parquet))
This line works in local storage, I doubt that it could work in S3 storage (without testing). Personally, I prefer the alternative solution by using
package. This approach requires some additional installationreticulate
> install.packages("reticulate")
> arrow::install_pyarrow()
After the installation process completes and the R session restarts , you can then read the parquet directory by
> library(reticulate)
> pd <- import("pandas")
> df <- pd$read_parqet(parquet_directory)
A full example can be found in this function and test by this script on Github.