Convert A Spark DataFrame with Date Columns to Pandas DataFrame using Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. After install pyarrow package, we can convert Spark DataFrame efficiently to Pandas DataFrame by setting spark.sql.execution.arrow.enabled to true for the Spark instance

from pyspark.sql import SparkSession
import numpy as np
import pandas as pd
from pyspark.sql import DataFrame
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .master(local[*]) \
    .appName(Load data) \
    .config(spark.driver.memory, 4G) \
    .config(spark.sql.execution.arrow.enabled, true) \
    .getOrCreate()

However, with pandas version 1.0.1 and pyarrow 0.16.0, I experienced trouble to convert the Spark DataFrame to Pandas DataFrame when there is a DateType column.

orderbookFile = spark.read.parquet(/home/ruihong/lobster4test/convertedParquets)
orderbookFile.printSchema()
root
 |-- eventId: integer (nullable = true)
 |-- orderId: string (nullable = true)
 |-- effectiveQuantity: long (nullable = true)
 |-- effectivePrice: long (nullable = true)
 |-- attribution: string (nullable = true)
 |-- ask_1: long (nullable = true)
 |-- ask_size_1: long (nullable = true)
 |-- bid_1: long (nullable = true)
 |-- bid_size_1: long (nullable = true)
 |-- time: long (nullable = true)
 |-- side: string (nullable = true)
 |-- date: date (nullable = true)
 |-- symbol: string (nullable = true)

As we can see, the date column\'s data type is date. Now if we like to convert order book to pandas DataFrame, we get into trouble,

orderbookFile.limit(5).toPandas()


To resolve this problem, we have to cast the date type into timestamp type.

newBookFile = orderbookFile.withColumn(date_temp, (col(date).cast(timestamp))) \
.drop(date) \
.withColumnRenamed(date_temp, date)

newBookFile.printSchema()
root
 |-- eventId: integer (nullable = true)
 |-- orderId: string (nullable = true)
 |-- effectiveQuantity: long (nullable = true)
 |-- effectivePrice: long (nullable = true)
 |-- attribution: string (nullable = true)
 |-- ask_1: long (nullable = true)
 |-- ask_size_1: long (nullable = true)
 |-- bid_1: long (nullable = true)
 |-- bid_size_1: long (nullable = true)
 |-- time: long (nullable = true)
 |-- side: string (nullable = true)
 |-- symbol: string (nullable = true)
 |-- date: timestamp (nullable = true)
newBookFile.limit(5).toPandas()

This entry was posted in Python, Spark. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *