Read Parquet in R

The new LOBSTER engine working on AWS cloud, can output order book data in either Parquet or CSV format . For the sake of efficiency, it outputs Parquet file by default. For Spark and Python users, loading the parquet files is trivial. For R users, it needs a little bit more efforts. However, Apache Arrow's new-released R arrow package makes the task massively simple now. Here are the steps of installing and using this package:

> install.packages("arrow")
> install.packages("data.table")
> arrow::install_arrow()

After the installation and restart your R session , you can load a single file by

>df <- arrow::read_parquet(file_name)

Please refer to the official R arrow document and here for the details. However, for loading a local directory with parquet partitions, more efforts are required. This thread provides a solution by facilitating lapply function of R,

>df <- data.table::rbindlist(lapply(Sys.glob("parquet_directory/part-*.parquet"), arrow::read_parquet))

This line works in local storage, I doubt that it could work in S3 storage (without testing). Personally, I prefer the alternative solution by using reticulate package. This approach requires some additional installation

> install.packages("reticulate")
> arrow::install_pyarrow()

After the installation process completes and the R session restarts , you can then read the parquet directory by

> library(reticulate)
> pd <- import("pandas")
> df <- pd$read_parqet(parquet_directory)

A full example can be found in this function and test by this script on Github.

Posted in R | Leave a comment

Clawing in the Jungle

Arnaud Amsellem just publishes an exciting research  using LOBSTER data.  - Using random forest to model limit order book dynamic

Posted in LOBSTER Data | Leave a comment

How to Install Cygwin without Admin Privilege

  1. Download Cygwin setup file here.
  2. Right click on setup-x86_64.exe
  3. Modify the shotcut property, adding " --no-admin" at the end of the target. Note that there must have a space between ".exe" and "--no-admin".

  4. Double click the shortcut to run the installation -- Ignore Windows 10's warning!

  5. Follow the steps
  6. Choose Just for Me and a place you can write file to install
  7. To install additonal packages
Posted in Cygwin | Leave a comment

MathWorks Publishes Machine Learning Application Using LOBSTER Data


https://uk.mathworks.com/help/finance/machine-learning-with-financial-data.html?s_tid=CRUX_lftnav

Posted in LOBSTER Data | Leave a comment

Self Type in Scala

abstract class A (val x: Int){
}
trait B {
    def xPlus1: Int = x +1
}

Of course this is silly. Because B does not know x defined in A, Scala compiler will complain,

error: not found: value x
       def xPlus1: Int = x +1

However, in many applications, we do know that trait B has to be mixed with a sub-type of A (e.g. Trait Calibration must be mix only with some Model in the context of statistics). In this case, we can explicitly tell the compiler through self type. In our toy example,

trait B {
self: A =>
    def xPlus1: Int = x+1
}

Now, we test this approach by creating a concrete class C as

scala> case class C(override val x:Int) extends A(x) with B
scala> C(10) xPlus1
res1: Int = 11
Posted in Scala | Leave a comment

Mount AWS S3 Bucket on An EC2 Ubuntu Instance

The current Ubuntu version on AWS EC2 instances is 18.04 LTS. After launch an Ubuntu Instance and connect to it by SSH, one can follow those steps to set up a permanent mount to S3 bucket.

  1. Install the packages.
    $ sudo apt update
    $ sudo apt-get install s3fs
  2. Create a global access key password file. The access key can be created in IAM console, following those guidelines.
    $ echo "0ASDF87XJKN34LK32EXAMPLE:AklasjlUUHNJ98SaadASKEXAMPLE1237890"  | sudo tee  /etc/passwd-s3fs
    $ sudo chmod 600 /etc/passwd-s3fs
  3. As a test, we could first mount it manually in command line.
    $ sudo s3fs myBucket /path/to/mount/point
    $ sudo ls /path/to/mount/point
    # List of files and folders 

    Note that myBucket is the name of the bucket, exclude s3:// part. If everything works well, we can un-mount the bucket and prepare for the auto-mount steps.

    $ sudo fusermount -u /path/to/mount/point
  4. To set up the auto-mount at booting. We first add a line into /etc/fstab,
    s3fs#myBucket       /path/to/mount/point       fuse _netdev,allow_other,umask=0007,uid=1000     0 0

    where uid=1000 is the user's id. It could be found by

    $ id -u

    Now we can reboot the instance, the S3 bucket will be mount automatically. Alternatively, we can test the auto-mount without reboot.

    $ sudo mount -a
Posted in AWS | Leave a comment

Convert A Spark DataFrame with Date Columns to Pandas DataFrame using Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. After install pyarrow package, we can convert Spark DataFrame efficiently to Pandas DataFrame by setting spark.sql.execution.arrow.enabled to true for the Spark instance

from pyspark.sql import SparkSession
import numpy as np
import pandas as pd
from pyspark.sql import DataFrame
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .master(local[*]) \
    .appName(Load data) \
    .config(spark.driver.memory, 4G) \
    .config(spark.sql.execution.arrow.enabled, true) \
    .getOrCreate()

However, with pandas version 1.0.1 and pyarrow 0.16.0, I experienced trouble to convert the Spark DataFrame to Pandas DataFrame when there is a DateType column.

orderbookFile = spark.read.parquet(/home/ruihong/lobster4test/convertedParquets)
orderbookFile.printSchema()
root
 |-- eventId: integer (nullable = true)
 |-- orderId: string (nullable = true)
 |-- effectiveQuantity: long (nullable = true)
 |-- effectivePrice: long (nullable = true)
 |-- attribution: string (nullable = true)
 |-- ask_1: long (nullable = true)
 |-- ask_size_1: long (nullable = true)
 |-- bid_1: long (nullable = true)
 |-- bid_size_1: long (nullable = true)
 |-- time: long (nullable = true)
 |-- side: string (nullable = true)
 |-- date: date (nullable = true)
 |-- symbol: string (nullable = true)

As we can see, the date column\'s data type is date. Now if we like to convert order book to pandas DataFrame, we get into trouble,

orderbookFile.limit(5).toPandas()


To resolve this problem, we have to cast the date type into timestamp type.

newBookFile = orderbookFile.withColumn(date_temp, (col(date).cast(timestamp))) \
.drop(date) \
.withColumnRenamed(date_temp, date)

newBookFile.printSchema()
root
 |-- eventId: integer (nullable = true)
 |-- orderId: string (nullable = true)
 |-- effectiveQuantity: long (nullable = true)
 |-- effectivePrice: long (nullable = true)
 |-- attribution: string (nullable = true)
 |-- ask_1: long (nullable = true)
 |-- ask_size_1: long (nullable = true)
 |-- bid_1: long (nullable = true)
 |-- bid_size_1: long (nullable = true)
 |-- time: long (nullable = true)
 |-- side: string (nullable = true)
 |-- symbol: string (nullable = true)
 |-- date: timestamp (nullable = true)
newBookFile.limit(5).toPandas()

Posted in Python, Spark | Leave a comment

Tips on Linux

How to unzip all 7z files in a directory?

One can use the following bash script to unzip all *.7z files in a directory.

#!/bin/bash
FILES=$1/*.7z
shopt -s nullglob
for f in $FILES
do
    echo "Processing $f file ..."
    7za x $f -o$2
done

Save the script as unzipAll.sh.

$ chmod +x unzipAll.sh
$ ./unzipAll.sh /path/to/7z/file/directory /path/to/target/directory</pre>
Posted in Linux | Leave a comment

Evaluate Trading Strategies by Using LOBSTER Data

-- A short review of a working paper by Balch et.al (2019) from J.P. Morgan Artificial Intelligence Research and Imperial College London

T.H. Balch et.al have published a working paper, "How to Evaluate Trading Strategies: Single Agent Market Replay or Multiple Agent Interactive Simulation?" In the paper, by using LOBSTER data they show how a multi-agent simulator can support two important but distinct methods for assessing a trading strategy: Market Replay (backtesting) and Interactive Agent-Based Simulation (IABS).

In particular, they implement backtesting using three agents: An exchange agent representing the exchange which keeps the order book (e.g., Nasdaq or NYSE), a market replay agent that provides
liquidity by replaying historical orders and an experimental agent representing the trading strategy to be evaluated. While the experimental agent is based on an interactive agent-based simulator named as ABIDES, the market replay agent is based on LOBSTER message data. The authors visualize a short segment of the replay data in Figure 2 in the paper,

Figure 2 from the original paper: Price-level volume plot. Black line represents the mid price, Each point is the price at different price levels with the colour scheme indicating the size (log scale) present at each level

The experimental agent is configured to participate in the simulation in a manner similar to the market replay agent, with the orders submitted dependent on the experiment carried out. She uses a strategic “greed” parameter to determine what size order to place relative to the available liquidity. In the visualized experiment, the impact agent queries liquidity within 1% of the inside bid (if selling) or ask (if buying) and with greed = 1.0 places an order to capture all of it. [How exactly the greed level is defined is not very clear for me.]

Figure 3(a); Observe impact on the mid price by the replay-only experimental agent placing buy market orders at twice the best ask size
Figure 5(a): Observed impact on the mid price by the IABS experimental agent placing buy market orders with greed = 1.0

The above two figures show the typical simulation result observed by the authors. They conclude that in the backtesting environment the price trends rather quickly back to the baseline price, eventually reaching that price and remaining there [the authors also point out that whether the price finally stabilized at the baseline price seems to depend on the trading side. Although by looking their figures, I can not see this clearly.] . However, in the IABS experiments, the price stabilizes at a new level in each set of experiments, suggesting that the impact of the order is longer lasting or even permanent.

Posted in LOBSTER Data | Leave a comment

LOBSTER on Spark

A new LOBSTER engine has been built on Spark. The demo Jupyter-notebook is available at here .

In this short demo, we studies the different strategies of market makers, using pyspark > 2.4.4 and LOBSTER data constructed by the new LOBSTER engine running on cloud setting.

Prerequisite Packages

Several packages need to be installed:

  1. pyspark or a full spark installation following [here] https://spark.apache.org/downloads.html.
  2. numpy and pandas as expected
  3. seabon for graphic presentation
  4. pyarrow is NOT required in current stage, but better to get it ready.
In [1]:
from pyspark.sql import SparkSession
import numpy as np
import pandas as pd
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
import matplotlib.pyplot as plt
from datetime import date
import seaborn as sns
from datetime import date
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

sns.set()
pd.options.display.max_columns = None

%matplotlib inline

date = '2017-11-15'
symbol = 'AMZN'

Initialize Spark

Note that although Apache Arrow is a powerful tool to accelerate the convertion between spark dataframe and pandas dataframe, it hit a problem when there is a date column in the data set. Therefore, I have to disable the feature.

In [2]:
# Enable Apache Arrow to fast convert Spark dataframe to Pandas dataframe

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Load data") \
    .config("spark.driver.memory", "4G") \
    .config("spark.sql.execution.arrow.enabled", "true") \
    .getOrCreate()

Because pyarrow has a issue to convert date data type in spark dataframe to pandas dataframe, we have to cast date to timestamp

In [3]:
orderbookFile = spark.read.parquet("/home/ruihong/lobster4test/itch-derived/orderbook") \
.withColumn("date_temp", (col("date").cast("timestamp"))) \
.drop("date") \
.withColumnRenamed("date_temp", "date")


orderbookFile.createOrReplaceTempView("orderbook")

Investigate the schemas and some data

In [4]:
orderbookFile.printSchema()
root
 |-- sequenceNumber: long (nullable = true)
 |-- Bid_1_price: double (nullable = true)
 |-- Bid_1_quantity: double (nullable = true)
 |-- Bid_2_price: double (nullable = true)
 |-- Bid_2_quantity: double (nullable = true)
 |-- Offer_1_price: double (nullable = true)
 |-- Offer_1_quantity: double (nullable = true)
 |-- Offer_2_price: double (nullable = true)
 |-- Offer_2_quantity: double (nullable = true)
 |-- time: long (nullable = true)
 |-- eventType: string (nullable = true)
 |-- orderId: string (nullable = true)
 |-- symbol: string (nullable = true)
 |-- side: string (nullable = true)
 |-- effectiveQuantity: long (nullable = true)
 |-- effectivePrice: long (nullable = true)
 |-- attribution: string (nullable = true)
 |-- matchNumber: long (nullable = true)
 |-- liqGrp: integer (nullable = true)
 |-- level: integer (nullable = true)
 |-- date: timestamp (nullable = true)

In [5]:
orderbookFile.limit(5).toPandas().head()
Out[5]:
sequenceNumber Bid_1_price Bid_1_quantity Bid_2_price Bid_2_quantity Offer_1_price Offer_1_quantity Offer_2_price Offer_2_quantity time eventType orderId symbol side effectiveQuantity effectivePrice attribution matchNumber liqGrp level date
0 125201640 1699500.0 100.0 1699400.0 29.0 1700400.0 35.0 1701000.0 100.0 34200025913752 Submission 9518437 AAPL Sell 35 1700400 0 1 2 2017-11-15
1 125201840 1699500.0 100.0 1699400.0 29.0 1700400.0 135.0 1701000.0 100.0 34200025959228 Submission 9518469 AAPL Sell 100 1700400 0 1 2 2017-11-15
2 125203230 1699500.0 100.0 1699400.0 29.0 1700400.0 35.0 1701000.0 100.0 34200026293628 Deletion 9518469 AAPL Sell 100 1700400 0 1 2 2017-11-15
3 125207540 1699500.0 100.0 1699400.0 29.0 1700400.0 135.0 1701000.0 100.0 34200027368600 Submission 9519165 AAPL Sell 100 1700400 0 1 2 2017-11-15
4 125223160 1699500.0 100.0 1699400.0 29.0 1700400.0 135.0 1701000.0 100.0 34200030550133 HiddenOrderExecution 0 AAPL Buy 100 1700100 45711 1 2 2017-11-15
In [6]:
spark.sql(f"select symbol, date, count(*) as numberbook from orderbook group by symbol, date").toPandas().head(20)
Out[6]:
symbol date numberbook
0 ASML 2017-11-15 68743
1 MDLZ 2017-11-16 198361
2 EXC 2017-11-16 134225
3 BIDU 2017-11-16 45312
4 SNPS 2017-11-15 22839
5 GILD 2017-11-16 93417
6 CHTR 2017-11-16 37381
7 MXIM 2017-11-16 77134
8 EA 2017-11-16 55896
9 CMCSA 2017-11-16 401262
10 FOX 2017-11-16 332061
11 ROST 2017-11-16 81371
12 EXPE 2017-11-16 38199
13 INTU 2017-11-16 33338
14 REGN 2017-11-16 27900
15 BMRN 2017-11-15 25781
16 INCY 2017-11-15 24077
17 GILD 2017-11-15 111269
18 COST 2017-11-16 48440
19 WDAY 2017-11-16 23195
In [7]:
orderBookSparkDf = spark.sql(f"select * from orderbook where date='{date}' and symbol='{symbol}'")
orderBookPandasDf = orderBookSparkDf.toPandas()
orderBookPandasDf['timestamp'] = pd.to_datetime(orderBookPandasDf['date']) + pd.to_timedelta(orderBookPandasDf['time']/1e9, unit='s')

orderBookPandasDf.head()
Out[7]:
sequenceNumber Bid_1_price Bid_1_quantity Bid_2_price Bid_2_quantity Offer_1_price Offer_1_quantity Offer_2_price Offer_2_quantity time eventType orderId symbol side effectiveQuantity effectivePrice attribution matchNumber liqGrp level date timestamp
0 125143950 11270000.0 507.0 11267800.0 100.0 11271600.0 76.0 11284800.0 20.0 34200016481947 HiddenOrderExecution 0 AMZN Buy 3 11270500 45384 2 2 2017-11-15 2017-11-15 09:30:00.016481947
1 125330250 11270000.0 507.0 11267800.0 100.0 11271600.0 76.0 11284800.0 20.0 34200071354238 HiddenOrderExecution 0 AMZN Buy 3 11270500 46240 2 2 2017-11-15 2017-11-15 09:30:00.071354238
2 125406470 11270000.0 507.0 11267800.0 100.0 11271600.0 76.0 11284800.0 20.0 34200082349479 HiddenOrderExecution 0 AMZN Buy 36 11270100 46364 2 2 2017-11-15 2017-11-15 09:30:00.082349479
3 125443000 11270000.0 507.0 11267800.0 100.0 11271600.0 76.0 11284800.0 20.0 34200141846795 HiddenOrderExecution 0 AMZN Buy 69 11270500 46549 2 2 2017-11-15 2017-11-15 09:30:00.141846795
4 125443010 11270000.0 507.0 11267800.0 100.0 11271600.0 69.0 11284800.0 20.0 34200141846795 Execution 9441073 AMZN Sell 7 11271600 46550 2 2 2017-11-15 2017-11-15 09:30:00.141846795

Plot the bid and ask

In [8]:
import matplotlib.dates as mdates
timeFormatter = mdates.DateFormatter('%H:%M')

plt.plot(orderBookPandasDf['timestamp'], orderBookPandasDf['Bid_1_price']/10000, color='g', label='Bid')
plt.plot(orderBookPandasDf['timestamp'], orderBookPandasDf['Offer_1_price']/10000, color='r', label='Ask')
plt.gca().xaxis.set_major_formatter(timeFormatter)
plt.legend()
plt.show()

Posted in LOBSTER Data, Python, Spark | Leave a comment