On Making Money with Python and Data Science

Introduction

On my current research sabbatical with Metis, I’ve had the amazing opportunity to — long overdue — apply the data science and machine learning techniques that I teach daily to the financial sector. In this series of posts, I’ll show you how to get started building your own financial models!

First, I want to thank the team at Quandl for their amazing, easy-to-use platform, and the Quantopian community for great resources and inspiration!

This is by no means an “advanced” guide, and while here, I should mention:

The information provided here and accompanying material is for informational purposes only. It should not be considered legal or financial advice. You should consult with an attorney or other professional to determine what may be best for your individual needs.

Jonathan does not make any guarantee or other promise as to any results that may be obtained from using his content. No one should make any investment decision without first consulting his or her own financial advisor and conducting his or her own research and due diligence. To the maximum extent permitted by law, Jonathan Balaban disclaims any and all liability in the event any information, commentary, analysis, opinions, advice and/or recommendations prove to be inaccurate, incomplete or unreliable, or result in any investment or other losses.

Oh, and also, past performance is not a reliable indicator of future results and blah blah blah…

Cool, with that out of the way, let’s take a look at definitions and setup!

Definitions and Assumptions

What is a Trading Algorithm?

From Quantopian:

A trading algorithm is a computer program that defines a set of rules for buying and selling assets. Most trading algorithms make decisions based on mathematical or statistical models that are derived from research conducted on historical data.

What platforms are we using?

I’m modeling in Python using Anaconda, Jupyter Notebooks, and PyCharm, and it’s easiest to follow along using these tools. However, you can use the Quantopian platform’s built-in kernels, or even modify code to R or other languages if you so desire.

I’m also on a Mac, and will be sharing UNIX commands throughout; Windows users, Bing is your friend!

What assets are we focusing on?

Apple (AAPL) is a good stock to build on, as it’s currently (September 2018) the most valuable company in the world, has a relatively stable stock price, and has huge amounts of volume, news, and sentiment associated with the brand.

Keep in mind: the principles covered here may work differently for an equity that’s smaller, in a different sector, etc.

Setup

To get the Quantopian platform on our local machine, run the following in terminal:

# create conda py35 since that's the newest version that works
conda create -n py35 python=3.5

conda install -c quantopian/label/ci -c quantopian zipline

To get Quandl working, follow the account creation instructions and API documentation to start loading in financial data. Also, save your API key, as you’ll need it to load anything meaningful.

Load Data

Let’s get started with our codebase:

import pandas as pd
import numpy as np
import patsy

pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas_datareader import data
import quandl
quandl.ApiConfig.api_key = "##############"

Now let’s pull some Apple stock:

df = quandl.get("WIKI/" + 'AAPL', start_date="2014-01-01")

Take a look at the columns and note the one called “Split Ratio”. This is a very important metric; it denotes where stock splits happen. In 2014, Apple decided on a 7:1 split, and we can use Python and pandas to find the date that happened:

len(df)
df['Split Ratio'].value_counts()
df[df['Split Ratio'] == 7.0]

Our culprit is 2014–06–09. Let’s pull only stock prices after that date to keep things simple:

aapl_split = quandl.get("WIKI/" + 'AAPL', start_date="2014-06-10")
aapl_split.head()

By the way, I found a nice list of all Fortune 500 tickers on GitHub. If you’d like to expand your analysis to sets of equities, you can load them into a list like so:

f500 = pd.read_csv('https://raw.githubusercontent.com/datasets/s-and-p-500-companies/master/data/constituents.csv')

tickers = f500.Symbol.tolist()

Key Statistics

Augmented Dickey-Fuller

One thing we should check for is the presence of a unit root, which can be done with the Augmented Dickey-Fuller test. In a nutshell, presence of a unit root implies there’s an underlying trend that’s driving AAPL, and therefore a pattern we can extract and use for prediction.

# run ADF to determine unit root
import statsmodels.tsa.stattools as ts
cadf = ts.adfuller(aapl_split.Close)

print('Augmented Dickey Fuller:')
print('Test Statistic =',cadf[0])
print('p-value =',cadf[1])
print('Critical Values =',cadf[4])

Augmented Dickey Fuller:
Test Statistic = -0.731194982176
p-value = 0.838503045276
Critical Values = {‘1%’: -3.4372231474483499, ‘5%’: -2.8645743628401763, ‘10%’: -2.5683856650361054}

We compare the test statistic above with the critical values; if it’s lower than our chosen threshold, we rejectthe null hypothesis that there is a unit root. As you can see with our large p-value, we must accept the null: there is a unit root for AAPL. This is a good thing, as we can utilize the underlying trends and patterns for prediction.

Correlation with other Equities

Apple is considered a luxury technology brand. What if we can determine a strong correlation with other equities?

Note that correlation does not imply causation and there might be the question of which stock is a first-mover, but patterns and relationships are always a good thing for boosting our model performance.

I recommend you look at three stocks, and how AAPL correlates:

Microsoft (MSFT)
Intel (INTC)
Tiffany & Co. (TIF)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

MSFT = quandl.get("WIKI/" + 'MSFT', start_date="2014-06-10")
INTC = quandl.get("WIKI/" + 'INTC', start_date="2014-06-10")
TIF = quandl.get("WIKI/" + 'TIF', start_date="2014-06-10")

For the sake of time here we’ll focus just on the Intel data; let’s plot our closing prices for AAPL and INTC:

sns.jointplot(INTC.Close, aapl_split.Close, kind="reg");

We can also take a look at our correlation value:

np.corrcoef(INTC.Close, aapl_split.Close)

We note an r-value of 0.7434; not bad for prediction, but we need to keep an important fact in mind: if we know INTC closing price, we can just look up AAPL’s! So, let’s check to see correlation with INTC’s closing price seven days in advance for a more viable metric:

# seven day lead
np.corrcoef(INTC.Close[:-7], aapl_split.Close[7:])

For this run, we note an r-value of 0.7332; still pretty good!

Google Trends

We can compare how Twitter and other repositories of sentiment affect stock prices. For now, let’s see if Google Trends correlates with AAPL. Make sure to specify the time range or use this link for the exact search (note I added a few days into April to deal with half-week issues), then load the CSV into Python:

# Google Trends

aapl_trends = pd.read_csv('/Users/jb/Desktop/multiTimeline.csv', header=1)

aapl_trends.tail()

Note the weekly format, so we need to convert our stock price dataset using pandas.resample():

aapl_split_week = aapl_split.resample('W', convention='end').last()

Now let’s check the correlation and plot for the sum of Google search requests over a given week, with the closing stock price for the last business day of that week:

# trend and price corr
np.corrcoef(aapl_trends['Apple: (Worldwide)'], aapl_split_week.Close)

Oy! We get a miniscule 0.0454, which makes sense as we think about it: AAPL news/activity/chatter doesn’t imply positive things for the stock price. Something like sentiment that has a polarity to it should provide a stronger signal, but we’ll look at that another time.

Final Thoughts

We’ve only scratched the surface of what can be done in the Exploratory Data Analysis (EDA) portion of financial analysis, but in the next post we’ll transition to building predictive models and letting advanced packages do the heavy lifting for us.

I hope you’ve found this helpful, and I’d love to hear from you in the comments:

Any issues running this code? Sometimes environments and versions can screw things up…
What packages and techniques do you use?
What visualizations are most helpful for understand a stock price’s movement?
What factors do you think will maximize model predictions?

And finally, if you happen to know a modeling technique that consistently makes tons of money, please direct message me the deets

Contact

Kingdavidoheb@gmail.com

+2348135802101

On Making Money with Python and Data Science — 1: Setup and Statistics

Introduction

Definitions and Assumptions

What is a Trading Algorithm?

What platforms are we using?

What assets are we focusing on?

Setup

Load Data

Key Statistics

Augmented Dickey-Fuller

Correlation with other Equities

Google Trends

Final Thoughts

Post a Comment

0 Comments

Contact Form

On Making Money with Python and Data Science — 1: Setup and Statistics

Introduction

Definitions and Assumptions

What is a Trading Algorithm?

What platforms are we using?

What assets are we focusing on?

Setup

Load Data

Key Statistics

Augmented Dickey-Fuller

Correlation with other Equities

Google Trends

Final Thoughts

You may like these posts

Post a Comment

0 Comments

Social Plugin

Contact Form