Introduction
On my current research sabbatical with Metis, I’ve had the amazing opportunity to — long overdue — apply the data science and machine learning techniques that I teach daily to the financial sector. In this series of posts, I’ll show you how to get started building your own financial models!
First, I want to thank the team at Quandl for their amazing, easy-to-use platform, and the Quantopian community for great resources and inspiration!
This is by no means an “advanced” guide, and while here, I should mention:
The information provided here and accompanying material is for informational purposes only. It should not be considered legal or financial advice. You should consult with an attorney or other professional to determine what may be best for your individual needs.
Jonathan does not make any guarantee or other promise as to any results that may be obtained from using his content. No one should make any investment decision without first consulting his or her own financial advisor and conducting his or her own research and due diligence. To the maximum extent permitted by law, Jonathan Balaban disclaims any and all liability in the event any information, commentary, analysis, opinions, advice and/or recommendations prove to be inaccurate, incomplete or unreliable, or result in any investment or other losses.
Oh, and also, past performance is not a reliable indicator of future results and blah blah blah…
Cool, with that out of the way, let’s take a look at definitions and setup!
Definitions and Assumptions
What is a Trading Algorithm?
From Quantopian:
A trading algorithm is a computer program that defines a set of rules for buying and selling assets. Most trading algorithms make decisions based on mathematical or statistical models that are derived from research conducted on historical data.
What platforms are we using?
I’m modeling in Python using Anaconda, Jupyter Notebooks, and PyCharm, and it’s easiest to follow along using these tools. However, you can use the Quantopian platform’s built-in kernels, or even modify code to R or other languages if you so desire.
I’m also on a Mac, and will be sharing UNIX commands throughout; Windows users, Bing is your friend!
What assets are we focusing on?
Apple (AAPL) is a good stock to build on, as it’s currently (September 2018) the most valuable company in the world, has a relatively stable stock price, and has huge amounts of volume, news, and sentiment associated with the brand.
Keep in mind: the principles covered here may work differently for an equity that’s smaller, in a different sector, etc.
Setup
To get the Quantopian platform on our local machine, run the following in terminal:
# create conda py35 since that's the newest version that works conda create -n py35 python=3.5
conda install -c quantopian/label/ci -c quantopian zipline
To get Quandl working, follow the account creation instructions and API documentation to start loading in financial data. Also, save your API key, as you’ll need it to load anything meaningful.
Load Data
Let’s get started with our codebase:
import pandas as pd import numpy as np import patsy
pd.core.common.is_list_like = pd.api.types.is_list_like from pandas_datareader import data import quandl quandl.ApiConfig.api_key = "##############"
Now let’s pull some Apple stock:
df = quandl.get("WIKI/" + 'AAPL', start_date="2014-01-01")
Take a look at the columns and note the one called “Split Ratio”. This is a very important metric; it denotes where stock splits happen. In 2014, Apple decided on a 7:1 split, and we can use Python and pandas to find the date that happened:
len(df) df['Split Ratio'].value_counts() df[df['Split Ratio'] == 7.0]
Our culprit is 2014–06–09. Let’s pull only stock prices after that date to keep things simple:
aapl_split = quandl.get("WIKI/" + 'AAPL', start_date="2014-06-10") aapl_split.head()
By the way, I found a nice list of all Fortune 500 tickers on GitHub. If you’d like to expand your analysis to sets of equities, you can load them into a list like so:
f500 = pd.read_csv('https://raw.githubusercontent.com/datasets/s-and-p-500-companies/master/data/constituents.csv')
tickers = f500.Symbol.tolist()
Key Statistics
Augmented Dickey-Fuller
One thing we should check for is the presence of a unit root, which can be done with the Augmented Dickey-Fuller test. In a nutshell, presence of a unit root implies there’s an underlying trend that’s driving AAPL, and therefore a pattern we can extract and use for prediction.
# run ADF to determine unit root import statsmodels.tsa.stattools as ts cadf = ts.adfuller(aapl_split.Close)
print('Augmented Dickey Fuller:') print('Test Statistic =',cadf[0]) print('p-value =',cadf[1]) print('Critical Values =',cadf[4])
Augmented Dickey Fuller:
Test Statistic = -0.731194982176
p-value = 0.838503045276
Critical Values = {‘1%’: -3.4372231474483499, ‘5%’: -2.8645743628401763, ‘10%’: -2.5683856650361054}
We compare the test statistic above with the critical values; if it’s lower than our chosen threshold, we rejectthe null hypothesis that there is a unit root. As you can see with our large p-value, we must accept the null: there is a unit root for AAPL. This is a good thing, as we can utilize the underlying trends and patterns for prediction.
Correlation with other Equities
Apple is considered a luxury technology brand. What if we can determine a strong correlation with other equities?
Note that correlation does not imply causation and there might be the question of which stock is a first-mover, but patterns and relationships are always a good thing for boosting our model performance.
I recommend you look at three stocks, and how AAPL correlates:
- Microsoft (MSFT)
- Intel (INTC)
- Tiffany & Co. (TIF)
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
MSFT = quandl.get("WIKI/" + 'MSFT', start_date="2014-06-10") INTC = quandl.get("WIKI/" + 'INTC', start_date="2014-06-10") TIF = quandl.get("WIKI/" + 'TIF', start_date="2014-06-10")
For the sake of time here we’ll focus just on the Intel data; let’s plot our closing prices for AAPL and INTC:
sns.jointplot(INTC.Close, aapl_split.Close, kind="reg");
We can also take a look at our correlation value:
np.corrcoef(INTC.Close, aapl_split.Close)
We note an r-value of 0.7434; not bad for prediction, but we need to keep an important fact in mind: if we know INTC closing price, we can just look up AAPL’s! So, let’s check to see correlation with INTC’s closing price seven days in advance for a more viable metric:
# seven day lead np.corrcoef(INTC.Close[:-7], aapl_split.Close[7:])
For this run, we note an r-value of 0.7332; still pretty good!
Google Trends
We can compare how Twitter and other repositories of sentiment affect stock prices. For now, let’s see if Google Trends correlates with AAPL. Make sure to specify the time range or use this link for the exact search (note I added a few days into April to deal with half-week issues), then load the CSV into Python:
# Google Trends
aapl_trends = pd.read_csv('/Users/jb/Desktop/multiTimeline.csv', header=1)
aapl_trends.tail()
Note the weekly format, so we need to convert our stock price dataset using
pandas.resample():
aapl_split_week = aapl_split.resample('W', convention='end').last()
Now let’s check the correlation and plot for the sum of Google search requests over a given week, with the closing stock price for the last business day of that week:
# trend and price corr np.corrcoef(aapl_trends['Apple: (Worldwide)'], aapl_split_week.Close)
Oy! We get a miniscule 0.0454, which makes sense as we think about it: AAPL news/activity/chatter doesn’t imply positive things for the stock price. Something like sentiment that has a polarity to it should provide a stronger signal, but we’ll look at that another time.
Final Thoughts
We’ve only scratched the surface of what can be done in the Exploratory Data Analysis (EDA) portion of financial analysis, but in the next post we’ll transition to building predictive models and letting advanced packages do the heavy lifting for us.
I hope you’ve found this helpful, and I’d love to hear from you in the comments:
- Any issues running this code? Sometimes environments and versions can screw things up…
- What packages and techniques do you use?
- What visualizations are most helpful for understand a stock price’s movement?
- What factors do you think will maximize model predictions?
And finally, if you happen to know a modeling technique that consistently makes tons of money, please direct message me the deets
Contact
Kingdavidoheb@gmail.com
+2348135802101
0 Comments