{"id":981,"date":"2021-10-21T15:13:13","date_gmt":"2021-10-21T15:13:13","guid":{"rendered":"https:\/\/firemymoneymanager.com\/?p=981"},"modified":"2022-04-01T01:53:39","modified_gmt":"2022-04-01T01:53:39","slug":"principal-component-analysis-predict-stock-returns","status":"publish","type":"post","link":"https:\/\/firemymoneymanager.com\/principal-component-analysis-predict-stock-returns\/","title":{"rendered":"Can principal component analysis predict stock returns? [2021]"},"content":{"rendered":"\n
In this article we will take a look at principal component analysis. Principal component analysis (or PCA) is a tool used in many disciplines to find patterns in data. It can either be used as part of a machine learning algorithm, or it can be used on its own. <\/p>\n\n\n\n
Wikipedia defines principal component analysis like this:<\/p>\n\n\n\n
Principal component analysis<\/strong> (PCA<\/strong>) is the process of computing the principal components and using them to perform a change of basis<\/a> on the data, sometimes using only the first few principal components and ignoring the rest.<\/p>Wikipedia<\/cite><\/blockquote>\n\n\n\n
Essentially, it uses matrices and eigenvectors\/eigenvalues to find vectors which together can span most of the solution space. We won’t get too much into the math behind it, but we have linked to some useful articles below. <\/p>\n\n\n\n
Several academic papers have suggested that this type of analysis can generate factors which predict asset prices. In this article we will determine if that’s still true.<\/p>\n\n\n\n
Suggested readings<\/h2>\n\n\n\n
- Performance measurement with the arbitrage pricing theory<\/a><\/li>
- Risk and return in equilibrium APT<\/a><\/li>
- Linear algebra review (if you want to understand the math)<\/a><\/li>
- Getting started: using Python to find alpha [2021]<\/a><\/li>
- Do CAPM efficient portfolios really outperform random ones? [2021]<\/a><\/li><\/ul>\n\n\n\n
<\/p>\n\n\n\n
Modeling stock returns with 2 factor PCA<\/h2>\n\n\n\n
We begin with a basic model of stock returns. We will limit this model to three months: 2 input months and one month to test the results. <\/p>\n\n\n\n
There are different ways to use PCA on stock data. We can use the stocks as features and the stock prices at certain dates as samples, or we can use the dates as features and each stock as a sample. Each way provides valuable information.<\/p>\n\n\n\n
In the papers linked above, the authors build a matrix with time periods as columns and stocks as rows to perform their PCA. We use the following code to create a similar matrix:<\/p>\n\n\n\n
\nalldata = pd.DataFrame()\n\npd.set_option('display.max_rows', 100)\npd.set_option('display.max_columns', 100)\n\n\ntickers = pd.read_csv(\"tickers.csv\")\ntickers = tickers.loc[(tickers.exchange == 'NYSE') | (tickers.exchange == 'NYSEARCA')| (tickers.exchange == 'NYSEMKT')| (tickers.exchange == 'NASDAQ')]\ntickers = tickers.loc[tickers.isdelisted == 'N']\ntickers = tickers.loc[tickers.table == 'SF1']\n\nfor t in tickers['ticker']:\n print(t)\n try:\n data = pd.read_csv(t + \".csv\", \n header = None,\n usecols = [0, 1, 12],\n names = ['ticker', 'date', 'adj_close'])\n \n alldata = alldata.append(data)\n \n except FileNotFoundError as e:\n pass\n\n\n\ndf = alldata.set_index('date')\n\n\ntable = df.pivot(columns='ticker')\ntable.columns = [col[1] for col in table.columns]\n\n#table = table['2010-01-04':'2020-08-18']\ntable = table['2014-01-01':]\n\ntable.index = pd.to_datetime(table.index)\ntable.fillna(1, inplace=True)\n\n\n\ntable.resample('BM').last()\nt = table.resample('BM').last().pct_change().transpose()\n\nt = t[['2019-06-28','2019-07-31','2019-08-30']]\n\nx = t[['2019-06-28','2019-07-31']]\ny = t['2019-08-30']<\/pre>\n\n\n\nThis gives us all stocks and their percent changes as rows, and the three dates in 2019 as columns. Now let’s use sklearn’s PCA function to run a 2 dimension PCA. In this code we first scale the data using the StandardScaler, and run the PCA to get a DataFrame of principal vectors:<\/p>\n\n\n\n
x = pre.StandardScaler().fit_transform(x)\npca = dec.PCA(n_components = 2)\nvectors = pca.fit_transform(x)<\/pre>\n\n\n\nThat’s it! The actual PCA is done. The vectors <\/em>array now contains the 2 vectors that the PCA function believes are the principal factors that explain the data. <\/p>\n\n\n\n
We would really like (at least with this first test) to visualize the data to see what exactly the PCA function did. Let’s plot the vectors (the x and y axis) along with whether the returns are positive (shown in red) or negative (shown in black) over the period. If the PCA function successfully found explanatory components, we should see some separation of the data:<\/p>\n\n\n\n