Data Mining Project

I haven’t been posting as much as I’d like lately. I’ve been pretty caught up with everything; I’ve been reading a couple books to learn R, command line interface (I recently switched from windows to Linux), and data mining while taking multiple courses on I’ve also been working on a business with two of my friends and my data mining project.

For my data mining project I will try to predict the direction (and magnitude, somewhat) of daily open prices of SPY. It’s actually not restricted to SPY since I will be able to run the R code on any security of my choosing; however, I am familiar with SPY, it has plentiful data for testing, and the fact that it’s an ETF means I won’t over-fit a model as much relative to an individual security. I choose to use opening prices for my data set (sometimes I will use indicators that will use OHLC data, but I will try to predict opening prices and not closing prices) because it is much easier for me to place at-open trades versus at-close trades since often I am not near internet access during closing time. I am also trying to predict the price 24 hours in the future, and not some other time because I believe that it is easier to predict in shorter time frames. My belief is founded on the fact that it is easier to predict daily volatility versus weekly or monthly volatility using GARCH modelling techniques. I choose not to go into intra-day data because it is extremely expensive and I do not have the resources to purchase it. I could try to use open price data to predict closing prices of the same day, and try to predict the next day open prices with current day closing prices, however due to my lack of internet accessibility during closing periods, I have decided against this.

I am still not finished with my project but I am nearing completion. The hardest part, which was learning R and introductory Data Mining, is now over, and I just have to code up testing procedures. I’ve decided to make this problem a classification problem. There will be classifications: bull, neutral, and bear. Bull will be when SPY increases by at least 1%. Bear will be when SPY decreases by at least 1%. Neutral will be everything else. I reason that this will not only be a more useful prediction (since 0.5% profits will be eaten up quickly by commissions), but it will also be easier to predict because I assume that signals will be clearer near extremes versus in the middle (this assumption has no data backing it to my knowledge). I am hesitant to use static values since it detracts value from this model when using it on differing markets, but I’m not sure what market-normalized measure to use.

I chose classification because I assume (without evidence) that the secondary variables do not have a scalable quantitative relationship with price movements. What I mean by this is that there is a threshold value for indicators to obtain predictive powers (which may or may not have a quantitative relationship with closing prices), and below this particular threshold, all predictive power is lost.

I am confused about how to use direction-less magnitude predictors (volatility predictors). These predictors hold predictive power as to whether it’s in a bull/bear or neutral state, but not the particular direction. I’m thinking about making two prediction tasks (determining whether the next day will be low/high vol and whether the next day will be bull/bear) but this creates problems of its own, specifically the fact that sometimes the sum of the parts do not equal its whole.

There is also the problem that bear markets are often characterized by high volatility while bull markets are often characterized by low volatility. The model should be able to catch this, but my concern is that the trading system from this model will be inactive during bull markets. Maybe I will go long SPY or XIV when this model is inactive, and exit when it is not.

Currently, my next step is to define an objective function, or something to grade the accuracy of different models by. I will probably use the CAGR/MDD of a system, Sharpe Ratio, or some other commonly used back testing metric, but I am also considering common data mining metrics such as precision and recall.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s