How To Clean Price Data for Backtesting

Cleaning data for backtesting is not easy but its very necessary to get meaningful results.    Mis-adjusted price splits can skew the price data and mislead the unwary backtester into thinking they’re found the holy grail when the strategy merely happens to catch the good side of a bad gap.

Here’s the steps to screen out dirty data and produce a clean dataset:

1. Pick at least 3 candidate data vendors.

2. Format the data for comparison.

3. Write a program to do a smart comparison and run it on the 3 candidate data sets.

4. Analyze the mis-compares to see which set is in error.   if 2 of 3 sets agree, assume that’s the correct value and the outlier is wrong.

5. Send feedback to the data vendors so they can fix the errors.

6. Select the set of historical price data to use for backtesting and lock it down to prevent changes during the backtesting.

7. Feed the golden price data to the backtesting engine.

This process took me several weeks of work but was worth it to get accurate results.  There’s little point of going to the work of backtesting if the underlying data is riddled with errors.

Read on for details if you are going to attempt this on your own or if you just want to see what preparations go into serious backtesting.When trading live and personally looking at charts, its easy enough to spot dirty data.  Large gaps on a price chart are eye catching and you can check the news on a recent chart.  Obviously some price gaps are real but a few are errors — usually a split or special dividend not adjusted properly.  An alert and patient human can sort that out case by case.  With computerized backtesting covering over a decade, dirty data is much more prevalent, harder to detect, and harder to fix. 

Step 1: Pick candidate data vendors.   The data that comes with the backtesting engines is a natural first choice and yes it needs to be cleaned!    I have access to TradeStation and WordenTelechart (a.k.a. Blocks, Backscanner, StockFinder).   That’s two sets of data but I wanted an independent source so I did a study of the historical price data “food chain”.   To summarize, all data originates from the exchanges and data companies capture the data as reported by the exchanges.  The captured data is offered for resale (by agreement with the exchanges).  CSI Data impressed me because they supply data for the big websites such as Yahoo, MSN, Google and I believe that millions of eyes on the data will help root out errors.   Also, CSI Data offered delisted data, which gives a more accurate view for backtesting.  (CSI Data has since priced the delisted data out of reach of most private individuals, once they realized how much institutions would pay for it.  I feel very lucky to have gotten 14 years of delisted data for a mere four figures.) 

Step 2: Dump out the data.   To get the historical price data ready for comparison, you need to dump it out of the software tools.   Telechart makes it easiest with the Export to Text capability under the Databank menu item.    CSI Data is extremely flexible about writing data too.   TradeStation has no facility for writing out data.  I had to create a simple Easy Language script that writes the Close, High, Low, Open, and Volume for each day to a text file for each symbol.  The ticker symbol is the name of the text file and is not listed inside the files.  Take care to match the format in all three sets of data.   I wrote mine in this order: Date, C, H, L, O, V.  Its very important to always use split-adjusted data.  Another word to the wise:  each vendor uses a different units of measure for volume - you need to adjust accordingly.

Step 3: Write a program to compare the data.   With programs like tkdiff available for free on the web, I thought this step would be easy, but I thought wrong!   First of all, with nearly 7500 tickers, it takes way too long to manually load the files into tkdiff.  I needed to automate the comparison.  Once I automated, I found that minor differences of a penny or two happen almost weekly.   (This can happen, for example, if one vendor takes the Close as the last executed price and another vendor uses the middle of the last spread between bid and ask price as the Close.)  I very quickly decided that I didn’t want to know about small differences in the data between the three vendors.  That’s not going to have a material effect on backtesting results.   What does matter, however, is the huge gaps that cropped up from time to time.    To identify those, I wrote a fuzzy diff program.  It compares the data from two vendors and flags mis-compares, sorting into major errors ( values that are more than $0.04 off), minor errors (less than $0.04) and no errors.    I ran the fuzzy diff twice:  once comparing CSI and TradeStation, and again comparing CSI and Worden data sets.

Step 4: Analyze the mis-compares.   The previous step yielded a list of price data points that didn’t compare between vendors.   I painstakingly researched each of the major errors, looking at the two price charts and researching the news on the ticker around the time of the error.    In most cases, it was clear which was the faulty data set.

Step 5: Feedback to the data vendors.   I decided to be a good citizen and report most of the data errors I had found.   CSI Data had the fewest errors, promptly responded to my feedback and generally defended the correctness of their data.    TradeStation has a dedicated staff that quickly verified and fix all errors I reported.   Worden had the most errors and didn’t respond at all when I pointed them out.    Needless to say, I am very wary of using the Worden data now.

Step 6: Select the final historical price data.   In the end, I came up with a final list of stock tickers.   A few tickers had glaring mis-compares and no obvious culprit so I deleted them from my list.   I designated the cleaned set of CSI data (plus the un-checked delisted data) as my golden historical price data set.   I keep that locked down in a seperate directory to avoid inadvertent changes.  (The vendors update data at least daily and may fix errors.   Even a fix is unwelcome once backtesting begins — each strategy needs to perform on exactly the same data to make comparisons between strategies.)   

Step 7: Feed the golden price data to the backtesting engine.   TradeStation, my backtesting engine of choice, runs on its own data by default.   To use an external dataset, you have to use the Symbol Lookup->3rd Party tab to point to the data and also set up attribute and data order parameter files to tell TradeStation how to read it.   See the TradeStation help files under 3rd Party Data for complete instructions.

This arduous process helps you get accurate, high-quality results from backtesting.

December 16th, 2008 Filed under Backtesting Set Up

Tags: , , , , , , , , , , ,


Related posts:
  • My BackTesting Engine Evaluation in 2007
  • BackTesting Moving Averages
  • ATR Trailing Stop Definition
  • Curve-Fitting Definition
  • Naming Convention Definition
  • 2 Responses to “How To Clean Price Data for Backtesting”

    1. Pat Thorn | 4/02/10

      I am a programmer new to Forex and was most interested in your appraisal of MACD as an indicator, i have read a lot about this indicator on the web. I have found if i combine MACD with Fast Stochastic it is amazing my results so far, still analysing, haven’t bitten the bullet yet. I am going to start this coming monday.

      Is your backtesting data the actual tick or price at the time it happens as opposed to the Open,Close, High and Low per time frame? From wat i have read i believe it is.

      Being a programmer that knows a bit about data mining i can appreciate the amount of time and effort you have expended. I congratulate your tenacity.

      If my startegy as described above works, i have you to thank for that.
      Cheers

    2. backtester | 4/02/10

      Hey Pat
      Thanks for your gracious comments. I wish you well with your trading, whatever strategy you choose.

      My backtesting data is all end of day, just Open, High, Low, Close. Mostly my backtests are entering Market on Open except stop losses so it is admittedly coarse.

      I’ve got more info on MACD (as it applies to stocks) at http://www.truthaboutmacd.com Have not yet applied MACD to Forex.

      Glad to hear you are backtesting and assessing the situation thoroughly before risking cash.

      Cheers!
      Jackie

    Share Your Thoughts