Improving my NIFTY 50 Drawdown Prediction Model
A little bit of economic intuition and feature engineering can go a long way
I shared some of my initial work on the random forest prediction model that I developed to predict drawdowns ( < -2% returns) in the upcoming month for NIFTY 50 in this article:
As I had said, it’s a work in progress and I plan to continually improve the model by adding more features, using more robust testing and validation techniques, and optimizing hyperparameters.
In this article, I’ve explained a recent improvement I made to the model and the thought process behind it. Let’s get right into it.
Feature Engineering
I take a step back and start with economic intuition. Following are the features and the respective intuition that led me to use them:
Price Momentum: Short term and long term momentum can be an effective predictor of future returns
CAPE: Cyclically adjusted P/E ratio can be a proxy for business cycles and extreme high values can indicate an imminent reversal to mean (thereby predicting drawdown)
USD/INR Exchange Rate: USD/INR exchange rate can drive Foreign Institutional Investors in and out of the market. A rising USD/INR exchange rate will mean lower dollar returns for foreign investors.
I plan to another another feature which I believe is useful:
Buffet Indicator: The ratio of total market cap to GDP. Another variable that we can expect to mean revert. High values can signal an upcoming drawdown.
Now, I want to use start with some statistical analyses to understand the relationship between these features and (next month) market returns.
Linear relationship between features and output variable
We have monthly data from 1995 to 2025 (30 years+) and we use it to get the Pearson Correlation Coefficient. We also run OLS regression to get R-squared and the beta for the variable.
Table 1: Correlation Coefficient, R-squared beta for features
One thing that I have always struggled with is to understand how correlation is different from beta. The explanation that I’ve read is that correlation tells us how two variables move together, while beta tells you how the output variable changes with respect to changes in the feature (input) variable.
Further, R-squared is a metric from OLS regression that measures the amount of variance in the output variable that is explained by the feature. A value of 0.29 means that 29% of the variance in the output is explained by the feature.
High correlation and low beta basically point to cases where correlation is real, but causation is false. We somewhat see this with CAPE, where there’s a positive correlation, but beta is is close to 0 and R-sqaured is insignificant.
Therefore, I tried using CAPE momentum (or change in CAPE), as a feature instead of absolute CAPE. The economic intuition here is that increase in P/E is not always problematic, especially if it is driven by a higher pace of earnings growth. But if the pace of P/E growth outdoes the pace of earnings growth, the P/E might rise to unreasonably high values. I found that 1-month and 3-month CAPE momentum had strong linear relationship with next month’s return.
Table 2: Correlation Coefficient, R-squared beta for CAPE Momentum
Results on Validation and Test Data
I use 5-fold cross validation for training my random forest model. One thing about using 5-fold cross validation for time series data is that there will be instances where the model is trained on years 2005 to 2015 and validated on 2004 (for example). This might seem senseless given that our data is sequential. But I would say, this is actually not too bad given the nature of our objective - which is classification. Since for every row (every month) we already have features that are based on past 3 month, 6 month, or 12 month features, the correct sequence of data is never lost for any individual. And that’s all that matters to us. If we were building some sort of sequential model, say an LSTM, then we using 5-fold cross validation for training wouldn’t make sense.
The previous model was trained all on price momentum features, absolute CAPE values, and USD/INR momentum.
The new model will use only the following features:
Price Return: 3 -month and 6-month
USD/INR momentum: 3-month and 6-month
CAPE momentum: 1-month and 6-month
Old Model Results:
Training Data (5-fold cross validation): Accuracy: 68%
Test Data: 81.61%
New Model Results:
Training Data (5-fold cross validation): Accuracy: 74%
Test Data: 82.76%
We see significant improvement in training results, and slight improvement in test results. It’s actually a pretty good sign that by reducing the number of training features we’re seeing an improvement in training performance. This would confirm that a lot of our data was just noise for the model.
This is also a rather rare case where our test accuracy is higher than training accuracy. This is mostly because we’re getting lucky with the particular months in test data that the model is able to classify well. But our test data is still quite long (Dec 2017 to Feb 2025) and it includes bull markets, sideways markets, the covid crash, and a fair bit of volatility.
As a next step, I am collecting historical data to add the Buffet Indicator as a feature. I will keep you updated when I’ll have done it.
Quant India
For first time readers - this market predictor is part of my systematic equities strategy ‘Quant India’ for Indian retail investors.
Its methodology, portfolio, code, and performance are all publicly available. You can use the web-app which is updated every month to check all of these things: