Decision tree makes flat prediction.
When you use decision tree model like RandomForest or XGBoost or LightGBM, sometimes it makes flat result.
This article explains why it happens.
Problem
If we prepare particular condition, decision tree always makes flat result.
Decision tree always makes flat result
First we prepare increasing data like y=0.5x
.
import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor # dataset x = np.linspace(0, 10, 100) y = x * 0.5 # Set data on graph plt.plot(x, y, label="y=0.5x") # Set legend plt.legend() # Show graph plt.show()
Once data is prepared, it shows linear graph like below.
We split this data into 70% training data and remaining test data.
# split data train_rate = 0.7 train_size = int(len(x) * train_rate) x_train, x_test, y_train, y_test = x[0:train_size], x[train_size:len(x)], y[0:train_size], y[train_size:len(y)] # Show graph plt.plot(x_train, y_train, label="y=0.5x(train)") plt.plot(x_test, y_test, label="y=0.5x(test)") plt.legend() plt.show()
Then linear data is separated into training data and test data.
And we train model with trainig data and predict with test data.
We use a kind of Decision tree model RandomForest
.
# Train and Predict reg = RandomForestRegressor(random_state=0) reg = reg.fit(np.array(x_train).reshape(-1,1), y_train) y_pred = reg.predict(np.array(x_test).reshape(-1,1)) # Show graph plt.plot(x_train, y_train, label="y=0.5x(train)") plt.plot(x_test, y_test, label="y=0.5x(test)") plt.plot(x_test, y_pred, label="y_pred") plt.legend() plt.show()
Then we get result like below.
Prediction result shows flat. It is different from actual test data.
Comparing to Linear Regression
If we use Linear Regression model, it does not show flat result.
# Train and Predict from sklearn.linear_model import LinearRegression reg = LinearRegression() reg = reg.fit(np.array(x_train).reshape(-1,1), y_train) y_pred = reg.predict(np.array(x_test).reshape(-1,1)) # Show graph plt.plot(x_train, y_train, label="y=0.5x(train)") plt.plot(x_test, y_test, label="y=0.5x(test)") plt.plot(x_test, y_pred, label="y_pred") plt.legend() plt.show()
Reason why predicted data become flat
Reason why predicted data become flat is architecture of Decision tree.
Rule of Decision tree is below.
RandomForest and XGBoost are little bit more compricated.
But base of the architecture is same.
- Check input value and choose branch.
- End of branch has prediction result.
- Threshold values and edge values are fixed by training data.
Important thing is "edge values are fixed by training data".
Edge values are prediction values.
They are fixed in training phase.
After training, even we input bigger data, decision route and result is always maximum ome.
In last example of y=0.5x
, maximum value of training data is 3.5
in case of x=7
.
So trained model can't predict more than 3.5
.
It means Decision tree can't predict different range from training data.
How to solve
So is Decision tree useless ?
No, it is not correct.
It can't predict unlimited range value.
So we can use limited range value as prediction target.
Use difference
Assume x is separated value and use difference between y of t and t-1.
So we can get stable difference for the example of y=0.5x
.
import pandas as pd df_data = pd.DataFrame() df_data["x"] = x df_data["y"] = y df_data["y_diff"] = df_data["y"] - df_data["y"].shift(1) df_data = df_data.dropna() # split data df_train, df_test = df_data.iloc[0:train_size], df_data.iloc[train_size:] # Show graph plt.plot(df_train["x"], df_train["y_diff"], label="y=0.5x(train) diff") plt.plot(df_test["x"], df_test["y_diff"], label="y=0.5x(test) diff") plt.legend() plt.show()
Difference range of test data is same as range of training data.
So it is suitable for prediction target.
And we add predicted difference to original y.
Then we get good prediction.
# Train and Predict difference reg = RandomForestRegressor(random_state=0) reg = reg.fit(np.array(df_train["x"]).reshape(-1,1), df_train["y_diff"]) y_diff_pred = reg.predict(np.array(df_test["x"]).reshape(-1,1)) # Add difference y_pred = [] for i in range(0,len(df_test["x"])): if i == 0: y_pred.append(df_train["y"].iloc[-1] + y_diff_pred[i]) else: y_pred.append(y_pred[i-1]+y_diff_pred[i]) print(y_pred) # Show graph plt.plot(df_train["x"], df_train["y"], label="y=0.5x(train)") plt.plot(df_test["x"], df_test["y"], label="y=0.5x(test)") plt.plot(df_test["x"], y_pred, label="y=0.5x(pred)") plt.legend() plt.show()
Finally
- Flat prediction result is due to Decision tree architecture.
- Predictable range of Decision tree depends on training data.
- With using difference or rate, you can predict unlimited range target.