機械学習

Decision tree makes flat prediction.

Decision tree makes flat prediction.

Decision tree makes flat prediction.
When you use decision tree model like RandomForest or XGBoost or LightGBM, sometimes it makes flat result.
This article explains why it happens.


Problem

If we prepare particular condition, decision tree always makes flat result.


Decision tree always makes flat result

First we prepare increasing data like y=0.5x .

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# dataset
x = np.linspace(0, 10, 100)
y = x * 0.5

# Set data on graph
plt.plot(x, y, label="y=0.5x")
# Set legend
plt.legend()
# Show graph
plt.show()

Once data is prepared, it shows linear graph like below.

Linear data

We split this data into 70% training data and remaining test data.

# split data
train_rate = 0.7
train_size = int(len(x) * train_rate)
x_train, x_test, y_train, y_test = x[0:train_size], x[train_size:len(x)], y[0:train_size], y[train_size:len(y)]

# Show graph
plt.plot(x_train, y_train, label="y=0.5x(train)")
plt.plot(x_test, y_test, label="y=0.5x(test)")
plt.legend()
plt.show()

Then linear data is separated into training data and test data.

Linear data (training and test)

And we train model with trainig data and predict with test data.
We use a kind of Decision tree model RandomForest.

# Train and Predict
reg = RandomForestRegressor(random_state=0)
reg = reg.fit(np.array(x_train).reshape(-1,1), y_train)
y_pred = reg.predict(np.array(x_test).reshape(-1,1))

# Show graph
plt.plot(x_train, y_train, label="y=0.5x(train)")
plt.plot(x_test, y_test, label="y=0.5x(test)")
plt.plot(x_test, y_pred, label="y_pred")
plt.legend()
plt.show()

Then we get result like below.
Prediction result shows flat. It is different from actual test data.

Result of RandomForest


Comparing to Linear Regression

If we use Linear Regression model, it does not show flat result.

# Train and Predict
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg = reg.fit(np.array(x_train).reshape(-1,1), y_train)
y_pred = reg.predict(np.array(x_test).reshape(-1,1))

# Show graph
plt.plot(x_train, y_train, label="y=0.5x(train)")
plt.plot(x_test, y_test, label="y=0.5x(test)")
plt.plot(x_test, y_pred, label="y_pred")
plt.legend()
plt.show()

Result of LinearRegression


Reason why predicted data become flat

Reason why predicted data become flat is architecture of Decision tree.

Rule of Decision tree is below.
RandomForest and XGBoost are little bit more compricated.
But base of the architecture is same.

  • Check input value and choose branch.
  • End of branch has prediction result.
  • Threshold values and edge values are fixed by training data.

Important thing is "edge values are fixed by training data".
Edge values are prediction values.
They are fixed in training phase.
After training, even we input bigger data, decision route and result is always maximum ome.

In last example of y=0.5x, maximum value of training data is 3.5 in case of x=7.
So trained model can't predict more than 3.5.

It means Decision tree can't predict different range from training data.


How to solve

So is Decision tree useless ?
No, it is not correct.

It can't predict unlimited range value.
So we can use limited range value as prediction target.


Use difference

Assume x is separated value and use difference between y of t and t-1.
So we can get stable difference for the example of y=0.5x.

import pandas as pd
df_data = pd.DataFrame()
df_data["x"] = x
df_data["y"] = y
df_data["y_diff"] = df_data["y"] - df_data["y"].shift(1)
df_data = df_data.dropna()

# split data
df_train, df_test = df_data.iloc[0:train_size], df_data.iloc[train_size:] 

# Show graph
plt.plot(df_train["x"], df_train["y_diff"], label="y=0.5x(train) diff")
plt.plot(df_test["x"], df_test["y_diff"], label="y=0.5x(test) diff")
plt.legend()
plt.show()

Difference of Linear data

Difference range of test data is same as range of training data.
So it is suitable for prediction target.
And we add predicted difference to original y.
Then we get good prediction.

# Train and Predict difference
reg = RandomForestRegressor(random_state=0)
reg = reg.fit(np.array(df_train["x"]).reshape(-1,1), df_train["y_diff"])
y_diff_pred = reg.predict(np.array(df_test["x"]).reshape(-1,1))

# Add difference
y_pred = []
for i in range(0,len(df_test["x"])):
    if i == 0:
        y_pred.append(df_train["y"].iloc[-1] + y_diff_pred[i])
    else:
        y_pred.append(y_pred[i-1]+y_diff_pred[i])

print(y_pred)

# Show graph
plt.plot(df_train["x"], df_train["y"], label="y=0.5x(train)")
plt.plot(df_test["x"], df_test["y"], label="y=0.5x(test)")
plt.plot(df_test["x"], y_pred, label="y=0.5x(pred)")
plt.legend()
plt.show()

Result of RandomForest with using difference


Finally

  • Flat prediction result is due to Decision tree architecture.
  • Predictable range of Decision tree depends on training data.
  • With using difference or rate, you can predict unlimited range target.


コチラもオススメ

KRSW

駆け出し機械学習エンジニア。機械学習、DB、WEBと浅く広い感じ。 Junior machine learning engineer. Not a specialist but a generalist who knows DB, WEB too.

If you felt this article is useful, please share.

にほんブログ村 IT技術ブログへ

-機械学習
-,

Translate »

Copyright© ITips , 2020 All Rights Reserved.