决策树

决策树是一种常用的分类与回归方法，以树状结构表示数据分类的过程。每个内部节点表示一个特征或属性上的测试，每个分支代表这个特征的一个可能值，而每个叶节点代表一种分类结果。

算法原理

决策树的构建基于以下基本思想：

选择一个最优特征，将训练数据集分割成子集
对子集重复上述过程，直到所有样本属于同一类别，或者没有更多特征可用
生成一颗从根节点到叶节点的树，每个叶节点代表一个分类结果

特征选择度量

决策树算法中，特征选择的关键是使用不同的评估标准：

信息增益（Information Gain）

基于熵（entropy）概念，计算划分前后熵的减少量：

Gain(D, a) = H(D) - \sum_{v} \frac{|D^v|}{|D|} H(D^v)

其中， $H(D)$ 是数据集 $D$ 的熵：

H(D) = -\sum_{k=1}^{K} p_k \log_2 p_k

增益率（Gain Ratio）

为了克服信息增益偏向多值属性的问题，C4.5算法使用增益率：

GainRatio(D, a) = \frac{Gain(D, a)}{SplitInfo(D, a)}

其中， $SplitInfo(D, a)$ 是:

SplitInfo(D, a) = -\sum_{v=1}^{V} \frac{|D^v|}{|D|} \log_2 \frac{|D^v|}{|D|}

基尼指数（Gini Index）

CART算法使用基尼指数来选择特征：

Gini(D) = 1 - \sum_{k=1}^{K} p_k^2

特征 $a$ 的基尼指数：

Gini\_index(D, a) = \sum_{v=1}^{V} \frac{|D^v|}{|D|} Gini(D^v)

常见决策树算法

ID3算法：使用信息增益选择特征，无法处理连续特征
C4.5算法：使用增益率选择特征，可以处理连续特征和缺失值
CART算法：使用基尼指数，可以进行分类和回归，生成二叉树

代码实现

使用scikit-learn库

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import tree

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建决策树分类器
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)

# 训练模型
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print(f"决策树分类器的准确率: {accuracy:.4f}")

# 可视化决策树
plt.figure(figsize=(15, 10))
tree.plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("决策树可视化")
plt.savefig('decision_tree_vis.png')
plt.show()

从零实现简单决策树

import numpy as np
from collections import Counter

class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # 分割特征的索引
        self.threshold = threshold  # 分割阈值
        self.left = left            # 左子树
        self.right = right          # 右子树
        self.value = value          # 叶子节点的值

class DecisionTree:
    def __init__(self, max_depth=100, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
      
    def fit(self, X, y):
        self.n_features = X.shape[1]
        self.root = self._grow_tree(X, y)
      
    def _grow_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
      
        # 停止条件
        if (depth >= self.max_depth or n_samples < self.min_samples_split or n_classes == 1):
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)
      
        # 寻找最佳分割
        best_feature, best_threshold = self._best_split(X, y, n_features)
      
        # 创建子树
        left_idxs = X[:, best_feature] <= best_threshold
        right_idxs = X[:, best_feature] > best_threshold
      
        left = self._grow_tree(X[left_idxs], y[left_idxs], depth + 1)
        right = self._grow_tree(X[right_idxs], y[right_idxs], depth + 1)
      
        return Node(best_feature, best_threshold, left, right)
  
    def _best_split(self, X, y, n_features):
        best_gain = -1
        split_feature, split_threshold = None, None
      
        for feature in range(n_features):
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                gain = self._information_gain(X, y, feature, threshold)
                if gain > best_gain:
                    best_gain = gain
                    split_feature = feature
                    split_threshold = threshold
                  
        return split_feature, split_threshold
  
    def _information_gain(self, X, y, feature, threshold):
        # 计算父节点的熵
        parent_entropy = self._entropy(y)
      
        # 根据阈值划分左右子树
        left_idxs = X[:, feature] <= threshold
        right_idxs = X[:, feature] > threshold
      
        if np.sum(left_idxs) == 0 or np.sum(right_idxs) == 0:
            return 0
      
        # 计算子节点的熵
        n = len(y)
        n_l, n_r = np.sum(left_idxs), np.sum(right_idxs)
        e_l, e_r = self._entropy(y[left_idxs]), self._entropy(y[right_idxs])
        child_entropy = (n_l / n) * e_l + (n_r / n) * e_r
      
        # 计算信息增益
        information_gain = parent_entropy - child_entropy
        return information_gain
  
    def _entropy(self, y):
        hist = np.bincount(y)
        ps = hist / len(y)
        return -np.sum([p * np.log2(p) for p in ps if p > 0])
  
    def _most_common_label(self, y):
        counter = Counter(y)
        return counter.most_common(1)[0][0]
  
    def predict(self, X):
        return np.array([self._traverse_tree(x, self.root) for x in X])
  
    def _traverse_tree(self, x, node):
        if node.value is not None:
            return node.value
      
        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)

决策树示例图解

下面是一个决策树的示例图解：

该图展示了一个基于特征条件的决策路径，从根节点开始，根据特征值的判断，沿着相应的分支向下，最终到达叶节点获得分类结果。

剪枝技术

为了避免过拟合，决策树通常需要进行剪枝：

预剪枝：在决策树生成过程中就进行剪枝，例如设置最大深度、最小样本数等
后剪枝：先生成完整的决策树，然后自底向上进行剪枝，如果剪掉某个节点后错误率降低，则进行剪枝

# scikit-learn中预剪枝示例
clf = DecisionTreeClassifier(
    max_depth=4,                  # 树的最大深度
    min_samples_split=5,          # 分割内部节点所需的最小样本数
    min_samples_leaf=2,           # 叶节点所需的最小样本数
    max_features='sqrt',          # 寻找最佳分割时考虑的特征数
    random_state=42
)

决策树的优缺点

优点

易于理解和解释，可以被可视化
需要很少的数据预处理，不需要归一化特征
可以处理数值型和类别型特征
能够处理多输出问题
使用白盒模型，结果容易解释

缺点

容易过拟合，特别是树很深的时候
可能创建一个过于复杂的树，不能很好地泛化数据
决策树可能不稳定，数据小变动可能导致树结构的大变化
寻找最优决策树是NP完全问题，实际使用的是启发式算法
有些概念难以学习，如XOR、奇偶校验或多重异或

应用场景

决策树广泛应用于：

医疗诊断
信用风险评估
客户流失预测
欺诈检测
推荐系统

小结

决策树是一种直观且强大的分类与回归方法，通过层层分割特征空间来构建预测模型。尽管它有过拟合的风险，但通过剪枝和集成方法（如随机森林）可以有效地改善这个问题。决策树算法因其可解释性和易用性，在机器学习实践中非常受欢迎。

算法原理​

特征选择度量​

信息增益（Information Gain）​

增益率（Gain Ratio）​

基尼指数（Gini Index）​

常见决策树算法​

代码实现​

使用scikit-learn库​

从零实现简单决策树​

决策树示例图解​

剪枝技术​

决策树的优缺点​

优点​

缺点​

应用场景​

小结​