SVM调优详解

2 Aug, 2019 · Read in about 7 min · (1490 Words)

[原创]

在支持向量机(以下简称SVM)的核函数中，高斯核(以下简称RBF)是最常用的，从理论上讲，RBF一定不比线性核函数差，但是在实际应用中，却面临着几个重要的超参数的调优问题。如果调的不好，可能比线性核函数还要差。所以我们实际应用中，能用线性核函数得到较好效果的都会选择线性核函数。如果线性核不好，我们就需要使用RBF，在享受RBF对非线性数据的良好分类效果前，我们需要对主要的超参数进行选取。本文我们就对scikit-learn中 SVM RBF的调参做一个小结。

1 SVM RBF 主要超参数概述

如果是SVM分类模型，这两个超参数分别是惩罚系数和RBF核函数的系数。当然如果是nu-SVC的话，惩罚系数C代替为分类错误率上限nu, 由于惩罚系数C和分类错误率上限nu起的作用等价，因此本文只讨论带惩罚系数C的分类SVM**

1.1 SVM分类模型

###（1）惩罚系数

惩罚系数C即上一篇里讲到的松弛变量ξ的系数。它在优化函数里主要是平衡支持向量的复杂度和误分类率这两者之间的关系，可以理解为正则化系数。
当惩罚系数C比较大时，我们的损失函数也会越大，这意味着我们不愿意放弃比较远的离群点。这样我们会有更加多的支持向量，也就是说支持向量和超平面的模型也会变得越复杂，也容易过拟合。
当惩罚系数C比较小时，意味我们不想理那些离群点，会选择较少的样本来做支持向量，最终的支持向量和超平面的模型也会简单。scikit-learn中默认值是1。

（2）RBF核函数的系数

另一个超参数是RBF核函数的参数。回忆下RBF 核函数

γ主要定义了单个样本对整个分类超平面的影响。

当γ比较小时，单个样本对整个分类超平面的影响比较小，不容易被选择为支持向量
当γ比较大时，单个样本对整个分类超平面的影响比较大，更容易被选择为支持向量**，或者说整个模型的支持向量也会多。scikit-learn中默认值是1/n_features**

（3）惩罚系数和RBF核函数的系数

如果把惩罚系数和RBF核函数的系数一起看：

当C比较大、 γ比较大时，会有更多的支持向量，模型会比较复杂，较容易过拟合
当C比较小、γ比较小时，模型会变得简单，支持向量的个数会少

1.2 SVM回归模型

SVM回归模型的RBF核比分类模型要复杂一点，因为此时除了惩罚系数C和RBF核函数的系数γ之外，还多了一个损失距离度量ϵ。如果是nu-SVR的话，损失距离度量ϵ代替为分类错误率上限nu，由于损失距离度量ϵ和分类错误率上限nu起的作用等价，因此本文只讨论带距离度量ϵ的回归SVM。

对于惩罚系数C和RBF核函数的系数γ，回归模型和分类模型的作用基本相同。
对于损失距离度量ϵ，它决定了样本点到超平面的距离损失.当ϵ比较大时，损失较小，更多的点在损失距离范围之内，模型较简单;当ϵ比较小时，损失函数会较大，模型也会变得复杂；scikit-learn中默认值是0.1

惩罚系数C、RBF核函数的系数γ和损失距离度量ϵ一起看

当C比较大、 γ比较大、ϵ比较小时，会有更多的支持向量，模型会比较复杂，容易过拟合一些;
当C比较小、γ比较小、ϵ比较大时**，模型会变得简单，支持向量的个数会少

2 SVM RBF 主要调参方法

对于SVM的RBF核，主要的调参方法都是交叉验证。具体在scikit-learn中，主要是使用网格搜索，即GridSearchCV类。
当然也可以使用cross_val_score类来调参，但是个人觉得没有GridSearchCV方便。本文只讨论用GridSearchCV**来进行SVM的RBF核的调参。

将GridSearchCV类用于SVM RBF调参时要注意的参数有：

1. estimator :即模型，此处就是带高斯核的SVC或者SVR
2. param_grid：即要调参的参数列表。比如用SVC分类模型的话，那么param_grid可以定义为{"C":[0.1, 1, 10], "gamma": [0.1, 0.2, 0.3]}，这样就会有9种超参数的组合来进行网格搜索，选择一个拟合分数最好的超平面系数。
3. cv: S折交叉验证的折数，即将训练集分成多少份来进行交叉验证。默认是3。如果样本较多的话，可以适度增大cv的值。

网格搜索结束后，可以得到最好的模型estimator、param_grid中最好的参数组合、最好的模型分数。

下面用一个具体的分类例子来观察SVM RBF调参的过程。

3 SVM RBF分类调参的例子

import numpy as np  
import matplotlib.pyplot as plt  
from sklearn import datasets, svm  
from sklearn.svm import SVC  
from sklearn.datasets import make_moons, make_circles, make_classification

接着生成一些随机数据来分类，为了数据难一点，加入了一些噪音，生成数据的同时把数据归一化。

X, y = make_circles(noise=0.2, factor=0.5, random_state=1)  
from sklearn.preprocessing import StandardScaler  
X = StandardScaler().fit_transform(X)

数据可视化如下：

from matplotlib.colors import ListedColormap  
cm = plt.cm.RdBu  
cm_bright = ListedColormap(['#FF0000', '#0000FF'])  
ax = plt.subplot()  
ax.set_title("Input data")  
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright)  
ax.set_xticks(())  
ax.set_yticks(())  
plt.tight_layout()  
plt.show()

现在要对这个数据集进行SVM RBF分类了，分类时使用了网格搜索，在C=(0.1,1,10)和gamma=(1, 0.1, 0.01)形成的9种情况中选择最好的超参数，使用4折交叉验证。这里只是一个例子，实际运用中，可能需要更多的参数组合来进行调参。

from sklearn.model_selection import GridSearchCV  
grid = GridSearchCV(SVC(), param_grid={"C":[0.1, 1, 10], "gamma":[1, 0.1, 0.01]}, cv=4)  
grid.fit(X, y)  
print("The best parameters are %s with a score of %0.2f" %(grid.best_params_, grid.best_score_))

最终的输出如下：

The best parameters are {'C': 10, 'gamma': 0.1} with a score of 0.91

也就是说，通过网格搜索，在给定的9组超参数中，C=10， Gamma=0.1 分数最高，这就是最终的参数候选。
到这里，调参举例就结束了。不过可以看看普通SVM分类后的可视化。这里把这9种组合分别训练后，通过对网格里的点预测来标色，观察分类的效果图。代码如下：

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1  
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1  
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))  
  
for i, C in enumerate((0.1, 1, 10)):  
    for j, gamma in enumerate((1, 0.1, 0.01)):  
        plt.subplot()  
        clf = SVC(C=C, gamma=gamma)  
        clf.fit(X, y)  
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])  
        Z = Z.reshape(xx.shape)  
        plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)  
        plt.scatter(X[:,0], X[:, 1], c=y, cmap=plt.cm.coolwarm)  
        plt.xlim(xx.min(), xx.max())  
        plt.ylim(yy.min(), yy.max())  
        plt.xticks(())  
        plt.yticks(())  
        plt.xlabel("gamma="+str(gamma)+" C="+str(C))  
        plt.show()

4 SVM算法库其他调参要点

上面已经对Scikit-Learn中类库的参数做了总结，这里对其他的调参要点做一个小结。

1）一般推荐在做训练之前对数据进行归一化，当然测试集中的数据也需要归一化。

2）在特征数非常多的情况下，或者样本数远小于特征数的时候，使用线性核，效果已经很好，并且只需要选择惩罚系数C即可。

3）在选择核函数时，如果线性拟合不好，一般推荐使用默认的高斯核'rbf'。这时我们主要需要对惩罚系数C和核函数参数γ进行艰苦的调参，通过多轮的交叉验证选择合适的惩罚系数C和核函数参数γ。

4）理论上高斯核不会比线性核差，但是这个理论却建立在要花费更多的时间来调参上。所以实际上能用线性核解决问题我们尽量使用线性核。

5 附录：官方文档

In [1]: from sklearn.svm import SVC

In [2]: SVC?
Init signature: SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)
Docstring:
C-Support Vector Classification.

The implementation is based on libsvm. The fit time complexity
is more than quadratic with the number of samples which makes it hard
to scale to dataset with more than a couple of 10000 samples.

The multiclass support is handled according to a one-vs-one scheme.

For details on the precise mathematical formulation of the provided
kernel functions and how `gamma`, `coef0` and `degree` affect each
other, see the corresponding section in the narrative documentation:
:ref:`svm_kernels`.

Read more in the :ref:`User Guide <svm_classification>`.

Parameters
----------
C : float, optional (default=1.0)
    Penalty parameter C of the error term.

kernel : string, optional (default='rbf')
     Specifies the kernel type to be used in the algorithm.
     It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
     a callable.
     If none is given, 'rbf' will be used. If a callable is given it is
     used to pre-compute the kernel matrix from data matrices; that matrix
     should be an array of shape ``(n_samples, n_samples)``.

degree : int, optional (default=3)
    Degree of the polynomial kernel function ('poly').
    Ignored by all other kernels.

gamma : float, optional (default='auto')
    Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
    If gamma is 'auto' then 1/n_features will be used instead.

coef0 : float, optional (default=0.0)
    Independent term in kernel function.
    It is only significant in 'poly' and 'sigmoid'.

probability : boolean, optional (default=False)
    Whether to enable probability estimates. This must be enabled prior
    to calling `fit`, and will slow down that method.

shrinking : boolean, optional (default=True)
    Whether to use the shrinking heuristic.

tol : float, optional (default=1e-3)
    Tolerance for stopping criterion.

cache_size : float, optional
    Specify the size of the kernel cache (in MB).

class_weight : {dict, 'balanced'}, optional
    Set the parameter C of class i to class_weight[i]*C for
    SVC. If not given, all classes are supposed to have
    weight one.
    The "balanced" mode uses the values of y to automatically adjust
    weights inversely proportional to class frequencies in the input data
    as ``n_samples / (n_classes * np.bincount(y))``

verbose : bool, default: False
    Enable verbose output. Note that this setting takes advantage of a
    per-process runtime setting in libsvm that, if enabled, may not work
    properly in a multithreaded context.

max_iter : int, optional (default=-1)
    Hard limit on iterations within solver, or -1 for no limit.

decision_function_shape : 'ovo', 'ovr', default='ovr'
    Whether to return a one-vs-rest ('ovr') decision function of shape
    (n_samples, n_classes) as all other classifiers, or the original
    one-vs-one ('ovo') decision function of libsvm which has shape
    (n_samples, n_classes * (n_classes - 1) / 2).

    .. versionchanged:: 0.19
        decision_function_shape is 'ovr' by default.

    .. versionadded:: 0.17
       *decision_function_shape='ovr'* is recommended.

    .. versionchanged:: 0.17
       Deprecated *decision_function_shape='ovo' and None*.

random_state : int, RandomState instance or None, optional (default=None)
    The seed of the pseudo random number generator to use when shuffling
    the data.  If int, random_state is the seed used by the random number
    generator; If RandomState instance, random_state is the random number
    generator; If None, the random number generator is the RandomState
    instance used by `np.random`.

Attributes
----------
support_ : array-like, shape = [n_SV]
    Indices of support vectors.

support_vectors_ : array-like, shape = [n_SV, n_features]
    Support vectors.

n_support_ : array-like, dtype=int32, shape = [n_class]
    Number of support vectors for each class.

dual_coef_ : array, shape = [n_class-1, n_SV]
    Coefficients of the support vector in the decision function.
    For multiclass, coefficient for all 1-vs-1 classifiers.
    The layout of the coefficients in the multiclass case is somewhat
    non-trivial. See the section about multi-class classification in the
    SVM section of the User Guide for details.

coef_ : array, shape = [n_class-1, n_features]
    Weights assigned to the features (coefficients in the primal
    problem). This is only available in the case of a linear kernel.

    `coef_` is a readonly property derived from `dual_coef_` and
    `support_vectors_`.

intercept_ : array, shape = [n_class * (n_class-1) / 2]
    Constants in decision function.

Examples
--------
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = SVC()
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
>>> print(clf.predict([[-0.8, -1]]))
[1]

See also
--------
SVR
    Support Vector Machine for Regression implemented using libsvm.

LinearSVC
    Scalable Linear Support Vector Machine for classification
    implemented using liblinear. Check the See also section of
    LinearSVC for more comparison element.
File:           c:\anaconda3\lib\site-packages\sklearn\svm\classes.py
Type:           ABCMeta

In [4]: from sklearn.svm import SVR

In [5]: SVR?
Init signature: SVR(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)
Docstring:
Epsilon-Support Vector Regression.

The free parameters in the model are C and epsilon.

The implementation is based on libsvm.

Read more in the :ref:`User Guide <svm_regression>`.

Parameters
----------
C : float, optional (default=1.0)
    Penalty parameter C of the error term.

epsilon : float, optional (default=0.1)
     Epsilon in the epsilon-SVR model. It specifies the epsilon-tube
     within which no penalty is associated in the training loss function
     with points predicted within a distance epsilon from the actual
     value.

kernel : string, optional (default='rbf')
     Specifies the kernel type to be used in the algorithm.
     It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
     a callable.
     If none is given, 'rbf' will be used. If a callable is given it is
     used to precompute the kernel matrix.

degree : int, optional (default=3)
    Degree of the polynomial kernel function ('poly').
    Ignored by all other kernels.

gamma : float, optional (default='auto')
    Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
    If gamma is 'auto' then 1/n_features will be used instead.

coef0 : float, optional (default=0.0)
    Independent term in kernel function.
    It is only significant in 'poly' and 'sigmoid'.

shrinking : boolean, optional (default=True)
    Whether to use the shrinking heuristic.

tol : float, optional (default=1e-3)
    Tolerance for stopping criterion.

cache_size : float, optional
    Specify the size of the kernel cache (in MB).
verbose : bool, default: False
    Enable verbose output. Note that this setting takes advantage of a
    per-process runtime setting in libsvm that, if enabled, may not work
    properly in a multithreaded context.

max_iter : int, optional (default=-1)
    Hard limit on iterations within solver, or -1 for no limit.

Attributes
----------
support_ : array-like, shape = [n_SV]
    Indices of support vectors.

support_vectors_ : array-like, shape = [nSV, n_features]
    Support vectors.

dual_coef_ : array, shape = [1, n_SV]
    Coefficients of the support vector in the decision function.

coef_ : array, shape = [1, n_features]
    Weights assigned to the features (coefficients in the primal
    problem). This is only available in the case of a linear kernel.

    `coef_` is readonly property derived from `dual_coef_` and
    `support_vectors_`.

intercept_ : array, shape = [1]
    Constants in decision function.

sample_weight : array-like, shape = [n_samples]
        Individual weights for each sample

Examples
--------
>>> from sklearn.svm import SVR
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = SVR(C=1.0, epsilon=0.2)
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma='auto',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

See also
--------
uSVR
    Support Vector Machine for regression implemented using libsvm
    using a parameter to control the number of support vectors.

LinearSVR
    Scalable Linear Support Vector Machine for regression
    implemented using liblinear.
File:           c:\anaconda3\lib\site-packages\sklearn\svm\classes.py
Type:           ABCMeta