SVM调优详解
[原创]
在支持向量机(以下简称SVM
)的核函数中,高斯核(以下简称RBF
)是最常用的,从理论上讲,RBF一定不比线性核函数差
,但是在实际应用中,却面临着几个重要的超参数的调优问题。如果调的不好,可能比线性核函数还要差。所以我们实际应用中,能用线性核函数得到较好效果的都会选择线性核函数。如果线性核不好,我们就需要使用RBF,在享受RBF对非线性数据的良好分类效果前,我们需要对主要的超参数进行选取。本文我们就对scikit-learn
中 SVM RBF
的调参做一个小结。
1 SVM RBF 主要超参数概述
如果是SVM
分类模型,这两个超参数分别是惩罚系数和RBF核函数的系数
。当然如果是nu-SVC
的话,惩罚系数C
代替为分类错误率上限nu
, 由于惩罚系数C
和分类错误率上限nu
起的作用等价,因此本文只讨论带惩罚系数C
的分类SVM
**
1.1 SVM分类模型
###(1) 惩罚系数
惩罚系数C即上一篇里讲到的松弛变量
ξ
的系数。它在优化函数里主要是平衡支持向量的复杂度和误分类率这两者之间的关系,可以理解为正则化系数
。当惩罚系数C比较大时,我们的
损失函数也会越大
,这意味着我们不愿意放弃比较远的离群点
。这样我们会有更加多的支持向量
,也就是说支持向量和超平面的模型也会变得越复杂,也容易过拟合
。当惩罚系数C比较小时,意味我们
不想理那些离群点
,会选择较少的样本来做支持向量
,最终的支持向量和超平面的模型也会简单。scikit-learn
中默认值是1。
(2)RBF核函数的系数
另一个超参数是RBF核函数的参数。回忆下RBF 核函数
γ
主要定义了单个样本对整个分类超平面的影响
。
当
γ
比较小时,单个样本对整个分类超平面的影响比较小,不容易被选择为支持向量
当
γ
比较大时,单个样本对整个分类超平面的影响比较大,更容易被选择为支持向量**
,或者说整个模型的支持向量也会多。scikit-learn
中默认值是1/n_features
**
(3)惩罚系数和RBF核函数的系数
如果把惩罚系数和RBF核函数的系数一起看:
- 当
C
比较大、γ
比较大时,会有更多的支持向量,模型会比较复杂,较容易过拟合
- 当
C
比较小、γ
比较小时,模型会变得简单,支持向量的个数会少
1.2 SVM回归模型
SVM回归模型的RBF核比分类模型要复杂一点,因为此时除了惩罚系数C和RBF核函数的系数γ之外,还多了一个损失距离度量ϵ。如果是nu-SVR的话,损失距离度量ϵ代替为分类错误率上限nu,由于损失距离度量ϵ和分类错误率上限nu起的作用等价,因此本文只讨论带距离度量ϵ的回归SVM
。
对于惩罚系数C和RBF核函数的系数γ,回归模型和分类模型的作用基本相同。
对于损失距离度量ϵ,它决定了
样本点到超平面的距离损失
.当ϵ比较大时,损失较小,更多的点在损失距离范围之内,模型较简单
;当ϵ比较小时,损失函数会较大,模型也会变得复杂
;scikit-learn中默认值是0.1
惩罚系数C、RBF核函数的系数γ和损失距离度量ϵ一起看
当
C比较大、 γ比较大、ϵ比较小
时,会有更多的支持向量,模型会比较复杂,容易过拟合一些;当
C比较小、γ比较小、ϵ比较大时
**,模型会变得简单,支持向量的个数会少
2 SVM RBF 主要调参方法
对于SVM的RBF核,主要的调参方法都是交叉验证
。具体在scikit-learn
中,主要是使用网格搜索
,即GridSearchCV
类。
当然也可以使用cross_val_score
类来调参,但是个人觉得没有GridSearchCV
方便。本文只讨论用GridSearchCV
**来进行SVM的RBF核的调参。
将GridSearchCV
类用于SVM RBF
调参时要注意的参数有:
- 1.
estimator
:即模型,此处就是带高斯核的SVC
或者SVR
- 2.
param_grid
:即要调参的参数列表。 比如用SVC
分类模型的话,那么param_grid
可以定义为{"C":[0.1, 1, 10], "gamma": [0.1, 0.2, 0.3]}
,这样就会有9
种超参数的组合来进行网格搜索,选择一个拟合分数最好的超平面系数。 - 3.
cv
:S
折交叉验证的折数,即将训练集分成多少份来进行交叉验证
。默认是3
。如果样本较多的话,可以适度增大cv
的值。
网格搜索结束后,可以得到最好的模型estimator
、param_grid
中最好的参数组合、最好的模型分数。
下面用一个具体的分类例子来观察SVM RBF
调参的过程。
3 SVM RBF分类调参的例子
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
from sklearn.svm import SVC
from sklearn.datasets import make_moons, make_circles, make_classification
接着生成一些随机数据来分类,为了数据难一点,加入了一些噪音,生成数据的同时把数据归一化
。
X, y = make_circles(noise=0.2, factor=0.5, random_state=1)
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
数据可视化如下:
from matplotlib.colors import ListedColormap
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot()
ax.set_title("Input data")
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm_bright)
ax.set_xticks(())
ax.set_yticks(())
plt.tight_layout()
plt.show()
现在要对这个数据集进行SVM RBF
分类了,分类时使用了网格搜索
,在C=(0.1,1,10)
和gamma=(1, 0.1, 0.01)
形成的9
种情况中选择最好的超参数,使用4折交叉验证
。这里只是一个例子,实际运用中,可能需要更多的参数组合来进行调参。
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(), param_grid={"C":[0.1, 1, 10], "gamma":[1, 0.1, 0.01]}, cv=4)
grid.fit(X, y)
print("The best parameters are %s with a score of %0.2f" %(grid.best_params_, grid.best_score_))
最终的输出如下:
The best parameters are {'C': 10, 'gamma': 0.1} with a score of 0.91
也就是说,通过网格搜索,在给定的9
组超参数中,C=10
, Gamma=0.1
分数最高,这就是最终的参数候选。
到这里,调参举例就结束了。不过可以看看普通SVM
分类后的可视化。这里把这9
种组合分别训练后,通过对网格里的点预测来标色,观察分类的效果图。代码如下:
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
for i, C in enumerate((0.1, 1, 10)):
for j, gamma in enumerate((1, 0.1, 0.01)):
plt.subplot()
clf = SVC(C=C, gamma=gamma)
clf.fit(X, y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X[:,0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.xlabel("gamma="+str(gamma)+" C="+str(C))
plt.show()
4 SVM算法库其他调参要点
上面已经对Scikit-Learn
中类库的参数做了总结,这里对其他的调参要点做一个小结。
1)一般推荐在做训练之前对数据进行归一化
,当然测试集中的数据
也需要归一化
。
2)在特征数非常多
的情况下,或者样本数远小于特征数
的时候,使用线性核,效果已经很好
,并且只需要选择惩罚系数C
即可。
3)在选择核函数时,如果线性拟合不好,一般推荐使用默认的高斯核'rbf'
。这时我们主要需要对惩罚系数C和核函数参数γ进行艰苦的调参,通过多轮的交叉验证选择合适的惩罚系数C和核函数参数γ
。
4)理论上高斯核不会比线性核差
,但是这个理论却建立在要花费更多的时间来调参上
。所以实际上能用线性核解决问题我们尽量使用线性核
。
5 附录:官方文档
In [1]: from sklearn.svm import SVC
In [2]: SVC?
Init signature: SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', random_state=None)
Docstring:
C-Support Vector Classification.
The implementation is based on libsvm. The fit time complexity
is more than quadratic with the number of samples which makes it hard
to scale to dataset with more than a couple of 10000 samples.
The multiclass support is handled according to a one-vs-one scheme.
For details on the precise mathematical formulation of the provided
kernel functions and how `gamma`, `coef0` and `degree` affect each
other, see the corresponding section in the narrative documentation:
:ref:`svm_kernels`.
Read more in the :ref:`User Guide <svm_classification>`.
Parameters
----------
C : float, optional (default=1.0)
Penalty parameter C of the error term.
kernel : string, optional (default='rbf')
Specifies the kernel type to be used in the algorithm.
It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
a callable.
If none is given, 'rbf' will be used. If a callable is given it is
used to pre-compute the kernel matrix from data matrices; that matrix
should be an array of shape ``(n_samples, n_samples)``.
degree : int, optional (default=3)
Degree of the polynomial kernel function ('poly').
Ignored by all other kernels.
gamma : float, optional (default='auto')
Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
If gamma is 'auto' then 1/n_features will be used instead.
coef0 : float, optional (default=0.0)
Independent term in kernel function.
It is only significant in 'poly' and 'sigmoid'.
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior
to calling `fit`, and will slow down that method.
shrinking : boolean, optional (default=True)
Whether to use the shrinking heuristic.
tol : float, optional (default=1e-3)
Tolerance for stopping criterion.
cache_size : float, optional
Specify the size of the kernel cache (in MB).
class_weight : {dict, 'balanced'}, optional
Set the parameter C of class i to class_weight[i]*C for
SVC. If not given, all classes are supposed to have
weight one.
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
max_iter : int, optional (default=-1)
Hard limit on iterations within solver, or -1 for no limit.
decision_function_shape : 'ovo', 'ovr', default='ovr'
Whether to return a one-vs-rest ('ovr') decision function of shape
(n_samples, n_classes) as all other classifiers, or the original
one-vs-one ('ovo') decision function of libsvm which has shape
(n_samples, n_classes * (n_classes - 1) / 2).
.. versionchanged:: 0.19
decision_function_shape is 'ovr' by default.
.. versionadded:: 0.17
*decision_function_shape='ovr'* is recommended.
.. versionchanged:: 0.17
Deprecated *decision_function_shape='ovo' and None*.
random_state : int, RandomState instance or None, optional (default=None)
The seed of the pseudo random number generator to use when shuffling
the data. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState
instance used by `np.random`.
Attributes
----------
support_ : array-like, shape = [n_SV]
Indices of support vectors.
support_vectors_ : array-like, shape = [n_SV, n_features]
Support vectors.
n_support_ : array-like, dtype=int32, shape = [n_class]
Number of support vectors for each class.
dual_coef_ : array, shape = [n_class-1, n_SV]
Coefficients of the support vector in the decision function.
For multiclass, coefficient for all 1-vs-1 classifiers.
The layout of the coefficients in the multiclass case is somewhat
non-trivial. See the section about multi-class classification in the
SVM section of the User Guide for details.
coef_ : array, shape = [n_class-1, n_features]
Weights assigned to the features (coefficients in the primal
problem). This is only available in the case of a linear kernel.
`coef_` is a readonly property derived from `dual_coef_` and
`support_vectors_`.
intercept_ : array, shape = [n_class * (n_class-1) / 2]
Constants in decision function.
Examples
--------
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = SVC()
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> print(clf.predict([[-0.8, -1]]))
[1]
See also
--------
SVR
Support Vector Machine for Regression implemented using libsvm.
LinearSVC
Scalable Linear Support Vector Machine for classification
implemented using liblinear. Check the See also section of
LinearSVC for more comparison element.
File: c:\anaconda3\lib\site-packages\sklearn\svm\classes.py
Type: ABCMeta
In [4]: from sklearn.svm import SVR
In [5]: SVR?
Init signature: SVR(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)
Docstring:
Epsilon-Support Vector Regression.
The free parameters in the model are C and epsilon.
The implementation is based on libsvm.
Read more in the :ref:`User Guide <svm_regression>`.
Parameters
----------
C : float, optional (default=1.0)
Penalty parameter C of the error term.
epsilon : float, optional (default=0.1)
Epsilon in the epsilon-SVR model. It specifies the epsilon-tube
within which no penalty is associated in the training loss function
with points predicted within a distance epsilon from the actual
value.
kernel : string, optional (default='rbf')
Specifies the kernel type to be used in the algorithm.
It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or
a callable.
If none is given, 'rbf' will be used. If a callable is given it is
used to precompute the kernel matrix.
degree : int, optional (default=3)
Degree of the polynomial kernel function ('poly').
Ignored by all other kernels.
gamma : float, optional (default='auto')
Kernel coefficient for 'rbf', 'poly' and 'sigmoid'.
If gamma is 'auto' then 1/n_features will be used instead.
coef0 : float, optional (default=0.0)
Independent term in kernel function.
It is only significant in 'poly' and 'sigmoid'.
shrinking : boolean, optional (default=True)
Whether to use the shrinking heuristic.
tol : float, optional (default=1e-3)
Tolerance for stopping criterion.
cache_size : float, optional
Specify the size of the kernel cache (in MB).
verbose : bool, default: False
Enable verbose output. Note that this setting takes advantage of a
per-process runtime setting in libsvm that, if enabled, may not work
properly in a multithreaded context.
max_iter : int, optional (default=-1)
Hard limit on iterations within solver, or -1 for no limit.
Attributes
----------
support_ : array-like, shape = [n_SV]
Indices of support vectors.
support_vectors_ : array-like, shape = [nSV, n_features]
Support vectors.
dual_coef_ : array, shape = [1, n_SV]
Coefficients of the support vector in the decision function.
coef_ : array, shape = [1, n_features]
Weights assigned to the features (coefficients in the primal
problem). This is only available in the case of a linear kernel.
`coef_` is readonly property derived from `dual_coef_` and
`support_vectors_`.
intercept_ : array, shape = [1]
Constants in decision function.
sample_weight : array-like, shape = [n_samples]
Individual weights for each sample
Examples
--------
>>> from sklearn.svm import SVR
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = SVR(C=1.0, epsilon=0.2)
>>> clf.fit(X, y) #doctest: +NORMALIZE_WHITESPACE
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
See also
--------
uSVR
Support Vector Machine for regression implemented using libsvm
using a parameter to control the number of support vectors.
LinearSVR
Scalable Linear Support Vector Machine for regression
implemented using liblinear.
File: c:\anaconda3\lib\site-packages\sklearn\svm\classes.py
Type: ABCMeta