Gradient boosting
From Wikipedia, the free encyclopedia(百科全书)
Gradient boosting is a technique for problems, which produces a prediction(预报) model in the form of an of weak prediction models, typically(典型的) decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes(概括) them by allowing optimization(最佳化) of an arbitrary(任意的) differentiable loss function. The gradient(梯度) boosting(助推) method can also be used for problems by reducing them to regression(回归) with a suitable loss function. The method was invented by in 1999 and was published in a series of two papers, the first of which introduced the method, and the second one described an important tweak(扭) to the algorithm(算法), which improves its accuracy(精确度) and performance. Gradient Boosting is a special case of thefunctional(功能的) gradient descent(下降) view of boosting [3] .[4]
Contents
[]
Informal introduction[]
(This section follows the exposition(博览会) of gradient(梯度) boosting(促进) due to Li.)
Like other boosting(助推) methods, gradient(梯度) boosting combines weak learners into a single strong learner, in an iterative(迭代的) fashion. It is easiest to explain in the least-squares regression setting, where the goal is to learn a model minimizing the mean squared error set).
At each stage of gradient boosting, it may be assumed(假定的) that there is
(at the outset(开始), a very weak model that that predicts values , to the true values y (averaged over some training some imperfect(有缺点的) model
just predicts(预报) the mean y in the training set could be used). The gradient boosting algorithm(算法) does not change in any way; instead, it improves on it by constructing a new model that adds an estimator(估计量) h to provide a better model . The question is now, how to find ? The gradient
boosting solution(解决方案) starts with the observation(观察) that a perfect h would imply
or, equivalently,
.
Therefore, gradient(梯度) boosting(助推) will fit h to the residual
boosting variants(变体), each . Like in other . learns to correct itspredecessor(前任)
A generalization(概括) of this idea to other loss functions than squared error (and to classification(分类) and ranking problems) follows from the observation(观察) that residuals(剩余) are the negative(负的) gradients of the squared error loss function . So, gradient boosting is
a gradient descent algorithm(算法); and generalizing(概括) it entails(限定继承) "plugging in" a different loss and its gradient. Algorithm[]
In many supervised learning problems one has an output(输出) variable(变量的) y and a vector(矢量) of input(投入) variables x connected together via ajoint(共同的) probability(可能性) distribution(分布) P(x, y). Using a training set
values of x and corresponding values ofy, the goal is to find an approximation(近似法) of known to a function F*(x) that minimizes(使减到最少) the expected value of some specified(指定) loss functionL(y, F(x)):
.
Gradient(梯度) boosting(助推) method assumes(承担) a real-valued y and seeks an approximation(近似法) in the form of a weighted sum of functions hᵢ(x)from some class ℋ, called base (or weak) learners:
.
In accordance(一致) with the
额) principle(原理), the method tries to find an approximation(近似法) that minimizes(
使减到最少) the average value of the loss function on the training set. It does so by starting with a model, consisting of a constant function
in a greedy fashion: , and incrementally(递增地) expanding(扩张) it
,
,
where f is restricted to be a function from the class ℋ of base learner functions.
However, the problem of choosing at each step the best f for an arbitrary(任意的) loss function L is a hard optimization(最佳化) problem in general, and so we'll "cheat" by solving a much easier problem instead.
The idea is to apply a steepest descent step to this minimization(估到最低额) problem. If we only cared about predictions(预报) at the points of the training set, and f were unrestricted(自由的), we'd update the model per the following equation(方程式), where we view L(y, f) not as a functional(功能的) of f, but as a function of a vector(矢量) of values :
But as f must come from a restricted class of functions (that's what allows us to generalize(概括)), we'll just choose the one that most closelyapproximates(近似) the gradient(梯度) of L. Having chosen f, the multiplier γ is then selected using just as shown in the second equation(方程式)above.
In pseudocode(伪代码), the generic(类的) gradient(梯度) boosting(助推) method is: Input:
function
Algorithm: training set number of iterations a differentiable(可微的) loss
1. Initialize(初始化) model with a constant value:
2. For m = 1 to M:
1. Compute so-called pseudo-residuals:
2. Fit a base learner
set
3. Compute multiplier . by solving the
following one-dimensional(肤
浅 to pseudo-residuals, i.e. train it using the training
的) optimization(最佳化) problem:
4. Update the model:
3. Output
Gradient tree boosting[]
Gradient(梯度) boosting(助推) is typically(代表性地) used with (especially trees) of a fixed size as base learners. For this special case Friedman proposes(建议) a modification(修改) to gradient boosting method which improves the quality of fit of each base learner.
Generic(类的) gradient boosting at the m-th step would fit a decision tree
residuals. Let to pseudo- be the number of its leaves. The treepartitions(划分) the input(投入) space
and predicts(预报) a constant value in each region.
for input x can be written as the sum: into disjoint regions Using the indicator notation, the output(输出) of
where is the value predicted in the region .[7] , chosen using line search so as Then
the coefficients are multiplied
by
some value to minimize(使减到最少) the loss function, and the model is updated as follows:
Friedman proposes(建议) to modify(修改) this algorithm(算法) so that it chooses a separate optimal(最佳的) value for each of the tree's regions, instead of a single for the
from the whole tree. He calls the modified algorithm "TreeBoost". The coefficients(系数)
tree-fitting procedure(程序) can be then simply discarded(抛弃) and the model update rule becomes:
Size of trees[]
, the number of terminal nodes in trees, is the method's parameter(参数) which can be adjusted(调整) for a data set at hand. It controls the maximum allowed level of interaction between variables(变量) in the model. With
no interaction(相互作用) between variables(变量) is allowed. With
include effects of the interaction between up to two variables, and so on.
Hastie et al.[6] comment that typically work well for boosting(助推) and results are
is insufficient(不足的) for (decision stumps), the model may fairly insensitive(感觉迟钝的) to the choice of in this range,
many applications, and
Regularization[] is unlikely to be required.
Fitting the training set too closely can lead to degradation(退化) of the model's generalization(概括) ability. Several so-called regularization techniques reduce this overfitting effect by constraining(驱使) the fitting procedure(程序).
One natural regularization(规则化) parameter(参数) is the number of gradient(梯度) boosting(促
进) iterations(迭代) M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting(过适). An optimal(最佳的)value of M is often selected by monitoring prediction(预报) error on a separate validation(确认) data set. Besides controlling M, several other regularization techniques are used.
Shrinkage[]
An important part of gradient boosting method is regularization by shrinkage(收缩) which consists in modifying(修改) the update rule as follows:
where parameter is called the "learning rate".
Empirically(经验主义的) it has been found that using small learning rates (such as ) yields(产量) dramatic(戏剧的) improvements(改进) in model'sgeneralization(概
).[6] However, it comes at the 括) ability over gradient boosting without shrinking(收缩) (
price of increasing computational(计算的) time both during training and querying: lower learning rate requires more iterations(迭代).
Stochastic(随机的) gradient(梯度) boosting(助推)[]
Soon after the introduction of gradient boosting Friedman proposed(建议) a minor(未成年的) modification(修改) to the algorithm(算法), motivated(刺激) byBreiman's bagging method. Specifically(特殊的), he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample(子样品) of the training set drawn at random(随意) without replacement(更换). Friedman observed a substantial(大量的) improvement(改进) in gradient boosting'saccuracy(精确度) with this modification.
Subsample(子样品) size is some constant fraction(分数) f of the size of the training set. When f = 1, the algorithm(算法) is deterministic(确定性的) andidentical(同一的) to the one described above. Smaller values of f introduce randomness(随意) into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression(回归) trees have to be fit to smaller datasets at each iteration(迭代). Friedman[2]obtained that leads to good results for small and moderate(稳健的) sized training sets. Therefore, f is typically(代表性地) set to 0.5, meaning that one half of the
training set is used to build each base learner.
Also, like in bagging, subsampling allows one to define(定义) an of the prediction(预报) performance improvement(改进) by evaluating(评价)predictions on those observations(观察) which were not used in the building of the next base learner. Out-of-bag estimates(估计) help avoid the need for an independent validation(确认) dataset, but often underestimate(低估) actual performance improvement and the optimal(最佳的) number of iterations(迭代).
Number of observations in leaves[]
Gradient(梯度) tree boosting(助推) implementations(实现) often also use regularization(规则化) by limiting the minimum(最小的) number of observations in trees' terminal nodes (this parameter(参数) is called n.minobsinnode in the R gbm package). It is used in the tree building process by ignoring(驳回诉讼) any splits that lead to nodes containing fewer than this number of training set instances(实例).
Imposing(利用) this limit helps to reduce variance(变异) in predictions at leaves.
Penalize(处罚) Complexity of Tree[]
Another useful regularization(规则化) techniques for gradient(梯度) boosted(升高的) trees is to penalize(处罚) model complexity(复杂) of the learned model. The model complexity can be defined(定义) proportional(比例的) number of leaves in the learned trees. The jointly(共同地) optimization(最佳化) of loss and model complexity corresponds to a post-pruning algorithm(算法) to remove branches that fail to reduce the loss by a threshold(入口). Other kinds of regularization such as l2 penalty(罚款) on the leave values can also be added to avoid overfitting(过适).
Usage[]
Recently, gradient boosting has gained some popularity(普及) in the field of . The commercial(商业的) web search engines [11] and [12]use variants(变体) of gradient(梯度) boosting(促进) in their machine-learned ranking engines.
Names[]
The method goes by a wide variety of names. The title of the original publication(出版) refers to it as a "Gradient Boosting Machine" (GBM). That same publication and a later one by J. Friedman also use the names "Gradient Boost", "Stochastic Gradient Boosting" (emphasizing(强
调) the random(随机的)subsampling(对…作二次抽样) technique), "Gradient Tree Boosting" and "TreeBoost" (for specialization(专门化) of the method to the case of decision trees as base learners.)
A popular open-source implementation(实现)[9] for calls it "Generalized Boosting(促进) Model". Sometimes the method is referred to as "functional(功能的)gradient(梯度) boosting" (this term was introduced in,), "Gradient Boosted Models" and its tree version is also called "Gradient Boosted Decision Trees" (GBDT) or "Gradient Boosted Regression Trees" (GBRT). Commercial implementations(实现) from Salford Systems use the names "Multiple Additive Regression Trees" (MART) and TreeNet, both trademarked.
Random forest
From Wikipedia, the free encyclopedia(百科全书)
This article is about the machine learning technique. For other kinds of random(随机的) tree, see .
Random forests are an method for , and other tasks, that operate by constructing a multitude(群众) of at training time and outputting the class that is the of the classes (classification(分类)) or mean prediction(预报) (regression(回归)) of theindividual(个人的) trees. Random(随机的) forests correct for decision trees' habit of to their training set.
The algorithm(算法) for inducing(诱导) a random forest was developed by [1] and Adele Cutler,and "Random(随机的) Forests" is their . The method combines Breiman's "" idea and the random selection(选择) of features(特色), introduced independently by Ho and Amit and [5] in order to construct a collection of decision trees with controlled variance(变异).
The selection of a random subset(子集) of features is an example of the , which, in Ho's formulation(构想), is a way to implement(实施) classification(分类) proposed(建议) by Eugene Kleinberg.
Contents
[]
History[]
The early development of random(随机的) forests was influenced by the work of Amit and Geman[5] who introduced the idea of searching over a random subset(子集) of the available decisions when splitting a node, in the context(环境) of growing a single . The idea of random subspace(子空间) selection(选择)from Ho was also influential(有影响的) in the design of random forests. In this method a forest of trees is grown, and variation(变化) among the trees is introduced by projecting the training data into a randomly chosen before fitting each tree. Finally, the idea of randomized(随机化的) nodeoptimization(最佳化), where the decision at each node is selected by a randomized procedure(程序), rather than a deterministic(确定性的) optimization was first introduced by Dietterich.
The introduction of random forests proper was first made in a paper by . This paper describes a method of building a forest of uncorrelated(不相关的) trees using a like procedure(程序), combined with randomized(随机化的) node optimization(最佳化) and bagging. In addition, this paper combines several ingredients(原料), some previously known and some novel, which form the basis(基础) of the modern practice of random(随机的) forests, in particular:
1. Using out-of-bag error as an estimate(估计) of the .
2. Measuring variable(变量的) importance through permutation(排列).
The report also offers the first theoretical result for random forests in the form of a bound on the which depends on the strength of the trees in the forest and their .
Algorithm[]
Preliminaries(准备): decision tree learning[]
Main article:
Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf(现成的) procedure(程序) for data mining", say et al., because it is invariant(不变的) under scaling(缩放比例) and various othertransformations(转化) of feature(特色) values, is robust(强健的) to inclusion(包含) ofirrelevant(不相干的) features, and produces inspectable models. However, they are seldomaccurate(精确的).:352
In particular, trees that are grown very deep tend(照料) to learn highly irregular(不规则的) patterns: they overfit their training sets, because they have lowbias(偏见), but very . Random(随机的) forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.:587–588 This comes at the expense of a small increase in the bias(偏见) and some loss ofinterpretability(可解释性), but generally greatly boosts(推动) the performance of the final model.
Tree bagging[]
Main article:
The training algorithm(算法) for random(随机的) forests applies the general technique
of , or bagging, to tree learners. Given a training set X = x1, …, xn with responses Y = y1, …, yn, bagging repeatedly selects a 品) with replacement(更换) of the training set and fits trees to these samples:
For b = 1, …, B:
1. Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
2. Train a decision or regression(回归) tree fb on Xb, Yb.
After training, predictions(预报) for unseen samples(样品) x' can be made by averaging the predictions from all the individual(个人的) regression(回归) trees onx':
or by taking the majority vote in the case of decision trees.
This bootstrapping(自举电路) procedure(程序) leads to better model performance because it decreases(减少) the of the model, without increasing thebias(偏见). This means that while the predictions of a single tree are highly sensitive(敏感的) to noise in its training set, the average of many trees is not, as long as the trees are not correlated(有相互关系的). Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm(算法) is deterministic(确定性的)); bootstrap(引导程序) sampling is a way of de-correlating the trees by showing them different training sets.
The number of samples/trees, B, is a free parameter(参数). Typically(典型的), a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal(最佳的) number of trees B can be found using cross-validation(交叉验证), or by observing the out-of-bag error: the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.[9] The training and test error tend(照料) to level off after some number of trees have been fit.
From bagging to random(随机的) forests[]
Main article:
The above procedure(程序) describes the original bagging algorithm(算法) for trees. Random forests differ in only one way from this general scheme(计划): they use a modified(改进的) tree learning algorithm that selects, at each candidate(候选人) split in the learning process, a
. This process is sometimes called "feature bagging".
The reason for doing this is the correlation(相关) of the trees in an ordinarybootstrap(引导程序) sample(样品): if one or a few features are very strong predictors(预报器) for the response variable(变量的) (target output(输出)), these features will be selected in many of the B trees, causing them to become correlated(有相互关系的).
Typically(典型的), for a dataset with p features, √p features(特色) are used in each split. Extensions[]
Adding one further step of randomization(随机化) yields(产量) extremely randomized trees, or ExtraTrees. These are trained using bagging and the random(随机的)subspace(子空间) method, like in an ordinary random forest, but additionally(此外) the top-down splitting in the tree learner is randomized(随机化的). Instead of computing the locally optimal feature/split combination(结合) (based on, e.g., information gain or the Gini impurity), for each feature under consideration a random value is selected in the feature's empirical(经验主义的) range (in the tree's training set, i.e., the bootstrap(引导程序) sample(取样)). The best of these is then chosen as the split.[10] Properties[]
Variable importance[]
Random(随机的) forests can be used to rank the importance of variables(变量) in a regression(回归) or classification(分类) problem in a natural way. The following technique was described in Breiman's original paper and is implemented(实施) in the package randomForest.
The first step in measuring the variable importance in a data set is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted(替代) if bagging is not used during training).
To measure the importance of the -th feature(特色) after training, the values of the -th feature are permuted(交换) among the training data and the out-of-bag error is again computed on this perturbed(烦燥不安的) data set. The importance score for the -th feature(特色) is computed by averaging the difference in out-of-bag error before and after the permutation(排列) over all trees. The score is normalized(标准化的) by the standard deviation(偏差) of these differences. Features which produce large values for this score are ranked as more important than features
which produce small values.
This method of determining variable(变量的) importance has some drawbacks(缺点). For data including categorical(绝对的) variables with different number of levels, random(随机的) forests are biased(有偏见的) in favor of those attributes(属性) with more levels. Methods such as [11][12] and growingunbiased(公正的) trees can be used to solve the problem. If the data contain groups of correlated(有相互关系的) features of similar relevance(关联) for theoutput(输出), then smaller groups are favored over larger groups.
Relationship to nearest neighbors[]
A relationship between random(随机的) forests and the k-nearest neighbor algorithm(
算法) (k
-NN) was pointed out by Lin and Jeon in 2002.[15] It turns out that both can be viewed as so-called weighted neighborhoods schemes(计划). These are models built from a training set that make predictions for new points x' by looking at the "neighborhood" of the point, formalized(形式化) by a weight function W:
Here, is the non-negative(非负的) weight of the i'th training point relative to the new point x'. For any particular x', the weights must sum to one. Weight functions are given as follows:
In k-NN, the weights are
zero otherwise. if xi is one of the k points closest to x', and
In a tree,
leaf as x'. is the fraction(分数) of the training data that falls into the same
Since a forest averages the predictions(预报) of a set of m trees with individual(个人的) weight functions , its predictions are
This shows that the whole forest is again a weighted neighborhood scheme(计划), with weights
that average those of the individual trees. The neighbors of x' in this interpretation(解释) are the points which fall in the same leaf as x' in at least one tree of the forest. In this way, the neighborhood of x' depends in a complex(复杂的) way on the structure(结构) of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random(随机的) forest adapts(适应) to the local importance of each feature(特色).[15] Unsupervised(无人监督的) learning with random forests[]
As part of their construction, RF predictors(预报器) naturally lead to a dissimilarity(不同) measure between the observations(观察). One can also define(定义)an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes(区分) the “observed” data from suitablygenerated(形成) synthetic(综合的) data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference(参考)distribution(分布). An RF dissimilarity can be attractive because it handles mixed variable(变量的) types well, is invariant(不变的) to monotonic(单调的)transformations(转化) of the input(投入) variables, and is robust(强健的) to outlying(边远的) observations. The RF dissimilarity easily deals with a large number of semi-continuous(半连续的) variables due to its intrinsic(本质的) variable selection(选择); for example, the "Addcl 1" RF dissimilarity weighs the contribution of each variable according to how dependent(依靠的) it is on other variables. The RF dissimilarity has been used in a variety of applications, e.g. to find clusters(群) of patients based on tissue(纸巾) marker data. Variants[]
Instead of decision trees, linear(线的) models have been proposed(建议) and evaluated(评价) as base estimators(估计量) in random(随机的) forests, in particular勤学的) regression(回归) and naive Bayes classifiers.
Gradient boosting
From Wikipedia, the free encyclopedia(百科全书)
Gradient boosting is a technique for problems, which produces a prediction(预报) model in the form of an of weak prediction models, typically(典型的) decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes(概括) them by allowing optimization(最佳化) of an arbitrary(任意的) differentiable loss function. The gradient(梯度) boosting(助推) method can also be used for problems by reducing them to regression(回归) with a suitable loss function. The method was invented by in 1999 and was published in a series of two papers, the first of which introduced the method, and the second one described an important tweak(扭) to the algorithm(算法), which improves its accuracy(精确度) and performance. Gradient Boosting is a special case of thefunctional(功能的) gradient descent(下降) view of boosting [3] .[4]
Contents
[]
Informal introduction[]
(This section follows the exposition(博览会) of gradient(梯度) boosting(促进) due to Li.)
Like other boosting(助推) methods, gradient(梯度) boosting combines weak learners into a single strong learner, in an iterative(迭代的) fashion. It is easiest to explain in the least-squares regression setting, where the goal is to learn a model minimizing the mean squared error set).
At each stage of gradient boosting, it may be assumed(假定的) that there is
(at the outset(开始), a very weak model that that predicts values , to the true values y (averaged over some training some imperfect(有缺点的) model
just predicts(预报) the mean y in the training set could be used). The gradient boosting algorithm(算法) does not change in any way; instead, it improves on it by constructing a new model that adds an estimator(估计量) h to provide a better model . The question is now, how to find ? The gradient
boosting solution(解决方案) starts with the observation(观察) that a perfect h would imply
or, equivalently,
.
Therefore, gradient(梯度) boosting(助推) will fit h to the residual
boosting variants(变体), each . Like in other . learns to correct itspredecessor(前任)
A generalization(概括) of this idea to other loss functions than squared error (and to classification(分类) and ranking problems) follows from the observation(观察) that residuals(剩余) are the negative(负的) gradients of the squared error loss function . So, gradient boosting is
a gradient descent algorithm(算法); and generalizing(概括) it entails(限定继承) "plugging in" a different loss and its gradient. Algorithm[]
In many supervised learning problems one has an output(输出) variable(变量的) y and a vector(矢量) of input(投入) variables x connected together via ajoint(共同的) probability(可能性) distribution(分布) P(x, y). Using a training set
values of x and corresponding values ofy, the goal is to find an approximation(近似法) of known to a function F*(x) that minimizes(使减到最少) the expected value of some specified(指定) loss functionL(y, F(x)):
.
Gradient(梯度) boosting(助推) method assumes(承担) a real-valued y and seeks an approximation(近似法) in the form of a weighted sum of functions hᵢ(x)from some class ℋ, called base (or weak) learners:
.
In accordance(一致) with the
额) principle(原理), the method tries to find an approximation(近似法) that minimizes(
使减到最少) the average value of the loss function on the training set. It does so by starting with a model, consisting of a constant function
in a greedy fashion: , and incrementally(递增地) expanding(扩张) it
,
,
where f is restricted to be a function from the class ℋ of base learner functions.
However, the problem of choosing at each step the best f for an arbitrary(任意的) loss function L is a hard optimization(最佳化) problem in general, and so we'll "cheat" by solving a much easier problem instead.
The idea is to apply a steepest descent step to this minimization(估到最低额) problem. If we only cared about predictions(预报) at the points of the training set, and f were unrestricted(自由的), we'd update the model per the following equation(方程式), where we view L(y, f) not as a functional(功能的) of f, but as a function of a vector(矢量) of values :
But as f must come from a restricted class of functions (that's what allows us to generalize(概括)), we'll just choose the one that most closelyapproximates(近似) the gradient(梯度) of L. Having chosen f, the multiplier γ is then selected using just as shown in the second equation(方程式)above.
In pseudocode(伪代码), the generic(类的) gradient(梯度) boosting(助推) method is: Input:
function
Algorithm: training set number of iterations a differentiable(可微的) loss
1. Initialize(初始化) model with a constant value:
2. For m = 1 to M:
1. Compute so-called pseudo-residuals:
2. Fit a base learner
set
3. Compute multiplier . by solving the
following one-dimensional(肤
浅 to pseudo-residuals, i.e. train it using the training
的) optimization(最佳化) problem:
4. Update the model:
3. Output
Gradient tree boosting[]
Gradient(梯度) boosting(助推) is typically(代表性地) used with (especially trees) of a fixed size as base learners. For this special case Friedman proposes(建议) a modification(修改) to gradient boosting method which improves the quality of fit of each base learner.
Generic(类的) gradient boosting at the m-th step would fit a decision tree
residuals. Let to pseudo- be the number of its leaves. The treepartitions(划分) the input(投入) space
and predicts(预报) a constant value in each region.
for input x can be written as the sum: into disjoint regions Using the indicator notation, the output(输出) of
where is the value predicted in the region .[7] , chosen using line search so as Then
the coefficients are multiplied
by
some value to minimize(使减到最少) the loss function, and the model is updated as follows:
Friedman proposes(建议) to modify(修改) this algorithm(算法) so that it chooses a separate optimal(最佳的) value for each of the tree's regions, instead of a single for the
from the whole tree. He calls the modified algorithm "TreeBoost". The coefficients(系数)
tree-fitting procedure(程序) can be then simply discarded(抛弃) and the model update rule becomes:
Size of trees[]
, the number of terminal nodes in trees, is the method's parameter(参数) which can be adjusted(调整) for a data set at hand. It controls the maximum allowed level of interaction between variables(变量) in the model. With
no interaction(相互作用) between variables(变量) is allowed. With
include effects of the interaction between up to two variables, and so on.
Hastie et al.[6] comment that typically work well for boosting(助推) and results are
is insufficient(不足的) for (decision stumps), the model may fairly insensitive(感觉迟钝的) to the choice of in this range,
many applications, and
Regularization[] is unlikely to be required.
Fitting the training set too closely can lead to degradation(退化) of the model's generalization(概括) ability. Several so-called regularization techniques reduce this overfitting effect by constraining(驱使) the fitting procedure(程序).
One natural regularization(规则化) parameter(参数) is the number of gradient(梯度) boosting(促
进) iterations(迭代) M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting(过适). An optimal(最佳的)value of M is often selected by monitoring prediction(预报) error on a separate validation(确认) data set. Besides controlling M, several other regularization techniques are used.
Shrinkage[]
An important part of gradient boosting method is regularization by shrinkage(收缩) which consists in modifying(修改) the update rule as follows:
where parameter is called the "learning rate".
Empirically(经验主义的) it has been found that using small learning rates (such as ) yields(产量) dramatic(戏剧的) improvements(改进) in model'sgeneralization(概
).[6] However, it comes at the 括) ability over gradient boosting without shrinking(收缩) (
price of increasing computational(计算的) time both during training and querying: lower learning rate requires more iterations(迭代).
Stochastic(随机的) gradient(梯度) boosting(助推)[]
Soon after the introduction of gradient boosting Friedman proposed(建议) a minor(未成年的) modification(修改) to the algorithm(算法), motivated(刺激) byBreiman's bagging method. Specifically(特殊的), he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample(子样品) of the training set drawn at random(随意) without replacement(更换). Friedman observed a substantial(大量的) improvement(改进) in gradient boosting'saccuracy(精确度) with this modification.
Subsample(子样品) size is some constant fraction(分数) f of the size of the training set. When f = 1, the algorithm(算法) is deterministic(确定性的) andidentical(同一的) to the one described above. Smaller values of f introduce randomness(随意) into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression(回归) trees have to be fit to smaller datasets at each iteration(迭代). Friedman[2]obtained that leads to good results for small and moderate(稳健的) sized training sets. Therefore, f is typically(代表性地) set to 0.5, meaning that one half of the
training set is used to build each base learner.
Also, like in bagging, subsampling allows one to define(定义) an of the prediction(预报) performance improvement(改进) by evaluating(评价)predictions on those observations(观察) which were not used in the building of the next base learner. Out-of-bag estimates(估计) help avoid the need for an independent validation(确认) dataset, but often underestimate(低估) actual performance improvement and the optimal(最佳的) number of iterations(迭代).
Number of observations in leaves[]
Gradient(梯度) tree boosting(助推) implementations(实现) often also use regularization(规则化) by limiting the minimum(最小的) number of observations in trees' terminal nodes (this parameter(参数) is called n.minobsinnode in the R gbm package). It is used in the tree building process by ignoring(驳回诉讼) any splits that lead to nodes containing fewer than this number of training set instances(实例).
Imposing(利用) this limit helps to reduce variance(变异) in predictions at leaves.
Penalize(处罚) Complexity of Tree[]
Another useful regularization(规则化) techniques for gradient(梯度) boosted(升高的) trees is to penalize(处罚) model complexity(复杂) of the learned model. The model complexity can be defined(定义) proportional(比例的) number of leaves in the learned trees. The jointly(共同地) optimization(最佳化) of loss and model complexity corresponds to a post-pruning algorithm(算法) to remove branches that fail to reduce the loss by a threshold(入口). Other kinds of regularization such as l2 penalty(罚款) on the leave values can also be added to avoid overfitting(过适).
Usage[]
Recently, gradient boosting has gained some popularity(普及) in the field of . The commercial(商业的) web search engines [11] and [12]use variants(变体) of gradient(梯度) boosting(促进) in their machine-learned ranking engines.
Names[]
The method goes by a wide variety of names. The title of the original publication(出版) refers to it as a "Gradient Boosting Machine" (GBM). That same publication and a later one by J. Friedman also use the names "Gradient Boost", "Stochastic Gradient Boosting" (emphasizing(强
调) the random(随机的)subsampling(对…作二次抽样) technique), "Gradient Tree Boosting" and "TreeBoost" (for specialization(专门化) of the method to the case of decision trees as base learners.)
A popular open-source implementation(实现)[9] for calls it "Generalized Boosting(促进) Model". Sometimes the method is referred to as "functional(功能的)gradient(梯度) boosting" (this term was introduced in,), "Gradient Boosted Models" and its tree version is also called "Gradient Boosted Decision Trees" (GBDT) or "Gradient Boosted Regression Trees" (GBRT). Commercial implementations(实现) from Salford Systems use the names "Multiple Additive Regression Trees" (MART) and TreeNet, both trademarked.
Random forest
From Wikipedia, the free encyclopedia(百科全书)
This article is about the machine learning technique. For other kinds of random(随机的) tree, see .
Random forests are an method for , and other tasks, that operate by constructing a multitude(群众) of at training time and outputting the class that is the of the classes (classification(分类)) or mean prediction(预报) (regression(回归)) of theindividual(个人的) trees. Random(随机的) forests correct for decision trees' habit of to their training set.
The algorithm(算法) for inducing(诱导) a random forest was developed by [1] and Adele Cutler,and "Random(随机的) Forests" is their . The method combines Breiman's "" idea and the random selection(选择) of features(特色), introduced independently by Ho and Amit and [5] in order to construct a collection of decision trees with controlled variance(变异).
The selection of a random subset(子集) of features is an example of the , which, in Ho's formulation(构想), is a way to implement(实施) classification(分类) proposed(建议) by Eugene Kleinberg.
Contents
[]
History[]
The early development of random(随机的) forests was influenced by the work of Amit and Geman[5] who introduced the idea of searching over a random subset(子集) of the available decisions when splitting a node, in the context(环境) of growing a single . The idea of random subspace(子空间) selection(选择)from Ho was also influential(有影响的) in the design of random forests. In this method a forest of trees is grown, and variation(变化) among the trees is introduced by projecting the training data into a randomly chosen before fitting each tree. Finally, the idea of randomized(随机化的) nodeoptimization(最佳化), where the decision at each node is selected by a randomized procedure(程序), rather than a deterministic(确定性的) optimization was first introduced by Dietterich.
The introduction of random forests proper was first made in a paper by . This paper describes a method of building a forest of uncorrelated(不相关的) trees using a like procedure(程序), combined with randomized(随机化的) node optimization(最佳化) and bagging. In addition, this paper combines several ingredients(原料), some previously known and some novel, which form the basis(基础) of the modern practice of random(随机的) forests, in particular:
1. Using out-of-bag error as an estimate(估计) of the .
2. Measuring variable(变量的) importance through permutation(排列).
The report also offers the first theoretical result for random forests in the form of a bound on the which depends on the strength of the trees in the forest and their .
Algorithm[]
Preliminaries(准备): decision tree learning[]
Main article:
Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf(现成的) procedure(程序) for data mining", say et al., because it is invariant(不变的) under scaling(缩放比例) and various othertransformations(转化) of feature(特色) values, is robust(强健的) to inclusion(包含) ofirrelevant(不相干的) features, and produces inspectable models. However, they are seldomaccurate(精确的).:352
In particular, trees that are grown very deep tend(照料) to learn highly irregular(不规则的) patterns: they overfit their training sets, because they have lowbias(偏见), but very . Random(随机的) forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.:587–588 This comes at the expense of a small increase in the bias(偏见) and some loss ofinterpretability(可解释性), but generally greatly boosts(推动) the performance of the final model.
Tree bagging[]
Main article:
The training algorithm(算法) for random(随机的) forests applies the general technique
of , or bagging, to tree learners. Given a training set X = x1, …, xn with responses Y = y1, …, yn, bagging repeatedly selects a 品) with replacement(更换) of the training set and fits trees to these samples:
For b = 1, …, B:
1. Sample, with replacement, n training examples from X, Y; call these Xb, Yb.
2. Train a decision or regression(回归) tree fb on Xb, Yb.
After training, predictions(预报) for unseen samples(样品) x' can be made by averaging the predictions from all the individual(个人的) regression(回归) trees onx':
or by taking the majority vote in the case of decision trees.
This bootstrapping(自举电路) procedure(程序) leads to better model performance because it decreases(减少) the of the model, without increasing thebias(偏见). This means that while the predictions of a single tree are highly sensitive(敏感的) to noise in its training set, the average of many trees is not, as long as the trees are not correlated(有相互关系的). Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm(算法) is deterministic(确定性的)); bootstrap(引导程序) sampling is a way of de-correlating the trees by showing them different training sets.
The number of samples/trees, B, is a free parameter(参数). Typically(典型的), a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal(最佳的) number of trees B can be found using cross-validation(交叉验证), or by observing the out-of-bag error: the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.[9] The training and test error tend(照料) to level off after some number of trees have been fit.
From bagging to random(随机的) forests[]
Main article:
The above procedure(程序) describes the original bagging algorithm(算法) for trees. Random forests differ in only one way from this general scheme(计划): they use a modified(改进的) tree learning algorithm that selects, at each candidate(候选人) split in the learning process, a
. This process is sometimes called "feature bagging".
The reason for doing this is the correlation(相关) of the trees in an ordinarybootstrap(引导程序) sample(样品): if one or a few features are very strong predictors(预报器) for the response variable(变量的) (target output(输出)), these features will be selected in many of the B trees, causing them to become correlated(有相互关系的).
Typically(典型的), for a dataset with p features, √p features(特色) are used in each split. Extensions[]
Adding one further step of randomization(随机化) yields(产量) extremely randomized trees, or ExtraTrees. These are trained using bagging and the random(随机的)subspace(子空间) method, like in an ordinary random forest, but additionally(此外) the top-down splitting in the tree learner is randomized(随机化的). Instead of computing the locally optimal feature/split combination(结合) (based on, e.g., information gain or the Gini impurity), for each feature under consideration a random value is selected in the feature's empirical(经验主义的) range (in the tree's training set, i.e., the bootstrap(引导程序) sample(取样)). The best of these is then chosen as the split.[10] Properties[]
Variable importance[]
Random(随机的) forests can be used to rank the importance of variables(变量) in a regression(回归) or classification(分类) problem in a natural way. The following technique was described in Breiman's original paper and is implemented(实施) in the package randomForest.
The first step in measuring the variable importance in a data set is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted(替代) if bagging is not used during training).
To measure the importance of the -th feature(特色) after training, the values of the -th feature are permuted(交换) among the training data and the out-of-bag error is again computed on this perturbed(烦燥不安的) data set. The importance score for the -th feature(特色) is computed by averaging the difference in out-of-bag error before and after the permutation(排列) over all trees. The score is normalized(标准化的) by the standard deviation(偏差) of these differences. Features which produce large values for this score are ranked as more important than features
which produce small values.
This method of determining variable(变量的) importance has some drawbacks(缺点). For data including categorical(绝对的) variables with different number of levels, random(随机的) forests are biased(有偏见的) in favor of those attributes(属性) with more levels. Methods such as [11][12] and growingunbiased(公正的) trees can be used to solve the problem. If the data contain groups of correlated(有相互关系的) features of similar relevance(关联) for theoutput(输出), then smaller groups are favored over larger groups.
Relationship to nearest neighbors[]
A relationship between random(随机的) forests and the k-nearest neighbor algorithm(
算法) (k
-NN) was pointed out by Lin and Jeon in 2002.[15] It turns out that both can be viewed as so-called weighted neighborhoods schemes(计划). These are models built from a training set that make predictions for new points x' by looking at the "neighborhood" of the point, formalized(形式化) by a weight function W:
Here, is the non-negative(非负的) weight of the i'th training point relative to the new point x'. For any particular x', the weights must sum to one. Weight functions are given as follows:
In k-NN, the weights are
zero otherwise. if xi is one of the k points closest to x', and
In a tree,
leaf as x'. is the fraction(分数) of the training data that falls into the same
Since a forest averages the predictions(预报) of a set of m trees with individual(个人的) weight functions , its predictions are
This shows that the whole forest is again a weighted neighborhood scheme(计划), with weights
that average those of the individual trees. The neighbors of x' in this interpretation(解释) are the points which fall in the same leaf as x' in at least one tree of the forest. In this way, the neighborhood of x' depends in a complex(复杂的) way on the structure(结构) of the trees, and thus on the structure of the training set. Lin and Jeon show that the shape of the neighborhood used by a random(随机的) forest adapts(适应) to the local importance of each feature(特色).[15] Unsupervised(无人监督的) learning with random forests[]
As part of their construction, RF predictors(预报器) naturally lead to a dissimilarity(不同) measure between the observations(观察). One can also define(定义)an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes(区分) the “observed” data from suitablygenerated(形成) synthetic(综合的) data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference(参考)distribution(分布). An RF dissimilarity can be attractive because it handles mixed variable(变量的) types well, is invariant(不变的) to monotonic(单调的)transformations(转化) of the input(投入) variables, and is robust(强健的) to outlying(边远的) observations. The RF dissimilarity easily deals with a large number of semi-continuous(半连续的) variables due to its intrinsic(本质的) variable selection(选择); for example, the "Addcl 1" RF dissimilarity weighs the contribution of each variable according to how dependent(依靠的) it is on other variables. The RF dissimilarity has been used in a variety of applications, e.g. to find clusters(群) of patients based on tissue(纸巾) marker data. Variants[]
Instead of decision trees, linear(线的) models have been proposed(建议) and evaluated(评价) as base estimators(估计量) in random(随机的) forests, in particular勤学的) regression(回归) and naive Bayes classifiers.