# 简介

• 目标函数非凸
• 参数空间维度高
• 数据集样本量大

• 多框架开箱即用
• 热门研究领域

. . .

• 超参黑盒
• 场景黑盒

## 基于牛顿迭代法

$f\left( x \right) = 0$ 求根问题，将 $f(x)$ 泰勒展开： $$y = f\left( {{x_0}} \right) + f'\left( {{x_0}} \right)\left( {x - {x_0}} \right)$$ 令 $f\left( {{x_0}} \right) + f'\left( {{x_0}} \right)\left( {x - {x_0}} \right) = 0$ 进而得迭代公式： $${x_{n + 1}} = {x_n} - \frac{{f\left( {{x_n}} \right)}}{{f'\left( {{x_n}} \right)}}$$ . . .

a large variety of quasi-Newton methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is L-BFGS, which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed).

# 梯度下降

## 动机

Strategy #1：随机搜索

bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
loss = L(X_train, Y_train, W) # get the loss over the entire training set
if loss < bestloss: # keep track of the best solution
bestloss = loss
bestW = W
print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)


. . .

Strategy #2：局部随机搜索

W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in xrange(1000):
step_size = 0.0001
Wtry = W + np.random.randn(10, 3073) * step_size
loss = L(Xtr_cols, Ytr, Wtry)
if loss < bestloss:
W = Wtry
bestloss = loss
print 'iter %d loss is %f' % (i, bestloss)


. . .

Strategy #3：梯度下降 . . .

## 变体

$$\theta = \theta - \eta \cdot \nabla_\theta J( \theta)$$

Method #1：批量梯度下降

Method #2：随机梯度下降（SGD） SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily (From Wikipedia)

Method #3：Mini-batch 梯度下降

## 痛点

SGD 已经足够好了吗？

. . .

• 如何选择学习率？学习率太小会有什么问题？学习率太大又会有什么问题？
• 学习率在训练过程应该如何调整？预先定义一个调度器？学习无进展时衰减？
• 所有参数的学习率都一样会有问题吗？或者说不同参数是不是应该按照不同尺度去调整？
• 当迭代进行到极小值点或鞍点时，梯度为零，应该如何逃离这些次优位置？

# Momentum

## 动机 ## 数学形式

Momentum 引入惯性加速 SGD 下降、减少震荡： $$\begin{split} v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta) \ \theta &= \theta - v_t \end{split}$$ 其中 $v_t$ 累计历史动量；$\gamma$ 动量衰减因子，一般设 $0.9$。

## 直观解释

. . .

$\nabla_\theta J( \theta)$ 梯度 like **加速度**；Momentum 引入 $v_t$ 作为**速度**；$\theta$ 参数值 like **位置**；

# NAG

## 动机

NAG (Nesterov accelerated gradient) 如何解决这个问题？

. . .

• 下一步位置的梯度跟当前动量反向，减小当前动量，比如小球要再次冲到对面上坡
• 下一步位置的梯度跟当前动量同向，增大当前动量，比如小球正在冲下山坡

## 直观解释

NAG 跟 Momentun 相比，从参数更新公式看，区别仅在于当前累计到历史动量的梯度值是 $\nabla_\theta J( \theta)$ 还是 $\nabla_\theta J( \theta_\text{next})$，也即计算梯度的位置不同。

. . . # Adagrad

## 动机

Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance (we might want to perform a larger update for parameters related to the rarely occurring features).

Adagrad is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.

Dean et al. have found that Adagrad greatly improved the robustness of SGD. Moreover, Pennington et al. used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.

## 数学形式

Previously, we performed an update for all parameters $\theta$ at once as every parameter $\theta_i$ used the same learning rate $\eta$. As Adagrad uses a different learning rate for every parameter $\theta_i$ at every time step $t$, we first show Adagrad’s per-parameter update. For brevity, we use $g^{(t)}$ to denote the gradient at time step $t$. $g^{(t)}i = \nabla\theta J( \theta^{(t)}i )$ is then the partial derivative of the objective function w.r.t. to the parameter $\theta_i$ at time step $t$. Adagrad modifies the general learning rate $\eta$ of SGD at each time step $t$ for every parameter $\theta_i$ based on the past gradients that have been computed for $\theta_i$: $$\eta_i = \frac{\eta}{\sqrt{G^{(t)}{i,i} + \epsilon}}$$ where $G^{(t)} \in \mathbb{R}^{d \times d}$ is a diagonal matrix where each diagonal element $G_{i,i}=\sum_{t'}^{t} (g^{t'}_i)^2$ is the sum of the squares of the gradients w.r.t. $\theta_i$ up to time step $t$. It’s obviously that the larger $G_{i,i}$ is, and the slower $\theta_i$ changed; and $\epsilon$ is a smoothing term that avoids division by zero (usually on the order of $1e-8$); Interestingly, without the square root operation, the algorithm performs much worse. One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate $\eta$. Most implementations use a default value of $\eta = 0.01$ and leave it at that.

Now the parameters updating rule as follows: $$\theta^{(t+1)}_i = \theta^{(t)}_i - \eta_i \cdot g^{(t)}_i$$ As $G^{(t)}$ contains the sum of the squares of the past gradients w.r.t. to all parameters $\theta$ along its diagonal, we can now vectorize our implementation: $$\theta^{(t+1)} = \theta^{(t)} - \dfrac{\eta}{\sqrt{G^{(t)} + \epsilon}} g^{(t)}$$ Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.

# Adadelta

## 动机

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size $w$.

How to record the $w$ past gradients? Instead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E[g^2]^{(t+1)}$ at time step $t+1$ then depends only on the previous average and the current squared gradient $(g^2)^{(t+1)}$: $$E[g^2]^{(t+1)} = \gamma E[g^2]^{(t)} + (1 - \gamma)(g^{(t+1)})^2$$ where $1 - \gamma$ controls how much of current gradient will be accumulated into the running average; we set $\gamma$ around 0.9. $E[g^2]^{(t)}$ is a vector with the same dimensions. The term $E[g^2]^{(t)}$ is the exponentially decaying average of squared gradients: in stark contrast to AdaGrad, we keep adding new information from squared gradients $(g^2)^{(t)}$, but we also make sure to decrease the effect of old gradients.

We can now simply replace the diagonal matrix $G^{(t)}$ in the Adagrad with the decaying average over past squared gradients $E[g^2]^{(t)}$. Then the parameters updating rule is: \begin{align} \begin{split} \theta^{(t+1)} &= \theta^{(t)} + \Delta \theta^{(t)} \ \Delta \theta_t &= - \dfrac{\eta}{\sqrt{E[g^2]^{(t)} + \epsilon}} \odot g^{(t)} = - \dfrac{\eta}{\text{RMS}[g]^{(t)}} \odot g^{(t)} \end{split} \end{align} where $\odot$ is a element-wise vector multiplication; $\text{RMS}[\cdot]$ denotes the root mean square metric.

The authors note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average $E[\Delta \theta^2]^{(t)}$’s RMS $\text{RMS}[\Delta \theta]^{(t)}$ to substitute the above learning rate $\eta$, this time not of squared gradients but of squared parameter updates: $$E[\Delta \theta^2]^{(t)} = \rho E[\Delta \theta^2]^{(t-1)} + (1 - \rho) (\Delta \theta^{(t)})^2$$ Since $\text{RMS}[\Delta \theta]^{(t)}=\sqrt{E[\Delta \theta^2]^{(t)}}$ is unknown in time step $t$, we approximate it with the RMS of parameter updates until the previous time step $\text{RMS}[\Delta \theta]^{(t-1)}$. Then, AdaDelta becomes: \begin{align} \begin{split} \theta^{(t+1)} &= \theta^{(t)} + \Delta \theta^{(t)} \ \Delta \theta^{(t)} &= - \dfrac{\text{RMS}[\Delta \theta]^{(t-1)}}{\text{RMS}[g]^{(t)}} \odot g^{(t)} \end{split} \end{align} With Adadelta, we do not even need to set a default learning rate $\eta$, as it has been eliminated from the update rule.

# RMSprop

## 简介

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above: \begin{align} \begin{split} \theta^{(t+1)} &= \theta^{(t)} + \Delta \theta^{(t)} \ \Delta \theta^{(t)} &= - \dfrac{\eta}{\sqrt{E[g^2]^{(t)} + \epsilon}} \odot g^{(t)} \ E[g^2]^{(t)} &= \gamma E[g^2]^{(t-1)} + (1-\gamma) (g^{(t)})^2 \end{split} \end{align} RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests $\gamma$ to be set to 0.9, while a good default value for the learning rate $\eta$ is 0.001.

# Adam

## 简介

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients ($v^{(t)}$ in the following formulation) like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients ($m^{(t)}$ in the following), similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.

We compute the decaying averages of past and past squared gradients $m^{(t)}$ and $v^{(t)}$ respectively as follows: $$m^{(t)} = \beta_1 m^{(t-1)} + (1 - \beta_1)g^{(t)} \ v^{(t)} = \beta_2 v^{(t-1)} + (1 - \beta_2)(g^{(t)})^2$$ $m^{(t)}$ and $v^{(t)}$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method.

Then the parameters updating rule as follows: $$\theta^{(t+1)} = \theta^{(t)} - \frac {\eta} {\sqrt{{v}^{(t)}} + \epsilon} \odot {m}^{(t)}$$ But as $m^{(t)}$ and $v^{(t)}$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. $\beta_1$ and $\beta_2$ are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates: \begin{align} \begin{split} \hat{m}^{(t)} &= \dfrac{{m}^{(t)}}{1 - \beta^t_1} \ \hat{v}^{(t)} &= \dfrac{{v}^{(t)}}{1 - \beta^t_2} \end{split} \end{align} They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule: $$\theta^{(t+1)} = \theta^{(t)} - \frac {\eta} {\sqrt{\hat{v}^{(t)}} + \epsilon} \odot \hat{m}^{(t)}$$ The authors propose default values of 0.9 for $\beta_1$, 0.999 for $\beta_2$, and $10^{-8}$ for $\epsilon$. They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

# AdaMax

## 简介

The second moment $v^{(t)}$ in the Adam update rule actually use the $\ell_2$ norm of the gradient: $$v^{(t)} = \beta_2 v^{(t-1)} + (1 - \beta_2)(g^{(t)})^2$$ We can generalize this update to the $\ell_p$ norm and also scale $\beta_2$: $$v^{(t)} = \beta_2^p v^{(t-1)} + (1 - \beta_2^p)(g^{(t)})^p$$ Norms for large $p$ values generally become numerically unstable, which is why $\ell_1$ and $\ell_2$ norms are most common in practice. However, $\ell_\infty$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax (Kingma and Ba, 2015) and show that $v^{(t)}$ with $\ell_\infty$ converges to the following more stable value: $$v^{(t)} = \beta_2^\infty v^{(t-1)} + (1 - \beta_2^\infty)(g^{(t)})^\infty = \max(\beta_2 v^{(t-1)}, g^{(t)})$$ where the $\max$ is a element-wise operator.

We can now plug this into the Adam update equation by replacing its denominator with the above $v^{(t)}$ to obtain the AdaMax update rule: $$\theta^{(t+1)} = \theta^{(t)} - \frac {\eta} {v^{(t)}} \odot \hat{m}^{(t)}$$ Note that $v^{(t)}$ relies on the maxmax operation, it is not as suggestible to bias towards zero as $v^{(t)}$ and $m^{(t)}$ in Adam, which is why we do not need to compute a bias correction for it.

Good default values are 0.9 for $\beta_1$, 0.999 for $\beta_2$, and $\eta = 0.002$.

# Nadam

## 简介

Adam can be viewed as a combination of RMSprop and Momentum: RMSprop contributes the exponentially decaying average of past squared gradients $v^{(t)}$, while momentum accounts for the exponentially decaying average of past gradients $m^{(t)}$. And we have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla Momentum. Nadam (Nesterov-accelerated Adaptive Moment Estimation) thus combines Adam and NAG. In order to incorporate NAG into Adam, we need to modify its momentum $m^{(t)}$.

First, let us recall the Momentum update rule: $$\begin{split} g^{(t)} & = \nabla_\theta J( \theta) \ m^{(t)} &= \gamma m^{(t-1)} + \eta g^{(t)} \ \theta^{(t+1)} &= \theta^{(t)} - m^{(t)} \end{split}$$ where $J$ is our objective function, $\gamma$ is the momentum decay term, and $\eta$ is our step size. Note that momentum involves taking a step in the direction of the previous momentum vector and a step in the direction of the current gradient.

NAG then allows us to perform a more accurate step in the gradient direction by updating the parameters with the momentum step before computing the gradient. We thus only need to modify the gradient $g^{(t)}$ to arrive at NAG: $$\begin{split} \theta_\text{next} &= \theta^{(t)} - \gamma m^{(t-1)} \ g^{(t)} & = \nabla_\theta J(\theta_\text{next}) \ m^{(t)} &= \gamma m^{(t-1)} + \eta g^{(t)} \ \theta^{(t+1)} &= \theta^{(t)} - m^{(t)} \end{split}$$ where $\theta_\text{next}$ is an approximation of the next position of the parameters. We can arrange those formulations as following: $$\begin{split} g^{(t)} & = \nabla_\theta J(\theta^{(t)} - \gamma m^{(t-1)}) \ \theta^{(t+1)} &= \theta^{(t)} - (\gamma m^{(t-1)} + \eta g^{(t)}) \end{split}$$ We can see that the momentum step $m^{(t-1)}$ is applied twice – one time for updating the gradient $g^{(t)}$ and a second time for updating the parameters $\theta^{(t+1)}$. Many proposes to modify NAG by the following way: to apply the look-ahead momentum vector $m^{(t)}$ instead of the previous momentum vector $m^{(t-1)}$ in the above NAG directly to update the current parameters. $$\begin{split} g^{(t)} & = \nabla_\theta J(\theta) \ m^{(t)} &= \gamma m^{(t-1)} + \eta g^{(t)} \ \theta^{(t+1)} &= \theta^{(t)} - (\gamma m^{(t)} + \eta g^{(t)}) \end{split}$$

Similarly, we can add Nesterov momentum into Adam by replacing the previous momentum vector with the current momentum vector. First, the Adam update rule is the following: $$\begin {split} g^{(t)} & = \nabla_\theta J(\theta) \ m^{(t)} &= \beta_1 m^{(t-1)} + (1 - \beta_1) g^{(t)} \ v^{(t)} &= \beta_2 v^{(t-1)} + (1 - \beta_2) g^{(t)} \ \hat{m}^{(t)} &= \frac {m^{(t)}} {1 - \beta^t_1} \ \hat{v}^{(t)} &= \frac {v^{(t)}} {1 - \beta^t_2} \ \theta^{(t+1)} &= \theta^{(t)} - \frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon} \odot \hat{m}^{(t)} \ &= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot {\frac {m^{(t)}} {1 - \beta^t_1}} \ &= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot {\frac {\beta_1 m^{(t-1)} + (1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \ &= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot \left( {\frac {\beta_1 m^{(t-1)}} {1 - \beta^t_1}} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right) \ &= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot \left( {\frac { m^{(t-1)}} {1 - \beta^{t-1}_1}} \cdot {\frac {\beta_1 ({1 - \beta^{t-1}_1})} {{1 - \beta^t_1}}} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right) \ &= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot \left( { \hat{m}^{(t-1)}} \cdot {\frac {\beta_1 ({1 - \beta^{t-1}_1})} {{1 - \beta^t_1}}} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right) \ \end {split}$$ We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step $\hat{m}^{(t-1)}$ with the bias-corrected estimate of the current momentum vector $\hat{m}^{(t)}$, which gives us the Nadam update rule: $$\theta{(t+1)} = \theta^{(t)} - \frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon} \odot \left( \beta_1 \hat{m}^{(t)} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right)$$

# AMSGrad

## 简介

Reddi et al. (2018) shows that the exponential moving average of past squared gradients as a reason for the poor generalization behaviour of adaptive learning rate methods, e.g. Adadelta, RMSprop, Adam and so on. Recall that the introduction of the exponential average was well-motivated: It should prevent the learning rates to become infinitesimally small as training progresses, the key flaw of the Adagrad algorithm. However, this short-term memory of the gradients becomes an obstacle in other scenarios.

In settings where Adam converges to a suboptimal solution, it has been observed that some minibatches provide large and informative gradients, but as these minibatches only occur rarely, exponential averaging diminishes their influence, which leads to poor convergence. To fix this behaviour, the authors propose a new algorithm, AMSGrad that uses the maximum of past squared gradients $v^{(t)}$ rather than the exponential average to update the parameters.

# Visualization of algorithms

## 简介

The following two animations (Image credit: Alec Radford) provide some intuitions towards the optimization behaviour of most of the presented optimization methods. Also have a look here for a description of the same images by Karpathy and another concise overview of the algorithms discussed. In the above image, we see their behaviour on the contours of a loss surface (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum. The above image shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope.

As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios.

Note: If you are interested in visualizing these or other optimization algorithms, refer to this useful tutorial.

# Which optimizer to use?

## 简介

If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Adam with bias-correction slightly outperform RMSprop towards the end of optimization as gradients become sparser.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.