ML

Decoupled Neural Interfaces Using Synthetic Gradients 요약

KAU 2020. 11. 6. 22:01

Neural networks and the problem of locking

일반적인 모델의 경우 한번에 네트워크의 시퀀스를 실행하고 gradient는 backpropahation 될 것이다. 

 

아웃풋이 얻어진 뒤에 역전파 되면서 업데이트 되는게 일반적인데 

 

여기서 문제가 발생하는데 전파가 일어나는동안 layer1은 계속 잠겨있다

LOCK 되어 있다는것이다.

 

간단한 네트워크의 경우에는 문제가 없다.

 

하지만 복잡하고 불규칙한 타임 스케일에서 동작하는 네트워크에서는 문제가 발생할 수 있다.

 

모듈이 너무 심하게 경사 역전파가 일어나기 전까지 시간을 소비할 수 있고 다루기도 힘들 수 도 있다.

 

 

우리는 그래서 인터베이스를 분리할것이다. 독립적으로 업데이트 될수 있고 locked되지 않는다.

 

So, how can one decouple neural interfaces - that is decouple the connections between network modules - and still allow the modules to learn to interact? In this paper, we remove the reliance on backpropagation to get error gradients, and instead learn a parametric model which predicts what the gradients will be based upon only local information. We call these predicted gradients synthetic gradients.

 

역전파에 의한 경사 학습을 하지 않고

로컬 정보만을 기반으로 기울기가 무엇인지 예측하는 매개 변수 모델을 학습할것이다.

예측 기울기를 합성 기울기라고합니다.

gradients synthetic gradients.

 

The synthetic gradient model takes in the activations from a module and produces what it predicts will be the error gradients - the gradient of the loss of the network with respect to the activations.

 

합성 기울기 모델은 모듈에서 activation을 가져와서 error gradient가 될것으로 예측하는것을 생성한다.

 

Going back to our simple feed-forward network example, if we have a synthetic gradient model we can do the following:

... and use the synthetic gradients (blue) to update Layer 1 before the rest of the network has even been executed.

 

합성 그래디언트 모델 자체는 타겟 그래디언트를 회귀하도록 훈련된다.

이러한 타겟 그래디언트는 손실에서 역 전파 된 실제 그래디언트이거나

추가 다운 스트림 합성 그래디언트 모델에서 역 전파 된 다른 합성 그래디언트 일 수 있습니다.

 

 

그냥 feed-forward network만 존재하는게 아니다.

두 모듈간의 연결에서 업데이트 해주는것

 

대략적인 그림을 보면 아래와 같다.

 

However in practice, we can only afford to unroll for a limited number of steps due to memory constraints and the need to actually compute an update to our core model frequently.

제로 메모리 제약 조건과 코어 모델에 대한 업데이트를 자주 계산해야 하므로 제한된 수의 단계에 대해서만

롤풀을 해제할 수 있습니다. 

 

This is called truncated backpropagation through time, and shown below for a truncation of three steps:

The change in colour of the core illustrates an update to the core, that the weights have been updated.

코어 색상 변화는 가중치에 대한 업데이트이다.

 

In this example, truncated BPTT seems to address some issues with training - we can now update our core weights every three steps and only need three cores in memory. However, the fact that there is no backpropagation of error gradients over more than three steps means that the update to the core will not be directly influenced by errors made more than two steps in the future.

This limits the temporal dependency that the RNN can learn to model.

렇게 하면 RNN이 모델링하는 방법을 배울 수 있는 시간적 종속성을 제한합니다.

 

What if instead of doing no backpropagation between the boundary of BPTT we used DNI and produce synthetic gradients, which model what the error gradients of the future will be? We can incorporate a synthetic gradient model into the core so that at every time step, the RNN core produces not only the output but also the synthetic gradients. In this case, the synthetic gradients would be the predicted gradients of the all future losses with respect to the hidden state activation of the previous timestep. The synthetic gradients are only used at the boundaries of truncated BPTT where we would have had no gradients before.

This can be performed during training very efficiently - it merely requires us to keep an extra core in memory as illustrated below.

메모리에 여분의 코어를 유지해야한다.

 

그린 선은 경사를 계산하는데 코어의 파라미터에 따라서 계산하고

점선은 인풋의 상태에 따라서 계산할 뿐이다.

 

 

아래 자료는 RNN에 DNI를 적용시켜봤을 때

 

By using DNI and synthetic gradients with an RNN, we are approximating doing backpropagation across an infinitely unrolled RNN. In practice, this results in RNNs which can model longer temporal dependencies. Here’s an example result showing this from the paper.

Penn Treebank test error during training (lower is better):

This graph shows the application of an RNN trained on next character prediction on Penn Treebank, a language modelling problem. On the y-axis the bits-per-character (BPC) is given, where smaller is better. The x-axis is the number of characters seen by the model as training progresses. The dotted blue, red and grey lines are RNNs trained with truncated BPTT, unrolled for 8 steps, 20 steps and 40 steps - the higher the number of steps the RNN is unrolled before performing backpropagation through time, the better the model is, but the slower it trains. When DNI is used on the RNN unrolled 8 steps (solid blue line) the RNN is able to capture the long term dependency of the 40-step model, but is trained twice as fast (both in terms of data and wall clock time on a regular desktop machine with a single GPU).

To reiterate, adding synthetic gradient models allows us to decouple the updates between two parts of a network. DNI can also be applied on hierarchical RNN models - system of two (or more) RNNs running at different timescales. As we show in the paper, DNI significantly improves the training speed of these models by enabling the update rate of higher level modules.

 

참고자료

https_deepmind.com/?url=https%3A%2F%2Fdeepmind.com%2Fblog%2Farticle%2Fdecoupled-neural-networks-using-synthetic-gradients

 

'ML' 카테고리의 다른 글

로지스틱 회귀(Logistic regression)  (0) 2021.01.15
SqueezeNet [모델 압축] 논문 리뷰&구현 [Matlab]  (0) 2020.11.17
Resnet  (0) 2020.10.09
Vggnet 논문 리뷰  (0) 2020.10.09
ALEXNET 리뷰&번역  (0) 2020.09.25