





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth analysis of multi-layer perceptrons (mlps) and recurrent neural networks (rnns), two types of deep neural networks. Mlps are used to learn arbitrary decision boundaries, while rnns are designed to process sequences of variable length. The structure of mlps and rnns, the convolution operation in convolutional neural networks (cnns), and the long short-term memory (lstm) model, a popular form of gated rnn. It also discusses the gradient descent method, l2-regularization, and dropout, techniques used to combat overfitting and improve the learning process.
Typology: Lecture notes
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















Definition 1.1 - Deep Representation Learning Representation Learning is a set of techniques in machine learning where a system can automatically learn representations needed for feature detection from the raw data without the need for hand-designed feature descriptions. Deep Representation Learning is then learning to classify using this feature detection.
Remark 2.1 - Biological Inspiration In the natural world Neurons are the basic working units of the brain. Neurons can be split into three main areas
i). Dendrites - Receives inputs from other neurons. ii). Axon - Carries information. iii). Axon Terminals & Synapses = Send information to other neurons.
Artificial Neural Networks seek to mimic this structure.
Definition 2.1 - Neuro-Plasticity Neuro-Plasticity is the ability of a neural system to adapt its structure to accommodate new information (i.e. Learn). This can take several forms including growth & function changes.
Definition 2.2 - Feed-Forward Network is an artificial neural network where the connections between nodes are uni-directional. Data is provided to the input layer and then an output is returned from the output layer, no layers are visited twice.
Proposition 2.1 - Neural Networks as an Ensemble of Sub-Networks A Neural Network can be considered to represent many sub-networks. These sub-networks are switched between depending on which components are picked and how they are defined.
Definition 2.3 - Perceptron
Here, η ∈ R+^ is know as the Learning Rate. Remember that f ∗(·) ∈ { 1 , − 1 }.
Proposition 2.4 - Training Process for a Single-Layer Perceptron Let
xxx 1 , f ∗(xxx 1 )
xxxN , f ∗(xxxN )
be a set of training data. To learn a good set of weights www do the following process.
i). Initialise the weight vector www = 0 00
ii). Consider next training datum
xxxi, f ∗(xxxi)
iii). Calculate prediction f (xxx).
iv). Compare prediction f (xxx) and ground truth f ∗(xxx).
v). Update the weight vector www = www + ∆www where ∆www =
ηf ∗(xxx)xxx if f ∗(xxx) 6 = f (xxx) 0 otherwise
vi). Repeat ii)-v) until the training set is exhausted.
Remark 2.3 - Learning Arbitrary Decision Boundaries To lean an arbitrary decision boundary (i.e. anything non-linear) can be done by using a Multi-Layer Preceptron with non-linear activation functions.
Definition 2.4 - Multi-Layer Perceptron A Multi-Layer Perceptron has the same general structure as a perceptron but with multiple calculations occuring and multiple output values. Below is a diagram of a MLP of depth N (i.e. there are N layers of computation)
input layer ︷︸︸︷
f 00
f 10
f 20
.. .
1 st^ hidden layer ︷︸︸︷
f 01
f 11
f 21
.. .
more hidden layers ︷︸︸︷
... ...
output layer ︷︸︸︷
f 0 N
f 1 N
f 2 N
.. .
Note that each layer can have a different width (i.e. number of nodes in the layer). For each consecutive pair of layers fff i, fff j^ (of widths ni, nj respectively) there is an associated weight matrix W ∈ Rni×nj^ st fff j^ = W T^ fff i. VarThe values from the output layer are then passed to an activation function to make a classification.
Remark 2.4 - Using MLPs An MLP with a single hidden layer is sufficient to represent any boolean or continuous function, althought the layer may be exponentially wider than the input.
An MPL with two hidden layers is sufficient to represent any mathematical function.
Proposition 2.5 - MLPs as Computation Graphs
sji := (W j^ )T^ f j−^1 weighted sum of the ith^ node of the jth^ hidden layer
=⇒ ∂sji ∂wjii
= f (^) ij−^1
f (^) ij := gij (sji ) ith^ output vajue of jth^ hidden layer
=⇒
∂f (^) ij ∂sji
= depends on def of gji
Proposition 2.6 - Output Layer for Classification Problem To use an MLP for classification we require the output layer to represent a probability distribution for the possible classes (i.e. each node has a value in [0, 1] and the sum of all nodes is 1).
We can force outputs to reflect this distribution using a Softmax Neuron Group in the last layer with activation function
gNj (sNj ) := es
Nj ∑ i∈Group esNi
This has gradients g′ j N^ (sNj ) = f (^) jN (1 − f (^) jn ) g j ′N^ (sNi 6 =j ) = −f (^) jN f (^) iN
All outputs f (^) iN := giN (sNi ) range between 0 and 1, while the
i∈Group fi^ = 1.
2.3.1 Convolution DNN (CNNs)
Definition 2.5 - Convolution ∗ The Convolution operation takes an input x and a kernel ω. The kernel is then applied to each element of the input. The actual operation depends on the set up of the input and the kernel (generally on their dimension).
(X ∗ ω)(i, j) =
m
n
Xi+m,j+nω(m, n)
The kernel is first placed st its the top-left entry overlaps the top-left entry of the input; the operation is applied; the kernel is then shifted right by one cell; this repeats until the right-side of the kernel surpasses the right-side of the input at which point is starts again but one cell lower. This means that (X ∗ ω) is likely smaller in all dimensions that X.
Proposition 2.7 - Zero Padding The kernel shrinks the dataset, which we don’t want as eventually the dataset would disappear. To avoid this Zero Padding is used. Zero Padding adds zeros to the outside of the input st the input and output are the same size after the kernel is applied to the original input data.
Proposition 2.8 - Higher-Dimensional Data When we have multiple data readings per instance (e.g. for each pixel of an RGB image) we consider each of these data fields to be a channel. When we apply a convolution they must have the same dimension as the number of channels, and they can applied to both space & channels. (ie a single convolution of an image is a 3D tensor). An extra dimension is added if we want to apply multiple filters/convolutions. The number of filters applied is equal to the number of channels in the next layer.
Proof 2.1 - Convolution in Practice In practice we do not convolve densely (ie every pixel), rather we skip certain pixels. The number of pixels skipped is referred to as the stride of the layer. This results in downsampled convolutions.
Remark 2.7 - C are needs to be taken when backpropagating CNNs with zero padding or stride greater than 1.
Remark 2.8 - T ypically CNNs start with convolution layers and then end with a couple of fully connected layers.
Remark 2.9 - Training CNNs
The most expensive part of training CNNs is training the convolutional layers. The fully- connected layers at the end are relatively inexpensive as they have a small number of features.
When performing gradient descent, every gradient step requires a complete run of feed- forward propagation and backward propagation through the entire network.
Proposition 2.9 - Residual Networks (ResNet) ResNets are a new innovation in CNNs where filters are applied to the input and then mergered back (by addition) with the input before be passed to the activation function. This leads to faster convergence by searching for weights which deviate only slightly from the identity. This allows for deeper networks.
ht− 1
xt
ht ot
θ
W
Figure 1: Diagram of how the prediction ot from step t is calculated. Note that this process Has a Hidden-to-Hidden recursive step (ie ht− 1 to ht). Note if xt or ht are multi-dimensional then the arrows are considered as a fully-connected layer. xt is tth^ input, ht− 1 is the hidden value of the previous step, θ is the hidden-to-hidden weights, W is the input-to-hidden weights and V is the hidden-to-output weights.
xt, 1
xt, 2
ht− 1 , 1
ht− 1 , 2
ht, 1
ht, 2
ot, 1
W ot, 2
θθθ
V
Figure 2: Diagram of an RNN which accepts a sequence x 1 ,... , xT of 2D data (ie xt = (xt, 1 , xt, 2 )). W, θθθ, V ∈ R^2 ×^2 are the weight matrices between each pair of data. Note that this graph is recurrent with the dashed arrows denoting how ht− 1 is receiving inputs from the previous step in the graph, and ht is sending its value to the next step of the graph.
2.3.2 Recurrent DNN (RNNs)
Definition 2.8 - Recurrent DNN (RNN) Recurrent DNNs are designed for processing sequential data (similar to how CNNs are designed for grid-like data). Most RNN architectures are designed to process sequences of variable length.
Each hidden layer of an RNN acts on a different element in the sequence (ie layer t acts on xt) and takes the output of the previous layer ht− 1 as an input too.
Every layer uses the same weight values, and these weights can be considered as three categories: Input-to-Hidden W which interact with the inputs to the layer xt; Hidden-to-Hidden θ)t which interacts with the hidden value of the previous layer hidden layer ht− 1 ; and, Hidden-to-Output V when produces the outputted prediction ot from the calculated hidden value ht. Traditionally θ denotes the set of all weights (ie θ := (θ, W ))
ht = f (ht− 1 , xt; θ)
To calculate the outputted prediction ot we apply the Hidden-to-Output weights V to the hidden value ht. ot = g(ht; V )
Figure 1 provides an abstract diagram for a layer of a recurrent neural network.
yt− 1
ot− 1
ht− 1
xt− 1
yt
ot
ht
xt
yt+
ot+
ht+
xt+
V θ
θ
Figure 4: Diagram of relationship between data (x, y) and the loss function L, for an RNN which makes predictions are every time step and has an output-to-hidden recursive relationship between steps (ot− 1 to ht).
ht− 1
xt− 1
ht
xt
yt+
ot+
ht+
xt+
θ
W
θ
W
Figure 5: Diagram of relationship between data (x, y) and the loss function L, for an RNN which only makes a prediction at time step t + 1 and has a hidden-to-hidden recursive relationship between steps (ht− 1 to ht).
Type I) is the most common as it is computing complete. Type II) is less powerful as it only the information captured in the prediction ot is passed to future steps (and ot tends to be low-dimensional). Type III) is only used to produced summaries (e.g. classify a sentence from its constituent words).
Training an RNN
Proposition 2.12 - Training a Type I) RNN Consider the Type I) RNN defined in Proposition 2.12. The value for each hidden value ht and predicted value ot can be considered as the following functions
ht = g(θht− 1 + W xt + b) ot = g(V ht + c)
where g are activation functions and b, c ∈ R are bias terms. The total loss is the sum of all losses at all time steps. L :=
t∈T
L(ot, yt)
To calculate the gradient of L the generalised back-propagation algorithm is applied to the unrolled network. (The parameters are V, θ, W, b, c and the nodes are xt, ht, ot, Lt for all t ∈ T ).
This gives us gradients
∂L ∂L(ot, yt)
∂ot
i
∂L(ot, yt)
∂L(ot, yt) ∂ot
i
= [ot]i − [yt]i
∂L ∂ot is calculated using softmax and cross-entropy. To work out the gradients wrt each prediction ot we start at the end of the sequence (oT , yT ) and towards (o 1 , y 1 ).
Due to unrolling the model, the instances of the weights W, θ, V at each time step are treated separately (W 1 , θ 1 , V 1 ,... , WT , θT , VT ) when calculating gradients. Thus, once the partial derivatives have been calculated for each of these instances we merge them to get the gradient wrt each of W, θ, V by taking averages.
Proposition 2.13 - Gradients for RNN - Type 1 Let T be the size of the sequence (x 1 ,... , xT ) (not transpose).
∇cL =
t=
∇ot L
∇cL =
t=
∂ht ∂bt
∇ht L
t=
i
∂ot
∇Vt [ot]i =
t
(∇ot L)hTt
t=
i
∂ht
∇Wt [ht]i =
t
diag(1 − h^2 t )(∇ht L)hTt− 1
t=
i
∂ht
∇Ut [ht]i =
t
diag(1 − h^2 t )(∇ht L)xTt
Remark 2.13 - Computing the Gradient in an RNN is expensive Computing the gradient of L in an RNN is very expensive as it requires performing a forward propagation pass of the unrolled network (even though the weights V, W, θ at each time step are shared they are considered as separated when unrolling), followed by a back-propagation pass through time (BPTT). The run-time of this cannot be reduced by parallelisation as the forward pass is sequential (and, all values computed n the forward pass need to be stored for reuse by the backward pass). The run-time and memory cost are both O(T ).
More Types of RNN
Definition 2.9 - Bi-Directional RNN Bi-Directional RNN are able to use information from the past and the future. Effectively, a Bi-Directional RNN is actually two RNNs: one moves forward through the sequence; the other moves backwards through the sequence, both have their own weight sets. These two RNNs are joined to make the final decision by the prediction step (ie calculating ot) takes into account the hidden values of both RNNs (for the same input). Suppose the forward direction RNN has hidden values h 1 ,... , hT and the backwards direction RNN has hidden values k 1 ,... , kT , then
ot = g(Vhht + Vkkt + b)
to forget what has been learnt upto that point.
[ft]i = σ
[bf ]i +
j [Uf^ ]i,j^ [xt]j^ +^
j [Wf^ ]i,j^ [ht−^1 ]j
ft = bf + Uf xt + Wf ht− 1 where σ(·) is a sigmoid function, Uf is the input-to-hidden weights learnt for f and Wf is the hidden-to-hidden weights for f.
External Input Gate (Orange)- The function it(xt, ht− 1 ; ·) is used to gate another input. It is computed in the same was as the Forget Gate (using a sigmoid)
it = bi + Uixt + Wiht− 1
where Ui is the input-to-hidden weights learnt for i and Wi is the hidden-to-hidden weights for i.
State Gate (Green) - The function ct(f , i, j, ct− 1 ) is a State Gate which updates the memory of the LSTM. ct = fi ◦ ct− 1 + it ◦ σ (bj + Uj xt + Wj ht− 1 ) where ◦ is element-wise multiplication
Output Gate - The function ot(xt, ht− 1 ; ·) calculates the about using the previous hidden value (before any forgetting occurs).
ot = σ (bo + Uoxt + Woht− 1 )
Hidden State - The hidden state of this layer ht is calculated by updating the value of the Output Gate ot with the memory ct
ht = tanh(ct) ◦ ot
Remark 2.15 - Simpler LSTMs GRUs are simpler versions of LSTMs which are deemed to be just as powerful. A GRU just have an Update Gate, which gates future information, and a Reset Gate, which control which parts of the state get used. A single equation can be used to calculate the hidden value for a GRU.
Remark 2.16 - For Revision Focus on the notion of RNNs; and the power & use-cases of the different flavours. Don’t worry about derivatives. See chapter 10.
2.4.1 Overfitting
Proposition 2.15 - High Parameter Space Overfitting Multi-Layer perceptrons often have millions of neurons, with millions of parameters & connections. These models have very high degrees of freedom and thus are prone to overfitting.
Proposition 2.16 - Identifying Overfitting When training a neural network we can plot the Loss Function for the training data against that for the testing data, values after each epoch. Overfitting occurs if the distance between these lines increases over time.
A natural idea from this is to keep a copy of the model with the smallest generalisation width, and then reverting to it after all the training.
Remark 2.17 - Overfitting can always be addressed by using more data, more representative data and/or strategically sampling data. Deep Learning techniques are particularly good at extract new information from new data. Whereas some older algorithms reach a ceiling where they can no longer learn form new information.
Proposition 2.17 - Obtaining More Data (Data Augmentation) Data for deep learning is expensive to collect as it requires ground truth annotations. Data Augmentation is a se of techniques used to increase the size of our data pool, without having to collect more data, by slightly modifying existing data. This can be done as an online process during training.
For images these techniques include: cropping, adding noise, rotating, translation, hue shift etc.
Definition 2.12 - Regularisation A Regularisation is any modification made to a learning algorithm which is intended to reduced its generalisation error, but not its training error.
Definition 2.13 - L-Regularisation L-Regularisation constrains the weight space (i.e. makes certain weight values more likely to be learned by the system).
Definition 2.14 - L 1 -Regularisation L 1 -Regularisation targets a local minimum with sparse weights to combat overfitting (ie using very few nodes). This is done by introducing a penalty to the cost function for every weight, based on the absolute value of the weight. This penalises non-zero weight values. This gives a cost function of the form
J(X; W ) =
x∈X
L(f (x; W ), f ∗(x)) ︸ ︷︷ ︸ normal cost function
w∈W
μ|w|
Practical Performance. Greater depth gives greater performance than greater width. Note that large networks require more training time and larger data sets. Oscillation Argument. There are functions f that can be represented by a deep ReLU network with a polynomial number of neurons, whereas a shallow network would require exponentially many units.
Proposition 2.18 - Network Distillation Is it possible to learn an ensemble of deep networks, but then compress these deep networks into a single shallow (or more efficient) network? Sometimes.
Proposition 2.19 - Mixture of Experts We could learn a series of networks which each deal with a specific subtask of a problem and then use another network to decide which of these networks (or order of networks) to use in each instance.
Definition 3.1 - Cost/Loss Function, J A Cost Function J(·; ·) is a real-valued measure of how inaccurate a classifier is for a given input configuration (test data & weights). Greater values imply the classifier is less accurate. Here are some common cost functions
Expected Loss J(X; www) = E[L(f (xxx, www), f ∗(xxx))]
Empirical Risk J(X; www) =
xxx∈X
L(f (xxx, www), f ∗(xxx))
Here L(x, x∗) is a measure of loss (distance) between two values. This is defined by the user on a case by case basis. Popular definitions are: |x − x∗|, (x − x∗)^2 & 1 {x = x∗}
Definition 3.2 - Gradient Descent Gradient Descent aims to learn a set of weight values www which produce a local minimum for a given cost function J. The update rule for gradient descent is
wwwt+1 = wwwt − η · ∇J(X; wwwt) ︸ ︷︷ ︸ ∆www ∇J(X; wwwt) is the partial derivative of the cost function wrt to the weights and gives the direction of the greatest descent. We can calculate the ith^ component of ∆www after observing (xxx, f ∗(xxx))
[∆www]i = ηxi( wwwTt xxx ︸︷︷︸ f (xxx;wwwt)
−f ∗(xxx))
Definition 3.3 - Online Gradient Descent
i). initialise all weights W randomly. ii). for t = 0, 1 ,... do: (a) pick net training sample (x, f ∗). (b) forward-backward pass to compute ∇J. (c) update weights W ← W − η∇J. (d) if (stopping criteria met) break loop. iii). return final weights W.
Remark 3.1 - Using Single Samples Using single samples to find the minimum point of the cost function will only roughly approximate aspects of the cost function gradient in online mode, leading to a very noisy gradient descent which may not find the global minimum at all.
Thus, it is not good to do online learning. And if the learning rate η is set too large then we may overshoot the global minimum. If the learning rate η is set too small then we takes a very long time to find a minimum.
Proposition 3.1 - Using Multiple Samples As using a single sample is bad, we try using multiple samples at once and using the average ∇J. There are two approaches
Deterministic Gradient Descent (DGD) where all training samples (X, F ∗) are used. Given a small enough learning rate η this will process to the true local minimum, but at high computational cost. Stochastic Gradient Descent (MiniBatch) where a small subset of training samples (X, F ∗) are used. This is still good at finding a minimum, and much less computationally costly.
For the average of ∇J we use
∇J =
j
L(f (xj , W ) ︸ ︷︷ ︸ prediction
, f ∗)
Proposition 3.2 - Setting the Learning Rate η Setting the learning rate η can be hard so a process called Simulated Annealing is used to test out several learning rates.
Let η 0 be an initial (high) learning rate and ητ be a final (smaller) learning rate. Simulated Annealing transitions from η 0 to ητ.
i). initialise all weights W randomly.
ii). for k = 0,... , τ do: (a) ηk :=
1 − kτ )η 0 + kτ ητ (b) for t = 0, 1 ,... do: i. pick net training sample (x, f ∗). ii. forward-backward pass to compute ∇J. iii. update weights W ← W − ηk∇J. iv. if (stopping criteria met) break loop. (c) return final weights W.
Remember that J(·, ·) is the cost function; slj := (wl)T^ f l−^1 is the signal of a layer; gjl (·) is the activation function of a layer; f (^) jl := glj (slj ) is the output of the layer;
Definition 3.5 - Auto-Differentiation using a Feedforward Computational Graph Consider two nodes in a computational graph x, y and suppose you want to find the partial derivative ∂x∂y.
i). Establish all the paths from y to x in the graph. ii). Calculate the partial derivatives of each step of these graphs. (i.e. if there is a path y → a → x calculate ∂a∂y , ∂x∂a ).
iii). Apply the chain rule along each path (i.e. For y → a → x calculate ∂a∂y · ∂x∂a ).
iv). Sum these calculations together to get the final result ∂x∂y. v). Substitute variables to make computation easier.
Example 3.2 - Auto-Differentiation using a Feedforward Computational Graph Consider the graph in Example 3.1 and wanting to calculate ∂f∂a.
i). There are three paths from f to a in the graph: (1) f → d → b → a; (2) f → e → b → a; and, (3) f → e → c → a. ii). We need to calculate the following partial derivatives: ∂d∂f , ∂b∂d , ∂a∂b for (1); (^) ∂f∂e , ∂b∂e , ∂a∂b for (2); and, ∂e∂f , ∂c∂e , ∂a∂c for (3).
(1) (2) (3) ∂d ∂f =^1
∂e ∂f =^ g^
∂e ∂f =^ g ∂b ∂d =^1
∂b ∂e =^1
∂c ∂a ∂e^ =^1 ∂b =^ c^
∂a ∂b =^ c^
∂a ∂c =^ b iii). Applying the chain rule to each path gives (1) ∂d∂f∂b∂d∂a∂b = 1 · 1 · c = c (2) ∂e∂f∂e∂f∂b∂e = g · 1 · 1 · c = gc (3) (^) ∂f∂e∂c∂e∂a∂c = g · 1 · b = gb
iv). Summing the terms together we get ∂a ∂f = c + gc + gb
v). By substitution we get a final expression ∂a ∂f
= 2 + 5g + 2f g + 2f g^2
So when f = 4, g = 2 we have that a = 150 and ∂a∂f = 60.
Proposition 3.5 - Using Hierarchical Dependency By the chain rule we have that ∂x∂z = ∂x∂y∂y∂z. So, if ∂x∂y is already known then we just need to
multiply that value by ∂y∂z to get ∂x∂z.
This can be utilised to ease the computational load of a calculation. In particular, calculating the derivatives one layer at a time is a good strategy.
Remark 3.2 - Usefulness of Auto-Differentiation Auto-Differentiation allows us to mathematical quantify the affect one variable has on another, which is good. However, the number of paths in a network grows exponentially with the number of nodes, thus this can be computational hard. (Hierarchical Dependence can be used to mitigate this)
Remark 3.3 - Backpropagation Algorithm - Intuition The Backpropagation Algorithm combines reverse auto-differentiation with gradient descent. Reverse auto-differentiation is used to find the relationship between the cost function and each weight; and gradient descent to perform stepwise adjustments on weights.
The Backpropagation Algorithm seeks to compute the discrepancy between the network’s output and the target value; then propagate this discrepancy backwards through the network to determine the influence of each weight on this discrepancy, by considering the influence of each path.
Proposition 3.6 - Backpropagation Algorithm - Overall Strategy
i). Read the input & perform a forward pass through the network. (This will calculate all slj , f (^) jl .)
ii). Calculate the cost function between each final layer neuron and its target J(f (^) j∗ , f (^) jN ).
iii). Calculate the error derivatives δN j +1of the cost function J wrt each final layer neuron f (^) jN
δN j +1:=
∂f (^) jN
iv). Compute the error derivative δNj of the cost function wrt the signals of the last layer
δjN :=
∂sNj
= gjN ′(sNj ) · δN j^ +