Partial preview of the text
Download Recurrent Neural Network and more Schemes and Mind Maps Artificial Intelligence in PDF only on Docsity!
mm 8-COMP & Sem 7-ECs) muse ural networks or RNNsg are af, ce a RE gs al f âwe âa sequence of values x (1) rey xÂą oe with large width and height, and som eS rite ize recurrent networks can Scale to m : aio? . without sequence-based SPâŹCializatio en of variable length. eS yen ra) ne 1 ined a feedforward network tha dforward network would yearn all of the rules of the langu 0 yee a recurrent neural network Share: parison ol 1 Sequence Learning Problems ag S 1 of the networks that are covered So far we nal Neural Network i.e CNN)) : ut at any time step is independent of the TEVIO the outp Pp us |; 0 the input was always of the fixed-length/size ti) ih jn âSequence Learning Problems' ', the âtwo pr \ output at any time step depends on Some examples of Sequence learning are : . ÂŁx.1: Part of speech tagging ' Given a âsequence of wordsâ, the idea is to âpredict (whether that word is a pronoun, noun, Pronoun â article OO ©OOÂź) an amily of ne 1m oavolutio nal network is a neural Ural networks fo c Retwork that 4 , i 1S speciali an Image, a recurrent ney âPp i por "x such as Tal Network is a Neural net T). Just as Convolution e Convolutional ne uch J nm. Mi (Fully Conne eneral, for fully connected neural networks, cony . : step is independent of any of the previous input/o tim (Recurrent Neural Networks).,.Page no (5-2) T processing sequential data. zed for processing a grid of work that is specialized for âal networks can readily scale tworks can process images of quences than would be practical for Ost recurrent networks can also process Onger se t Plocesses Sentences of fixed length have Separate Parametey Ts for each input feature, so it would cted Neural Network i.e. FCNN), ayer input/output Olutional neural networks, the output at âtâ utput(s). Operties of FCNN and CNNs do not holdâ and the Previous input/out put and the length of the input is not fixed. part of the speech tag for each wordâ verb, article, adjective, and so on) adjective awesome Fig. 5.1.1 : Sequence learning-Part of speech tagging We "Silabus we f academic year 22-23)( M7B-135) ae Bel rech-neo Publications...A SACHIN SHAH Venture (Recurrent Noural Networks)...Page no 533) âââââ Trap LAATIAD (Muay SOOMP & Sam 7-ECS) * put âalso on the previous input(s)» for âawesomeâ which is an adjective, onfident that the ânext worg is 1 Gepends* not only on the âcurrent input icy evious i wa 1 input as âMovieÂź and the previous input : it would be more oF less Âą Adel sees an adjectiveâ ng tO be a noun.â : | BS « nounâ would be higher âif the Previous eee Sten . rie (ord) is a nee in predicting that the âmovie Qvord) put depends not only on the 1 it oul otiveâ. There is this dependency where the curren! apa Dar. on the previous input as well. >» Bx, 2: Sentimental Analysis pflayer). For example, for sentiment s/predicts a final output as ste] * Is nx mandatory two produce an output at each step(time . â . « : ve snalysix, the model looks at all the words in a sentence and gi step but the model ignores those nal output is dependent on all the fication Problem. * ould he considerad as output is produced at every time and reports only the final output (and somehow this fi Jous Inputs as well). This is also termed as a Sequence Classi ech data > Ex. 3: Sequence Learning problems using video and spe Âą Speech Recognition : Think of speech as a sequence of phonemes and give the speech signal as the the idea would be to map each signal to its respective phoneme in the language * Video Labeling : A video is a sequence of frames (there might be some processing on these mss), one task could be to label every frame in the video (say which of 12 steps of Surya mameskar a frame corresponds to). = 5.1.2 Unfolding Computational Graphs tr * A computstional graph is a way to formalize the structure of a set of computations, such as those involved in mapping inputs and parameters to outputs and loss. The idea of unfolding a recursive or xecurrent computation into a computational graph that has a repetitive structure, typically corresponding to a chain of events. * Unfolding this graph results in the sharing of parameters across a deep network structure. For example, consider the classical form of a dynamical system: s° = ÂŁ6"; 6) GAD where sis called the state of the system. Equation (5.1.1) is recurrent because the definition of s at time 1 refers back to the same definition at time t â 1. For a finite number of time steps t, the graph can be unrolled or unfolded by applying the definition t â 1 times. For example, if we unfold Eqnation 5.1.1 for t= 3 time steps, we obtain s© = f° 30) (5.1.2) f(E(Sâą ; 8); 8) i (5.13) (QMU-New Syllabus w.e.f academic year 22-23)( M78-135) Ls Tech-Neo Publications...A SACHIN SHAH Venture [eee Y (Recurrent Neural Networks), Page is Ss Deep Leaming (MU-Sem 8-COMP & som 7-ECS) two different ways: ith a diagram CO ataining one node for every componen, odel, such as @ biological neural Retwo, > : i ith physical in real time, wi Paris this view, the network defines 4 circuit that Oe ieft of Fig- 5.1.3. . : i state, as in . , current state can influence their future ied computational graph, in which © Equation 5.1.4 can be drawn in © One way to draw the RNN i o might exist in a physical implementau at Tk. } Whose unfo! each js as an : 7 : © The other way to draw the RNN 16 eect variables, with one variable per time Ste component is represented by many Âą that point in time. Each variable for each time Step ig s representing the state of the component at ey, as in the right of Fig. 5.1.3. drawn as a separate node of the computational grap" im the left side of the Fig. 5.13 jrcuit as in . 5.1, © What we call unfolding is the operation that maps 9 as tog A i ec unfolded graph now h; . computational graph with repeated pieces 4S in the right side. Th AS a size that depends on the sequence length. wi © Wecan represent the unfolded recurrence after t sere bÂź = eR OD OD, OK) G15 > AS C-) Lo, G.16) f(h »x 39) @ xt, xt Dx? x) as iat © The function g (t) takes the whole past sequence oe eure allows us to factorize g ad produces the current state, but the unfolded recurrent struc &âą into repeated application of a function f. A oO. ith afunction g : e The unfolding process thus introduces two major advantages: 1. Regardless of the sequence length, the learned model always has the same input size, because it is specified in terms of transition from one state to another state, rather than specified in terms of a variable-length history of states. 2. It is possible to use the same transition function f with the same parameters at every time step, es on all time steps and These two factors make it possible to learn a single model f that operates all sequence lengths, rather than needing to learn a separate model gâą for all possible time steps. © Leaming a single, shared model allows generalization to sequence lengths that did not appear in the training set, and allows the model to be estimated with far fewer training examples than would be required without parameter sharing. © Both the recurrent graph and the unfolded graph have their uses. The recurrent graph is succinct. The unfolded graph provides an explicit description of which computations to perform. The unfolded graph also helps to illustrate the idea of information flow forward in time (computing outputs and losses) and backward in time (computing gradients) by explicitly showing the path along which this information flows. (MU-New Syllabus w.e.f academic year 22-23)( M78-135) Tel tech-eo Publications...A SACHIN SHAH Venture Deep Loaming (MU-Sem 8-COMP & Som 7-ECS) (Rocurront Noural Notworks)...Pago no (S ) 5.1.3 Recurrence Neural Networks ° RNNs are called recurrent because they perform the same task for every clement of a sequence, with the output being dependent on the previous computations, Another way to think about RNNs is = r, that they have a âmemoryâ which captures information about what has been calculated so far, * Armed with the graph unfolding and parameter sharing ideas of Section 5.1, we can design a wide variety of recurrent neural networks. Fig. 5.1.4 The computational graph to compute the training loss of a recurrent network that maps an input soquenco of x values to a corresponding sequence of output o values. A loss L measures how far each o Is from the corresponding training largot y. When using softmax outputs, we assume o Is the unnormalized log probabilities. Tho loss L internally computes „) = softmax(o) and compares this to the target y. Tho RNN has Input to hidden connections parametrized by a wolght matrix U, hidden-to-hiddon recurront connections parametrized by a wolght matrix W, and hidden-to-output connoctlons parametrized by a wolght matrix V. Equation 5.7 dofinos forward propagation In this modal. (Left)The RNN and Its loss drawn with recurrent connectlons. (Right)Tho samo as a time unfolded computational graph, where each node Is now assoclated with one particular timo Instance. Âą Some examples of important design patterns for recurrent neural networks include the following o Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, illustrated in Fig. 5.1.4. o Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden-units at the next time step, illustrated in Fig. 5.1.4. : o Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output, illustrated in Fig. 5.1.5. (MU-New Syllabus w.o.f academic year 22-23)( M78-135) Tech-Neo Publications...A SACHIN SHAH Venture Daop Lenni (MU-BenLEGOMP Âź Goin PEON) (Rocurront Noural Notworks)...Pago no (5-B) ~ whore (he PATOMETEHE AS the Ding yoapectively for inputte hidden, hi te OF A FECULTCHE Het Woy, thy bane Phe (oll Lone for Live, ie just HC RON OF ihe the aC) low tikethood OFYâą piven xO ep vector band © wong with the weight matrices U, V and W, en -to-ontput and hidden-to hidden connections. This is an MMpS an i Beque; Tonnes Hput sequence to an output sequence of the same Nee Of x values paired with a sequence of y values would or ; Paracen : § Over all the time steps. For example, if L? is the negative 1 ctxt? wk then CE ay {ty 1 ra) ty rreey yyy i = YL ta | + (5.1.12) FZ OB Piatey Y OL tx (ly âx(O)), + (5.1.13) where prude! YOUXC), D) is Biven by readin Tees RCL vector 9 (Os Computing the pradicn operation, The pradiont Computati ripht through our illustration of the puss moving right to lelt (hroupl parallelization because the forward p be computed after the previous one, reused during the backward Pass, so the Memory cost is applied to the unrolled graph with O(t) cost ig site he discussed in a Curther section, The n . f clwork with recurrence but also expensive to train, ig the entry for y(t) from the modelâs output Cot this loss fun O} 1 involve retion with respect to the parameters is an expensive PS performing a forward Propagation pass moving left to in Fig. 5.1.5, followed by a backward propagation tenes: The Tunlime is O(t) and cannot be reduced by eagation graph is inherently sequential; each time step may only States Computed in the forward Pass must be stored until they are also O(t). The back-propagation algorithm k-propagation through time or BPTT and is Âą between hidden units is thus very powerful ânrolled graph V the Braph. yo 5.1.4 Bidirectional RNNs e The recurrent networks considered ini i. have a âcausalâ captures information from the past, () cs Structure, meaning that the state at time t only Berg © » and the present input x . Âą Sonic of the models discussed also allow informati ani hunt © affect the current state when the y valucs are available, However, in Many applications we want to output a prediction of y which may depend on the whole input sequence, * For example, in speech recognition, the hat are both acousticall into the future (and the past) to disambiguate them. This is als many other Sequence-to-sequence learning tasks. 'y plausible, we may have to look far â0 truc of handwriting recognition and * Bidirectional recurrent neural networks (or bidirectional RNNs) were invented to address that need. They have been extremely successful in applications where that need arises, such as handwriting fecognition, speech recognition and bioinformatics. As the name suggests, bidirectional RNNs combine an RNN that moves forward through time beginning from the start of the sequence with MUNow Syllabus w.o.f academic year 22-23)( M78-135) Tech-Neo Publications...A SACHIN SHAH Venture â âaecurrent Neural Notworka)...Page i iJ Doep Loaming (MU-Som 8-COMP & Som 7-ECS) | wo a : gyi = (As) vw by Slay Ww ~ (oh, aor = Ydiag( 1h? 10,0 Bb S.1.23) t IL. © (5, Yul = =z (ae) Vuh, Âą 1.24) ci ONG Sdiag (1-@°Y1V,0D or (3.1.25) t * We do not need to compute the gradient w! parameters as ancestors in the computation : : i es, the ci i when the number of time steps increas: âOmputation also e high cost of single parameter updates m, tikes 0) ini i ith respect to xâ for training because it does not have any al graph defining the loss. * The disadvantages of BPTT are increases. This will make the overall model noisy. Thi : the BPTT impossible to use for a large number of iterations. ; Âą This is where Truncated Backpropagation comes. Truncated Backpropagation (TBPTT) is Lothing but a slightly modified version of BPTT algorithm for the recurrent neural network. In this, the sequence is processed one time step at a time and periodically the BPTT update is performed for a fixed number of time steps. © The goal of BPTT is to compute the partial derivatives of the error with respect to the synaptic weights, known as the âgradientsâ. The network improves its performance by learning through âgradient descentâ; nudging the synaptic weights in the negative direction of the gradient reduces the network's error. When using the chain rule to compute the gradients of the recurrent parameters de : for a standard RNN (e.g. h! = o(W-h'â 1 + Wix')), we have an intermediate term ah, that computes the gradient of the error with respect to the activity states : Be _ de Dhy_ de LT-1 Mya dh, ~ @hy db, Ohy *=t dh, (5.1.26) Of _T-1., T = Fe Man, diag © (hes) Wy e In Eqn. (5.1.26), where we have an iterated product of matrices, including the matrix denoting the network's recurrent synaptic weights W,. Iterated products of real numbers can explode to infinity or vanish to zero, iterated products of matrices can explode or vanish along some vector direction (in particular, the directions corresponding to the eigenvectors with the leading eigenvalues of the recurrent weight matrix). (MU-New Syllabus w.e.f academic year 22-23)( M78-135) Tech-Neo Publications...A. SACHIN SHAH Venture : ; ~ Networks)...Pago no (5- Deep Learning (MU-Som 8-COMP & Som 7-ECS) (Rocurrent Noural No! ) 13 7 . ining i ion. The ii * This change in weight is added to the old set of weights for every training iteration Issuc here is when the change in weight is multiplied, the value is very less. . . â to predict â * Consider you are predicting a sentence say, âI am going to Franceâ and you want lo pi tT am going to France, the language spoken there is â * A lot of iterations will cause the new weights to be extremely negligible and this leads to the weights not being updated. * Methods that are proposed to overcome the vanishing gradient problem are: Residual neural networks (ResNets) Multi-level hierarchy Long short term memory (LSTM) Faster hardware ReLU o Batch normalization 00000 Exploding Gradient Âą Exploding gradients are a problem when large error gradients accumulate and result in very large updates to neural network model weights during training. A gradient calculates the direction and magnitude during the training of a neural network and it is used to teach the network weights in the right direction by the right amount. Âą When there is an error gradient, the explosion of components may grow exponentially. * When large error gradients accumulate, the model may become unstable and unable to learn from training data. At an extreme, the values of weights can become so large as to overflow and result in NaN values. Âą Some indications to determine whether model is suffering from exploding gradients during the training of your network o The model does not learn much on training data therefore resulting in poor loss. © The model is unstable, resulting in large changes in loss for each update. o After some time, the model loss goes to NaN during training. * Solutions to overcome this problem are: There are many approaches to fix exploding gradients but some of the best approaches are, o Use LSTM network ©. Use Gradient clipping o. Use Regularization © Redesign the neural network (MU-New Syllabus w.e.f academic year 22-23)( M78-135) Tech-Neo Publications...A SACHIN SHAH Venture Deep Leaming (MU-Sem 8-COMP & Sem 7-ECS) (Recurrent Neural Notworks)...Page no (5-14) w 5.1.7 Truncated Backpropagation Through Time (Truncated BPTT) The problem with BPTT is that the update in weights requires going forward through the entire sequence to compute loss, then backward through the entire sequence to compute gradient. The slight variant of this algorithm is called Truncated Backpropagation through time (TBPTT), where forward and backward Passes are run through chunks of sequences instead of the whole sequence: Itâs similar to using mini-batches in gradient descent i.c. the gradients are calculated for cach step, but the weights get updated in batches periodically (as opposed to once every input sample). This i ight helps reduce the complexity of the training process and helps remove noise from the weight updates. : the entire © In the backpropagation training, there is a forward pass and a backward pass eee the training Sequence to compute the loss and the gradient. By taking a window, we also impro performance from the training duration aspect- where we shortcut it. Loss Calculation a =â Fig 5.1.7: Truncated backpropagation through time Âą The Truncated BPTT is much faster make the contribution of the gradie: dependencies longer than the chu disadvantage is the detection of th assume that the gradient vanishes, than the simple BPTT, and also less complex because we donât nts from faraway steps. The downside of this approach is that nk length are not taught during the training process. Another © vanishing gradients. From looking at the learning curve one can but, maybe the task itself is difficult. âDL 5.2. LONG SHORT-TERM MEMORY NETWORKS) : Âą Consequently, the output is dependent on the Previous predictions which are already known. However, RNNs have limited Capacity to bridge more than a certain number of Steps. Mainly this is activation functions are added, the gradient of the loss function approaches zero. The LSTM neural (MU-New Syllabus w.e.f academic year 22-23)( M78-135) Bal rech-Neo Publications...A SACHIN SHAH Venture poop Lonming (MU-Som 6-COMP & Som 7-ECS) (Recurront Noural Networks)...Pago no (5-16) rhe state of infos i . . Fi e The st ormation depends on these interactions, If there are no interactions, the information will run along without changes. . The a Ne block removes or adds information to the cell state through the gates, which allow oat information to cross Forget gate, The Forget Gate decides the type of information that should be thrown away or kept from the cell state, This process is implemented by a sigmoid activation function. The sigmoid activation function outputs values between 0 and | coming from the weighted input previous hidden State, and a bias. e §=6Forget Gate : The information that is no longer uscful in the cell state is removed with the forget gate, Two inputs x' (input at the particular time) and h'~ ' (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias. The resultant is passed through an activation function which gives a binary output. If for a particular cell state the output is 0, the piece of information is forgotten and for the output 1, the information is retained for future usc, Input gate : It controls what new information will be added to the cell state from the current input. This gate also plays the role to protect the memory contents from perturbation by irrelevant input - A sigmoid activation function is used to generate the input values and convert information between Oand 1. © Output gate : It controls which information to reveal from the updated cell state to the output ina single time step. In other words, the output gate determines what the value of the next hidden state should be in each time step. Gated RNN : One way to deal with long-term dependencies is to design a model that operates at multiple time scales, so that some parts of the model operate at fine-grained time scales and can handle small details, while other parts operate at coarse time scales and transfer information from the distant past to the present more efficiently. Various strategies for building both fine and coarse lime scales are possible. These include the addition of skip connections across time, âleaky unitsâ that integrate signals with different time constants, and the removal of some of the connections used to model fine-grained time scales. e Like leaky units, gated RNNs are based on the idea of creating paths through time that have derivatives that neither vanish nor explode. Leaky units did this with connection weights that were cither manually chosen constants or were parameters. Gated RNNs generalize this to connection weights that may change at each time step. e Leaky units allow the network to accumulate information (such as evidence for a particular feature or category) over a long duration. However, once that information has been used, it might be useful for the neural network to forget the old state. For example, if a sequence is made of sub-sequences and we want a leaky unit to accumulate evidence inside each sub-subsequence, we need a mechanism to forget the old state by setting it to zero. Instead of manually deciding when to clear the state, we want the neural network to learn to decide when to do it. This is what gated RNNs do. (MU-New Syllabus w.e.f academic year 22-23)( M78-135) Bal tech-neo Publications...A SACHIN SHAH Venture