















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Decoding models can help reveal whether particular information is present in a brain region in a format the decoder can exploit. Encoding models make ...
Typology: Lecture notes
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















a (^) Department of Psychology, Department of Neuroscience, Department of Electrical Engineering, Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY b (^) Center for Cognitive Neuroscience, University of California, Los Angeles, CA; Modeling, Simulation, Computer Science, UCF, USA Abstract Encoding and decoding models are widely used in systems, cognitive, and computational neuroscience to make sense of brain-activity data. However, the interpretation of their results requires care. Decoding models can help reveal whether particular information is present in a brain region in a format the decoder can exploit. Encoding models make comprehensive predictions about representational spaces. In the context of sensory experiments, where stimuli are experimentally controlled, encoding models enable us to test and compare brain- computational theories. Encoding and decoding models typically include fitted linear-model components. Sometimes the weights of the fitted linear combinations are interpreted as reflecting, in an encoding model, the contribution of different sensory features to the representation or, in a decoding model, the contribution of different measured brain responses to a decoded feature. Such interpretations can be problematic when the predictor variables or their noise components are correlated and when priors (or penalties) are used to regularize the fit. Encoding and decoding models are evaluated in terms of their generalization performance. The correct interpretation depends on the level of generalization a model achieves (e.g. to new response measurements for the same stimuli, to new stimuli from the same population, or to stimuli from a different population). Significant decoding or encoding performance of a single model (at whatever level of generality) does not provide strong constraints for theory. Many models must be tested and inferentially compared for analyses to drive theoretical progress. Highlights
stimuli (e.g. multivariate analysis of variance), instead of a decoding model. However, a univariate encoding model would in general have less sensitivity, because it does not account for the noise correlations between different response channels (Averbeck et al. 2006). A multivariate analysis of variance would account for the noise correlations, but might have less specificity. In fact, it might fail to control false-positives at the nominal level if its assumptions of multivariate normality did not hold (as is often the case), making it invalid as a statistical test (Kriegeskorte 2011). Decoding provides a natural approach to modelling the noise correlations (e.g. using a multivariate normal model as in the Fisher linear discriminant), without relying on the model assumptions for the validity of the test: Violations of the decoding model assumptions will hurt decoding performance. We, thus, err on the safe side of concluding that there is no information about the stimuli in the responses. In sum, it is not the direction of the decoding model (“brain reading”) that makes it a compelling test for information, but the statistical nature of the problem (noise correlations) and the fact that decoders are tested on independent data. Figure 1 | Linear encoding and decoding models. (A) Encoding and decoding model the relationship between stimuli and responses in opposite directions. An encoding model (top) predicts brain responses as a linear combination of stimulus properties (black circles). A decoding model (bottom) predicts stimulus properties as a linear combination of brain responses. (B) Example of a linear decoding model using a 2- dimensional feature space consisting of two voxels. Voxel 1 contains relevant information about which of two classes (green, blue) the stimulus belongs to. Voxel 2 contains no information about the stimulus class. The two dimensions jointly define the linear discriminatory boundary. Note that the weights assigned to each voxel are defined by the vector w, which is orthogonal to the decision boundary. Because the noise is correlated between the voxels, a linear decoder will assign significant negative weight to voxel 2, using this voxel (which contains only noise) to cancel the noise in the voxel which contains signal. As a result, interpreting the absolute weights of linear decoders requires care and additional analyses. Linear decodability indicates “explicit” information For decoding to succeed, the information must be present in the brain region in a format that the decoder can exploit. Linear decoders, the most widely used class, require that the distributions of patterns be linearly separable to some extent. This is a weakness in that we might fail to detect information encoded in a more complex format. However, it is a strength in that it provides clues to the format of the information we do detect. The simpler the decoder, the more focused its sensitivity will be. From the perspective of understanding the brain computations, it
is attractive to use decoding operations that single neurons can implement. These include linear readout, but also simple nonlinear forms of readout such as radial basis function decoding (Poggio & Girosi 1990 ). Linear decodability indicates that a downstream neuron receiving input from a sufficient portion of the pattern, could read out the information in question (DiCarlo & Cox 2007). Information amenable to direct readout by downstream neurons is sometimes referred to as “explicit” in the code (Kriegeskorte 2011, DiCarlo et al. 2012, Hong et al. 2016). The level of generalization beyond the training set must be considered when interpreting a decoding result Fitting a model always poses the risk of overfitting, i.e. of optimizing the fit to the training data at the expense of predictive performance on independent data. Overfitting can lead to high decoding accuracy on the training set, even if the response patterns contain no information about the stimulus. Decoders therefore need to be tested for generalization to independent data (Hastie et al. 2009, Kriegeskorte 2015). In our example, we might test the decoder on an independent set of response measurements for the same two particular images of a cat and a dog. If decoding accuracy on this independent test set is significant, we can reject the null hypothesis that the response patterns contain no information about the stimulus (Mur et al. 2009, Pereira et al. 2009). However, detecting information about which of two images has been presented tells us almost nothing about the nature of the code. The two images must have distinct response patterns in the retina and V1 (low-level representations) as well as in the visuo-semantic regions of the ventral stream (high-level representations). We would therefore expect a linear decoder to work on new measurements for the same images in any of these regions. This reflects the fact that all the regions contain image information. In the retina, for example, we expect the two images to elicit distinct response patterns, while the manifolds of response patterns corresponding to the two categories are hopelessly entangled (DiCarlo & Cox 2007, Chung et al. 2016, Chung et al. 2018 ). Given responses to just two images, we can demonstrate the presence of information, but have no empirical basis for characterizing the nature of the code (see Kriegeskorte et al. 2007 for a study limited by this drawback). In order to learn whether the region we are decoding from contains a low-level image encoding or a high-level categorical encoding, we can train the decoder on one set of cat and dog images and test it on another set of images of different cats and dogs (Anzellotti et al. 2014, Freedman et al. 2002, Kriegeskorte 2011). To support the interpretation that “cats” and “dogs” are linearly separable in the representation (rather than the weaker claim that there is image information), it is not sufficient to increase the number of particular images of cats and dogs, while training and testing on the same images. The linear decoder has many parameters (one for each response channel) and is expected to overfit even to a larger set of particular images. Even for the retinal representation, we therefore expect a cat/dog decoder to generalize to new measurements performed on the same images. We must test the linear decoder for generalization to different cats and dogs. Note, however, that interpreting linear decodability as linear separability of the two classes in the neuronal representational space would further require the decoding accuracy to be so high that errors can be attributed to the measurement noise rather than the neural representation. In practice, we typically face ambiguity. For example, decoding accuracy may be significantly
this has never been shown for V1 and would be puzzling, because we expect visual representations to be specialized for natural stimuli.
Decoding provides an intuitive and compelling demonstration of the presence of information in a brain region. Decoding models bring several benefits:
interpretation of the reconstruction results: good looking reconstructions no longer indicate that all the detail they provide is encoded in the brain region. The reconstruction has to be compared to the presented stimulus, and the complexity of the output space (which is reduced by the prior over the outputs) needs to be considered in the interpretation. An important question is what we can learn from stimulus reconstructions. The goal to learn about the content and format of the code may not be ideally served by striving for the most natural looking reconstruction. Decoding models predicting behavioral responses from brain responses can be interpreted as brain-computational models Decoding is usually used as a tool of analysis that reveals aspects of the content and format of the information encoded in a brain region. The decoding model, thus, is not interpreted as a model of brain computation. In the context of sensory systems, a decoder maps from brain responses to stimuli. Since stimulus processing by the brain operates in the opposite direction, it is difficult to interpret a decoder as a model of brain information processing. However, if a decoder is used to predict behavioral responses, e.g. judgments of categorical or continuous stimulus variables (possibly including errors and reaction times on individual trials), then the decoder can be interpreted as a model (at a high level of abstraction) of the brain computations generating the behavioral responses from the encoding of the stimuli in the decoded brain region (Shadlen et al. 1996, Williams et al. 2007, Walther et al. 2012). Encoding models: testing comprehensive representational predictions Encoding models attempt to predict brain response patterns from descriptions of the experimental conditions (Figure 1A; Paninski et al. 2007, Kay et al. 2008, Dumoulin & Wandell 2008, Mitchell et al. 2008, Naselaris et al. 2011, Khaligh-Razavi & Kriegeskorte 2014, Yamins et al. 2014, Naselaris & Kay 2015, van Gerven 2017). Encoding models, thus, operate in the opposite direction as decoding models. If our goal is merely to demonstrate that a brain region contains information about the experimental conditions, then the direction the model should operate in is a technical issue: One direction may be more convenient for capturing the relevant statistical dependencies (e.g. noise correlations among responses), but a model operating in either direction could support the inference that particular information is present in the code. If our goal is to test computational theories, however, then the direction that the model operates in matters, because it determines whether the model can be interpreted as a brain-computational model. Encoding models predicting brain responses from sensory stimuli can serve as brain-computational models Whereas a decoding model typically serves to test for the presence of particular information in a brain region, an encoding model can provide a process model, at some level of abstraction, of the brain computations that produce the neuronal code. An encoding model makes
measuring distances between noisy data points, we can use the crossnobis (crossvalidated Mahalanobis) estimator (Nili et al. 2014, Walther et al. 2016). The resulting crossnobis RDM provides a full characterization of the linearly decodable information in the representational space (Diedrichsen & Kriegeskorte 2017). Comparing representations in models and brains at the level of RDMs obviates the need for fitting a linear model to predict each measured response (thus reducing the need for training data) and enables the analysis to naturally handle noise correlations between responses (which are typically ignored when encoding models separately predict each of the measured response channels). A third approach to the evaluation of encoding models is to predict the distribution of activity profiles. In pattern component modelling (PCM, Diedrichsen et al. 2011), this distribution is characterized by the second moment of the activity profiles. Like the RDM, this is a stimulus-by- stimulus summary statistic of the stimulus-response matrix. Each entry of the second-moment matrix corresponds to the inner product between two response patterns. All three approaches can be construed as testing hypotheses about the representational space induced by the activity profiles (Diedrichsen & Kriegeskorte 2017 ). Consider a linear encoding model using a Gaussian prior on the weights. Such a model predicts a Gaussian distribution of activity profiles. The predicted distribution of activity profiles is captured by its second moment. For representational similarity analysis, similarly, the RDM is a function of the second moment of the activity profiles. This core mathematical commonality between the methods notwithstanding, each is best suited for a particular set of questions. Linear models predicting raw measurements lend themselves to univariate brain mapping, revealing which voxels or neurons are accounted for by a particular nonlinear encoding model. RSA lends itself to characterizing the geometry of the representational space, naturally handles noise correlations among responses, and reduces the need for training data. PCM can have greater sensitivity for adjudicating among models than the other two methods, at the expense of relying on stronger assumptions. The three methods are best viewed as part of a single toolbox of representational model analyses, whose elements can be combined as needed to address particular questions. The level of generalization beyond the training set must be considered when interpreting an encoding result Encoders, like decoders, are tested by evaluating how well they predict independent data, whether the predicted quantities are the raw brain-activity measurements, the representational dissimilarity matrix, or the second-moment matrix of the activity profiles. For encoders, as for decoders, the interpretation depends on both the prediction accuracy and the level of generalization beyond the training set that the model achieves. Encoding models typically require the fitting of parameters, so overfitting needs to be accounted for in any inferential procedure. In the simplest type of a univariate linear encoding model, we can rely on Gaussian assumptions and perform inference without a separate test set (e.g. Friston et al. 1994). However, more interesting models require independent test sets, for example when parameters are fitted using priors over the weights and when the model is a brain-computational model to be tested for generalization to new conditions.
A key consideration is how much flexibility to allow in fitting each model representation to a brain representation. One extreme is to allow no flexibility and assume that the model representation precisely predicts the geometry of the representational space (Kriegeskorte et al. 2008). This case is most naturally handled by RSA and PCM, but could also be implemented with linear encoders by using a prior that prevents any distortion of the representational geometry. The other extreme is to allow arbitrary linear remixing of the units of the nonlinear encoding model. This case is most naturally handled with linear encoding models, but can also be implemented with PCM and RSA (Diedrichsen et al. 2011, Khaligh-Razavi & Kriegeskorte 2014, Khaligh-Razavi et al. 2016). In practice, some compromise is desirable, which we can think of as a prior on the mapping from the brain-computational model to the measured brain responses. We might use a 0-mean Gaussian prior on the weights (e.g. Kay et al. 2008). Alternatively, we can limit flexibility more aggressively, by allowing each unit (or each feature map or layer) a single weight (not a separate weight for each response). Such weighted representational models (e.g. Khaligh-Razavi et al. 2014) are naturally implemented with RSA and PCM. Each brain-computational model in this case predicts a superset of the features spanning the brain representational space (disallowing linear mixtures), but does not predict the prevalence of each of the features in the neuronal population. The lowest level of generalization beyond the training set is generalization to new measurements for the same experimental conditions. This is sufficient, if the experimental conditions exhaustively cover the domain we would like to draw inferences about (consider the case of the representation of the five fingers in motor cortex: Diedrichsen et al. 2012). However, in a domain such a sensory systems, the goal is typically to evaluate to what extent a brain- computational model can predict brain representations of arbitrary stimuli. This requires a higher level of generalization beyond the training set. A vision model, for example, might be trained with responses to one sample of natural images and tested for generalization to responses to an independent (and nonoverlapping) sample from the same distribution of natural images. Because the set of all natural images is so rich, this is a challenging generalization task (as illustrated by the difficulty of computer vision). An even more stringent test of the assumptions implicit to a model is to train the model on a sample from one population of images and test it on a sample from a different population of images (e.g. Eickenberg et al. 2017). The prediction accuracy can be assessed in terms of whether it is significantly above chance level, whether it significantly differs from that of competing encoding models, and how close it comes come to the noise ceiling (the highest achievable accuracy given the noise in the data, Nili et al. 2014). We can generalize claims about an encoding model to the extent that its predictions generalize. If we want to conclude that the model can predict responses for the stimuli presented, we need not test the model with different stimuli (only with different response measurements elicited by the same stimuli). If we want to conclude that the model can predict responses to arbitrary natural stimuli, we need to test it with new arbitrary natural stimuli. The population of conditions the test set is a sample of defines the scope of the claims we can make (Hastie et al. 2009, Kriegeskorte 2015 ). We focus here on encoding and decoding models that are fitted to individual subjects’ brains, so as be able to exploit fine-grained idiosyncratic patterns of activity that are unique to each subject. Within-subject prediction accuracy may support generalization to a population of stimuli, but it doesn’t support generalization to the population of subjects. In some fields, such as low-
Figure 2 | The feature fallacy. Different linear encoding models spanning the same space of activity profiles may not be distinguishable. There are many alternative sets of feature vectors { f 1 , f 2 , …} that span the same space of activity profiles. In the absence of a prior on the weights of the linear model, all these sets can equally explain a given set of brain responses. The ambiguity is reduced, but not resolved when a prior on the weights is assumed (Diedrichsen et al. 2018). If we define a prior on the weights, then each model predicts a probability density over the space of activity profiles. This probabilistic prediction may be distinct for two sets of basis features, even if they span the same space. For example, if the weight prior is a 0-mean isotropic Gaussian, then each model assigns probabilities to different activity profiles according to a Gaussian distribution over the space of profiles. Two linear models may span the same space, but predict distinct distributions of activity profiles. However, even with a Gaussian weight prior, there are still (infinitely) many equivalent models that make identical probabilistic predictions. We illustrate this by example. ( a ) Three models (A, B, C) each contain two feature vectors as predictors (A: { f A1, f A2}, B: { f B1, f B2}, C: { f C1, f C2}). The three models all span the same 2-dimensional space of activity profiles. For each model, we assume a 0-mean isotropic Gaussian weight prior. ( b ) All three models predict the same nonisotropic Gaussian probability density over the space of activity profiles (indicated by a single iso-probability-density contour: the ellipse). Model A (gray) predicts the density by modeling it with two orthogonal features that capture the principal-component axes, with features having different norms to capture the anisotropy. Model B predicts the same density by modeling with two correlated features of similar norm. Model C falls somewhere in between, combining feature correlation and different feature norms to capture the same Gaussian density over the activity profiles. Note that there are many other models that span the same space, but will not induce the same probability density over activity profiles when complemented by a 0- mean isotropic Gaussian weight prior. A given linear encoding model’s success at predicting brain responses provides evidence for the induced distribution of activity profiles, but not for the particular features chosen to express that distribution.
An encoding model predicts the response of each measurement channel (e.g. a neuron or fMRI voxel) on the basis of properties of the experimental condition (e.g. sensory stimulus). Continuous brain maps can be obtained by fitting such a model to each response in turn. Responses are typically predicted as linear combinations of the features, rendering this approach closely related to classical univariate brain mapping. However, encoding models have several important benefits:
Figure 3 | Weights of decoding and encoding models are difficult to interpret. Three examples (rows) illustrate the difficulty of interpreting decoder weights for a pair of voxels. In the first example (top row), only the top voxel contains signal (stimulus information, red) and the two voxels have independent noise. This scenario is unproblematic: both univariate mapping (second column from the left) and decoder weight maps detect the informative voxel (red). In the second example (second row), both voxels contain the same signal. Here, univariate mapping and weight maps often work. However, the LASSO decoder, because of its preference for a sparse solution, may choose one of the voxels arbitrarily. In the third example (third row), only the top voxel contains signal and both voxels contain correlated noise. Univariate mapping correctly identifies the informative voxel. Linear decoders will give negative weight (blue) to the uninformative voxel, so as to cancel the noise. The single-model-significance fallacy When our goal is merely to detect information in a brain region, we don’t interpret the model as a model of brain computation. This lowers the requirements for the model: It need not operate in the direction of information flow and it need not be neurobiologically plausible. The model is merely a statistical tool to sense a dependency. The choice of model in this scenario, will affect our sensitivity, and its structure is not entirely irrelevant to the interpretation. For example, decodability by a linear model tells us something about the format of the encoding. However, a single model will suffice to demonstrate the presence of information in a brain region. When our goal is to gain insight into the computations the brain performs, a model has a more prominent role: it is meant to capture, at some level of abstraction, the computations occurring in the brain. We then require the model to operate in the causal direction and to be neurobiologically plausible (albeit abstracted). Examples of such models include encoding models of sensory responses and decoding models that predict behavioral responses from brain activity. Psychophysical models, which skip the brain entirely and predict behavioral responses directly from stimuli, also fall into this class. In all these cases, finding that a model explains significant variance is a very low bar and tells us little as to whether the model captures the computational process.
The single-model-significance fallacy is to interpret the fact that a single model explains significant variance as evidence in favor of the model. A simple example that is widely understood is linear correlation. A significant linear correlation does demonstrate a dependency between two variables, but it does not demonstrate that the dependency is linear. Similarly, the fact that a complex encoding model explains significant variance in the responses of a brain region to a test set of novel stimuli does demonstrate that the brain region contains information about the stimuli, but it does not demonstrate that the encoding model captures the process that computes the encoding. Even a bad model can explain significant variance, especially if it has a large number of parameters fitted to the data. In order to learn about the underlying brain computations, we need to (a) consider multiple models, (b) assess what proportion of the explainable (i.e. nonnoise) variance each explains at a given level of generalization, and (c) compare the models inferentially. Representational interpretations require additional assumptions Decoding and encoding models are often motivated by the goal to understand how the brain represents the world, as well as the animal’s decisions, goals, plans, actions, and motor dynamics. Significant variance explained by encoding and decoding models demonstrates the presence of information. Interpreting this information as a representation (Dennett 1987) implies the additional claim that the brain activity serves the purpose to convey the information to other parts of the brain (Kriegeskorte & Bandettini 2007 , Kriegeskorte 2011, Diedrichsen & Kriegeskorte 2017). This functional interpretation is so compelling in the context of sensory systems that we sometimes jump too easily from findings of information to representational interpretations (de Wit et al. 2016, Ritchie et al. 2017). In addition to the presence of the information, its functional role as a representation implies that the information is read out by other regions, affecting downstream processing and ultimately behavior. Combining encoding and decoding models with stimulus- and response-based experimentation can help disambiguate the causal implications (Weichwald et al. 2015). Ideally, experimental control of neural activity should also be used to test whether activity has particular downstream or behavioral consequences (Afraz et al. 2006, Raizada & Kriegeskorte 2010). To the extent that we rely on prior assumptions to justify a representational interpretation, it is important to reflect on these and consider if there is evidence from previous studies to support them.
how these methods provide complementary tools in a single toolbox for representational analyses.