Distributed and overlapping cortical networks represent visual categories
Haiguang Wen1, Junxing Shi1, Kuan Han1, and Zhongming Liu2

1Electrical and Computer Engineering, Purdue University, West Lafayette, IN, United States, 2Electrical and Computer Engineering and Biomedical Engineering, Purdue University, West Lafayette, IN, United States


The principle of cortical representations when thousands of real-life objects and categories are involved remains unclear. Here, we built a computational model of the human visual system by using a deep neural network and predicted the cortical responses to natural visual stimuli. In particular, we trained the model by using fMRI data obtained while subjects watched very long (>10 hours) natural movie stimuli that contained thousands of visual object categories. Based on the model, we systematically analyzed the activation patterns in the brain induced by different kinds of object categories. We found that the categorical information was represented by distributed and overlapping cortical networks, as opposed to discrete and distinct areas. Three cortical networks represented such broad categories as biological objects, non-biological objects, and background scenes. More fine-grained categorical representations in the brain suggest that visual objects share more (spatially) similar cortical representations if they share more similar semantic meanings.


How does the brain represent different objects in the rich visual world? Scientists commonly use a cherry-picking strategy to map the cortical activations with only a few objects or categories, e.g. faces and houses, and thus offer a narrowly-focused view of categorical representations. Here, we explored a new and high-throughput strategy to investigate the cortical representations of thousands of visual objects and categories, and to offer unique insights to the distributed network basis of categorical representations. Central to this strategy is a Deep Learning model, i.e. deep residual network (ResNet), which has enabled computers to recognize and segment natural pictures with human-like performance1.


We acquired a 3-tesla fMRI dataset from three human subjects, watching over 10 hours of natural video clips during weeks; the fMRI data were minimally preprocessed. We built voxel-wise encoding models by using ResNet as a fully-observable model of human visual system. Specifically, the encoding model included two parts: 1) extracting hierarchically organized features from individual movie frames based on ResNet, and 2) projecting the extracted features onto cortical fMRI responses using voxel-wise linear regression models. In ResNet, 50 layers of computational units were stacked into a hierarchical architecture1, decomposing visual information into features with increasing levels of complexity and abstraction. To train and test the encoding model, we used 10.5-hour video-fMRI dataset for model training, and 8-min video-fMRI dataset for testing. The model was evaluated by calculating the correlation between the measured and model-predicted fMRI responses during the testing movie. After establishing the encoding model, we used it to predict the distributed cortical responses to 84 categories of natural pictures (around 1,000 pictures per category) obtained by Google Search and carefully screened. By averaging the predicted responses within categories, we then obtained the brain activation patterns corresponding to individual categories (Fig. 2). The significance of the activations of individual voxels was tested against the distribution of the responses to ~50,000 random natural images. We also evaluated the similarity (spatial correlation) of such activation patterns between every pair of categories. Based on the similarity matrix, we performed a modularity analysis to group these activation patterns into different clusters. For each cluster, we obtained an average activation pattern, and related it with the semantic content within each cluster (Fig. 3).


Trained with >10-hour video-fMRI data, the voxel-wise encoding models could reliably predict the cortical responses to novel testing movie stimuli with high accuracies for nearly the entire visual cortex (r = 0.64 ± 0.1) (Fig. 1). Applying the encoding models to 84 categories of natural pictures (1,000 pictures per category), we predicted the activation pattern corresponding to each category (Fig. 2). For example, human face was represented by a network including the fusiform face area (FFA), the occipital face area (OFA), the posterior superior temporal sulcus (pSTS); building was represented by activations at bilateral parahippocampal place area (PPA), with de-activations at FFA (Fig. 2). Some categories showed similar co-activation networks. For example, living room, restaurant, and building all had significant activations in PPA; cup, cellphone, and car also shared similar cortical representations; human face, cat and bird had distinct but overlapping representations (Fig. 2). All 84 cortical activation patterns could be grouped into three clusters based on their spatial similarities (Fig. 3). Interestingly, these clusters were found to bear different semantic content: the first cluster included non-biological objects (e.g. hat, computer, shoes, cellphone, car); the second cluster included biological objects (e.g. person, cat, horse, bird, fish); the third cluster included background scenes (e.g. bedroom, restaurant, street, market, office) (Fig. 3). The similarity of co-activation networks also revealed the similarity of semantic meanings of their corresponding objects; for example, the similarity between cat and dog (r = 0.9±0.04) was higher than cat and bird (r = 0.7±0.08) (p=0.0012, two-way t-test). The cortical representations between biological objects and background scenes were anti-correlated in the inferior temporal cortex (Fig. 3).


Here we provide evidence that the categorical information is represented by distributed and overlapping cortical networks, as opposed to discrete and distinct areas. Three cortical networks represent such broad categories as biological objects, non-biological objects, and background scenes. More fine-grained categorical representations in the brain suggest that visual objects share more (spatially) similar cortical representations if they share more similar semantic meanings. These findings, as well as those from our previous studies2, advocate the use of brain-inspired artificial intelligence models and functional neural imaging and recording to advance our understanding of the brain itself.


This work was supported in part by NIH R01MH104402.


1. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

2. Wen, H., Shi, J., Zhang, Y., Lu, K. H., & Liu, Z. (2016). Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. arXiv preprint arXiv:1608.03425.


Fig 1. Cortical response predictability with encoding models trained and tested for the same subject. The prediction accuracy of individual voxels was quantified as the correlation between the predicted response and the measured fMRI.

Fig 2. Co-ctivation networks represented different kinds of categories. (a)(b)(c) the co-activation networks of example objects of background scenes or locations, man-made things, and human or animals respectively. Colorbar shows the activation value.

Fig 3. Simimlarity of co-activation networks of 84 categories. (a) Cross correlation matrix of the co-activation networks of 84 categories. Each element in the matrix represents the similarty of a pair of categories. It is well separated into three clusters. (b) For each of the three clusters, the average activation patterns are shown and related to non-biological objects, background scenes, and biological objects.

Proc. Intl. Soc. Mag. Reson. Med. 25 (2017)