Here you can find some information about some past and present research lines. For a more complete list you can check my publications.
Jump to: scene recognition, multi-modal CNNs for RGB-D recognition, context-based and multi-modal food modeling, scalability in summarization and adaptation
Understanding the surrounding scene is a fundamental task in computer vision and robotics. Typically, a scene (e.g. coast, office, bedroom) is a very abstract representation composed of many less abstract and localized semantic entities (e.g. sky, rock, table, car). Thus, scene recognition involves complex reasoning from low-level local features to high-level scene categories. In our research we have studied mainly two approaches to scene recognition: multi-scale convolutional neural networks (CNNs) and intermediate representations in the semantic manifold.
Scene recognition with convolutional neural networks
Since objects are important components of scenes, accurate recognition requires knowledge about both scenes and objects. Recently, deep CNNs can be trained with scene-centric data (e.g. Places, Places2) resulting in scene-CNNs (often referred as Places-CNNs), which can complement the previous object-centric CNNs trained with ImageNet (often referred as ImageNet-CNNs). In this context, we studied ImageNet-CNNs and Places-CNNs in multi-scale settings (extracting features from patches at different scales and then combining them into a single representation).
Scaling-induced bias. Depending on the scale, patches may represent full scenes or look more like objects. Thus, Places-CNNs and ImageNet-CNNs perform very differently depending on the scale range. In addition, the performance is also very sensitive to the particular scale. Note that the scaling operation transforms the actual content of the patches (shifting from scenes to objects as we zoom in), progressively resembling less and less the training data. Thus, scaling induces a bias in the distribution of features, and the consequence is a drop in accuracy.
Previous multiscale CNN architectures for scene recognition typically use a single CNNs model as generic feature extractor for all the scales. As we saw earlier, this is not optimal due to the scaling-induce bias. Thus, we propose using scale-specific CNNs, and study two methods: hybrid architectures using Places-CNN or ImageNet-CNN depending on the scale, and fine tuning with scale-specific patches to reduce the bias.
A different perspective of our work is as a way to combine scene-centric and object-centric knowledge (e.g. Places and ImageNet). A previous attempt with CNNs are MIT’s Hybrid-CNNs, which are trained combining both ImageNet and Places data and categories. The problem is that they are combined at the same scale, and may suffer from the same bias problem. In contrast, our architecture generates hybrid object-scene representations for scene recognition, but taking into account the scale (Places for global scales, ImageNet for local scales).
In adition to conventional RGB scene recognition, we also explored multi-modal CNNs for RGB-D scene recognition, please check multi-modal CNNs for RGB-D recognition.
- L. Herranz, S. Jiang, X. Li, “Scene recognition with CNNs: objects, scales and dataset bias”, Proc. International Conference on Computer Vision and Pattern Recognition (CVPR16), Las Vegas, Nevada, USA, June 2016 [link] [poster].
Co-occurrence modeling for scene recognition in the semantic manifold
Before the availability of large datasets and the explosion of CNN representations, the large semantic gap in scene recognition was typically addressed in two steps using some sort of localized mid-level representation (e.g. objects). Mid-level entities are inferred from local visual features, and then scene categories from mid-level representations. The problem with intermediate representations is that they require defining a mid-level vocabulary and costly local annotation. Alternatively, mid-level concepts can be considered unknown (e.g. latent topics, parts), but in this case they need to be discovered jointly while modeling the scenes (e.g. LDA and variants), which is costly and difficult to scale to large datasets.
In contrast, we focus on the semantic manifold/semantic multinomial representation, which models patches directly with scene labels in a weakly supervised fashion (since all patches in the image share the same scene label). Note that we separate. However, this process also creates the specific problem of scene category co-occurrences, where related related categories are given significant probability making the description not very discriminative. Fortunately these patterns are consistent across categories, so scene categories can be modeled and disambiguated with a second classifier. To obtain good performance is critical to remove co-occurrence noise before this second classifier. In our work we exploit different unsupervised methods to do that (MRF, sparse coding, kernelized noise filters).
- X. Song, S. Jiang, L. Herranz, “Multi-scale multi-feature context modeling for scene recognition in the semantic manifold”, IEEE Transactions on Image Processing (accepted March 2017) .
- X. Song, S. Jiang, L. Herranz, Y. Kong, K. Zheng, “Category co-occurrence modeling for large scale scene recognition”, Pattern Recognition, vol. 59, pp. 98-111, Nov. 2016 [link] [poster].
- X. Song, S. Jiang, L. Herranz, “Joint Multi-feature Spatial Context for Scene Recognition on the Semantic Manifold”, Proc. International Conference on Computer Vision and Pattern Recognition (CVPR15), pp. 1312-1320, Boston, Massachusetts, USA, June 2015 [link] [poster].
Multi-modal CNNs for RGB-D visual recognition
With the availability of low-cost depth sensors (e.g. Kinect, RealSense), RGB-D visual recognition has emerged as a new. The additional depth channel can provide useful information about the physical properties of the observed objects and scenes. A major difference with RGB visual recognition is the much more limited amount of data.
RGB-D scene recognition
To study multi-modal visual recognition we focused on RGB-D scene recognition. Current works typically rely on transferring large RGB models (e.g. Places-CNN) to the depth modality, and fine tune them with depth data. In our work we proposed an architecture and training strategy to learn depth CNNs only from depth data, yet outperforming fine-tuned Places-CNN and overall achieving state-of-the art RGB-D scene recognition.
- X. Song, S. Jiang, L. Herranz, C. Chen, “Learning Effective RGB-D Representations for Scene Recognition” , IEEE Transactions on Image Processing, 2018 (accepted) [arxiv].
- X. Song, S. Jiang, L. Herranz, “Combining Models from Multiple Sources for RGB-D Scene Recognition”, Proc. International Joint Conference on Artificial Intelligence (IJCAI17), Melbourne, Australia, August 2017 (acceptance rate <26%) .
- X. Song, L. Herranz, S. Jiang, “Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs”, Proc. AAAI Conference on Artificial Intelligence (AAAI17), San Francisco, California, USA, February 2017 (acceptance rate <25%) [poster] [models].
Context-based and multi-modal food modeling
Food photos are widely used in food logs for diet and health monitoring and in social networks to share social and gastronomic experiences. We focus on the frequent scenario in which photos are taken in restaurants (we often use the term dish in that scenario). Dish recognition in general is very challenging, due to different cuisines, cooking styles and the intrinsic difficulty of modeling food from its visual appearance. To solve this complex problems, in practice humans also leverage prior knowledge and contextual information. In particular, the geocontext (e.g. GPS coordinates) has been widely exploited for outdoor landmark recognition.
Similarly, we exploit knowledge about menus and location of restaurants and test images.
We first adapt the conventional landmark recognition framework based on discarding unlikely categories located far from the test image (we use the term shortlist approach), and use it as baseline.
In a first work we focus on the visual classifier under geolocalized settings (i.e. we know the geolocation of the query, but only during test). Since we know the nearby restaurants and consequently the candidate dish categories, the problem is much simpler since just a fraction of the total number of categories in the dataset is relevant (a few tens vs several hundreds). We also observe that the visual classifier is trained on the whole dataset while during test only a fraction are relevant. Thus the classifier is suboptimal and overcomplex since is trained to discriminate between many more categories than necessary. In order to reduce this mismatch between training and test conditions, we propose geolocalized classifiers. The optimal geolocalized classifier is that trained for each query only with the training data of the relevant categories. Since we do not have information about test queries at training time, and training during test is too expensive in practice, we propose using restaurants as anchor points and train classifiers geolocalized to each restaurant. During test we approximate the optimal geolocalized classifier by selecting and combining the geolocalized classifiers corresponding to the restaurants near the query. We proposed two ways to implement this idea: geolocalized voting of pairwise models and combination of bundled classifiers.
- R. Xu, L. Herranz, S. Jiang, S. Wang, X. Song, R. Jain, “Geolocalized Modeling for Dish Recognition”, IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1187-1199, August 2015 [link] [poster].
- R. Xu, S. Jiang, L. Herranz, “Dishes: a restaurant-oriented food dataset”, Institute of Computing Technology, Chinese Academy of Sciences [link].
Probabilistic modeling of restaurant context
While the shortlist approach is very effective in simplifying the problem by reducing the number of categories, it considers all the restaurants and categories equally independently of their distance. We reformulate the problem using a probabilistic model connecting dishes, restaurants and locations. This model allows us to include smooth models for the neighborhood and restaurants, which can be combined using probabilities. In addition, it allows us to perform inference over any of the latent variables, leading to three different tasks: dish recognition, restaurant recognition and location refinement.
- L. Herranz, S. Jiang, R. Xu, “Modeling Restaurant Context for Food Recognition”, IEEE Transactions on Multimedia , vol. 19, no. 2, pp. 430-440, Feb. 2017 [link].
- L. Herranz, R. Xu, S. Jiang, “A probabilistic framework for food recognition in restaurants”, Proc. International Conference on Multimedia and Expo 2015 (ICME15), pp. 1-6, Torino, Italy, June 2015 [link] [slides] [poster].
Multi-modal food modeling
In this work (mostly carried out by Dr. Weiqing Min) we explore multi-modal analysis of food with applications in retrieval and exploration. The system leverages images, recipe names, ingredients and other attributes.
- W. Min, S. Jiang, J. Sang, H. Wang, L. Herranz, “Being a Super Cook: Joint Food Attributes and Multi-Modal Content Modeling for Recipe Retrieval and Exploration”, IEEE Transactions on Multimedia (accepted December 2016) [link] [dataset].
Scalability in summarization and adaptation
Video is a rich type of content conveying visual and audio information during a certain temporal interval. However, this temporal nature poses many challenges to its effective distribution, management and visualization. These challenges are of particular interest in devices with limited screen resolution, processing capabilities or network capacity, such as smartphones and tablets. In addition, the amount of digital video content available in multimedia platforms has increased dramatically in recent years (e.g. YouTube).
Video adaptation adresses the problem of delivering video content adapted to the particular user’s environment, including terminal, network and user’s preferences. Scalable video coding (SVC) organizes the video stream packets in a way that selecting certain subset (i.e. bitstream extraction) would result in an adapted bitstream, still compliant with the coding format, and decodable into an adapted version (e.g. half spatial resolution, half frame rate and lower bit rate).
Video summarization (or video abstraction) addresses the problem of creating compact yet informative visual representations (i.e. summaries or abstracts) where the user can quickly grasp the underlying semantic content of the original video. Thus, large collections of videos can be browsed in a fraction of time compared with visualizing the whole content. Examples of (static and dynamic) video summaries are storyboards (i.e. a collection of key frames from the video) and video skims (short clips made with key segments of the original video).
During my Ph.D I explored summarization and adaptation from a unified perspective, and in particular focusing on scalable representations. A first observation is that we can connect these two areas by considering video summaries also as adapted versions of the original video, in which the temporal structure is re-arranged according to some semantic analysis.
- L. Herranz, “A scalable approach to video summarization and adaptation”, Ph.D thesis, October 2010, Universidad Autónoma de Madrid [slides].
Inspired by scalable video representations, which are encoded once, decoded many, depending on the particular external constraints, we developed scalable summaries, as representations where summaries are obtained in an analyze once, generate many fashion. Scalable summaries can be adapted to a wide range of summary length (e.g. number of images, duration), which is the main constraint for a summary. Some examples are scalable storyboards, scalable video skims and scalable comic-like summaries. There is a natural trade-off between amount of information and length (and indirectly browsing time and display area).
Conventional summarization algorithms try to maximize information coverage for a given length budget. In contrast, algorithms for scalable summaries require balancing the information coverage at multiple lengths simultaneously, and at the same time represent the potential summaries in a compact description from which summaries of any length can be extracted. Note that, as in scalable video, the most computationally demanging task, i.e. content analysis and summarization, is performed only once, while the generation of the summary is on-demand using a simple and efficient mechanism (using similar bitstream extraction techniques, see next section).
These scalable summary representations are typically a reorganization of the packages by semantic importance for summarization (e.g. a ranked list of bitstream packets). Thus, a summary is extracted simply by truncating the ranked list to select a subset of packets (depending on the required summary length).
- L. Herranz, J. Calic, J. M. Martínez, M. Mrak, “Scalable Comic-Like Video Summaries and Layout Disturbance”, IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1290-1297, August 2012 [link].
- L. Herranz, J. M. Martínez, “A framework for scalable summarization of video”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 9, pp. 1265-1270, September 2010 [link].
- L. Herranz and J.M. Martínez, “An efficient summarization algorithm based on clustering and bitstream extraction”, Proc. International Conference on Multimedia and Expo 2009 (ICME09), pp. 654-657, New York, USA, Julio 2009 [link] [poster].
- L. Herranz, J.M. Martínez, “Generation of scalable summaries based on iterative GOP ranking”, Proc. International Conference on Image Processing (ICIP08), pp. 2544-2547, San Diego, California, October 2008 [link] [poster].
Integrated summarization and adaptation
Another advantage of considering jointly summarization and adaptation, is that summaries can be also adapted to the terminal and network constraints. Thus, the actual bitstream of an adapted summary is extracted using the same bitextraction tools. We studied this approach in a wavelet-based scalable video codec, H.264/MPEG-4 AVC and its scalable extension MPEG-4 SVC. The bitstream extraction tools needs to take some additional cautions to preserve a valid packet order and internal status in the decoder in order to keep the bitstream compliant with the standard. This is achieved by dynamically modifying certain parameters in packet headers.
- L. Herranz, J.M. Martínez, “Combining MPEG tools to generate video summaries adapted to the terminal and network”, Computer Journal, vol. 56, no. 5, pp. 529-553, May 2013 [link].
- L. Herranz, J. M. Martínez, “On the use of hierarchical prediction structures for efficient summary generation of H.264/AVC bitstreams”, Signal Processing: Image Communication, vol. 24, no. 8, pp. 615-629, September 2009 [link].
- L. Herranz and J. M. Martínez, “An integrated approach to summarization and adaptation using H.264/MPEG-4 SVC”, Signal Processing: Image Communication, vol. 24, no. 6, pp. 499-509, July 2009 [link].
- L. Herranz, J.M. Martínez, “Integrated summarization and adaptation using H.264/MPEG-4 SVC”, Proc. International Conference on Visual Information Engineering (VIE08), pp. 729-734, Xi’an, China, July 2008 [link].
- L. Herranz, “Integrating semantic analysis and scalable video coding for efficient content-based adaptation”, Multimedia Systems, vol. 13, no. 2, pp. 103-118, August 2007 [link].
- L. Herranz, J.M. Martínez, “Use cases of scalable video based summarization within MPEG-21 DIA”, Proc. International Conference on Semantic and Digital Media Technology (SAMT07), LNCS, vol 4816, pp. 256-259, Springer Verlag, Genoa, Italy, December 2007 [link].
- L. Herranz, “A framework for online semantic adaptation of scalable video”, Proc. International Workshop on Semantic Media Adaptation and Personalization (SMAP06), pp. 13-18, Athens, Greece, December 2006 [link].
Applications of scalable summaries
Scalable representations can be applied to adapt the length of the summary to user preferences (e.g. user A prefers 10 seconds video skims, user B prefers 20 seconds) and terminal displays (default storyboards may contain 10 image in PC and 5 images in smartphones). Scalable storyboards can be used in navigation to efficiently implement semantic zooming (i.e. zooming into a temporal segment unravels more images from the same segment, increasing the semantic detail rather than the visual detail). However, an additional usability problem arises in this setting related with the difficulty to follow the changes in the transition from one scale to another (e.g. images appearing, disappearing, changing their size and location). In this case, the user interface and the representation should take these usability factors into account to design smooth and useful scale transitions.
- L. Herranz, S. Jiang, “Scalable storyboards in handheld devices: applications and evaluation metrics”, Multimedia Tools and Applications, vol. 75, no. 20, pp. 12597-12625, October 2016 [link].
- L. Herranz, “Multiscale browsing through video collections in smartphones using scalable storyboards”, Proc. International Conference on Multimedia and Expo 2012 (ICME12), Workshop on Social Multimedia Computing (SMC2012), pp. 278-283, Melbourne, Australia, July 2012 [link].