Hi! I'm a research senior manager in Artificial Intelligence at Samsung R&D Institute Brazil and this is my personal web page.
In this page, you can find some news and materials related to myself and my research.
You can also find more information about myself on the links below.
[Aug, 2020]Code release on GitHub:The source code of the Eva tool was released on my GitHub page. This tool was developed during my masters degree.
[July, 2020]Top cited paper :A paper that I published at Pattern Recognition journal in 2017 in collaboration with colleagues from UFMG is currently the top-3 most cited paper among all papers from the journal according to the updated Google Scholar metrics for papers published from 2015 to 2019.
[July, 2020]Top cited paper :A paper that I published at CVPR Workshops in 2015 in collaboration with colleagues from UFMG is currently the top-6 most cited paper among all papers from CVPR Workshops according to the updated Google Scholar metrics for papers published from 2015 to 2019.
[October, 2018]Best Paper Award at SIBGRAPI 2018:One of my papers at SIBGRAPI 2018 was awarded as the best paper in the category of Image Processing/Computer Vision/Pattern Recognition. The paper "Bag of Attributes for Video Event Retrieval" was written in collaboration with colleagues from Unifesp (São José dos Campos).
[September, 2018]Patent granted on USPTO:"Method for multiclass classification in open-set scenarios and uses thereof". Patent 14/532,580
[September, 2017]Best results for Flood Detection in Satellite Images (FDSI) in the MediaEval challenge of 2017:In collaboration with colleagues from UFMG, Unicamp, UEFS and Samsung Brazil, we achieved good results in the two sub-tasks of the Satellite task of MediaEval 2017.
For Flood Detection (FDSI), we got the 1st place.
For Disaster Image Retrieval from Social Media (DIRSM), we obtained the highest average precision (AP@480) in two runs (textual only and textual+visual). Our working notes paper explains the approaches used.
[August, 2017]Paper accepted at Elsevier Pattern Recognition Letters journal:"TWM: A framework for creating highly compressible videos targeted to computer vision tasks". Paper in collaboration with F. Andaló and V. Testoni. The method in this paper is related to US Patent 9,699,476.
[July, 2017]Patent granted on USPTO:"System and method for video context-based composition and compression from normalized spatial resolution objects". Patent 9,699,476.
[November, 2016]Paper published at Springer Machine Learning journal:"Nearest neighbors distance ratio open-set classifier", proposes a new multiclass open set classifier, which is robust to deal with unknown classes at training time that appear during test (open set scenario).
[July, 2016]2 papers accepted at SIBGRAPI 2016:the following papers were accepted at the Conference on Graphics, Patterns and Images (SIBGRAPI) that will happen on São José dos Campos, SP, on October 2016:
[June, 2016]Unicamp Inventors Awards 2016:The research team of my postdoc project received the Unicamp Inventors Award 2016 in the category of "Licensed Technology" due to the patents filed during the research collaboration between Unicamp and Samsung. The project was about feature engineering and open-set recognition.
Postdoc (12/2012 - 12/2013) - Unicamp/Samsung: Pattern recognition and classification by feature engineering, *-fusion, open set recognition and meta-recognition
Principal investigator: Anderson de Rezende Rocha Awards:
Unicamp Inventors Award 2016 - category Licensed Technology, for the patents filed during the research collaboration between Unicamp and Samsung.
Effectively encoding visual properties from multimedia content is challenging.
One popular approach to deal with this challenge is the visual dictionary model.
In this model, images are handled as an unordered set of local features being represented by the so-called bag-of-(visual-)words vector.
In this thesis, we work on three research problems related to the visual dictionary model.
The first research problem is concerned with the generalization power of dictionaries, which is related to the ability of representing
well images from one dataset even using a dictionary created over other dataset, or using a dictionary created on small dataset samples.
We perform experiments in closed datasets, as well as in a Web environment. Obtained results suggest that diverse samples in
terms of appearances are enough to generate a good dictionary.
The second research problem is related to the importance of the spatial information of visual words in the image space, which could be
crucial to distinguish types of objects and scenes. The traditional pooling methods usually discard the spatial configuration of visual words in the image.
We have proposed a pooling method, named Word Spatial Arrangement (WSA), which encodes the relative position of visual words in
the image, having the advantage of generating more compact feature vectors than most of the existing spatial pooling strategies.
Experiments for image retrieval show that WSA outperforms the most popular spatial pooling method, the Spatial Pyramids.
The third research problem under investigation in this thesis is related to the lack of semantic information in the visual dictionary model.
We show that the problem of having no semantics in the space of low-level descriptions is reduced when we move to the bag-of-words representation.
However, even in the bag-of-words space, we show that there is little separability between distance distributions of different semantic concepts.
Therefore, we question about moving one step further and propose a representation based on visual words which carry more semantics, according to the human visual perception.
We have proposed a bag-of-prototypes model, according to which the prototypes are the elements containing more semantics.
This approach goes in the direction of reducing the so-called semantic gap problem. We propose a dictionary based on scenes, that is used
for video representation in experiments for video geocoding. Video geocoding is the task of assigning a geographic location to a given video.
The evaluation was performed in the context of the Placing Task of the MediaEval challenge and the proposed bag-of-scenes model has shown promising performance.
Master's (03/2007 - 03/2009): Comparative study of descriptors for content-based image retrieval on the Web
Advisor: Ricardo da Silva Torres Awards:
The growth in size of image collections and the worldwide availability of these collections has increased the demand for image retrieval systems.
A promising approach to address this demand is to retrieve images based on image content (Content-Based Image Retrieval).
This approach considers the image visual properties, like color, texture and shape of objects, for indexing and retrieval.
The main component of a content-based image retrieval system is the image descriptor. The image descriptor is responsible for
encoding image properties into feature vectors. Given two feature vectors, the descriptor compares them and computes a distance value.
This value quantifies the difference between the images represented by their vectors. In a content-based image retrieval system, these distance
values are used to rank database images with respect to their distance to a given query image.
This dissertation presents a comparative study of image descriptors considering the Web as the environment of use. This environment presents a
huge amount of images with heterogeneous content. The comparative study was conducted by taking into account two approaches. The first approach
considers the asymptotic complexity of feature vectors extraction algorithms and distance functions, the size of the feature vectors generated
by the descriptors and the environment where each descriptor was validated. The second approach compares the descriptors in practical experiments
using four different image databases. The evaluation considers the time required for features extraction, the time for computing distance values,
the storage requirements and the effectiveness of each descriptor. Color, texture, and shape descriptors were compared. The experiments were
performed with each kind of descriptor independently and, based on these results, a set of descriptors was evaluated in an image database
containing more than 230 thousand heterogeneous images, reflecting the content existent in the Web. The evaluation of descriptors
effectiveness in the heterogeneous database was made by experiments using real users. This dissertation also presents a tool for
executing experiments aiming to evaluate image descriptors.
Undergraduate research project (07/2006 - 12/2006): Content-based image retrieval using spatial relationship descriptors
Advisor: Ricardo da Silva Torres Awards:
3º place CTIC 2007 - Undergraduate Research Projects Contest - Brazilian Computer Society (SBC)
Best Undergraduate Research Project 2006 - Institute of Computing - University of Campinas (Unicamp)
The growth in size of image collections has increased the demand for image retrieval systems. These systems use a great variety of techniques.
One of the most important ones is content-based image retrieval (CBIR). CBIR is based on image properties, like color, texture, shape and
spatial relationships. The last one can be fundamental for the recognition and retrieval of images bringing benefits for several applications,
like geographic and medical, for example. This work presents a comparative study of spatial relationship descriptors. The experiments compare
several descriptors considering efficiency and effectiveness as the evaluation criteria. Also, new spatial relationship descriptors are proposed.
The results indicate that the proposed descriptors are superior when compared to the existent ones.
Generating single novel views from an already captured image is a hard task in computer vision and graphics, in particular when the single input image has dynamic parts such as persons or moving objects. In this paper, we tackle this problem by proposing a new framework, called CycleMPI, that is capable of learning a multiplane image representation from single images through a cyclic training strategy for self-supervision. Our framework does not require stereo data for training, therefore it can be trained with massive visual data from the Internet, resulting in a better generalization capability even for very challenging cases. Although our method does not require stereo data for supervision, it reaches results on stereo datasets comparable to the state of the art in a zero-shot scenario. We evaluated our method on RealEstate10K and Mannequim Challenge datasets for view synthesis, and presented qualitative results on Places II dataset.
Adaptive Multiplane Image Generation From a Single Internet Picture
LUVIZON, D. C. ; CARVALHO, G. S. P. ; dos Santos, A. A. ; CONCEICAO, J. S. ; FLORES-CAMPANA, J. L. ; DECKER, L. G. ; SOUZA, M. R. ; PEDRINI, H. ; JOIA, A. ; PENATTI, O. A. B.
In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, p. 2556-2565.
In the last few years, several works have tackled the problem of novel view synthesis from a pair of stereo images or even from a single picture. However, previous methods are computationally expensive, specially for high-resolution images. In this paper, we address the problem of generating an efficient multiplane image (MPI) from a single high-resolution picture. We present the adaptive-MPI representation, which allows rendering novel views with low computational requirements. To this end, we propose an adaptive slicing algorithm that produces an MPI with a variable number of image planes. We also present a new lightweight CNN for depth estimation, which is learned by knowledge distillation from a larger network. Occluded regions in the adaptive-MPI are inpainted also by a lightweight CNN. We show that our method is capable of producing high-quality predictions with one order of magnitude less parameters, when compared to previous approaches. In addition, we show the robustness of our method for novel view synthesis on challenging pictures from the Internet.
A comparison of graph-based semi-supervised learning for data augmentation
OLIVEIRA, W. D. G. ; PENATTI, O. A. B. ; BERTON, L.
In: Conference on Graphics, Patterns, and Images (SIBGRAPI), 2020.
In supervised learning, the algorithm accuracy usually improves with the size of the labeled dataset used for training the classifier. However, in many real-life scenarios, obtaining enough labeled data is costly or even not possible. In many circumstances, Data Augmentation (DA) techniques are usually employed, generating more labeled data for training machine learning algorithms. The common DA techniques are applied to already labeled data, generating simple variations of this data. For example, for image classification, image samples are rotated, cropped, flipped or other operators to generate variations of input image samples, and keeping their original labels. Other options are using Neural Networks algorithms that create new synthetic data or to employ Semi-supervised Learning (SSL) that label existing unlabeled data. In this paper, we perform a comparison among graph-based semi-supervised learning (GSSL) algorithms to augment the labeled dataset. The main advantage of using GSSL is that we can increase the training set by adding non-annotated images to the training set, therefore, we can benefit from the huge amount of unlabeled data available. Experiments are performed on five datasets for recognition of handwritten digits and letters (MNIST and EMINIST), animals (Dogs vs Cats), clothes (MNIST-Fashion) and remote sensing images (Brazilian Coffee Scenes), in which we compare different possibilities for DA, including the GSSL, Generative Adversarial Networks (GANs) and traditional Image Transformations (IT) applied on input labeled data. We also evaluated the impact of such techniques on different convolutional neural networks (CNN). Results indicate that, although all DA techniques performed well, GSSL was more robust to different image properties, presenting less accuracy variation across datasets.
Object-based Temporal Segment Relational Network for Activity Recognition
MELO, V. H. C. ; SANTOS, J. B. ; CAETANO, C. ; SENA, J. ; PENATTI, O. A. B. ; SCHWARTZ, W. R.
In: Conference on Graphics, Patterns, and Images (SIBGRAPI), 2018, p. 103-109.
Video understanding is the next frontier of computer vision, in which activity recognition plays a major role. Despite the recent improvements
in holistic activity recognition, further researching part-based models such as context may allow us to better understand what is important for
activities and thus improve our current activity recognition models. This work tackles contextual cues obtained from object detections, in which
we posit that objects relevant to an action are related to its spatial arrangement regarding an agent. Based on that, we propose Egocentric
Pyramid to encode such spatial relationships. We further extend it by proposing a data-centric approach named Temporal Segment Relational
Network (TSRN). Our experiments give support to the hypothesis that object spatiality provides an important clue to activity recognition. In
addition, our datacentric approach shows that besides such spatial features, there may be other important information that further enhances the
object-based activity recognition, such as co-occurrence, relative size, and temporal information.
Bag of Attributes for Video Event Retrieval
DUARTE, L. A. ; PENATTI, O. A. B. ; ALMEIDA, J.
In: Conference on Graphics, Patterns, and Images (SIBGRAPI), 2018, p. 447-454.
In this paper, we present the Bag-of-Attributes (BoA) model for video representation aiming at video event retrieval. The BoA model is based
on a semantic feature space for representing videos, resulting in high-level video feature vectors. For creating a semantic space, i.e., the
attribute space, we can train a classifier using a labeled image dataset, obtaining a classification model that can be understood as a high-level
codebook. This model is used to map low-level frame vectors into high-level vectors (e.g., classifier probability scores). Then, we apply
pooling operations to the frame vectors to create the final bag of attributes for the video. In the BoA representation, each dimension
corresponds to one category (or attribute) of the semantic space. Other interesting properties are: compactness, flexibility regarding the
classifier, and ability to encode multiple semantic concepts in a single video representation. Our experiments considered the semantic space
created by state-of-the-art convolutional neural networks pre-trained on 1000 object categories of ImageNet. Such deep neural networks were
used to classify each video frame and then different coding strategies were used to encode the probability distribution from the softmax
layer into a frame vector. Next, different pooling strategies were used to combine frame vectors in the BoA representation for a video.
Results using BoA were comparable or superior to the baselines in the task of video event retrieval using the EVVE dataset, with the advantage
of providing a much more compact representation.
Exploiting ConvNet Diversity for Flooding Identification
NOGUEIRA K. ; FADEL, S. G. ; DOURADO, I. C., WERNECK, R. de O. ; MUNOZ, J. A. V. ; PENATTI, O. A. B. ; CALUMBY, R. T., LI, L. T. ; SANTOS, J. A., TORRES, R. da S.
In: IEEE Geoscience and Remote Sensing Letters, volume 15, issue 9, p. 1446-1450, 2018.
Flooding is the world's most costly type of natural disaster in terms of both economic losses and human causalities. A first and essential
procedure towards flood monitoring is based on identifying the area most vulnerable to flooding, which gives authorities relevant regions to
focus. In this work, we propose several methods to perform flooding identification in high resolution remote sensing images using deep learning.
Specifically, some proposed techniques are based upon unique networks, such as dilated and deconvolutional ones, while other was conceived
to exploit diversity of distinct networks in order to extract the maximum performance of each classifier. Evaluation of the proposed methods
were conducted in a high-resolution remote sensing dataset. Results show that the proposed algorithms outperformed state-of-the-art baselines,
providing improvements ranging from 1 to 4% in terms of the Jaccard Index.
TWM: A framework for creating highly compressible videos targeted to computer vision tasks
ANDALÓ, F. A. ; PENATTI, O. A. B. ; TESTONI, V.
In: Elsevier Pattern Recognition letters, volume 114, p. 63-72, 2017.
We present a simple yet effective framework - Transmitting What Matters (TWM) - to generate highly compressible videos containing only relevant
information targeted to specific computer vision tasks, such as faces for the task of face expression recognition, license plates for the task
of optical character recognition, among others. TWM takes advantage of the final desired computer vision task to compose video frames only with
the necessary data. The video frames are compressed and can be stored or transmitted to powerful servers where extensive and time-consuming tasks
are performed. Experiments explore the trade-offs between distortion and bitrate for a wide range of compression levels, and the impact generated
by compression artifacts on the accuracy of the desired vision task. We show that, for two computer vision tasks implemented by different methods,
it is possible to dramatically reduce the amount of required data to be stored or transmitted, without compromising accuracy. With PSNR_YUV quality
of over 41 dB, the bitrate was reduced up to four times, while a detection task was affected by only ˜1 pixel and a classification task by
1˜2 percentage points.
Kuaa: A unified framework for design, deployment, execution, and recommendation of machine learning experiments
WERNECK, R. de O. ; DE ALMEIDA, W. R. ; STEIN, B. V. ; PAZINATO, D. V. ; MENDES JUNIOR, P. R. ; PENATTI, O. A. B. ; TORRES, R. da S. ; ROCHA, A.
In: Future Generation Computer Systems, volume 78, part 1, p. 59-76, 2018.
In this work, we propose Kuaa, a workflow-based framework that can be used for designing, deploying, and executing machine learning experiments
in an automated fashion. This framework is able to provide a standardized environment for exploratory analysis of machine learning solutions,
as it supports the evaluation of feature descriptors, normalizers, classifiers, and fusion approaches in a wide range of tasks involving machine
learning. Kuaa also is capable of providing users with the recommendation of machine-learning workflows. The use of recommendations allows users
to identify, evaluate, and possibly reuse previously defined successful solutions. We propose the use of similarity measures (e.g., Jaccard,
Sørensen, and Jaro-Winkler) and learning-to-rank methods (LRAR) in the implementation of the recommendation service. Experimental results show
that Jaro-Winkler yields the highest effectiveness performance with comparable results to those observed for LRAR, presenting the best
alternative machine learning experiments to the user. In both cases, the recommendations performed are very promising and the developed framework
might help users in different daily exploratory machine learning tasks.
Nearest neighbors distance ratio open-set classifier
MENDES JUNIOR, P. R. ; DE SOUZA, R. M. ; WERNECK, R. de O. ; STEIN, B. V. ; PAZINATO, D. V. ; DE ALMEIDA, W. R. ; PENATTI, O. A. B. ; TORRES, R. da S. ; ROCHA, A.
In this paper, we propose a novel multiclass classifier for the open-set recognition scenario. This scenario is the one in which there are no a
priori training samples for some classes that might appear during testing. Usually, many applications are inherently open set. Consequently,
successful closed-set solutions in the literature are not always suitable for real-world recognition problems. The proposed open-set classifier
extends upon the Nearest-Neighbor (NN) classifier. Nearest neighbors are simple, parameter independent, multiclass, and widely used for
closed-set problems. The proposed Open-Set NN (OSNN) method incorporates the ability of recognizing samples belonging to classes that are
unknown at training time, being suitable for open-set recognition. In addition, we explore evaluation measures for open-set problems, properly
measuring the resilience of methods to unknown classes during testing. For validation, we consider large freely-available benchmarks with
different open-set recognition regimes and demonstrate that the proposed OSNN significantly outperforms their counterparts in the literature.
Towards Better Exploiting Convolutional Neural Networks for Remote Sensing Scene Classification
NOGUEIRA, K. ; PENATTI, O. A. B. ; SANTOS, J. A. dos
In: Pattern Recognition, volume 61, p. 539-556, 2017.
We present an analysis of three possible strategies for exploiting the power of existing convolutional neural networks (ConvNets or CNNs) in
different scenarios from the ones they were trained: full training, fine tuning, and using ConvNets as feature extractors. In many applications,
especially including remote sensing, it is not feasible to fully design and train a new ConvNet, as this usually requires a considerable amount
of labeled data and demands high computational costs. Therefore, it is important to understand how to better use existing ConvNets. We perform
experiments with six popular ConvNets using three remote sensing datasets. We also compare ConvNets in each strategy with existing descriptors
and with state-of-the-art baselines. Results point that fine tuning tends to be the best performing strategy. In fact, using the features from
the fine-tuned ConvNet with linear SVM obtains the best results. We also achieved state-of-the-art results for the three datasets used.
Bag of Genres for Video Retrieval
DUARTE, L. A. ; PENATTI, O. A. B. ; ALMEIDA, J.
In: Conference on Graphics, Patterns, and Images (SIBGRAPI), 2016, p. 257-264.
Often, videos are composed of multiple concepts or even genres. For instance, news videos may contain sports, action, nature, etc.
Therefore, encoding the distribution of such concepts/genres in a compact and effective representation is a challenging task. In this sense,
we propose the Bag of Genres representation, which is based on a visual dictionary defined by a genre classifier. Each visual word corresponds
to a region in the classification space. The Bag of Genres video vector contains a summary of the activations of each genre in the video
content. We evaluate the proposed method for video genre retrieval using the dataset of MediaEval Tagging Task of 2012 and for video event
retrieval using the EVVE dataset. Results show that the proposed method achieves results comparable or superior to state-of-the-art methods,
with the advantage of providing a much more compact representation than existing features.
Transmitting What Matters - Task-oriented video composition and compression
ANDALÓ, F. A. ; PENATTI, O. A. B. ; TESTONI, V.
In: Conference on Graphics, Patterns, and Images (SIBGRAPI), 2016, p. 72-79.
We present a simple yet effective framework - Transmitting What Matters (TWM) - to generate compressed videos containing only
relevant objects targeted to specific computer vision tasks, such as faces for the task of face expression recognition, license plates
for the task of optical character recognition, among others. TWM takes advantage of the final desired computer vision task to compose
video frames only with the necessary data. The video frames are compressed and can be stored or transmitted to powerful servers where
extensive and time-consuming tasks can be performed. We experimentally present the trade-offs between distortion and bitrate for a
wide range of compression levels, and the impact generated by compression artifacts on the accuracy of the desired vision task.
We show that, for one selected computer vision task, it is possible to dramatically reduce the amount of required data to be stored or
transmitted, without compromising accuracy.
Detection of Fragmented Rectangular Enclosures in Very-High-Resolution Remote Sensing Images
ZINGMAN, I. ; SAUPE, D. ; PENATTI, O. A. B. ; LAMBERS, K.
In: IEEE Transactions on Geoscience and Remote Sensing, volume 54, number 8, p. 4580-4593, 2016.
We develop an approach for the detection of ruins of livestock enclosures (LEs) in alpine areas captured by high-resolution
remotely sensed images. These structures are usually of approximately rectangular shape and appear in images as faint
fragmented contours in complex background. We address this problem by introducing a rectangularity feature that quantifies
the degree of alignment of an optimal subset of extracted linear segments with a contour of rectangular shape. The
rectangularity feature has high values not only for perfectly regular enclosures but also for ruined ones with distorted
angles, fragmented walls, or even a completely missing wall. Furthermore, it has a zero value for spurious structures with
less than three sides of a perceivable rectangle. We show how the detection performance can be improved by learning a
linear combination of the rectangularity and size features from just a few available representative examples and a large
number of negatives. Our approach allowed detection of enclosures in the Silvretta Alps that were previously unknown.
A comparative performance analysis is provided. Among other features, our comparison includes the state-of-the-art features
that were generated by pretrained deep convolutional neural networks (CNNs). The deep CNN features, although learned from a
very different type of images, provided the basic ability to capture the visual concept of the LEs. However, our handcrafted
rectangularity-size features showed considerably higher performance.
Pixel-Level Tissue Classification for Ultrasound Images
PAZINATO, D. V. ; STEIN, B. V. ; DE ALMEIDA, W. R. ; WERNECK, R. de O. ; MENDES JUNIOR, P. R. ; PENATTI, O. A. B. ; TORRES, R. da S. ; MENEZES, F. H. ; ROCHA, A.
In: IEEE Journal of Biomedical and Health Informatics (J-BHI), volume 20, number 1, p. 256-267, 2016.
Background: Pixel-level tissue classification for ultrasound images, commonly applied to carotid images, is usually based on defining thresholds for the isolated pixel values. Ranges of pixel values are defined for the classification of each tissue. The classification of pixels is then used to determine the carotid plaque composition and, consequently, to determine the risk of diseases (e.g., strokes) and whether or not a surgery is necessary. The use of threshold-based methods dates from the early 2000's but it is still widely used for virtual histology.
Methodology/Principal Findings: We propose the use of descriptors that take into account information about a neighborhood of a pixel when classifying it. We evaluated experimentally different descriptors (statistical moments, texture-based, gradient-based, local binary patterns, etc.) on a dataset of five types of tissues: blood, lipids, muscle, fibrous, and calcium. The pipeline of the proposed classification method is based on image normalization, multiscale feature extraction, including the proposal of a new descriptor, and machine learning classification. We have also analyzed the correlation between the proposed pixel classification method in the ultrasound images and the real histology with the aid of medical specialists.
Conclusions/Significance: The classification accuracy obtained by the proposed method with the novel descriptor in the ultrasound tissue images (around 73%) is significantly above the accuracy of the state-of-the-art threshold-based methods (around 54%). The results are validated by statistical tests. The correlation between the virtual and real histology confirms the quality of the proposed approach showing it is a robust ally for the virtual histology in ultrasound images.
Mid-level Image Representations for Real-time Heart View Plane Classification of Echocardiograms
PENATTI, O. A. B. ; WERNECK, R. de O. ; DE ALMEIDA, W. R. ; STEIN, B. V. ; PAZINATO, D. V. ; MENDES JUNIOR, P. R. ; TORRES, R. da S. ; ROCHA, A.
In: Computers in Biology and Medicine, volume 66, p. 66-81, 2015.
In this paper, we explore mid-level image representations for real-time heart view plane classification of 2D echocardiogram ultrasound images.
The proposed representations rely on bags of visual words, successfully used by the computer vision community in visual recognition problems.
An important element of the proposed representations is the image sampling with large regions, drastically reducing the execution time of the image
characterization procedure. Throughout an extensive set of experiments, we evaluate the proposed approach against different image descriptors for classifying
four heart view planes. The results show that our approach is effective and efficient for the target problem, making it suitable for use in real-time setups.
The proposed representations are also robust to different image transformations, e.g., downsampling, noise filtering, and different machine learning classifiers,
keeping classification accuracy above 90%. Feature extraction can be performed in 30 fps or 60 fps in some cases. This paper also includes an in-depth review
of the literature in the area of automatic echocardiogram view classification giving the reader a through comprehension of this field of study.
Do Deep Features Generalize from Everyday Objects to Remote Sensing and Aerial Scenes Domains?
PENATTI, O. A. B. ; NOGUEIRA, K. ; SANTOS, J. A. dos
In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR EarthVision Workshop), p. 44-51, 2015.
In this paper, we evaluate the generalization power of deep features (ConvNets) in two new scenarios: aerial and remote sensing image classification.
We evaluate experimentally ConvNets trained for recognizing everyday objects for the classification of aerial and remote sensing images.
ConvNets obtained the best results for aerial images, while for remote sensing, they performed well but were outperformed by low-level color
descriptors, such as BIC. We also present a correlation analysis, showing the potential for combining/fusing different ConvNets with other descriptors
or even for combining multiple ConvNets. A preliminary set of experiments fusing ConvNets obtains state-of-the-art results for the well-known UCMerced dataset.
Unsupervised Manifold Learning for Video Genre Retrieval
ALMEIDA, J. ; PEDRONETTE, D. C. G. ; PENATTI, O. A. B.
In: Iberoamerican Congress on Pattern Recognition (CIARP), Puerto Vallarta, Mexico, 2014, p. 604-612 (LNCS 8827)
This paper investigates the perspective of exploiting pairwise similarities to improve the performance of visual features for video genre retrieval.
We employ manifold learning based on the reciprocal neighborhood and on the authority of ranked lists to improve the retrieval of videos considering their genre.
A comparative analysis of different visual features is conducted and discussed. We experimentally show in the dataset of 14,838 videos from the MediaEval benchmark
that we can achieve considerable improvements in results. In addition, we also evaluate how the late fusion of different visual features using the same manifold
learning scheme can improve the retrieval results.
Efficient and Effective Hierarchical Feature Propagation
SANTOS, J. A. dos ; PENATTI, O. A. B. ; GOSSELIN, P-H. ; FALCÃO, A. X. ; PHILIPP-FOLIGUET, S. ; TORRES, R. da S.
In: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS), volume 7, number 12, p. 4632-4643, 2014.
Many methods have been recently proposed to deal with the large amount of data provided by the new remote sensing technologies. Several of those methods rely on the use
of segmented regions. However, a common issue in region-based applications is the definition of the appropriate representation scale of the data, a problem usually addressed by exploiting
multiple scales of segmentation. The use of multiple scales, however, raises new challenges related to the definition of effective and efficient mechanisms for extracting features.
In this paper, we address the problem of extracting features from a hierarchy by proposing two approaches that exploit the existing relationships among regions at different scales.
The H-Propagation propagates any histogram-based low-level descriptors. The BoW-Propagation approach uses the bag-of-visual-word model to propagate features along multiple scales.
The proposed methods are very efficient as features need to be extracted only at the base of the hierarchy and yield comparable results to low-level extraction approaches.
Unsupervised Distance Learning By Reciprocal kNN Distance for Image Retrieval
PEDRONETTE, D. C. G. ; PENATTI, O. A. B. ; CALUMBY, R. T. ; TORRES, R. da S.
In: ACM International Conference on Multimedia Retrieval (ICMR), Glasgow, Scotland, 2014, p. 345:345-345:352.
This paper presents a novel unsupervised learning approach that takes into account the intrinsic dataset structure, which is represented in terms of the reciprocal neighborhood
references found in different ranked lists. The proposed Reciprocal kNN Distance defines a more effective distance between two images, and is used to improve the effectiveness of image
retrieval systems. Several experiments were conducted for different image retrieval tasks involving shape, color, and texture descriptors. The proposed approach is also evaluated on multimodal
retrieval tasks, considering visual and textual descriptors. Experimental results demonstrate the effectiveness of proposed approach. The Reciprocal kNN Distance yields better results in terms of
effectiveness than various state-of-the-art algorithms.
Unsupervised Manifold Learning Using Reciprocal kNN Graphs in Image Re-Ranking and Rank Aggregation Tasks
PEDRONETTE, D. C. G. ; PENATTI, O. A. B. ; TORRES, R. da S.
In: Image and Vision Computing, volume 32, number 2, p. 120-130, 2014.
In this paper, we present an unsupervised distance learning approach for improving the effectiveness of image retrieval tasks. We propose a Reciprocal kNN Graph algorithm
that considers the relationships among ranked lists in the context of a k-reciprocal neighborhood. The similarity is propagated among neighbors considering the geometry of the
dataset manifold. The proposed method can be used both for re-ranking and rank aggregation tasks. Unlike traditional diffusion process methods, which require matrix multiplication
operations, our algorithm takes only a subset of ranked lists as input, presenting linear complexity in terms of computational and storage requirements. We conducted a large evaluation
protocol involving shape, color, and texture descriptors, various datasets, and comparisons with other post-processing approaches. The re-ranking and rank aggregation algorithms yield
better results in terms of effectiveness performance than various state-of-the-art algorithms recently proposed in the literature, achieving bull's eye and MAP scores of 100% on the
well-known MPEG-7 shape dataset.
Visual word spatial arrangement for image retrieval and classification
PENATTI, O. A. B. ; SILVA, F. B. ; VALLE, E. ; GOUET-BRUNET, V ; TORRES, R. da S.
In: Pattern Recognition, volume 47, number 2, p. 705-720, 2014.
We present word spatial arrangement (WSA), an approach to represent the spatial arrangement of visual words under the bag-of-visual-words model.
It lies in a simple idea which encodes the relative position of visual words by splitting the image space into quadrants using each detected point as origin.
WSA generates compact feature vectors and is flexible for being used for image retrieval and classification, for working with hard or soft assignment,
requiring no pre/post processing for spatial verification. Experiments in the retrieval scenario show the superiority of WSA in relation to Spatial Pyramids.
Experiments in the classification scenario show a reasonable compromise between those methods, with Spatial Pyramids generating larger feature vectors,
while WSA provides adequate performance with much more compact features. As WSA encodes only the spatial information of visual words and not their frequency
of occurrence, the results indicate the importance of such information for visual categorization.
A rank aggregation framework for video multimodal geocoding
LI, L. T. ; PEDRONETTE, D. C. G. ; ALMEIDA, J. ; PENATTI, O. A. B. ; CALUMBY, R. T. ; TORRES, R. da S.
This paper proposes a rank aggregation framework for video multimodal geocoding. Textual and visual descriptions associated with videos are used to
define ranked lists. These ranked lists are later combined, and the resulting ranked list is used to define appropriate locations for videos.
An architecture that implements the proposed framework is designed. In this architecture, there are specific modules for each modality (e.g, textual and visual)
that can be developed and evolved independently. Another component is a data fusion module responsible for combining seamlessly the ranked lists defined for
each modality. We have validated the proposed framework in the context of the MediaEval 2012 Placing Task, whose objective is to automatically assign
geographical coordinates to videos. Obtained results show how our multimodal approach improves the geocoding results when compared to methods that rely on a
single modality (either textual or visual descriptors). We also show that the proposed multimodal approach yields comparable results to the best submissions to
the Placing Task in 2012 using no extra information besides the available development/training data. Another contribution of this work is related to the
proposal of a new effectiveness evaluation measure. The proposed measure is based on distance scores that summarize how effective a designed/tested approach is,
considering its overall result for a test dataset.
Image and Video Representations based on Visual Dictionaries
PENATTI, O. A. B. ; VALLE, E. ; TORRES, R. da S.
In: Workshop of Thesis and Dissertations (WTD), 26th Conference on Graphics, Patterns, and Images (SIBGRAPI), Arequipa, Peru, 2013.
The thesis explores three research topics involving the popular approach used for representing visual content: the visual dictionaries.
The first topic concerns the generality of visual dictionaries: does a dictionary based on one dataset generalize to another dataset?
Our findings create the opportunity to greatly alleviate the burden in generating dictionaries.
The second topic is related to the importance of the spatial information of visual words in the image space for distinguishing types of
scenes and objects. We propose an efficient and effective spatial pooling approach which presents promising results for image retrieval.
And the third topic refers to the semantic information in the visual dictionary model. We claim that a bag-of-prototypes model, where the
prototypes are visual words carrying semantics, is promising for improving image and video representations. Employing this model, we propose
a semantically enriched dictionary based on scenes, which was effectively used for video geocoding. Defended in November 29th, 2012,
the thesis has already generated 6 publications, including a best paper award. One of the proposed approaches has also obtained one of
the best results in the Placing Task of MediaEval challenge in the last two years.
Domain-specific Image Geocoding: A Case Study on Virginia Tech Building Photos
LI, L. T. ; PENATTI, O. A. B. ; FOX, E. A. ; TORRES, R. da S.
In: Joint Conference on Digital Libraries (JCDL), Indianapolis, Indiana, USA, 2013, p. 363-366.
The use of map-based browser services is of great relevance in numerous digital libraries. The implementation of such services, however, demands the use
of geocoded data collections. This paper investigates the use of image content local representations in geocoding tasks. Performed experiments demonstrate
that some of the evaluated descriptors yield effective results in the task of geocoding VT building photos. This study is the first step to geocode multimedia
material related to the VT April 16, 2007 school shooting tragedy.
Remote Sensing Image Representation based on Hierarchical Histogram Propagation
SANTOS, J. A. dos ; PENATTI, O. A. B. ; TORRES, R. da S. ; GOSSELIN, P-H. ; PHILIPP-FOLIGUET, S. ; FALCÃO, A. X.
In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Melbourne, Australia, 2013.
Many methods have been recently proposed to deal with the large amount of data provided by high-resolution remote sensing technologies.
Several of these methods rely on the use of image segmentation algorithms for delineating target objects. However, a common issue in geographic
object-based applications is the definition of the appropriate data representation scale, a problem that can be addressed by exploiting multiscale segmentation.
The use of multiple scales, however, raises new challenges related to the definition of effective and efficient mechanisms for extracting features.
In this paper, we address the problem of extracting histogram-based features from a hierarchy of regions for multiscale classification. The strategy,
called H-Propagation, exploits the existing relationships among regions in a hierarchy to iteratively propagate features along multiple scales. The proposed
method speeds up the feature extraction process and yields good results when compared with global low-level extraction approaches.
Multimedia Multimodal Geocoding
LI, L. T. ; PEDRONETTE, D. C. G. ; ALMEIDA, J. ; PENATTI, O. A. B. ; CALUMBY, R. T. ; TORRES, R. da S.
In: ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS),
Redondo Beach, California, USA, 2012. p. 474-477.
This work is developed in the context of the placing task of the MediaEval 2011 initiative. The objective is to geocode
(or geotag) a set of videos, i.e., automatically assign geographical coordinates to them. This paper presents an architecture
for multimodal geocoding that exploits both visual and textual descriptions associated with videos. This paper
also describes our efforts regarding the implementation of this architecture aiming to demonstrate its applicability.
Conducted experiments show how our multimodal approach enhances the results compared to relying on a single modality (text or visual).
A Visual Approach for Video Geocoding using Bag-of-Scenes
PENATTI, O. A. B. ; LI, L. T. ; ALMEIDA, J. ; TORRES, R. da S.
In: ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, 2012, p. 53:1-53:8.
This paper presents a novel approach for video representation, called bag-of-scenes. The proposed method is based on dictionaries of scenes,
which provide a high-level representation for videos. Scenes are elements with much more semantic information than local features, specially for geotagging
videos using visual content. Thus, each component of the representation model has self-contained semantics and, hence, it can be directly related to a
specific place of interest. Experiments were conducted in the context of the MediaEval 2011 Placing Task. The reported results show our strategy
compared to those from other participants that used only visual content to accomplish this task. Despite our very simple way to generate the
visual dictionary, which has taken photos at random, the results show that our approach presents high accuracy relative to the state-of-the art solutions.
Improving Texture Description in Remote Sensing Image Multi-Scale Classification Tasks By Using Visual Words
SANTOS, J. A. dos ; PENATTI, O. A. B. ; TORRES, R. da S. ; P-H. GOSSELIN ; PHILIPP-FOLIGUET, S. ; FALCÃO, A. X.
In: International Conference on Pattern Recognition (ICPR), Tsukuba Science City, Japan, 2012, p. 3090-3093.
Although texture features are important for region-based classification of remote sensing images, the literature shows that
texture descriptors usually have poor performance when compared and combined with color descriptors. In this paper, we propose
a bag-of-visual-words (BOW) "propagation" approach to extract texture features from a hierarchy of regions. This strategy improves
the features efficacy by encoding texture independently of the region shape. Experiments show that the proposed approach improves
the classification results when compared with global descriptors using the bounding box padding strategy.
Comparative Study of Global Color and Texture Descriptors for Web Image Retrieval
PENATTI, O. A. B. ; VALLE E. ; TORRES, R. da S.
In: Journal of Visual Communication and Image Representation, volume 23, number 2, p. 359-380, 2012.
This paper presents a comparative study of color and texture descriptors considering the Web as the environment of use.
We take into account the diversity and large-scale aspects of the Web considering a large number of descriptors
(24 color and 28 texture descriptors, including both traditional and recently proposed ones). The evaluation is made on
two levels: a theoretical analysis in terms of algorithms complexities and an experimental comparison considering
efficiency and effectiveness aspects. The experimental comparison contrasts the performances of the descriptors in
small-scale datasets and in a large heterogeneous database containing more than 230 thousand images. Although there is a
significant correlation between descriptors performances in the two settings, there are notable deviations, which must be
taken into account when selecting the descriptors for large-scale tasks. An analysis of the correlation is provided for
the best descriptors, which hints at the best opportunities of their use in combination.
Encoding spatial arrangement of visual words
PENATTI, O. A. B. ; VALLE, E. ; TORRES, R. da S.
In: Iberoamerican Congress on Pattern Recognition (CIARP), Pucón, Chile, 2011, p. 240-247 (LNCS 7042)
This paper presents a new approach to encode spatial-relationship information of visual words in the well-known
visual dictionary model. The current most popular approach to describe images based on visual words is by means
of bags-of-words which do not encode any spatial information. We propose a graceful way to capture spatial-relationship
information of visual words that encodes the spatial arrangement of every visual word in an image. Our experiments show
the importance of the spatial information of visual words for image classification and show the gain in classification
accuracy when using the new method. The proposed approach creates opportunities for further improvements in image
description under the visual dictionary model.
User-oriented evaluation of color descriptors for Web image retrieval
PENATTI, O. A. B. ; TORRES, R. da S.
In: European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Glasgow, Scotland, 2010, p. 486-489.
This paper proposes a methodology for effectiveness evaluation in content-based image retrieval systems. The methodology
is based on the opinion of real users. This paper also presents the results of using this methodology to evaluate color
descriptors for Web image retrieval. The experiments were performed using a database containing more than 230 thousand
heterogeneous images that represents the existing content on the Web.
Eva - An Evaluation Tool for Comparing Descriptors in Content-based Image Retrieval Tasks
PENATTI, O. A. B. ; TORRES, R. da S.
In: 11th ACM SIGMM International Conference on Multimedia Information Retrieval (MIR), Philadelphia, Pennsylvania, USA, 2010, p. 413-416.
This paper presents Eva, a tool for evaluating image descriptors for content-based image retrieval. Eva integrates the most common stages of an image retrieval process and
provides functionalities to facilitate the comparison of image descriptors in the context of content-based image retrieval.
Eva supports the management of image descriptors and image collections and creates a standardized environment to run comparative experiments using them.
Evaluating the Potential of Texture and Color Descriptors for Remote Sensing Image Retrieval and Classification
SANTOS, J. A. dos ; PENATTI, O. A. B. ; TORRES, R. da S.
In: Proceedings of International Conference on Computer Vision Theory and Applications (VISAPP), Angers, France, 2010, p. 203-210.
Classifying Remote Sensing Images (RSI) is a hard task. There are automatic approaches whose results normally need to be revised.
The identification and polygon extraction tasks usually rely on applying classification strategies that exploit visual aspects related to spectral and
texture patterns identified in RSI regions. There are a lot of image descriptors proposed in the literature for content-based image retrieval purposes
that can be useful for RSI classification. This paper presents a comparative study to evaluate the potential of using successful color and texture image
descriptors for remote sensing retrieval and classification. Seven descriptors that encode texture information and twelve color descriptors that can be used to encode
spectral information were selected. We perform experiments to evaluate the effectiveness of these descriptors, considering image retrieval and classification tasks.
To evaluate descriptors in classification tasks, we also propose a methodology based on KNN classifier. Experiments demonstrate that Joint Auto-Correlogram (JAC), Color Bitmap,
Invariant Steerable Pyramid Decomposition (SID) and Quantized Compound Change Histogram (QCCH) yield the best results.
Color Descriptors for Web Image Retrieval: a Comparative Study
PENATTI, O. A. B. ; TORRES, R. da S.
In: XXI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI), Campo Grande, MS, Brazil, 2008, p. 163-170.
This paper presents a comparative study of color descriptors for content-based image retrieval on the Web. Several image descriptors were compared theoretically and the
most relevant ones were implemented and tested in two different databases. The main goal was to find out the best descriptors for Web image retrieval. Descriptors are compared
according to the extraction and distance functions complexities, the compactness of feature vectors, and the ability to retrieve relevant images.
Recuperação de Imagens: Desafios e Novos Rumos (Image retrieval: Challenges and new trends)
TORRES, R. da S. ; ZEGARRA, J. A. M. ; SANTOS, J. A. ; FERREIRA, C. D. ; PENATTI, O. A. B. ; ANDALO, F. A. ; ALMEIDA JUNIOR, J. G.
In: XXXV Seminário Integrado de Software e Hardware, Belém, PA, Brazil, 2008, p. 223-237.
Huge image collections have been created, managed and stored into image databases. Given the large size of these collections it is essential to provide
efficient and effective mechanisms to retrieve images. This is the objective of the so-called content-based image retrieval (CBIR) systems.
Traditionally, these systems are based on objective criteria to represent and compare images. However, users of CBIR systems tend to use subjective elements
to compare images. The use of these elements have improved the effectiveness of content-based image retrieval systems. This paper discusses approaches that
incorporate semantic information into content-based image retrieval process, highlighting some new challenges on this area.
Spatial relationship descriptor based on partitions (Descritor de Relacionamento Espacial Baseado em Partições)
PENATTI, O. A. B. ; TORRES, R. da S.
In: Electronic Magazine of Undergraduate Research Projects (REIC-SBC), v. VII, p. 3, 2007. In Portuguese
Neste trabalho, propomos um novo descritor de relacionamento espacial para recuperação de imagens por conteúdo.
Relacionamentos espaciais podem ser fundamentais para o reconhecimento e recuperação de imagens beneficiando
aplicações geográficas e médicas, por exemplo. O novo descritor apresentado se baseia no particionamento do
espaço em análise em quadrantes e na contagem da ocorrência de pontos do objeto de interesse em cada quadrante.
O experimentos comparam o descritor proposto com descritores da literatura. Os resultados mostram que o novo
descritor é mais eficaz que importantes descritores da literatura.
LI, L. T. ; MUNOZ, J. A. V. ; ALMEIDA, J. ; CALUMBY, R. T. ; PENATTI, O. A. B. ; DOURADO, I. C. ; NOGUEIRA, K. ; MENDES JR, P. R. ; PEREIRA, L. A. M. ; PEDRONETTE, D. C. G. ; SANTOS, J. A. dos ; GONÇALVES, M. A. ; TORRES, R. da S.
In: Working Notes Proceedings of the MediaEval Workshop, Wurzen, Germany, 2015, v. 1463.