J. Urbano Merino | TU Delft Repository

The Treatment of Ties in Rank-Biased Overlap

Conference paper (2024) - M. Corsi (author), Julián Urbano (author)

Rank-Biased Overlap (RBO) is a similarity measure for indefinite rankings: it is top-weighted, and can be computed when only a prefix of the rankings is known or when they have only some items in common. It is widely used for instance to analyze differences between search engines ...

Mitigating Mainstream Bias in Recommendation via Cost-sensitive Learning

Conference paper (2023) - Roger Zhe Li (author), Julián Urbano (author), A. Hanjalic (author)

Mainstream bias, where some users receive poor recommendations because their preferences are uncommon or simply because they are less active, is an important aspect to consider regarding fairness in recommender systems. Existing methods to mitigate mainstream bias do not explicit ...

Report on the Dagstuhl Seminar on Frontiers of Information Access Experimentation for Research and Education

Journal article (2023) - Christine Bauer (author), Ben Carterette (author), Nicola Ferro (author), Norbert Fuhr (author), Joeran Beel (author), Timo Breuer (author), Charles L. A. Clarke (author), Laura Dietz (author), Julián Urbano (author), More Authors...

This report documents the program and the outcomes of Dagstuhl Seminar 23031 "Frontiers of Information Access Experimentation for Research and Education", which brought together 38 participants from 12 countries. The seminar addressed technology-enhanced information access (infor ...

How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval?

Conference paper (2021) - Julián Urbano (author), M. Corsi (author), A. Hanjalic (author)

Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and rec ...

New Insights into Metric Optimization for Ranking-based Recommendation

Conference paper (2021) - Roger Zhe Li (author), Julián Urbano (author), A. Hanjalic (author)

Direct optimization of IR metrics has often been adopted as an approach to devise and develop ranking-based recommender systems. Most methods following this approach (e.g. TFMAP, CLiMF, Top-N-Rank) aim at optimizing the same metric being used for evaluation, under the assumption ...

Leave No User Behind

Towards Improving the Utility of Recommender Systems for Non-mainstream Users

Conference paper (2021) - Roger Zhe Li (author), Julián Urbano (author), A. Hanjalic (author)

In a collaborative-filtering recommendation scenario, biases in the data will likely propagate in the learned recommendations. In this paper we focus on the so-called mainstream bias: the tendency of a recommender system to provide better recommendations to users who have a mains ...

Music Tempo Estimation

Are We Done Yet?

Journal article (2020) - Hendrik Schreiber (author), Julián Urbano (author), Meinard Müller (author)

With the advent of deep learning, global tempo estimation accuracy has reached a new peak, which presents a great opportunity to evaluate our evaluation practices. In this article, we discuss presumed and actual applications, the pros and cons of commonly used metrics, and the su ...

Towards stochastic simulations of relevance profiles

Conference paper (2019) - Kevin Roitero (author), Andrea Brunello (author), Julián Urbano (author), Stefano Mizzaro (author)

Recently proposed methods allow the generation of simulated scores representing the values of an effectiveness metric, but they do not investigate the generation of the actual lists of retrieved documents. In this paper we address this limitation: we present an approach that expl ...

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Journal article (2019) - Jaehun Kim (author), Julián Urbano (author), C.C.S. Liem (author), A. Hanjalic (author)

Inspired by the success of deploying deep learning in the fields of Computer Vision and Natural Language Processing, this learning paradigm has also found its way into the field of Music Information Retrieval. In order to benefit from deep learning in an effective, but also effic ...

Statistical Significance Testing in Information Retrieval

An Empirical Analysis of Type I, Type II and Type III Errors

Conference paper (2019) - Julián Urbano (author), H.A. De Lima (author), A. Hanjalic (author)

Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS ...

The AcousticBrainz Genre Dataset

Music Genre Recognition with Annotations from Multiple Sources

Conference paper (2019) - Dmitry Bogdanov (author), Alastair Porter (author), Hendrik Schreiber (author), Julián Urbano (author), Sergio Oramas (author)

This paper introduces the AcousticBrainz Genre Dataset, a large-scale collection of hierarchical multi-label genre annotations from different metadata sources. It allows researchers to explore how the same music pieces are annotated differently by different communities following ...

A New Perspective on Score Standardization

Conference paper (2019) - Julián Urbano (author), H.A. De Lima (author), A. Hanjalic (author)

In test collection based evaluation of IR systems, score standardization has been proposed to compare systems across collections and minimize the effect of outlier runs on specific topics. The underlying idea is to account for the difficulty of topics, so that systems are scored ...

THE ACOUSTICBRAINZ GENRE DATASET

MULTI-SOURCE, MULTI-LEVEL, MULTI-LABEL, AND LARGE-SCALE

Conference paper (2019) - Dmitry Bogdanov (author), Alastair Porter (author), Hendrik Schreiber (author), Julián Urbano (author), Sergio Oramas (author)

This paper introduces the AcousticBrainz Genre Dataset, a large-scale collection of hierarchical multi-label genre annotations from different metadata sources. It allows researchers to explore how the same music pieces are annotated differently by different communities following ...

Mapping by Observation

Building a User-Tailored Conducting System From Spontaneous Movements

Journal article (2019) - Alvaro Sarasua (author), Julián Urbano (author), Emilia Gómez (author)

Metaphors are commonly used in interface design within Human-Computer Interaction (HCI). Interface metaphors provide users with a way to interact with the computer that resembles a known activity, giving instantaneous knowledge or intuition about how the interaction works. A wide ...

Metaphors are commonly used in interface design within Human-Computer Interaction (HCI). Interface metaphors provide users with a way to interact with the computer that resembles a known activity, giving instantaneous knowledge or intuition about how the interaction works. A widely used one in Digital Musical Instruments (DMIs) is the conductor-orchestra metaphor, where the orchestra is considered as an instrument controlled by the movements of the conductor. We propose a DMI based on the conductor metaphor that allows to control tempo and dynamics and adapts its mapping specifically for each user by observing spontaneous conducting movements (i.e., movements performed on top of fixed music without any instructions). We refer to this as mapping by observation given that, even though the systemis trained specifically for each
user, this training is not done explicitly and consciously by the user. More specifically, the system adapts its mapping based on the tendency of the user to anticipate or fall behind the beat and observing the Motion Capture descriptors that best correlate to loudness during spontaneous conducting. We evaluate the proposed system in an experiment with twenty four (24) participants where we compare it with a baseline that does not perform this user-specific adaptation. The comparison is done in a context where the user does not receive instructions and, instead, is allowed to discover by playing. We evaluate objective and subjective measures from tasks where participants have to make
the orchestra play at different loudness levels or in synchrony with a metronome. Results of the experiment prove that the usability of the system that automatically learns its mapping from spontaneous movements is better both in terms of providing a more intuitive control over loudness and a more precise control over beat timing. Interestingly, the results also show a strong correlation betweenmeasures taken fromthe data used for training and the improvement introduced by the adapting system. This indicates that it is possible to estimate in advance how useful the observation of spontaneous movements is to build user-specific adaptations. This opens interesting directions for creating more
intuitive and expressive DMIs, particularly in public installations.@en

Are Nearby Neighbors Relatives?

Testing Deep Music Embeddings

Journal article (2019) - Jaehun Kim (author), Julián Urbano (author), C.C.S. Liem (author), A. Hanjalic (author)

Deep neural networks have frequently been used to directly learn representations useful for a given task from raw input data. In terms of overall performance metrics, machine learning solutions employing deep representations frequently have been reported to greatly outperform tho ...

The MediaEval 2018 AcousticBrainz Genre Task

Content-based Music Genre Recognition from Multiple Sources

Conference paper (2018) - Dmitry Bogdanov (author), Alastair Porter (author), Julián Urbano (author), Hendrik Schreiber (author)

This paper provides an overview of the AcousticBrainz Genre Task organized as part of the MediaEval 2018 Benchmarking Initiative for Multimedia Evaluation. The task is focused on content-based music genre recognition using genre annotations from multiple sources and large-scale m ...

Statistical Analysis of Results in Music Information Retrieval: Why and How

Abstract (2018) - Julián Urbano (author), Arthur Flexer (author)

Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR resea ...

Nearly since the beginning, the ISMIR and MIREX communities have promoted rigor in experimentation through the creation of datasets and the practice of statistical hypothesis testing to determine the reliability of the improvements observed with those datasets. In fact, MIR researchers have adopted a certain way of going about statistical testing, namely non-parametric approaches like the Friedman test and multiple comparisons corrections like Tukey’s. In a way, they have become a standard of reporting and judging results for researchers, reviewers, committees, journal editors, etc. It is nowadays more frequent to require statistically significant improvements over a baseline with a well-established dataset. But hypothesis testing can be very misleading if not well understood. To many researchers, especially newcomers, even the simpler analyses and tests are seen as a black box where one puts performance scores and gets a p-value which, as they are told, must be smaller than 0.05. Therefore, significance tests are in part responsible of determining what gets published, what research lines to follow, and what project to fund, so it is very important to understand what they really mean and how they should be carried out and interpreted. We will also focus on experimental validity, and will show how a lack of internal or external validity, even if experiments are reliable and repeatable and hypothesis testing is done correctly, can render even your best results invalid. Problems discussed include adversarial examples or the lack of inter-rater agreement when annotating ground truth data. The goal of this tutorial is to help MIR researchers and developers get a better understanding of how these statistical methods work and how they should be interpreted. Starting from the very beginning of the evaluation process, it will show that statistical analysis is always required, but that too much focus on it, or the incorrect approach, is just harmful. The tutorial will attempt to provide better insight into statistical analysis of results, present better solutions and guidelines, and point the attendees to the larger but ignored problems of evaluation and reproducibility in MIR.@en

Stochastic Simulation of Test Collections

Evaluation Scores

Conference paper (2018) - Julián Urbano (author), Thomas Nagler (author)

Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing c ...

A Semi-Automatic and Low-Cost Method to Learn Patterns for Named Entity Recognition

Journal article (2018) - M. Marrero Llinares (author), Julián Urbano (author)

Named Entity Recognition is a basic task in Information Extraction that aims at identifying entities of interest within full text documents. The patterns used to recognize entities can be rule based, as in the popular JAPE system. However, hand-crafting effective patterns is ofte ...

The MediaEval 2017 AcousticBrainz Genre Task

Content-based Music Genre Recognition from Multiple Sources

Conference paper (2017) - Dmitry Bogdanov (author), Alastair Porter (author), Julián Urbano (author), Hendrik Schreiber (author)

This paper provides an overview of the AcousticBrainz Genre Task organized as part of the MediaEval 2017 Benchmarking Initiative for Multimedia Evaluation. The task is focused on content-based music genre recognition using genre annotations from multiple sources and large-scale ...