blank

Dense Retrieval Adaptation using Target Domain Description (ICTIR 2023)

2023-07-23T00:00:00+00:00

In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.

ECIR 23 Tutorial: Neuro-Symbolic Approaches for Information Retrieval

2023-03-08T00:00:00+00:00

This tutorial will provide an overview of recent advances on neuro-symbolic approaches for information retrieval. A decade ago, knowledge graphs and semantic annotations technology led to active research on how to best leverage symbolic knowledge. At the same time, neural methods have demonstrated to be versatile and highly effective. From a neural network perspective, the same representation approach can service document ranking or knowledge graph reasoning. End-to-end training allows to optimize complex methods for downstream tasks. We are at the point where both the symbolic and the neural research advances are coalescing into neuro-symbolic approaches. The underlying research questions are how to best combine symbolic and neural approaches, what kind of symbolic/neural approaches are most suitable for which use case, and how to best integrate both ideas to advance the state of the art in information retrieval.

Entity Retrieval from Multilingual Knowledge Graphs (MRL 2022)

2022-12-08T00:00:00+00:00

Knowledge Graphs (KGs) are structured databases that capture real-world entities and their relationships. The task of entity retrieval from a KG aims at retrieving a ranked list of entities relevant to a given user query. While English-only entity retrieval has attracted considerable attention, user queries, as well as the information contained in the KG, may be represented in multiple—and possibly distinct—languages. Furthermore, KG content may vary between languages due to different information sources and points of view. Recent advances in language representation have enabled natural ways of bridging gaps between languages. In this paper, we therefore propose to utilise language models (LMs) and diverse entity representations to enable truly multilingual entity retrieval. We propose two approaches:(i) an array of monolingual retrievers and (ii) a single multilingual retriever, trained using queries and documents in multiple languages. We show that while our approach is on par with the significantly more complex state-of-the-art method for the English task, it can be successfully applied to virtually any language with a LM. Furthermore, it allows languages to benefit from one another, yielding significantly better performance, both for low-and high-resource languages.

Similarity-based Multi-Domain Dialogue State Tracking with Copy Mechanisms for Task-based Virtual Personal Assistants (WWW 2022)

2022-04-09T00:00:00+00:00

Task-based Virtual Personal Assistants (VPAs) rely on multi-domain Dialogue State Tracking (DST) models to monitor goals throughout a conversation. Previously proposed models show promising results on established benchmarks, but they have difficulty adapting to unseen domains due to domain-specific parameters in their model architectures. We propose a new Similarity-based Multi-domain Dialogue State Tracking model (SM-DST) that uses retrieval-inspired and fine-grained contextual token-level similarity approaches to efficiently and effectively track dialogue state. The key difference with state-of-the-art DST models is that SM-DST has a single model with shared parameters across domains and slots. Because we base SM-DST on similarity it allows the transfer of tracking information between semantically related domains as well as to unseen domains without retraining. Furthermore, we leverage copy mechanisms that consider the system’s response and the dialogue state from previous turn predictions, allowing it to more effectively track dialogue state for complex conversations. We evaluate SM-DST on three variants of the MultiWOZ DST benchmark datasets. The results demonstrate that SM-DST significantly and consistently outperforms state-of-the-art models across all datasets by absolute 5-18% and 3-25% in the few- and zero-shot settings, respectively.

Understanding Financial Information Seeking Behavior from User Interactions with Company Filings (WWW companion 2022)

2022-04-05T00:00:00+00:00

Publicly-traded companies are required to regularly file financial statements and disclosures. Analysts, investors, and regulators leverage these filings to support decision making, with high financial and legal stakes. Despite their ubiquity in finance, little is known about the information seeking behavior of users accessing such filings. In this work, we present the first study of this behavior. We analyze 14 years of logs of users accessing company filings of more than 600K distinct companies on the U.S. Securities and Exchange Commission’s (SEC) Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, the primary resource for accessing company filings. We provide an analysis of the information-seeking behavior for this high-impact domain. We find that little behavioral history is available for the majority of users, while frequent users have rich histories. Most sessions focus on filings belonging to a small number of companies, and individual users are interested in a limited number of companies. Out of all sessions, 66% contain filings from one or two companies and 50% of frequent users are interested in six companies or less. Understanding user interactions with EDGAR can suggest ways to enhance the user journey in browsing filings, e.g., via filing recommendation. Our work provides a stepping stone for the academic community to tackle retrieval and recommendation tasks for the finance domain.

Recent papers

2022-01-02T00:00:00+00:00

Please look at Google Scholar to see the list of my most recept up-to-date papers, and be sure to check out https://techatbloomberg.com/ai as well!

Improving Dialogue State Tracking with Turn-based Loss Function and Sequential Data Augmentation (EMNLP 2021)

2021-11-08T00:00:00+00:00

While state-of-the-art Dialogue State Tracking (DST) models show promising results, all of them rely on a traditional cross-entropy loss function during the training process, which may not be optimal for improving the joint goal accuracy. Although several approaches recently propose augmenting the training set by copying user utterances and replacing the real slot values with other possible or even similar values, they are not effective at improving the performance of existing DST models. To address these challenges, we propose a Turn-based Loss Function (TLF) that penalises the model if it inaccurately predicts a slot value at the early turns more so than in later turns in order to improve joint goal accuracy. We also propose a simple but effective Sequential Data Augmentation (SDA) algorithm to generate more complex user utterances and system responses to effectively train existing DST models. Experimental results on two standard DST benchmark collections demonstrate that our proposed TLF and SDA techniques significantly improve the effectiveness of the state-of-the-art DST model by approximately 7-8% relative reduction in error and achieves a new state-of-the-art joint goal accuracy with 59.50 and 54.90 on MultiWOZ2. 1 and MultiWOZ2. 2, respectively.

News Article Retrieval in Context for Event-centric Narrative Creation

2021-07-11T00:00:00+00:00

Writers such as journalists often use automatic tools to find relevant content to include in their narratives. In this paper, we focus on supporting writers in the news domain to develop event-centric narratives. Given an incomplete narrative that specifies a main event and a context, we aim to retrieve news articles that discuss relevant events that would enable the continuation of the narrative. We formally define this task and propose a retrieval dataset construction procedure that relies on existing news articles to simulate incomplete narratives and relevant articles. Experiments on two datasets derived from this procedure show that state-of-the-art lexical and semantic rankers are not sufficient for this task. We show that combining those with a ranker that ranks articles by reverse chronological order outperforms those rankers alone. We also perform an in-depth quantitative and qualitative analysis of the results that sheds light on the characteristics of this task.

See https://doi.org/10.1145/3471158.3472247 for more details.

Contextualizing Trending Entities in News Stories

2021-03-01T00:00:00+00:00

Trends are those keywords, phrases, or names that are mentioned most often on social media or in news in a particular timeframe.They are an effective way for human news readers to both discover and stay focused on the most relevant information of the day. In this work, we consider trends that correspond to an entity in a knowledge graph and introduce the new and as-yet unexplored task of identifying other entities that may help explain the “why” an entity is trending. We refer to these retrieved entities as contextual entities. Some of them are more important than others in the context of the trending entity and we thus determine a ranking of entities according to how useful they are in contextualizing the trend. We propose two solutions for ranking contextual entities. The first one is fully unsupervised and based on Personalized PageRank, calculated over a trending entity-specific graph of other entities where the edges encode a notion of directional similarity based on embedded background knowledge. Our second method is based on learning to rank and combines the intuitions behind the unsupervised model with signals derived from hand-crafted features in a supervised setting. We compare our models on this novel task by using a new, purpose-built test collection created using crowdsourcing. Our methods improve over the strongest baseline in terms of Precision at 1 by 7% (unsupervised) and 13% (supervised). We find that the salience of a contextual entity and how coherent it is with respect to the news story are strong indicators of relevance in both unsupervised and supervised settings.

See https://doi.org/10.1145/3437963.3441765.

Report on the first workshop on bias in automatic knowledge graph construction at AKBC 2020

2020-12-08T00:00:00+00:00

We report on the First Workshop on Bias in Automatic Knowledge Graph Construction (KG-BIAS), which was co-located with the Automated Knowledge Base Construction (AKBC) 2020 conference. Identifying and possibly remediating any sort of bias in knowledge graphs, or in the methods used to construct or query them, has clear implications for downstream systems accessing and using the information in such graphs. However, this topic remains relatively unstudied, so our main aim for organizing this workshop was to bring together a group of people from a variety of backgrounds with an interest in the topic, in order to arrive at a shared definition and roadmap for the future. Through a program that included two keynotes, an invited paper, three peer-reviewed full papers, and a plenary discussion, we have made initial inroads towards a common understanding and shared research agenda for this timely and important topic.