NLP+CSS Workshops

Uncertainty over Uncertainty: Investigating the Assumptions, Annotations, and Text Measurements of Economic Policy Uncertainty

Methods and applications are inextricably linked in science, and in particular in the domain of text-as-data. In this paper, we examine one such text-as-data application, an established economic index that measures economic policy uncertainty from keyword occurrences in news. This index, which is shown to correlate with firm investment, employment, and excess market returns, has had substantive impact in both the private sector and academia. Yet, as we revisit and extend the original authors’ annotations and text measurements we find interesting text-as-data methodological research questions:(1) Are annotator disagreements a reflection of ambiguity in language?(2) Do alternative text measurements correlate with one another and with measures of external predictive validity? We find for this application (1) some annotator disagreements of economic policy uncertainty can be attributed to ambiguity in language, and (2) switching measurements from keyword-matching to supervised machine learning classifiers results in low correlation, a concerning implication for the validity of the index.

Evaluating the Calibration of Knowledge Graph Embeddings for Trustworthy Link Prediction

Little is known about the trustworthiness of predictions made by knowledge graph embedding (KGE) models. In this paper we take initial steps toward this direction by investigating the calibration of KGE models, or the extent to which they output confidence scores that reflect the expected correctness of predicted knowledge graph triples. We first conduct an evaluation under the standard closed-world assumption (CWA), in which predicted triples not already in the knowledge graph are considered false, and show that existing calibration techniques are effective for KGE under this common but narrow assumption. Next, we introduce the more realistic but challenging open-world assumption (OWA), in which unobserved predictions are not considered true or false until ground-truth labels are obtained. Here, we show that existing calibration techniques are much less effective under the OWA than the CWA, and provide explanations for this discrepancy. Finally, to motivate the utility of calibration for KGE from a practitioner’s perspective, we conduct a unique case study of human-AI collaboration, showing that calibrated predictions can improve human performance in a knowledge graph completion task.

Knowledge Graphs: An Information Retrieval Perspective

In this survey, we provide an overview of the literature on knowledge graphs (KGs) in the context of information retrieval (IR). Modern IR systems can benefit from information available in KGs in multiple ways, independent of whether the KGs are publicly available or proprietary ones. We provide an overview of the components required when building IR systems that leverage KGs and use a task-oriented organization of the material that we discuss. As an understanding of the intersection of IR and KGs is beneficial to many researchers and practitioners, we consider prior work from two complementary angles: leveraging KGs for information retrieval and enriching KGs using IR techniques. We start by discussing how KGs can be employed to support IR tasks, including document and entity retrieval. We then proceed by describing how IR—and language technology in general—can be utilized for the construction and completion of KGs. This includes tasks such as entity recognition, typing, and relation extraction. We discuss common issues that appear across the tasks that we consider and identify future directions for addressing them. We also provide pointers to datasets and other resources that should be useful for both newcomers and experienced researchers in the area. See http://www.nowpublishers.com/article/Details/INR-063 for more details.

CrossBERT: A Triplet Neural Architecture for Ranking Entity Properties

Task-based Virtual Personal Assistants (VPAs) such as the Google Assistant, Alexa, and Siri are increasingly being adopted for a wide variety of tasks. These tasks are grounded in real-world entities and actions (e.g., book a hotel, organise a conference, or requesting funds). In this work we tackle the task of automatically constructing actionable knowledge graphs in response to a user query in order to support a wider variety of increasingly complex assistant tasks. We frame this as an entity property ranking task given a user query with annotated properties. We propose a new method for property ranking, CrossBERT. CrossBERT builds on the Bidirectional Encoder Representations from Transformers (BERT) and creates a new triplet network structure on cross query-property pairs that is used to rank properties. We also study the impact of using external evidence for query entities from textual entity descriptions. We perform experiments on two standard benchmark collections, the NTCIR-13 Actionable Knowledge Graph Generation (AKGG) task and Entity Property Identification (EPI) task. The results demonstrate that CrossBERT significantly outperforms the best performing runs from AKGG and EPI, as well as previous state-of-the-art BERT-based models. In particular, CrossBERT significantly improves Recall and NDCG by approximately 2-12% over the BERT models across the two used datasets. [

Proceedings of the KG-BIAS Workshop 2020 at AKBC 2020

The KG-BIAS 2020 workshop touches on biases and how they surface in knowledge graphs (KGs), biases in the source data that is used to create KGs, methods for measuring or remediating bias in KGs, but also identifying other biases such as how and which languages are represented in automatically constructed KGs or how personal KGs might incur inherent biases. The goal of this workshop is to uncover how various types of biases are introduced into KGs, investigate how to measure, and propose methods to remediate them.

WWW 2020 logo

Novel Entity Discovery from Web Tables

When working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web on the other hand are abundant and have the distinct potential to assist with these tasks. In particular, we can leverage the content in such tables to discover new entities, properties, and relationships. Because web tables typically only contain raw textual content we first need to determine which cells refer to which known entities—a task we dub table-to-KB matching. This first task aims to infer table semantics by linking table cells and heading columns to elements of a KB. We propose a feature-based method and on two public test collections we demonstrate substantial improvements over the state-of-the-art in terms of precision whilst also improving recall. Then second task builds upon these linked entities and properties to not only identify novel ones in the same table but also to bootstrap their type and additional relationships. We refer to this process as novel entity discovery and, to the best of our knowledge, it is the first endeavor on mining the unlinked cells in web tables. Our method identifies not only out-of-KB (“novel”) information but also novel aliases for in-KB (“known”) entities. When evaluated using three purpose-built test collections, we find that our proposed approaches obtain a marked improvement in terms of precision over our baselines whilst keeping recall stable.

  • [PDF] [DOI] S. Zhang, E. Meij, K. Balog, and R. Reinanda, “Novel entity discovery from web tables,” in Proceedings of the web conference 2020, New York, NY, USA, 2020, p. 1298–1308.
    [Bibtex]
    @inproceedings{WWW:2020:Zhang,
    Address = {New York, NY, USA},
    Author = {Zhang, Shuo and Meij, Edgar and Balog, Krisztian and Reinanda, Ridho},
    Booktitle = {Proceedings of The Web Conference 2020},
    Date-Added = {2020-06-03 06:23:41 +0100},
    Date-Modified = {2020-06-03 06:24:53 +0100},
    Doi = {10.1145/3366423.3380205},
    Isbn = {9781450370233},
    Keywords = {tabular data extraction, Novel entity discovery, entity linking, KBP},
    Location = {Taipei, Taiwan},
    Numpages = {11},
    Pages = {1298--1308},
    Publisher = {Association for Computing Machinery},
    Series = {WWW '20},
    Title = {Novel Entity Discovery from Web Tables},
    Url = {https://doi.org/10.1145/3366423.3380205},
    Year = {2020},
    Bdsk-Url-1 = {https://doi.org/10.1145/3366423.3380205}}

ECIR 2020 logo

Identifying Notable News Stories

The volume of news content has increased significantly in recent years and systems to process and deliver this information in an automated fashion at scale are becoming increasingly prevalent. One critical component that is required in such systems is a method to automatically determine how notable a certain news story is, in order to prioritize these stories during delivery. One way to do so is to compare each story in a stream of news stories to a notable event. In other words, the problem of detecting notable news can be defined as a ranking task; given a trusted source of notable events and a stream of candidate news stories, we aim to answer the question: “Which of the candidate news stories is most similar to the notable one?”. We employ different combinations of features and learning to rank (LTR) models and gather relevance labels using crowdsourcing. In our approach, we use structured representations of candidate news stories (triples) and we link them to corresponding entities. Our evaluation shows that the features in our proposed method outperform standard ranking methods, and that the trained model generalizes well to unseen news stories.

  • [PDF] A. Saravanou, G. Stefanoni, and E. Meij, “Identifying notable news stories,” in Advances in information retrieval, Cham, 2020, p. 352–358.
    [Bibtex]
    @inproceedings{ECIR:2020:Saravanou,
    Abstract = {The volume of news content has increased significantly in recent years and systems to process and deliver this information in an automated fashion at scale are becoming increasingly prevalent. One critical component that is required in such systems is a method to automatically determine how notable a certain news story is, in order to prioritize these stories during delivery. One way to do so is to compare each story in a stream of news stories to a notable event. In other words, the problem of detecting notable news can be defined as a ranking task; given a trusted source of notable events and a stream of candidate news stories, we aim to answer the question: ``Which of the candidate news stories is most similar to the notable one?''. We employ different combinations of features and learning to rank (LTR) models and gather relevance labels using crowdsourcing. In our approach, we use structured representations of candidate news stories (triples) and we link them to corresponding entities. Our evaluation shows that the features in our proposed method outperform standard ranking methods, and that the trained model generalizes well to unseen news stories.},
    Address = {Cham},
    Author = {Saravanou, Antonia and Stefanoni, Giorgio and Meij, Edgar},
    Booktitle = {Advances in Information Retrieval},
    Date-Added = {2020-06-03 06:36:13 +0100},
    Date-Modified = {2020-06-03 06:47:12 +0100},
    Editor = {Jose, Joemon M. and Yilmaz, Emine and Magalh{\~a}es, Jo{\~a}o and Castells, Pablo and Ferro, Nicola and Silva, M{\'a}rio J. and Martins, Fl{\'a}vio},
    Isbn = {978-3-030-45442-5},
    Pages = {352--358},
    Publisher = {Springer International Publishing},
    Title = {Identifying Notable News Stories},
    Year = {2020}}

Improving the Utility of Knowledge Graph Embeddings with Calibration

This paper addresses machine learning models that embed knowledge graph entities and relationships toward the goal of predicting unseen triples, which is an important task because most knowledge graphs are by nature incomplete. We posit that while offline link prediction accuracy using embeddings has been steadily improving on benchmark datasets, such embedding models have limited practical utility in real-world knowledge graph completion tasks because it is not clear when their predictions should be accepted or trusted. To this end, we propose to calibrate knowledge graph embedding models to output reliable confidence estimates for predicted triples. In crowdsourcing experiments, we demonstrate that calibrated confidence scores can make knowledge graph embeddings more useful to practitioners and data annotators in knowledge graph completion tasks. We also release two resources from our evaluation tasks: An enriched version of the FB15K benchmark and a new knowledge graph dataset extracted from Wikidata.

  • [PDF] T. Safavi, D. Koutra, and E. Meij, Improving the utility of knowledge graph embeddings with calibration, 2020.
    [Bibtex]
    @misc{ARXIV:2020:Safavi,
    Archiveprefix = {arXiv},
    Author = {Tara Safavi and Danai Koutra and Edgar Meij},
    Date-Added = {2020-06-03 06:34:40 +0100},
    Date-Modified = {2020-06-03 06:47:20 +0100},
    Eprint = {2004.01168},
    Primaryclass = {cs.AI},
    Title = {Improving the Utility of Knowledge Graph Embeddings with Calibration},
    Year = {2020}}

Special issue on knowledge graphs and semantics in text analysis and retrieval

Knowledge graphs are an effective way to store semantics in a structured format that is easily used by computer systems. In the past few decades, work across different research communities led to scalable knowledge acquisition techniques for building large-scale knowledge graphs. The result is the emergence of large publicly available knowledge graphs (KGs) such as Wikidata, DBpedia, Freebase, and others. While knowledge graphs are designed to support a wide set of different applications, this special issue focuses on the use case of text retrieval and analysis.

Utilizing knowledge graphs for text analysis requires effective alignment techniques that associate segments of unstructured text with entries in the knowledge graph, for example using entity extraction and linking algorithms. A wide range of approaches that combine query-document representations and machine learning repeatedly demonstrate significant improvements for such tasks across diverse domains. The goal of this special issue is to summarize recent progress in research and practice in constructing, grounding, and utilizing knowledge graphs and similar semantic resources for text retrieval and analysis applications. The scope includes acquisition, alignment, and utilization of knowledge graphs and other semantic resources for the purpose of optimizing end-to-end performance of information retrieval systems.

For this special issue we selected six articles out of 23 submissions. Each article was reviewed by at least three reviewers and underwent at least one revision. More literature on how to effectively use of knowledge graphs in information retrieval can be found in the proceedings of the KG4IR Workshop series.

  • [PDF] [DOI] L. Dietz, C. Xiong, J. Dalton, and E. Meij, “Special issue on knowledge graphs and semantics in text analysis and retrieval,” Information retrieval journal, 2019.
    [Bibtex]
    @article{IRJ:2019:Dietz,
    Author = {Dietz, Laura and Xiong, Chenyan and Dalton, Jeff and Meij, Edgar},
    Date-Added = {2019-03-12 20:19:31 +0000},
    Date-Modified = {2019-03-12 20:19:39 +0000},
    Day = {04},
    Doi = {10.1007/s10791-019-09354-z},
    Issn = {1573-7659},
    Journal = {Information Retrieval Journal},
    Month = {Mar},
    Title = {Special issue on knowledge graphs and semantics in text analysis and retrieval},
    Url = {https://doi.org/10.1007/s10791-019-09354-z},
    Year = {2019},
    Bdsk-Url-1 = {https://doi.org/10.1007/s10791-019-09354-z}}

Related Entity Finding on Highly-heterogeneous Knowledge Graphs

In this paper, we study the problem of domain-specific related entity finding on highly-heterogeneous knowledge graphs where the task is to find related entities with respect to a query entity. As we are operating in the context of knowledge graphs, our solutions will need to be able to deal with heterogeneous data with multiple objects and a high number of relationship types, and be able to leverage direct and indirect connections between entities. We propose two novel graph- based related entity finding methods: one based on learning to rank and the other based on subgraph propagation in a Bayesian framework. We perform contrastive experiments with a publicly available knowledge graph and show that both our proposed models manage to outperform a strong baseline based on supervised random walks. We also investigate the results of our proposed methods and find that they improve different types of query entities.

  • [PDF] R. Reinanda, E. Meij, J. Pantony, and D. Jonathan, “Related entity finding on highly-heterogeneous knowledge graphs,” in Asonam, 2018.
    [Bibtex]
    @inproceedings{ASONAM:2018:Reinanda,
    Author = {Reinanda, Ridho and Meij, Edgar and Pantony, Joshua and Dorando Jonathan},
    Booktitle = {ASONAM},
    Date-Added = {2018-09-27 21:43:39 +0100},
    Date-Modified = {2018-09-27 21:55:03 +0100},
    Series = {{ASONAM} '18},
    Title = {Related Entity Finding on Highly-heterogeneous Knowledge Graphs},
    Year = {2018},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/3209978.3210031},
    Bdsk-Url-2 = {https://doi.org/10.1145/3209978.3210031}}