WWW 2020 logo

Novel Entity Discovery from Web Tables

When working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web on the other hand are abundant and have the distinct potential to assist with these tasks. In particular, we can leverage the content in such tables to discover new entities, properties, and relationships. Because web tables typically only contain raw textual content we first need to determine which cells refer to which known entities—a task we dub table-to-KB matching. This first task aims to infer table semantics by linking table cells and heading columns to elements of a KB. We propose a feature-based method and on two public test collections we demonstrate substantial improvements over the state-of-the-art in terms of precision whilst also improving recall. Then second task builds upon these linked entities and properties to not only identify novel ones in the same table but also to bootstrap their type and additional relationships. We refer to this process as novel entity discovery and, to the best of our knowledge, it is the first endeavor on mining the unlinked cells in web tables. Our method identifies not only out-of-KB (“novel”) information but also novel aliases for in-KB (“known”) entities. When evaluated using three purpose-built test collections, we find that our proposed approaches obtain a marked improvement in terms of precision over our baselines whilst keeping recall stable.

  • [PDF] [DOI] S. Zhang, E. Meij, K. Balog, and R. Reinanda, “Novel entity discovery from web tables,” in Proceedings of the web conference 2020, New York, NY, USA, 2020, p. 1298–1308.
    [Bibtex]
    @inproceedings{WWW:2020:Zhang,
    Address = {New York, NY, USA},
    Author = {Zhang, Shuo and Meij, Edgar and Balog, Krisztian and Reinanda, Ridho},
    Booktitle = {Proceedings of The Web Conference 2020},
    Date-Added = {2020-06-03 06:23:41 +0100},
    Date-Modified = {2020-06-03 06:24:53 +0100},
    Doi = {10.1145/3366423.3380205},
    Isbn = {9781450370233},
    Keywords = {tabular data extraction, Novel entity discovery, entity linking, KBP},
    Location = {Taipei, Taiwan},
    Numpages = {11},
    Pages = {1298--1308},
    Publisher = {Association for Computing Machinery},
    Series = {WWW '20},
    Title = {Novel Entity Discovery from Web Tables},
    Url = {https://doi.org/10.1145/3366423.3380205},
    Year = {2020},
    Bdsk-Url-1 = {https://doi.org/10.1145/3366423.3380205}}

Special issue on knowledge graphs and semantics in text analysis and retrieval

Knowledge graphs are an effective way to store semantics in a structured format that is easily used by computer systems. In the past few decades, work across different research communities led to scalable knowledge acquisition techniques for building large-scale knowledge graphs. The result is the emergence of large publicly available knowledge graphs (KGs) such as Wikidata, DBpedia, Freebase, and others. While knowledge graphs are designed to support a wide set of different applications, this special issue focuses on the use case of text retrieval and analysis.

Utilizing knowledge graphs for text analysis requires effective alignment techniques that associate segments of unstructured text with entries in the knowledge graph, for example using entity extraction and linking algorithms. A wide range of approaches that combine query-document representations and machine learning repeatedly demonstrate significant improvements for such tasks across diverse domains. The goal of this special issue is to summarize recent progress in research and practice in constructing, grounding, and utilizing knowledge graphs and similar semantic resources for text retrieval and analysis applications. The scope includes acquisition, alignment, and utilization of knowledge graphs and other semantic resources for the purpose of optimizing end-to-end performance of information retrieval systems.

For this special issue we selected six articles out of 23 submissions. Each article was reviewed by at least three reviewers and underwent at least one revision. More literature on how to effectively use of knowledge graphs in information retrieval can be found in the proceedings of the KG4IR Workshop series.

  • [PDF] [DOI] L. Dietz, C. Xiong, J. Dalton, and E. Meij, “Special issue on knowledge graphs and semantics in text analysis and retrieval,” Information retrieval journal, 2019.
    [Bibtex]
    @article{IRJ:2019:Dietz,
    Author = {Dietz, Laura and Xiong, Chenyan and Dalton, Jeff and Meij, Edgar},
    Date-Added = {2019-03-12 20:19:31 +0000},
    Date-Modified = {2019-03-12 20:19:39 +0000},
    Day = {04},
    Doi = {10.1007/s10791-019-09354-z},
    Issn = {1573-7659},
    Journal = {Information Retrieval Journal},
    Month = {Mar},
    Title = {Special issue on knowledge graphs and semantics in text analysis and retrieval},
    Url = {https://doi.org/10.1007/s10791-019-09354-z},
    Year = {2019},
    Bdsk-Url-1 = {https://doi.org/10.1007/s10791-019-09354-z}}

Weakly-supervised Contextualization of Knowledge Graph Facts

Knowledge graphs (KGs) model facts about the world; they consist of nodes (entities such as companies and people) that are connected by edges (relations such as founderOf ). Facts encoded in KGs are frequently used by search applications to augment result pages. When presenting a KG fact to the user, providing other facts that are pertinent to that main fact can enrich the user experience and support exploratory information needs. KG fact contextualization is the task of augmenting a given KG fact with additional and useful KG facts. The task is challenging because of the large size of KGs; discovering other relevant facts even in a small neighborhood of the given fact results in an enormous amount of candidates. We introduce a neural fact contextualization method (NFCM) to address the KG fact contextualization task. NFCM first generates a set of candidate facts in the neighborhood of a given fact and then ranks the candidate facts using a supervised learning to rank model. The ranking model combines features that we automatically learn from data and that represent the query-candidate facts with a set of hand-crafted features we devised or adjusted for this task. In order to obtain the annotations required to train the learning to rank model at scale, we generate training data automatically using distant supervision on a large entity-tagged text corpus. We show that ranking functions learned on this data are effective at contextualizing KG facts. Evaluation using human assessors shows that it significantly outperforms several competitive baselines.

  • [PDF] [DOI] N. Voskarides, E. Meij, R. Reinanda, A. Khaitan, M. Osborne, G. Stefanoni, P. Kambadur, and M. de Rijke, “Weakly-supervised contextualization of knowledge graph facts,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, New York, NY, USA, 2018, p. 765–774.
    [Bibtex]
    @inproceedings{SIGIR:2018:Voskarides,
    Acmid = {3210031},
    Address = {New York, NY, USA},
    Author = {Voskarides, Nikos and Meij, Edgar and Reinanda, Ridho and Khaitan, Abhinav and Osborne, Miles and Stefanoni, Giorgio and Kambadur, Prabhanjan and de Rijke, Maarten},
    Booktitle = {The 41st {International ACM SIGIR Conference on Research} \& {Development in Information Retrieval}},
    Date-Added = {2018-07-26 18:23:41 +0000},
    Date-Modified = {2018-09-27 21:55:17 +0100},
    Doi = {10.1145/3209978.3210031},
    Isbn = {978-1-4503-5657-2},
    Keywords = {distant supervision, fact contextualization, knowledge graphs},
    Location = {Ann Arbor, MI, USA},
    Numpages = {10},
    Pages = {765--774},
    Publisher = {ACM},
    Series = {SIGIR '18},
    Title = {Weakly-supervised Contextualization of Knowledge Graph Facts},
    Url = {http://doi.acm.org/10.1145/3209978.3210031},
    Year = {2018},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/3209978.3210031},
    Bdsk-Url-2 = {https://doi.org/10.1145/3209978.3210031}}

The Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR)

Semantic technologies such as controlled vocabularies, thesauri, and knowledge graphs have been used throughout the history of information retrieval for a variety of tasks. Recent advances in knowledge acquisition, alignment, and utilization have given rise to a body of new approaches for utilizing knowledge graphs in text retrieval tasks and it is therefore time to consolidate the community efforts and study how such technologies can be employed in information retrieval systems in the most effective way. It is also time to start and deepen the dialogue between researchers and practitioners in order to ensure that breakthroughs, technologies, and algorithms in this space are widely disseminated. The goal of this workshop, co-located with SIGIR 2018, is to bring together and grow a community of researchers and practitioners who are interested in using, aligning, and constructing knowledge graphs and similar semantic resources for information retrieval applications. See https://kg4ir.github.io/ for more info.

  • [PDF] [DOI] L. Dietz, C. Xiong, J. Dalton, and E. Meij, “The second workshop on knowledge graphs and semantics for text retrieval, analysis, and understanding (kg4ir),” in The 41st international acm sigir conference on research & development in information retrieval, New York, NY, USA, 2018, p. 1423–1426.
    [Bibtex]
    @inproceedings{SIGIR:2018:Dietz-WS,
    Acmid = {3210196},
    Address = {New York, NY, USA},
    Author = {Dietz, Laura and Xiong, Chenyan and Dalton, Jeff and Meij, Edgar},
    Booktitle = {The 41st International ACM SIGIR Conference on Research \& Development in Information Retrieval},
    Date-Added = {2018-07-26 18:25:34 +0000},
    Date-Modified = {2018-07-26 18:31:50 +0000},
    Doi = {10.1145/3209978.3210196},
    Isbn = {978-1-4503-5657-2},
    Keywords = {entity linking, entity retrieval, entity-oriented search, information retrieval, knowledge graphs},
    Location = {Ann Arbor, MI, USA},
    Numpages = {4},
    Pages = {1423--1426},
    Publisher = {ACM},
    Series = {SIGIR '18},
    Title = {The Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR)},
    Url = {http://doi.acm.org/10.1145/3209978.3210196},
    Year = {2018},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/3209978.3210196},
    Bdsk-Url-2 = {https://doi.org/10.1145/3209978.3210196}}

Utilizing Knowledge Graphs for Text-Centric Information Retrieval

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The depth and breadth of content in these KGs made them not only rich sources of structured knowledge by themselves, but also valuable resources for search systems. A surge of recent developments in entity linking and entity retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications. This tutorial is the first to summarize and disseminate the progress in this emerging area to industry practitioners and researchers.

  • [PDF] [DOI] L. Dietz, A. Kotov, and E. Meij, “Utilizing knowledge graphs for text-centric information retrieval,” in The 41st international acm sigir conference on research & development in information retrieval, New York, NY, USA, 2018, p. 1387–1390.
    [Bibtex]
    @inproceedings{SIGIR:2018:Dietz-Tut,
    Acmid = {3210187},
    Address = {New York, NY, USA},
    Author = {Dietz, Laura and Kotov, Alexander and Meij, Edgar},
    Booktitle = {The 41st International ACM SIGIR Conference on Research \& Development in Information Retrieval},
    Date-Added = {2018-07-26 18:24:31 +0000},
    Date-Modified = {2018-07-26 18:31:50 +0000},
    Doi = {10.1145/3209978.3210187},
    Isbn = {978-1-4503-5657-2},
    Keywords = {entity linking, entity retrieval, information retrieval, knowledge graphs},
    Location = {Ann Arbor, MI, USA},
    Numpages = {4},
    Pages = {1387--1390},
    Publisher = {ACM},
    Series = {SIGIR '18},
    Title = {Utilizing Knowledge Graphs for Text-Centric Information Retrieval},
    Url = {http://doi.acm.org/10.1145/3209978.3210187},
    Year = {2018},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/3209978.3210187},
    Bdsk-Url-2 = {https://doi.org/10.1145/3209978.3210187}}

Overview of The First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR)

Knowledge graphs have been used throughout the history of information retrieval for a variety of tasks. Advances in knowledge acquisition and alignment technology in the last few years have given rise to a body of new approaches for utilizing knowledge graphs in text retrieval tasks. This report presents the motivation, output, and outlook of the first workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis which was co-located with SIGIR 2017 in Tokyo, Japan. We aim to assess where we stand today, what future directions are, and which preconditions could lead to further performance increases.

  • [PDF] [DOI] L. Dietz, C. Xiong, and E. Meij, “Overview of the first workshop on knowledge graphs and semantics for text retrieval and analysis (kg4ir),” Sigir forum, vol. 51, iss. 3, p. 139–144, 2018.
    [Bibtex]
    @article{Forum:2018:Dietz,
    Acmid = {3190601},
    Address = {New York, NY, USA},
    Author = {Dietz, Laura and Xiong, Chenyan and Meij, Edgar},
    Date-Added = {2018-07-26 18:22:37 +0000},
    Date-Modified = {2018-07-26 18:22:48 +0000},
    Doi = {10.1145/3190580.3190601},
    Issn = {0163-5840},
    Issue_Date = {December 2017},
    Journal = {SIGIR Forum},
    Month = 2,
    Number = {3},
    Numpages = {6},
    Pages = {139--144},
    Publisher = {ACM},
    Title = {Overview of The First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR)},
    Url = {http://doi.acm.org/10.1145/3190580.3190601},
    Volume = {51},
    Year = {2018},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/3190580.3190601},
    Bdsk-Url-2 = {https://doi.org/10.1145/3190580.3190601}}

The First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR)

Knowledge graphs have been used throughout the history of information retrieval for a variety of tasks. Advances in knowledge acquisition and alignment technology in the last few years have given rise to a body of new approaches for utilizing knowledge graphs in text retrieval tasks. This report presents the motivation, output, and outlook of the first workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis which was co-located with SIGIR 2017 in Tokyo, Japan. We aim to assess where we stand today, what future directions are, and which preconditions could lead to further performance increases. See https://kg4ir.github.io/ for more info.

  • [PDF] [DOI] L. Dietz, C. Xiong, and E. Meij, “The first workshop on knowledge graphs and semantics for text retrieval and analysis (kg4ir),” in Proceedings of the 40th international acm sigir conference on research and development in information retrieval, New York, NY, USA, 2017, p. 1427–1428.
    [Bibtex]
    @inproceedings{SIGIR:2017:Dietz,
    Acmid = {3084371},
    Address = {New York, NY, USA},
    Author = {Dietz, Laura and Xiong, Chenyan and Meij, Edgar},
    Booktitle = {Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval},
    Date-Added = {2018-07-26 18:17:39 +0000},
    Date-Modified = {2018-07-26 18:17:51 +0000},
    Doi = {10.1145/3077136.3084371},
    Isbn = {978-1-4503-5022-8},
    Keywords = {entities, information retrieval, knowledge graphs},
    Location = {Shinjuku, Tokyo, Japan},
    Numpages = {2},
    Pages = {1427--1428},
    Publisher = {ACM},
    Series = {SIGIR '17},
    Title = {The First Workshop on Knowledge Graphs and Semantics for Text Retrieval and Analysis (KG4IR)},
    Url = {http://doi.acm.org/10.1145/3077136.3084371},
    Year = {2017},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/3077136.3084371},
    Bdsk-Url-2 = {https://doi.org/10.1145/3077136.3084371}}
ECIR 2017

Generating descriptions of entity relationships

Large-scale knowledge graphs (KGs) store relationships between entities that are increasingly being used to improve the user experience in search applications. The structured nature of the data in KGs is typically not suitable to show to an end user and applications that utilize KGs therefore benefit from human-readable textual descriptions of KG relationships. We present a method that automatically generates textual descriptions of entity relationships by combining textual and KG information. Our method creates sentence templates for a particular relationship and then generates a textual description of a relationship instance by selecting the best template and filling it with appropriate entities. Experimental results show that a supervised variation of our method outperforms other variations as it captures the semantic similarity between a relationship instance and a template best, whilst providing more contextual information.

  • [PDF] N. Voskarides, E. Meij, and M. de Rijke, “Generating descriptions of entity relationships,” in Ecir 2017: 39th european conference on information retrieval, 2017.
    [Bibtex]
    @inproceedings{ECIR:2017:voskarides,
    Author = {Voskarides, Nikos and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {ECIR 2017: 39th European Conference on Information Retrieval},
    Date-Added = {2017-01-10 21:27:37 +0000},
    Date-Modified = {2017-01-10 21:27:58 +0000},
    Month = {April},
    Publisher = {Springer},
    Series = {LNCS},
    Title = {Generating descriptions of entity relationships},
    Year = {2017}}
wsdm 2017

Utilizing Knowledge Bases in Text-centric Information Retrieval (WSDM 2017)

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The increasing depth and breadth of content in KGs makes them not only rich sources of structured knowledge by themselves but also valuable resources for search systems. A surge of recent developments in entity linking and retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications, making this an ideal time to pause and report current findings to the community, summarizing successful approaches, and soliciting new ideas. This tutorial is the first to disseminate the progress in this emerging field to researchers and practitioners.

CIKM 2016

Document Filtering for Long-tail Entities

Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities. Continue reading “Document Filtering for Long-tail Entities” »