The explosive increase in the amount of unstructured textual data being produced in all kinds of domains (web, news, social media, corporate, etc.), calls for advanced methodologies for making sense of this data. Historically, automatic techniques for making sense of unstructured text consisted of rather crude classifications at the document level, into categories such as “sports,” “finance,” etc. Later, advances in machine learning, text mining, and natural language processing enabled a more fine-grained analysis, giving rise to approaches such as named entity recognition (NER). NER effectively moves the granularity of the analysis from the document level to the phrase level and enables identifying the type of a phrase, such as “person,” “location,” etc. Once limited to fairly narrow domains, NER techniques are now mainstream.
Recent advances have enabled an even more precise manner of analysis, where phrases—consisting of a single term or sequence of terms—are automatically linked to entries in a knowledge base. This process is commonly known as entity linking. Entity linking facilitates advanced forms of searching and browsing in various domains and contexts. It can be used, for instance, to anchor the textual resources in background knowledge; authors or readers of a piece of text may find entity links to supply useful pointers. Another application can be found in search engines, where it is increasingly common to link queries to entities and present entity-specific overviews. More and more, users want to find the actual entities that satisfy their information need, rather than merely the documents that mention them; a process known as entity retrieval.
It is common to consider entities from a general-purpose knowledge base such as Wikipedia or Freebase, since they provide sufficient coverage for most tasks and applications. Wikipedia is therefore a common target for entity linking and a fertile ground for research on entity retrieval. Its rich structure (including article link structure and categorization, infoboxes, etc.) makes it possible to advance over plain document retrieval. Approaches for linking and retrieving entities are not Wikipedia-specific, however. Recent developments in the Web of Data enable the use of domain or task-specific entities. Alternatively, legacy or corporate knowledge bases can be used to provide entities. Entity linking and retrieval is also gaining popularity in the public domain, as witnessed by Wolfram Alpha, QWhisper, the Google Knowledge Graph, digital personal assistants such as Siri and Google Now, and entity-oriented vertical search results for places, products, etc. Examples of such collections and applications will be used throughout the tutorial as illustrative cases. Entity linking and retrieval is an emerging research area and no textbooks exist on the suject. With two relevant tracks in the research track and four relevant workshops, WWW2013 is the ideal venue for this tutorial.