Information Extraction
NGSLT - Short course
Lecturer
Dr. Mark Stevenson
Department of Computer Science, University of Sheffield
Course web page
Date: April 21-24, 2008
Location:
Room 231a, Reykjavik University, Ofanleiti 2, Reykjavik 103, Iceland (see travel information).
Format: Lectures and exercises
Time slots for lectures:
- Monday April 21st: 9:30-11:30, 13:00-15:00
- Tuesday April 22nd: 9:30-11:30, 13:00-15:00
- Wednesday April 23rd: 9:30-11:30, 13:00-15:00
- Thursday April 24th: 9:30-13:00,
ECTS credits: 4
Assessment: pass/fail grade
Minimum of registered students: 5
Participants
Goals
- Introduce Information Extraction (IE) as a language technology
- Outline the basic approaches to evaluating IE systems
- Describe a variety of approaches to IE using knowledge-based and
machine learning methodologies (both supervised and unsupervised)
- Discuss the relations between complexity of linguistic descriptions
and the difficulty of IE on particular tasks
Summary of contents
Information Extraction (IE) is an important language technology which aims to identify specific types of information from documents. IE has been applied to a variety of domains, including the mining of text, such as news or biomedical articles, and the Semantic Web. For example, IE systems have been created which identify the movements of executives within companies from newspaper reports and to identify interactions between proteins from scientific journals.
This course will consist of (1) an overview of IE systems and their components,
including a description of early approaches which relied on hand
written rules, (2) a description of evaluation methodologies commonly
used for IE systems including the Message Understanding Conferences,
(3) the use of machine learning algorithms to assist in the
development and adaptation of IE systems , thereby avoiding the need
for expert domain knowledge which is often difficult to obtain, and
(4) analysis of various linguistic considerations which effect the
difficulty of IE tasks.
Literature
The material in the course will be based on a number of research papers. The following list includes some sample papers:
- J. Cowie and Y. Wilks. Information Extraction. In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker, 2000.
- S. Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 34(1):233-272, 1999.
- R. Yangarber, R. Grishman, S. Huttunen, and P. Tapanainen. Unsupervised Discovery of Scenario-Level Patterns for Information Extraction. In 6th Applied Natural Language Processing Conference, 2000.
- M. Stevenson and M. Greenwood. A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pp. 379-386, Ann Arbor, Michigan, June 2005.
- K. Sudo, S. Sekine, and R. Grishman. An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 2003.
- M. Stevenson and M. Greenwood. Comparing Information Extraction Pattern Models. In "Information Extraction Beyond the Document" workshop at COLING/ACL-2006, 2006, pp. 12-19.
Pre-course preparation
Students registered for this course need to study the following papers and slides before the course starts:
Background on Information Extraction
Background on some relevant Language Understanding Tools
- McDonald, R. and Nivre, J. (2007) Introduction to Data Driven
Dependency Parsing, Lecture 1. Tutorial at ESSLLI 2007
- Pedersen, T., Patwardhan, S. and Michelizzi, J. (2004) WordNet::Similarity - Measuring the Relatedness of
Concepts. Proceedings of Fifth Annual Meeting of the North
American Chapter of the Association for Computational Linguistics
(NAACL-04)
- Budanitsky, A. and Hirst, G. (2006) Evaluating
WordNet-based measures of semantic distance. Computational
Linguistics, 32(1), pp. 13--47.
Prerequisites
The course has no special requisites over and above what is required for
admission to NGSLT.
Course coordinators
Hrafn Loftsson, Reykjavik University, hrafn@ru.is
Eiríkur Rögnvaldsson, University of Iceland, eirikur@hi.is
Last modified April 14, 2008