Logical and relational learning with 10 tables
Luc De RaedtThis textbook covers logical and relational learning in depth, and hence provides an introduction to inductive logic programming (ILP), multirelational data mining (MRDM) and (statistical) relational learning (SRL). These subfields of data mining and machine learning are concerned with the analysis of complex and structured data sets that arise in numerous applications, such as bio and chemoinformatics, network analysis, Web mining, and natural language processing, within the rich representations offered by relational databases and computational logic.
The author introduces the machine learning and representational foundations of the field and explains some important techniques in detail by using some of the classic case studies centered around wellknown logical and relational systems.
The book is suitable for use in graduate courses and should be of interest to graduate students and researchers in computer science, databases and artificial intelligence, as well as practitioners of data mining and machine learning. It contains numerous figures and exercises, and slides are available for many chapters.
 Відкрити в браузері
 Checking other formats...
 Конвертувати в EPUB
 Конвертувати в FB2
 Конвертувати в MOBI
 Конвертувати в TXT
 Конвертувати в RTF
 Конвертований файл может відрізнятись від оригіналу. По можливості краще завантажувати файл в форматі оригіналу.
 Будьласка, спочатку увійдіть в свій аккаунт

Необхідна допомога? Будьласка, ознайомтесь з інструкцією як надіслати книгу на Kindle
Примітка: Вам необхідно верифікувати кожну книгу, яку Ви відпрвляєте на Kindle. Перевірте поштову скриньку на наявність листа з підтвердженням від Amazon Kindle Support.
Можливо вас зацікавить Powered by Rec2Me
Ключові фрази
Пов’язані Буклисти


Cognitive Technologies H. Prendinger, M. Ishizuka (Eds.) LifeLike Characters Tools, Affective Functions, and Applications IX, 477 pages. 2004 H. Helbig Knowledge Representation and the Semantics of Natural Language XVIII, 646 pages. 2006 P.M. Nugues An Introduction to Language Processing with Perl and Prolog An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German XX, 513 pages. 2006 W. Wahlster (Ed.) SmartKom: Foundations of Multimodal Dialogue Systems XVIII, 644 pages. 2006 B. Goertzel, C. Pennachin (Eds.) Artificial General Intelligence XVI, 509 pages. 2007 O. Stock, M. Zancanaro (Eds.) PEACH — Intelligent Interfaces for Museum Visits XVIII, 316 pages. 2007 V. Torra, Y. Narukawa Modeling Decisions: Information Fusion and Aggregation Operators XIV, 284 pages. 2007 P. Manoonpong Neural Preprocessing and Control of Reactive Walking Machines Towards Versatile Artificial Perception–Action Systems XVI, 185 pages. 2007 S. Patnaik Robot Cognition and Navigation An Experiment with Mobile Robots XVI, 290 pages. 2007 M. Cord, P. Cunningham (Eds.) Machine Learning Techniques for Multimedia Case Studies on Organization and Retrieval XVI, 290 pages. 2008 L. De Raedt Logical and Relational Learning XVI, 388 pages. 2008 Cognitive Technologies Managing Editors: D. M. Gabbay J. Siekmann Editorial Board: A. Bundy J. G. Carbonell M. Pinkal H. Uszkoreit M. Veloso W. Wahlster M. J. Wooldridge Advisory Board: Luigia Carlucci Aiello Franz Baader Wolfgang Bibel Leonard Bolc Craig Boutilier Ron Brachman Bruce G. Buchanan Anthony Cohn Artur d’Avila Garcez Luis Fariñas del Cerro Koichi Furukawa Georg Gottlob Patrick J. Hayes James A. Hendler Anthony Jameson Nick Jennings Aravind K. Joshi Hans Kamp Martin Kay Hiroaki Kitano Robert Kowalski Sarit Kraus Maurizio Lenzerini Hector Levesque John Lloyd Alan Mackworth Mark Maybury Tom Mitchell Johanna D. Moore Stephen H. Muggleton Bernhard Nebel Sharon Oviatt Luis Pereira Lu Ruqian Stuart Russell Erik Sandewall Luc Steels Olivi; ero Stock Peter Stone Gerhard Strube Katia Sycara Milind Tambe Hidehiko Tanaka Sebastian Thrun Junichi Tsujii Kurt VanLehn Andrei Voronkov Toby Walsh Bonnie Webber Luc De Raedt Logical and Relational Learning With 77 Figures and 10 Tables Author: Managing Editors: Prof. Dr. Luc De Raedt Dept. of Computer Science Katholieke Universiteit Leuven Celestijnenlaan 200A 3001 Heverlee Belgium luc.deraedt@cs.kuleuven.be Prof. Dov M. Gabbay Augustus De Morgan Professor of Logic Department of Computer Science King’s College London Strand, London WC2R 2LS, UK Prof. Dr. Jörg Siekmann Forschungsbereich Deduktions und Multiagentensysteme, DFKI Stuhlsatzenweg 3, Geb. 43 66123 Saarbrücken, Germany ISBN: 9783540200406 eISBN: 9783540688563 Cognitive Technologies ISSN: 16112482 Library of Congress Control Number: 2008933706 ACM Computing Classification (1998): I.2.6, I.2.4, H.2.8, I.5.1 c 2008 SpringerVerlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: KuenkelLopka GmbH Printed on acidfree paper 9 8 7 6 5 4 3 2 1 springer.com To Lieve, Maarten and Soetkin Preface I use the term logical and relational learning to refer to the subﬁeld of artiﬁcial intelligence, machine learning and data mining that is concerned with learning in expressive logical or relational representations. It is the union of inductive logic programming, (statistical) relational learning and multirelational data mining, which all have contributed techniques for learning from data in relational form. Even though some early contributions to logical and relational learning are about forty years old now, it was only with the advent of inductive logic programming in the early 1990s that the ﬁeld became popular. Whereas initial work was often concerned with logical (or logic programming) issues, the focus has rapidly changed to the discovery of new and interpretable knowledge from structured data, often in the form of rules, and soon important successes in applications in domains such as bio and chemoinformatics and computational linguistics were realized. Today, the challenges and opportunities of dealing with structured data and knowledge have been taken up by the artiﬁcial intelligence community at large and form the motivation for a lot of ongoing research. Indeed, graph, network and multirelational data mining are now popular themes in data mining, and statistical relational learning is receiving a lot of attention in the machine learning and uncertainty in artiﬁcial intelligence communities. In addition, the range of tasks for which logical and relational techniques have been developed now covers almost all machine learning and data mining tasks. On the one hand these developments have resulted in a new role and novel views on logical and relational learning, but on the other hand have also made it increasingly diﬃcult to obtain an overview of the ﬁeld as a whole. This book wants to address these needs by providing a new synthesis of logical and relational learning. It constitutes an attempt to summarize some of the key results about logical and relational learning, it covers a wide range of topics and techniques, and it describes them in a uniform and accessible manner. While the author has tried to select a representative set of topics and techniques from the ﬁeld of logical and relational learning, he also realizes that he is probably biased by his own research interests and views on the VIII Preface ﬁeld. Furthermore, rather than providing detailed accounts of the many speciﬁc systems and techniques, the book focuses on the underlying principles, which should enable the reader to easily get access to and understand the relevant literature on logical and relational learning. Actually, at the end of each chapter, suggestions for further reading are provided. The book is intended for graduate students and researchers in artiﬁcial intelligence and computer science, especially those in machine learning, data mining, uncertainty in artiﬁcial intelligence, and computational logic, with an interest in learning from relational data. The book is the ﬁrst textbook on logical and relational learning and is suitable for use in graduate courses, though it can also be used for selfstudy and as a reference. It contains many diﬀerent examples and exercises. Teaching material will become available from the author’s website. The author would also appreciate receiving feedback, suggestions for improvement and needed corrections by email to luc.deraedt@cs.kuleuven.be. The book starts with an introductory chapter clarifying the nature, motivations and history of logical and relational learning. Chapter 2 provides a gentle introduction to logic and logic programming, which will be used throughout the book as the representation language. Chapter 3 introduces the idea of learning as search and provides a detailed account of some fundamental machine learning algorithms that will play an important role in later chapters. In Chapter 4, a detailed study of a hierarchy of diﬀerent representations that are used in machine learning and data mining is given, and two techniques (propositionalization and aggregation) for transforming expressive representations into simpler ones are introduced. Chapter 5 is concerned with the theoretical basis of the ﬁeld. It studies the generality relation in logic, the relation between induction and deduction, and introduces the most important framework and operators for generality. In Chapter 6, a methodology for developing logical and relational learning systems is presented and illustrated using a number of wellknown case studies that learn relational rules, decision trees and frequent queries. The methodology starts from existing learning approaches and upgrades them towards the use of rich representations. Whereas the ﬁrst six chapters are concerned with the foundations of logical and relational learning, the chapters that follow introduce more advanced techniques. Chapter 7 focuses on learning the deﬁnition of multiple relations, that is, on learning theories. This chapter covers abductive reasoning, using integrity constraints, program synthesis, and the use of an oracle. Chapter 8 covers statistical relational learning, which combines probabilistic models with logical and relational learning. The chapter starts with a gentle introduction to graphical models before turning towards probabilistic logics. The use of kernels and distances for logical and relational learning is addressed in Chapter 9, and in Chapter 10 computational issues such as eﬃciency considerations and learnability results are discussed. Finally, Chapter 11 summarizes the most important lessons learned about logical and relational learning. The author suggests to read it early on, possibly even directly after Chapter 1. Preface IX An introductory course to logical and relational learning covers most of the materials in Chapters 1 to 4, Sects. 5.1 – 5.4, 5.9, and Chapters 6 and 11. The other chapters do not depend on one another, and, hence, further chapters can be selected according to the interests and the preferences of the reader. Given the interests in statistical relational learning, the author certainly recommends Chapter 8. Advanced sections and exercises are marked with a * or even with **. They are more challenging, but can be skipped without loss of continuity. This book could not have been written without the help and encouragement of many persons. The author is indebted to a number of coworkers who contributed ideas, techniques, surveys and views that have found their way into this book, including: Maurice Bruynooghe for inﬂuencing the use of logic in this book and numerous suggestions for improvement; Hendrik Blockeel, Luc Dehaspe and Wim Van Laer for contributions to the upgrading methodology described in Chapter 6, Kristian Kersting for joint work on statistical relational learning presented in Chapter 8, and Jan Ramon for his work on distances in Chapter 9. This book has also taken inspiration from a number of joint overview papers and tutorials that the author delivered in collaboration with Hendrik Blockeel, Sašo Džeroski, Kristian Kersting, Nada Lavrač and Stephen Muggleton. The author would also like to thank the editor at Springer, Ronan Nugent, for his patience, help, and support during all phases of this bookwriting project. The author is grateful for the feedback and encouragement on the many earlier versions of this book he received. He would like to thank especially: the reading clubs at the University of Bristol (headed by Peter Flach, and involving Kerstin Eder, Robert Egginton, Steve Gregory, Susanne Hoche, Simon Price, Simon Rawles and Ksenia Shalonova) and at the Katholieke Universiteit Leuven (Hendrik Blockeel, Björn Bringmann, Maurice Bruynooghe, Fabrizio Costa, Tom Croonenborghs, Anton Dries, Kurt Driessens, Daan Fierens, Christophe Costa Florencio, Elisa Fromont, Robby Goetschalkx, Bernd Gutmann, Angelika Kimmig, Niels Landwehr, Wannes Meert, Siegfried Nijssen, Stefan Raeymaekers, Leander Schietgat, Jan Struyf, Ingo Thon, Anneleen van Assche, Joaquin Vanschoren, Celine Vens, and Albrecht Zimmermann), several colleagues who gave feedback or discussed ideas in this book (James Cussens, Paolo Frasconi, Thomas Gaertner, Tamas Horvath, Manfred Jaeger, Martijn van Otterlo), and the students in Freiburg who used several previous versions of this book. Last but not least I would like to thank my wife, Lieve, and my children, Soetkin and Maarten, for their patience and love during the many years it took to write this book. I dedicate this book to them. Destelbergen, May 2008 Luc De Raedt Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What Is Logical and Relational Learning? . . . . . . . . . . . . . . . . . . 1 1.2 Why Is Logical and Relational Learning Important? . . . . . . . . . 2 1.2.1 Structure Activity Relationship Prediction . . . . . . . . . . . . 3 1.2.2 A Web Mining Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 A Language Learning Example . . . . . . . . . . . . . . . . . . . . . . 7 1.3 How Does Relational and Logical Learning Work? . . . . . . . . . . . 8 1.4 A Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 An 2.1 2.2 2.3 2.4 2.5 2.6 Introduction to Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Relational Database Example . . . . . . . . . . . . . . . . . . . . . . . . . . . The Syntax of Clausal Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Semantics of Clausal Logic — Model Theory . . . . . . . . . . . . Inference with Clausal Logic — Proof Theory . . . . . . . . . . . . . . . Prolog and SLDresolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Historical and Bibliographic Remarks . . . . . . . . . . . . . . . . . . . . . . 17 17 20 22 28 35 39 3 An Introduction to Learning and Search . . . . . . . . . . . . . . . . . . . 3.1 Representing Hypotheses and Instances . . . . . . . . . . . . . . . . . . . . 3.2 Boolean Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 A GenerateandTest Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Structuring the Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Reﬁnement Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 A Generic Algorithm for Mining and Learning . . . . . . . . . . . . . . 3.11 A Complete GeneraltoSpeciﬁc Algorithm . . . . . . . . . . . . . . . . . . 3.12 A Heuristic GeneraltoSpeciﬁc Algorithm . . . . . . . . . . . . . . . . . . 3.13 A BranchandBound Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 43 44 45 47 48 50 53 56 58 59 60 62 XII Contents 3.14 A SpeciﬁctoGeneral Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15 Working with Borders* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15.1 Computing a Single Border . . . . . . . . . . . . . . . . . . . . . . . . . 3.15.2 Computing Two Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.15.3 Computing Two Borders Incrementally . . . . . . . . . . . . . . . 3.15.4 Operations on Borders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.16 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.17 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 64 64 65 66 68 69 69 4 Representations for Mining and Learning . . . . . . . . . . . . . . . . . . 71 4.1 Representing Data and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 AttributeValue Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 MultipleInstance Learning: Dealing With Sets . . . . . . . . . . . . . . 76 4.4 Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Logic Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Sequences, Lists, and Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 Trees and Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.9 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.10 Designing It Yourself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.11 A Hierarchy of Representations* . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.11.1 From AV to BL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.11.2 From M I to AV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.11.3 From RL to M I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.11.4 From LP to RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.12 Propositionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.12.1 A TableBased Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.12.2 A QueryBased Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.13 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.14 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.15 Historical and Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . 113 5 Generality and Logical Entailment . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1 Generality and Logical Entailment Coincide . . . . . . . . . . . . . . . . 115 5.2 Propositional Subsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 Subsumption in Logical Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.1 Specialization Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3.2 Generalization Operators* . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3.3 Computing the lgg and the glb . . . . . . . . . . . . . . . . . . . . . . 125 5.4 ΘSubsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4.1 Soundness and Completeness . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.2 Deciding ΘSubsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4.3 Equivalence Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.5 Variants of ΘSubsumption* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.5.1 Object Identity* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Contents XIII 5.5.2 Inverse Implication* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.6 Using Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.6.1 Saturation and Bottom Clauses . . . . . . . . . . . . . . . . . . . . . 139 5.6.2 Relative Least General Generalization* . . . . . . . . . . . . . . . 141 5.6.3 Semantic Reﬁnement* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.7 Aggregation* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.8 Inverse Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.9 A Note on Graphs, Trees, and Sequences . . . . . . . . . . . . . . . . . . . 152 5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.11 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6 The Upgrading Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.1 Motivation for a Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2 Methodological Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.2.1 Representing the Examples . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.2.2 Representing the Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 160 6.2.3 Adapting the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.2.4 Adding Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.3 Case Study 1: Rule Learning and Foil . . . . . . . . . . . . . . . . . . . . . 161 6.3.1 Foil’s Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3.2 Foil’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4 Case Study 2: Decision Tree Learning and Tilde . . . . . . . . . . . . 168 6.4.1 The Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.4.2 Inducing Logical Decision Trees . . . . . . . . . . . . . . . . . . . . . 172 6.5 Case Study 3: Frequent ItemSet Mining and Warmr . . . . . . . . 174 6.5.1 Relational Association Rules and Local Patterns . . . . . . 174 6.5.2 Computing Frequent Queries . . . . . . . . . . . . . . . . . . . . . . . . 177 6.6 Language Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.6.1 Syntactic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.6.2 Semantic Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 7 Inducing Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.1 Introduction to Theory Revision . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.1.1 Theories and Model Inference . . . . . . . . . . . . . . . . . . . . . . . 188 7.1.2 Theory Revision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.1.3 Overview of the Rest of This Chapter . . . . . . . . . . . . . . . . 192 7.2 Towards Abductive Logic Programming . . . . . . . . . . . . . . . . . . . . 193 7.2.1 Abduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7.2.2 Integrity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.2.3 Abductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . 196 7.3 Shapiro’s Theory Revision System . . . . . . . . . . . . . . . . . . . . . . . . . 199 7.3.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 7.3.2 The Model Inference System . . . . . . . . . . . . . . . . . . . . . . . . 203 XIV Contents 7.4 Two Propositional Theory Revision Systems* . . . . . . . . . . . . . . . 208 7.4.1 Learning a Propositional Horn Theory Eﬃciently . . . . . . 208 7.4.2 Heuristic Search in Theory Revision . . . . . . . . . . . . . . . . . 212 7.5 Inducing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 7.5.1 Problem Speciﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.5.2 An Algorithm for Inducing Integrity Constraints . . . . . . 215 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 7.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 8 Probabilistic Logic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 8.1 Probability Theory Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 8.2 Probabilistic Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 8.2.1 Probabilities on Interpretations . . . . . . . . . . . . . . . . . . . . . 226 8.2.2 Probabilities on Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 8.3 Probabilistic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 8.3.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 8.3.2 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 8.4 FirstOrder Probabilistic Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.4.1 Probabilistic Interpretations . . . . . . . . . . . . . . . . . . . . . . . 248 8.4.2 Probabilistic Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 8.5 Probabilistic Logic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 8.5.1 Learning from Interpretations . . . . . . . . . . . . . . . . . . . . . . . 267 8.5.2 Learning from Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . 270 8.5.3 Learning from Proof Trees and Traces . . . . . . . . . . . . . . . . 271 8.6 Relational Reinforcement Learning* . . . . . . . . . . . . . . . . . . . . . . . 274 8.6.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . 274 8.6.2 Solving Markov Decision Processes . . . . . . . . . . . . . . . . . . 277 8.6.3 Relational Markov Decision Processes . . . . . . . . . . . . . . . . 280 8.6.4 Solving Relational Markov Decision Processes . . . . . . . . . 282 8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 9 Kernels and Distances for Structured Data . . . . . . . . . . . . . . . . 289 9.1 A Simple Kernel and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 9.2 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 9.2.1 The Max Margin Approach . . . . . . . . . . . . . . . . . . . . . . . . . 291 9.2.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 9.2.3 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 9.3 DistanceBased Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 9.3.1 Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 9.3.2 The kNearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . 297 9.3.3 The kMeans Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 9.4 Kernels for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 9.4.1 Convolution and Decomposition . . . . . . . . . . . . . . . . . . . . . 299 9.4.2 Vectors and Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Contents 9.5 9.6 9.7 9.8 XV 9.4.3 Sets and Multisets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 9.4.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 9.4.5 Trees and Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 9.4.6 Graph Kernels* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Distances and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 9.5.1 Generalization and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 308 9.5.2 Vectors and Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 9.5.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 9.5.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 9.5.5 Atoms and Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 9.5.6 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Relational Kernels and Distances . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Bibliographical and Historical Notes . . . . . . . . . . . . . . . . . . . . . . . 323 10 Computational Aspects of Logical and Relational Learning 325 10.1 Eﬃciency of Relational Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 325 10.1.1 Coverage as θSubsumption . . . . . . . . . . . . . . . . . . . . . . . . . 326 10.1.2 θSubsumption Empirically . . . . . . . . . . . . . . . . . . . . . . . . . 327 10.1.3 Optimizing the Learner for θsubsumption . . . . . . . . . . . . 328 10.2 Computational Learning Theory* . . . . . . . . . . . . . . . . . . . . . . . . . . 333 10.2.1 Notions of Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 10.2.2 Positive Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 10.2.3 Negative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 10.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 10.4 Historical and Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 342 11 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 11.1 A Hierarchy of Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 11.2 From Upgrading to Downgrading . . . . . . . . . . . . . . . . . . . . . . . . . . 346 11.3 Propositionalization and Aggregation . . . . . . . . . . . . . . . . . . . . . . 346 11.4 Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 11.5 Operators and Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 11.6 Uniﬁcation and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 11.7 Three Learning Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 11.8 Knowledge and Background Knowledge . . . . . . . . . . . . . . . . . . . . 350 11.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 1 Introduction The ﬁeld of logical and relational learning, which is introduced in this chapter, is motivated by the limitations of traditional symbolic machine learning and data mining systems that largely work with propositional representations. These limitations are clariﬁed using three case studies: predicting the activity of chemical compounds from their structure; link mining, where properties of websites are discovered; and learning a simple natural language interface for a database. After sketching how logical and relational learning works, the chapter ends with a short sketch of the history of this subﬁeld of machine learning and data mining as well as a brief overview of the rest of this book. 1.1 What Is Logical and Relational Learning? Artiﬁcial intelligence has many diﬀerent subﬁelds. Logical and relational learning combines principles and ideas of two of the most important subﬁelds of artiﬁcial intelligence: machine learning and knowledge representation. Machine learning is the study of systems that improve their behavior over time with experience. In many cases, especially in a data mining context, the experience consists of a set of observations or examples in which one searches for patterns, regularities or classiﬁcation rules that provide valuable new insights into the data and that should ideally be readily interpretable by the user. The learning process then typically involves a search through various generalizations of the examples. In the past ﬁfty years, a wide variety of machine learning techniques have been developed (see Mitchell [1997] or Langley [1996] for an overview). However, most of the early techniques were severely limited from a knowledge representation perspective. Indeed, most of the early techniques (such as decision trees [Quinlan, 1986], Bayesian networks [Pearl, 1988], perceptrons [Nilsson, 1990], or association rules [Agrawal et al., 1993]) could only handle data and generalizations in a limited representation language, which was essentially 2 1 Introduction propositional. Propositional representations (based on boolean or propositional logic) cannot elegantly represent domains involving multiple entities as well as the relationships amongst them. One such domain is that of social networks where the persons are the entities and their social interactions are characterized by their relationships. The representational limitations of the early machine learning systems carried over to the resulting learning and data mining techniques, which were in turn severely limited in their application domain. Various machine learning researchers, such as Ryszard Michalski [1983] and Gordon Plotkin [1970], soon realized these limitations and started to employ more expressive knowledge representation frameworks for learning. Research focused on using frameworks that were able to represent a variable number of entities as well as the relationships that hold amongst them. Such representations are called relational. When they are grounded in or derived from ﬁrstorder logic they are called logical representations. The interest in learning using these expressive representation formalisms soon resulted in the emergence of a new subﬁeld of artiﬁcial intelligence that I now describe as logical and relational learning. Logical and relational learning is thus viewed in this book as the study of machine learning and data mining within expressive knowledge representation formalisms encompassing relational or ﬁrstorder logic. It speciﬁcally targets learning problems involving multiple entities and the relationships amongst them. Throughout the book we shall mostly be using logic as a representation language for describing data and generalizations, because logic is inherently relational, it is expressive, understandable, and interpretable, and it is well understood. It provides solid theoretical foundations for many developments within artiﬁcial intelligence and knowledge representation. At the same time, it enables one to specify and employ background knowledge about the domain, which is often also a key factor determining success in many applications of artiﬁcial intelligence. 1.2 Why Is Logical and Relational Learning Important? To answer this question, let us look at three important applications of logical and relational learning. The ﬁrst is concerned with learning to classify a set of compounds as active or inactive [Srinivasan et al., 1996], the second with analyzing a website [Craven and Slattery, 2001], and the third with learning a simple natural language interface to a database system [Mooney, 2000]. In these applications there are typically a variable number of entities as well as relationships amongst them. This makes it very hard, if not impossible, to use more traditional machine learning methods that work with ﬁxed feature vectors or attributevalue representations. Using relational or logical representations, these problems can be alleviated, as we will show throughout the rest of this book. 1.2 Why Is Logical and Relational Learning Important? 3 1.2.1 Structure Activity Relationship Prediction Consider the compounds shown in Fig. 1.1. Two of the molecules are active and two are inactive. The learning task now is to ﬁnd a pattern that discriminates the actives from the inactives. This type of task is an important task in computational chemistry. It is often called structure activity relationship prediction (SAR), and it forms an essential step in understanding various processes related to drug design and discovery [Srinivasan and King, 1999b], toxicology [Helma, 2005], and so on. The ﬁgure also shows a socalled structural alert, which allows one to distinguish the actives from the inactives because the structural alert is a substructure (subgraph) that matches both of the actives but none of the inactives. At the same time, the structural alert is readily interpretable and provides useful insights into the factors determining the activity. Active O=N O CH=NNHCNH O 2 + N O O O nitrofurazone  4nitropenta[cd]pyrene Structural alert: Inactive Y=Z O O N+ NH N O O 6nitro7,8,9,10tetrahydrobenzo[a]pyrene 4nitroindole Fig. 1.1. Predicting mutagenicity. Reprinted from [Srinivasan et al., 1996], page c 288, 1996, with permission from Elsevier Traditional machine learning methods employ the singletuple singletable assumption, which assumes that the data can be represented using attributevalue pairs. Within this representation, each example (or compound) corresponds to a single row or tuple in a table, and each feature or attribute to a single column; cf. Table 1.1. For each example and attribute, the cells of the 4 1 Introduction table specify the value of the attribute for the speciﬁc example, for instance, whether or not a benzene ring is present. We shall call representations that can easily be mapped into the singletuple singletable format propositional. Table 1.1. A tablebased representation Compound Attribute1 Attribute2 Attribute3 Class 1 true false true active true true true active 2 false false true inactive 3 true false true inactive 4 From a user perspective, the diﬃculty with this type of representation is the mismatch between the graphical and structured twodimensional representation in the molecules and the ﬂat representation in the table. In order to use the ﬂat representation, the user must ﬁrst determine the features or attributes of interest. In structure activity relationship prediction, these are sometimes called ﬁngerprints. Finding these features is in itself a nontrivial task as there exist a vast number of potentially interesting features. Furthermore, the result of the learning process will critically depend on the quality of the employed features. Even though there exist a number of specialized tools to tackle this kind of task (involving the use of libraries of ﬁngerprints), the question arises as to whether there exist generalpurpose machine learning systems able to cope with such structured representations directly. The answer to this question is aﬃrmative, as logical and relational learners directly deal with structured data. Example 1.1. The graphical structure of one of the compounds can be represented by means of the following tuples, which we call facts: active(f1) ← logmutag(f1, 0.64) ← lumo(f1, −1.785) ← logp(f1, 1.01) ← atom(f1, f11 , c, 21, 0.187) ← atom(f1, f12 , c, 21, −0.143) ← atom(f1, f13 , c, 21, −0.143) ← atom(f1, f14 , c, 21, −0.013) ← atom(f1, f15 , o, 52, −0.043) ← ... bond(f1, f11 , f12 , 7) ← bond(f1, f12 , f13 , 7) ← bond(f1, f13 , f14 , 7) ← bond(f1, f14 , f15 , 7) ← bond(f1, f18 , f19 , 2) ← bond(f1, f18 , f110 , 2) ← bond(f1, f11 , f111 , 1) ← bond(f1, f111 , f112 , 2) ← bond(f1, f111 , f113 , 1) ← In this encoding, each entity is given a name and the relationships among the entities are captured. For instance, in the above example, the compound is named f1 and its atoms f11 , f12 , .... Furthermore, the relation atom/5 of arity 5 states properties of the atoms: the molecule they occur in (e.g., f1), 1.2 Why Is Logical and Relational Learning Important? 5 the element (e.g., c denoting a carbon) and the type (e.g., 21) as well as the charge (e.g., 0.187). The relationships amongst the atoms are then captured by the relation bond/3, which represents the bindings amongst the atoms. Finally, there are also overall properties or attributes of the molecule, such as their logp and lumo values. Further properties of the compounds could be mentioned, such as the functional groups or ring structures they contain: ring size 5(f1, [f15 , f11 , f12 , f13 , f14 ]) ← hetero aromatic 5 ring(f1, [f15 , f11 , f12 , f13 , f14 ]) ← ... The ﬁrst tuple states that there is a ring of size 5 in the compound f1 that involves the atoms f15 , f11 , f12 , f13 and f14 in molecule f1; the second one states that this is a heteroaromatic ring. Using this representation it is possible to describe the structural alert in the form of a rule active(M) ← ring size 5(M, R), element(A1, R), bond(M, A1, A2, 2) which actually reads as1 : Molecule M is active IF it contains a ring of size 5 called R and atoms A1 and A2 that are connected by a double (2) bond such that A1 also belongs to the ring R. The previous example illustrates the use of logical representations for data mining. It is actually based on the wellknown mutagenicity application of relational learning due to Srinivasan et al. [1996], where the structural alert was discovered using the inductive logic programming system Progol [Muggleton, 1995] and the representation employed above. The importance of this type of application is clear when considering that the results were published in the scientiﬁc literature in the application domain [King and Srinivasan, 1996], that they were obtained using a generalpurpose inductive logic programming algorithm and were transparent to the experts in the domain. The combination of these factors has seldom been achieved in artiﬁcial intelligence. 1.2.2 A Web Mining Example While structure activity relationship prediction involves the mining of a (potentially large) set of small graphs, link mining and discovery is concerned with the analysis of a single large graph or network [Getoor, 2003, Getoor and Dielh, 2005]. To illustrate link mining, we consider the best known example of a network, that is, the Internet, even though link mining is applicable in 1 This form of description is sometimes called ‘Sternberg’ English in inductive logic programming, after the computational biologist Michael Sternberg, who has been involved in several pioneering scientiﬁc applications of logical learning. 6 1 Introduction other contexts as well, for instance, in social networks, protein networks, and bibliographic databases. The following example is inspired by the inﬂuential WebKB example of Craven and Slattery [2001]. Example 1.2. Consider the website of a typical university. It contains several web pages that each describe a wide variety of entities. These entities belong to a wide variety of classes, such as student, faculty, staﬀ, department, course, project, and publication. Let us assume that for each such web page, there is a corresponding tuple in our relational database. Consider, for instance, the following facts (where the urls denote particular URLs): faculty(url1, stephen) ← course(url3, logic for learning) ← department(url5, computer science) ← ... faculty(url2, john) ← project(url4, april2) ← student(url6, hiroaki) ← In addition, there are relationships among these entities. For instance, various pages that refer to one another, such as the link from url6 to url1, denote a particular relationship, in this case the relationship between the student hiroaki and his adviser stephen. This can again be modeled as facts in a relational database: adviser(stephen, hiroaki) ← teaches(john, logic for learning) ← belongsTo(stephen, computer science) ← follows(hiroaki, logic for learning) ← ... Again, the structure of the problem can elegantly be represented using a relational database. This representation can easily be extended with additional background knowledge in the form of rules: studentOf(Lect, Stud) ← teaches(Lect, Course), follows(Stud, Course) which expresses that Stud is a studentOf Lect IF Lect teaches a Course and Stud follows the Course. Further information could be contained in the data set, such as extracts of the text appearing on the diﬀerent links or pages. There are several interesting link mining tasks in this domain. It is for instance possible to learn to predict the classes of web pages or the nature of the relationships encoded by links between web pages; cf. [Getoor, 2003, Craven and Slattery, 2001, Chakrabarti, 2002, Baldi et al., 2003]. To address such tasks using logical or relational learning, one has to start from a set of examples of relations that are known to hold. For instance, the fact that stephen is the adviser of hiroaki is a (positive) example stating that the link from hiroaki to stephen belongs to the relation adviser. If for a given university website, say the University of Freiburg, all 1.2 Why Is Logical and Relational Learning Important? 7 hyperlinks would be labeled (by hand) with the corresponding relationships, one could then learn general rules using logical and relational learning that would allow one to predict the labels of unseen hyperlinks. These rules could then be applied to determine the labels of the hyperlinks at another university website, say that of the University of Leuven. An example of a rule that might be discovered in this context is adviser(Prof, Stud) ← webpage(Stud, Url), student(Url), contains(Url, advisor), contains(Url, Prof) which expresses that Prof is an adviser of Stud IF Stud has a webpage with Url of type student that contains the words adviser and Prof. To tackle such problems, one often combines logical and relational learning with probabilistic models. This topic will be introduced in Chapter 8. Link mining problems in general, and the above example in particular, cannot easily be represented using the singletuple singletable assumption (as in Table 1.1) without losing information. Exercise 1.3. Try to represent the link mining example within the singletuple singletable assumption and identify the problems with this approach. Exercise 1.4. Sketch other application domains that cannot be modeled under this assumption. 1.2.3 A Language Learning Example A third illustration of an application domain that requires dealing with knowledge as well as structured data is natural language processing. Empirical natural language processing is now a major trend within the computational linguistics community [Manning and Schütze, 1999], and several logical and relational learning scientists have contributed interesting techniques and applications; cf. [Cussens and Džeroski, 2000, Mooney, 2000]. As an illustration of this line of work, let us look at one of the applications of Raymond Mooney’s group, which pioneered the use of logical and relational learning techniques for language learning. More speciﬁcally, we sketch how to learn to parse database queries in natural language, closely following Zelle and Mooney [1996].2 The induced semantic parser is the central component of a questionanswering system. Example 1.5. Assume you are given a relational database containing information about geography. The database contains information about various basic 2 For ease of exposition, we use a slightly simpliﬁed notation. 8 1 Introduction entities such as countries, cities, states, rivers and places. In addition, it contains facts about the relationships among them, for instance, capital(C, Y), which speciﬁes that C is the capital of Y, loc(X, Y), which states that X is located in Y, nextTo(X, Y), which states that X is located next to Y, and many other relationships. The task could then be to translate queries formulated in natural language to database queries that can be executed by the underlying database system. For instance, the query in natural language What are the major cities in Kansas? could be translated to the database query answer(C, (major(C), city(C), loc(C, S), equal(S, kansas))) This last query can then be passed on to the database system and executed. The database system then generates all entities C that are major, a city, and located in kansas. Zelle and Mooney’s learning system [1996] starts from examples, which consist of queries in natural language and in database format, from an elementary shiftreduce parser, and from some background knowledge about the domain. The task is then to learn control knowledge for the parser. Essentially, the parser has to learn the conditions under which to apply the diﬀerent operators in the shiftreduce parser. The control knowledge is represented using a set of clauses (IF rules) as in the previous two case studies. The control rules need to take into account knowledge about the stack used by the parser, the structure of the database, and the semantics of the language. This is again hard (if not impossible) to represent under the singletuple singletable assumption. 1.3 How Does Relational and Logical Learning Work? Symbolic machine learning and data mining techniques essentially search a space of possible patterns, models or regularities. Depending on the task, diﬀerent search algorithms and principles apply. For instance, consider the structure activity relationship prediction task and assume that one is searching for all structural alerts that occur in at least 20% of the actives and at most 2% of the inactives. In this case, a complete search strategy is applicable. On the other hand, if one is looking for a structural alert that separates the actives from the inactives and could be used for classiﬁcation, a heuristic search method such as hill climbing is more appropriate. Data mining is often viewed as the process of computing the set of patterns T h(Q, D, L) [Mannila and Toivonen, 1997], which can be deﬁned as follows (cf. also Chapter 3). The search space consists of all patterns expressible within a language of patterns L. For logical and relational learning this will typically 1.3 How Does Relational and Logical Learning Work? 9 be a set of rules or clauses of the type we encountered in the case studies of the previous section; the data set D consists of the examples that need to be generalized; and, ﬁnally, the constraint Q speciﬁes which patterns are of interest. The constraint typically depends on the data mining task tackled, for instance, ﬁnding a single structural alert in a classiﬁcation setting, or ﬁnding all alerts satisfying particular frequency thresholds. So, the set T h(Q, D, L) can be deﬁned as the set of all patterns h ∈ L that satisfy the constraint Q(h, D) with respect to the data set D. A slightly diﬀerent perspective is given by the machine learning view, which is often formulated as that of ﬁnding a particular function h (again belonging to a language of possible functions L) that minimizes a loss function l(h, D) on the data. Using this view, the natural language application of the previous subsection can be modeled more easily, as the goal is to learn a function mapping statements in natural language to database queries. An adequate loss function is the accuracy of the function, that is, the fraction of database queries that is correctly predicted. The machine learning and data mining views can be reconciled, for instance, by requiring that the constraint Q(h, D) succeeds only when l(h, D) is minimal; cf. Chapter 3. Central in the deﬁnition of the constraint Q(h, D) or the loss function l(h, D) is the covers relation between the data and the rules. It speciﬁes when a rule covers an example, or, equivalently, when an example satisﬁes a particular rule. There are various possible ways to represent examples and rules and these result in diﬀerent possible choices for the covers relation (cf. Chapter 4). The most popular choice is that of learning from entailment. It is also the setting employed in the case studies above. Example 1.6. To illustrate the notion of coverage, let us reconsider Ex. 1.1 and let us also simplify it a bit. An example could now be represented by the rule: active(m1) ← atom(m1, m11, c), . . . , atom(m1, m1n, c), bond(m1, m11, m12, 2), . . . , bond(m1, m11, m13, 1), ring size 5(m1, [m15, m11, m12, m13, m14]), . . . Consider now the rule active(M) ← ring size 5(M, R), atom(M, M1, c) which actually states that Molecule M is active IF it contains a ring of size 5 called R and an atom M1 that is a carbon (c). The rule covers the example because the conditions in the rule are satisﬁed by the example when setting M = m1, R = [m15, m11, m12, m13, m14] and M1 = m11. The reader familiar with logic (see also Chapter 2) will recognize that the example e is a logical consequence of the rule r, which is sometimes written as r = e. 10 1 Introduction Now that we know what to compute, we can look at how this can be realized. The computation of the solutions proceeds typically by searching the space of possible patterns or hypotheses L. One way of realizing this is to employ a generateandtest algorithm, though this is too naive to be eﬃcient. Therefore symbolic machine learning and data mining techniques typically structure the space L according to generality. One pattern or hypothesis is more general than another if all examples that are covered by the latter pattern are also covered by the former. Example 1.7. For instance, the rule active(M) ← ring size 5(M, R), element(A1, R), bond(M, A1, A2, 2) is more general than the rule active(M) ← ring size 5(M, R), element(A1, R), bond(M, A1, A2, 2), atom(M, A2, o, 52, C) which reads as Molecule M is active IF it contains a ring of size 5 called R and atoms A1 and A2 that are connected by a double (2) bond such that A1 also belongs to the ring R and atom A2 is an oxygen of type 52. The former rule is more general (or, equivalently, the latter one is more speciﬁc) because the latter one requires also that the atom connected to the ring of size 5 be an oxygen of atomtype 52. Therefore, all molecules satisfying the latter rule will also satisfy the former one. The generality relation is quite central during the search for solutions. The reason is that the generality relation can often be used 1) to prune the search space, and 2) to guide the search towards the more promising parts of the space. The generality relation is employed by the large majority of logical and relational learning systems, which often search the space in a generaltospeciﬁc fashion. This type of system starts from the most general rule (the unconditional rule, which states that all molecules are active in our running example), and then repeatedly specializes it using a socalled reﬁnement operator. Reﬁnement operators map rules onto a set of specializations; cf. Chapters 3 and 5. Example 1.8. Consider the rule active(Mol) ← atom(Mol, Atom, c, Type, Charge), which states that a molecule is active if it contains a carbon atom. Reﬁnements of this rule include: 1.4 A Brief History 11 active(Mol) ← atom(Mol, Atom, c, 21, Charge) active(Mol) ← atom(Mol, Atom, c, T, Charge), atom(Mol, Atom2, h, T2, Charge2) active(Mol) ← atom(Mol, Atom, c, T, Charge), ring size 5(Mol, Ring) ... The ﬁrst reﬁnement states that the carbon atom must be of type 21, the second one requires that there be carbon as well as hydrogen atoms, and the third one that there be a carbon atom and a ring of size 5. Many more specializations are possible, and, in general, the operator depends on the description language and generality relation used. The generality relation can be used to prune the search. Indeed, assume that we are looking for rules that cover at least 20% of the active molecules and at most 1% of the inactive ones. If our current rule (say the second one in Ex. 1.7) only covers 18% of the actives, then we can prune away all specializations of that rule because specialization can only decrease the number of covered examples. Conversely, if our current rule covers 2% of the inactives, then all generalizations of the rule cover at least as many inactives (as generalization can only increase the number of covered examples), and therefore these generalizations can safely be pruned away; cf. Chapter 3 for more details. Using logical description languages for learning provides us not only with a very expressive representation, but also with an excellent theoretical foundation for the ﬁeld. This becomes clear when looking at the generality relation. It turns out that the generality relation coincides with logical entailment. Indeed, the above examples of the generality relation clearly show that the more general rule logically entails the more speciﬁc one.3 So, the more speciﬁc rule is a logical consequence of the more general one, or, formulated diﬀerently, the more general rule logically entails the more speciﬁc one. Consider the simpler example: ﬂies(X) ← bird(X) (if X is a bird, then X ﬂies), which logically entails and which is clearly more general than the rule ﬂies(X) ← bird(X), normal(X) (only normal birds ﬂy). This property of the generalization relation provides us with an excellent formal basis for studying inference operators for learning. Indeed, because one rule is more general than another if the former entails the latter, deduction is closely related to specialization as deductive operators can be used as specialization operators. At the same time, as we will see in detail in Chapter 5, one can obtain generalization (or inductive inference) operators by inverting deductive inference operators. 1.4 A Brief History Logical and relational learning typically employ a form of reasoning known as inductive inference. This form of reasoning generalizes speciﬁc facts into gen3 This property holds when learning from entailment. In other settings, such as learning from interpretations, this property is reversed; cf. Chapter 5. The more speciﬁc hypothesis then entails the more general one. 12 1 Introduction eral laws. It is commonly applied within the natural sciences, and therefore has been studied in the philosophy of science by several philosophers since Aristotle. For instance, Francis Bacon investigated an inductive methodology for scientiﬁc inquiry. The idea is that knowledge can be obtained by careful experimenting, observing, generalizing and testing of hypotheses. This is also known as empiricism, and various aspects have been studied by many other philosophers including Hume, Mill, Peirce, Popper and Carnap. Inductive reasoning is fundamentally diﬀerent from deductive reasoning in that the conclusions of inductive reasoning do not follow logically from their premises (the observations) but are always cogent; that is, they can only be true with a certain probability. The reader may notice that this method is actually very close in spirit to that of logical and relational learning today. The key diﬀerence seems to be that logical and relational learning investigates computational approaches to inductive reasoning. Computational models of inductive reasoning and scientiﬁc discovery have been investigated since the very beginning of artiﬁcial intelligence. Several cognitive scientists, such as a team involving the Nobel prize winner Herbert A. Simon (see [Langley et al., 1987] for an overview), developed several models that explain how speciﬁc scientiﬁc theories could be obtained. Around the same time, other scientists (including Bruce Buchanan, Nobel prize winner Joshua Lederberg, Ed Feigenbaum, and Tom Mitchell [Buchanan and Mitchell, 1978]) started to develop learning systems that could assist scientists in discovering new scientiﬁc laws. Their system MetaDendral produced some new results in chemistry and were amongst the ﬁrst scientiﬁc discoveries made by an artiﬁcial intelligence system that were published in the scientiﬁc literature of the application domain. These two lines of research have actually motivated many developments in logical and relational learning albeit there is also a crucial diﬀerence between them. Whereas the mentioned approaches were domainspeciﬁc, the goal of logical and relational learning is to develop generalpurpose inductive reasoning systems that can be applied across diﬀerent application domains. The example concerning structure activity relationship is a perfect illustration of the results of these developments in logical and relational learning. An interesting philosophical account of the relationship between these developments is given by Gillies [1996]. Supporting the scientiﬁc discovery process across diﬀerent domains requires a solution to two important computational problems. First, as scientiﬁc theories are complex by their very nature, an expressive formalism is needed to represent them. Second, the inductive reasoning process should be able to employ the available background knowledge to obtain meaningful hypotheses. These two problems can to a large extent be solved by using logical representations for learning. The insight that various types of logical and relational representations can be useful for inductive reasoning and machine learning can be considered as an outgrowth of two parallel developments in computer science and artiﬁcial intelligence. First, and most importantly, since the mid1960s a number 1.4 A Brief History 13 of researchers proposed to use (variants of) predicate logic as a formalism for studying machine learning problems. This was motivated by severe limitations of the early machine learning systems that essentially worked with propositional representations. Ranan Banerji [1964] was amongst the earliest advocates of the use of logic for machine learning. The logic he proposed was motivated by a pattern recognition task. Banerji’s work on logical descriptions provided inspiration for developing logical learning systems such as Confucius [Cohen and Sammut, 1982] and Marvin [Sammut and Banerji, 1986], which already incorporated the ﬁrst inverse resolution operators; cf. Chapter 5. These systems learned incrementally and were able to employ the already learned concepts during further learning tasks. Around the same time, Ryszard Michalski [1983] developed his inﬂuential AQ and Induce systems that address the traditional classiﬁcation task that made machine learning so successful. Michalski’s work stressed the importance of both learning readable descriptions and using background knowledge. He developed his own variant of a logical description language, the Variable Valued Logic, which is able to deal with structured data and relations. At the same time, within VVL, he suggested that induction be viewed as the inverse of deduction and proposed several inference rules for realizing this. This view can be traced back in the philosophy of science [Jevons, 1874] and still forms the basis for much of the theory of generalization, which is extensively discussed in Chapter 5. Theoretical properties of generalization and specialization were also studied by researchers such as Plotkin [1970], Reynolds [1970], Vere [1975] and Buntine [1988]. Especially, Plotkin’s Ph.D. work on θsubsumption and relative subsumption, two generalization relations for clausal logic, has been very inﬂuential and still constitutes the main framework for generalization in logical learning. It will be extensively studied in Chapter 5. Second, there is the work on automatic programming [Biermann et al., 1984] that was concerned with synthesizing programs from examples of their inputoutput behavior, where researchers such as Biermann and Feldman [1972], Summers [1977] and Shapiro [1983] contributed very inﬂuential systems and approaches. Whereas Alan Biermann’s work was concerned with synthesizing Turing machines, Phil Summers studied functional programs (LISP) and Ehud Shapiro studied the induction of logic programs and hence contributed an inductive logic programming system avant la lettre. Shapiro’s Model Inference System is still one of the most powerful program synthesis and inductive inference systems today. It will be studied in Chapter 7. In the mid1980s, various researchers including Bergadano et al. [1988], Emde et al. [1983], Morik et al. [1993], Buntine [1987] and Ganascia and Kodratoﬀ [1986] contributed inductive learning systems that used relational or logical description languages. Claude Sammut [1993] gives an interesting account of this period in logical and relational learning. A breakthrough occurred when researchers started to realize that both problems (in automatic programming and machine learning) could be stud 14 1 Introduction ied simultaneously within the framework of computational logic. It was the contribution of Stephen Muggleton [1991] to deﬁne the new research ﬁeld of inductive logic programming (ILP) [Muggleton and De Raedt, 1994, Lavrač and Džeroski, 1994] as the intersection of inductive concept learning and logic programming and to bring together researchers in these areas in the inductive logic programming workshops [Muggleton, 1992b] that have been organized annually since 1991. A new subﬁeld of machine learning was born and attracted many scientists, especially in Japan, Australia and Europe (where two European projects on inductive logic programming were quite inﬂuential [De Raedt, 1996]). Characteristic for the early 1990s was that inductive logic programming was developing ﬁrm theoretical foundations, built on logic programming concepts, for logical learning. In parallel, various wellknown inductive logic programming systems were developed, including Foil [Quinlan, 1990], Golem [Muggleton and Feng, 1992], Progol [Muggleton, 1995], Claudien [De Raedt and Dehaspe, 1997], Mobal [Morik et al., 1993], Linus [Lavrač and Džeroski, 1994]. Also, the ﬁrst successes in reallife applications of inductive logic programming were realized by Ross King, Stephen Muggleton, Ashwin Srinivasan, and Michael Sternberg [King et al., 1992, King and Srinivasan, 1996, King et al., 1995, Muggleton et al., 1992]; see [Džeroski, 2001] for an overview. Due to the success of these applications and the diﬃculties in true progress in program synthesis, the ﬁeld soon focused on machine learning and data mining rather than on automatic programming. A wide variety of systems and techniques were being developed that upgraded traditional machine learning systems towards the use of logic; cf. Chapter 6. During the mid1990s, both the data mining and the uncertainty in artiﬁcial intelligence communities started to realize the limitations of the key representation formalism they were using. Within the data mining community, the itemsets representation employed in association rules [Agrawal et al., 1993] corresponds essentially to a boolean or propositional logic, and the Bayesian network formalism [Pearl, 1988] deﬁnes a probability distribution over propositional worlds. These limitations motivated researchers to look again at more expressive representations derived from relational or ﬁrstorder logic. Indeed, within the data mining community, the work on Warmr [Dehaspe et al., 1998], which discovers frequent queries and relational association rules from a relational database, was quite inﬂuential. It was successfully applied on a structure activity relationship prediction task, and motivated several researchers to look into graph mining [Washio et al., 2005, Inokuchi et al., 2003, Washio and Motoda, 2003]. Researchers in data mining soon started to talk about (multi)relational data mining (MRDM) (cf. [Džeroski and Lavrač, 2001]), and an annual series of workshops on multirelational data mining was initiated [Džeroski et al., 2002, 2003, Džeroski and Blockeel, 2004, 2005]. A similar development took place in the uncertainty in artiﬁcial intelligence community. Researchers started to develop expressive probabilistic logics [Poole, 1993a, Breese et al., 1994, Haddawy, 1994, Muggleton, 1996], and started to study learning [Sato, 1995, Friedman et al., 1999] in these frame 1.4 A Brief History 15 works soon afterward. In the past few years, these researchers have gathered in the statistical relational learning (SRL) workshops [Getoor and Jensen, 2003, 2000, De Raedt et al., 2005, 2007a]. A detailed overview and introduction to this area is contained in Chapter 8 and two recent volumes in this area include [Getoor and Taskar, 2007, De Raedt et al., 2008]. Statistical relational learning is one of the most exciting and promising areas for relational and logical learning today. To conclude, the ﬁeld of logical and relational learning has a long history and is now being studied under diﬀerent names: inductive logic programming, multirelational data mining and (statistical) relational learning. The approach taken in this book is to stress the similarities between these trends rather than the diﬀerences because the problems studied are essentially the same even though the formalisms employed may be diﬀerent. The author hopes that this may contribute to a better understanding of this exciting ﬁeld. 2 An Introduction to Logic The data mining and machine learning approaches discussed in this book employ relational or logical representations. This chapter introduces relational database representations and their formalization in logic. Logic is employed because it is an expressive, wellunderstood and elegant formalism for representing knowledge. We focus on deﬁnite clause logic, which forms the basis of computational logic and logic programming. The chapter introduces the syntax as well as the modeltheoretic and prooftheoretic semantics of deﬁnite clause logic. Throughout this book an attempt is made to introduce all the necessary concepts and terminology in an intuitive and yet precise manner, without resorting to unnecessary formal proof or theory. The reader interested in a more detailed theoretical account of clausal logic may want to consult [Lloyd, 1987, Genesereth and Nilsson, 1987, Flach, 1994, Hogger, 1990, Kowalski, 1979], and for a more detailed introduction to the programming language Prolog we refer him to [Flach, 1994, Bratko, 1990, Sterling and Shapiro, 1986]. 2.1 A Relational Database Example This book is concerned with mining relational data. The data to be mined will typically reside in a relational database. An example relational database, written in the logical notation employed throughout this book, is given in the following example. Example 2.1. The database contains information about publications, authors and citations. The entry authorOf(lloyd, logic for learning) ← for the relation authorOf/2 represents the fact that lloyd is the author of the publication logic for learning. Similarly, reference(logic for learning, foundations of lp) ← is a fact stating that foundations of lp is in the bibliography of logic for learning. 18 2 An Introduction to Logic reference(logic for learning, foundations of lp) ← reference(logic for learning, learning logical deﬁnitions from relations) ← reference(logic for learning, ai a modern approach) ← reference(ai a modern approach, foundations of lp) ← reference(ai a modern approach, ilp theory and methods) ← reference(ilp theory and methods, learning logical deﬁnitions from relations) ← reference(ilp theory and methods, foundations of lp) ← authorOf(quinlan, learning logical deﬁnitions from relations) ← authorOf(lloyd, foundations of lp) ← authorOf(lloyd, logic for learning) ← authorOf(russell, ai a modern approach) ← authorOf(norvig, ai a modern approach) ← authorOf(muggleton, ilp theory and methods) ← authorOf(deraedt, ilp theory and methods) ← Even though the database contains only a small extract of a real database, it will be used to introduce some of the key concepts of the knowledge representation formalism employed in this book. The formalism employed is that of clausal logic because it is an expressive representation formalism, it is wellunderstood and it provides us with the necessary theory and tools for relational learning. At the same time, it is closely related to relational database formalisms. This is illustrated by the bibliographic database just introduced and will become more apparent throughout this section (and Sect. 4.4). In logic, a relation is called a predicate p/n (e.g., authorOf/2) where p denotes the name of the predicate (authorOf) and n the arity (2), which indicates the number of arguments the predicate takes. The arguments of predicates are terms. Constants are a particular type of term that start with a lowercase character, for instance, quinlan and logic for learning. They refer to a particular object in the domain of discourse, such as the book “Logic for Learning” [Lloyd, 2003] or the author “J.R. Quinlan”. All entries in Ex. 2.1 are facts, which are expressions of the form p(t1 , · · · , tn ) ←, where p/n is a predicate and the ti are terms. Facts correspond to tuples in a relational database. They express unconditional truths. For instance, the fact authorOf(lloyd, logic for learning) ← expresses that lloyd is the author of the publication logic for learning. As in relational databases it is possible to query for information that resides in the database. Example 2.2. The query ← authorOf(lloyd, logic for learning) asks whether lloyd is the author of logic for learning. A theorem prover (such as Prolog) would return the answer yes, implying that the fact is true, that is, that it logically follows from the speciﬁed database. The query ← authorOf(lloyd, Pub) asks whether there is a publication authored by lloyd. The term Pub is called a variable, and by convention, variables start with an uppercase character. The theorem prover would answer yes and would also return the substitutions {Pub/logic for learning} and 2.1 A Relational Database Example 19 {Pub/foundations of lp}, which state the conditions under which the original query is true. For those readers familiar with SQL, a popular database language, we also specify the corresponding query in SQL1 : SELECT Pub FROM authorOf WHERE Author = lloyd The query ← authorOf(Author, logic for learning), authorOf(Author, foundations of lp) asks whether there is an author who wrote logic for learning as well as foundations of lp. The single substitution that would be returned for this query is {Author/lloyd}. In SQL, this query is written as: SELECT Author FROM authorOf t1, authorOf t2 WHERE t1.Author = t2.Author AND t1.Pub = logic for learning and t2.Pub = foundations of lp So, the “,” between the two atoms in the query corresponds to and, and two occurrences of the same variable must be instantiated to the same constant in the substitution, which corresponds to a “join” in relational databases. Suppose now that instead of being interested in knowing which publications refer to which other publications, we are interested in knowing the authors who cite one another. They can be listed using the following query: ← authorOf(A, P), reference(P, Q), authorOf(B, Q) Many answers would be generated for this query, including: {A/lloyd, P/logic for learning, Q/foundations of lp, B/lloyd} {A/lloyd, P/logic for learning, Q/ai a modern appraoch, B/russell} ... If the query is posed a number of times, it is useful to deﬁne a new predicate cites/2 using the clause: cites(A, B) ← authorOf(A, P), reference(P, Q), authorOf(B, Q) This clause states that A cites B if A is the authorOf P, P references Q, and B is the authorOf Q. This clause corresponds to the deﬁnition of a view or an intensional relation in relational database terminology, which can be implemented by the following statement in SQL. CREATE VIEW cites AS SELECT a1.Author, a2.Author FROM authorOf a1, authorOf a2, reference r WHERE a1.Pub = r.Pub1 AND a2.Pub = r.Pub2 1 The reader unfamiliar with SQL can safely skip the SQL queries. 20 2 An Introduction to Logic The eﬀect is that the predicate cites can now be queried as if it contained the following facts: cites(lloyd, lloyd) ← cites(lloyd, russell) ← cites(russell, lloyd) ← cites(russell, muggleton) ← cites(russell, deraedt) ← cites(muggleton, quinlan) ← cites(muggleton, lloyd) ← cites(lloyd, quinlan) ← cites(lloyd, norvig) ← cites(norvig, lloyd) ← cites(norvig, muggleton) ← cites(norvig, deraedt) ← cites(deraedt, quinlan) ← cites(deraedt, lloyd) ← Exercise 2.3. Pose a query to identify authors who cite themselves. Use both the original database (consisting of reference/2 and authorOf/2) and the one where the cites predicate has been deﬁned. Deﬁne also a new predicate self citation/1 that succeeds for those authors who cite themselves. 2.2 The Syntax of Clausal Logic Whereas the previous section introduced some logical concepts in a rather informal manner, the present section will introduce them in a more formal and precise manner. A term t is a constant, a variable or a compound term f (t1 , ..., tn ) composed of a function symbol f /n (where f is the name of the function and n is its arity) and n terms ti . Simple terms are, for instance, logic for learning (a constant) and Pub (a variable). We will use the convention that constants and function symbols start with a lowercase character and variables start with an uppercase character. Compound terms were so far not introduced as they do not belong to the relational subset of clausal logic. They are used to represent structured objects. Examples of compound terms include card(j, hearts), where card/2 is a function symbol, and j and hearts are constants. It can be used to denote the Jack of Hearts card. Compound terms can also be nested. For instance, one can denote the father of John using the term fatherOf(john). The grandfather of John would then be denoted as fatherOf(fatherOf(john)), that is, as the father of the father of John. This notation is also used to represent the natural numbers using logic: the number 0 would be represented by 0, the number 1 by succ(0), the successor of 0, the number 2 by succ(succ(0)), etc. Lists are represented using the functor cons/2 (in Prolog the functor is often represented as ·/2). The ﬁrst argument of cons/2 denotes the ﬁrst element of the list, the second argument denotes the rest of the list, and the constant nil is used to denote the empty list. For instance, the list [1, 2] is represented as cons(1, cons(2, nil)).2 2 The list functor cons/2 is usually written in Prolog in inﬁx notation. Using this notation cons(A, B) is written as [A  B], and cons(A, cons(B, C)) as [A  [B  C]] or [A, B  C] for short. 2.2 The Syntax of Clausal Logic 21 An atom is a formula of the form p(t1 , .., tn ), where p/n is a predicate symbol and the ti are terms. For instance, authorOf(lloyd, logic for learning), largerRank(ace, X), faceCard(j, hearts), and pair(card(j, hearts), card(j, diamonds)) are atoms. Although terms and atoms possess a similar syntax, their meaning is quite diﬀerent. These diﬀerences are clearest when looking at ground terms and atoms: ground terms represent objects in the domain of discourse, whereas ground atoms represent a particular relationship among the objects denoted by the terms appearing in the atom. Therefore, ground atoms possess a truthvalue, that is, they are either true or false, whereas ground terms (and objects) do not posses truthvalues. Indeed, it does not make sense to talk about the truthvalue of the constant green, the persons “Mary Ann’ or fatherOf(john), or the list con(a, cons(b, nil)). However, the atom parent(fatherOf(john), john) (for the predicate parent/2) would typically be true. By now, we are able to deﬁne clauses, which are the key constructs in clausal logic. Clausal logic is a subset of ﬁrstorder logic that forms the basis of logic programming and the programming language Prolog. It is frequently employed within artiﬁcial intelligence due to its uniform and simple representation, and the existence of eﬃcient inference engines or theorem provers for clausal logic. Deﬁnition 2.4. A clause is an expression of the form h1 ; · · · ; hn ← b1 , · · · , bm where the hi and bj are logical atoms. The symbol “,” stands for conjunction (and), the symbol “;” for disjunction (or), and “←” for implication (if). Furthermore, all variables are universally quantiﬁed (though this is not explicitly written). So, the clause female(X); male(X) ← human(X) speciﬁes that for all X, when X is human, X is also male or female. h1 ; · · · ; hn is referred to as the conclusion part of the clause, or the head of the clause, b1 , · · · , bm as the condition part or the body of the clause. It will often be convenient to employ the notation body(c) for {b1 , · · · , bm } and head(c) for {h1 , · · · , hn }, where c is the clause h1 ; · · · ; hn ← b1 , · · · , bm . This set notation can be extended to the overall clause. For instance, the above clause c is represented by the set {h1 , · · · , hn , ¬b1 , · · · , ¬bm } of the literals it contains, where a literal is either a logical atom a or a negated atom ¬a. The set notation for clauses reﬂects also that a clause is a disjunction h1 ∨· · ·∨hn ∨¬b1 ∨· · ·∨¬bm of literals. There exist several special types of clauses: • • • • facts, where n = 1 and m = 0, deﬁnite clauses, where n = 1, Horn clauses, where n = 1 or n = 0, and denials, where n = 0. 22 2 An Introduction to Logic The clauses that are most often employed in logical and relational learning are facts and deﬁnite clauses, which we already encountered in the previous section. Denials represent negative information, for instance, ← human(dracula) speciﬁes that dracula is not human, that is, that human(dracula) is false. Denials are used as queries or goals. The reasons for this will become clear soon. Typically, the database consists of a set of clauses. A set of clauses {c1 , · · · , cn } speciﬁes that all clauses in the set are true, that is, the set denotes a conjunction c1 , · · · , cn of the clauses ci . We will sometimes refer to such sets of clauses as (clausal) theories, databases, or knowledge bases. A substitution θ = {V1 /t1 , · · · , Vn /tn } is an assignment of terms t1 , ..., tn to variables V1 , ..., Vn , for instance, {A/lloyd, Pub/logic for learning}. The instantiated formula F θ ,where F is a term, atom, or clause and θ = {V1 /t1 , · · · , Vn /tn } a substitution, is the formula obtained by simultaneously replacing all variables V1 , ..., Vn in F by the terms t1 , ..., tn . For instance, the formula card(X, Y)θ, where θ = { X/j, Y/diamonds}, denotes card(j, diamonds). 2.3 The Semantics of Clausal Logic — Model Theory Now that the syntax of clausal logic has been deﬁned, its semantics can be introduced. The semantics of logic are based on the concept of an interpretation. Interpretations are deﬁned using the notion of a domain, which contains the set of objects that exist (in the interpretation). Roughly speaking, an interpretation of a set of clauses is an assignment that maps • • • constants onto objects in the domain, function symbols f /n onto nary functions deﬁned on the domain; these functions map an ntuple of objects onto an object in the domain, predicates p/n onto nary relations deﬁned on the domain; these relations represent a set of ntuples of objects. For instance, consider the constant john. It could represent a particular person named John Doe Jr. The function symbol fatherOf/1 could then represent the function that maps every person to his or her father. It could map John Doe Jr. onto John Doe Sr. The predicate parent/2 could then map to the parent relationship. These mappings then deﬁne the truthvalues of atoms in the interpretation. For instance, the atom parent(fatherOf(john), john) would map to true, because John Doe Sr. (fatherOf(john)) is a parent of John Doe Jr. (john) in our interpretation, assuming that the tuple (John Doe Sr., John Doe Jr.) is in the relation denoted by parent. Because it is inconvenient and complicated to work with general interpretations, and because when working with clausal logic, this is not really needed, we will restrict our attention to a special class of interpretations named after the French logician Jacques Herbrand. A Herbrand interpretation uses the 2.3 The Semantics of Clausal Logic — Model Theory 23 Herbrand domain as its domain. The Herbrand domain is deﬁned as the set of all ground terms that can be constructed using the constants and function symbols that occur in the set of clauses considered. Example 2.5. The Herbrand universe in the bibliographic database of Ex. 2.1 consists of all constants appearing in the facts, i.e., {quinlan, lloyd, muggleton, . . . , logic for learning, . . . , ilp theory and methods}. Example 2.6. Now consider the clauses deﬁning the natural numbers: nat(0) ← nat(succ(X)) ← nat(X) The ﬁrst clause states that 0 is a natural number; the second one that succ(X) is a natural number if X is one. The Herbrand universe in this case is the inﬁnite set {0, succ(0), succ(succ(0)), . . .}. A Herbrand interpretation now maps each ground term to itself. So each ground term refers to itself; for instance, john now refers to the object john instead of to some person such as John Doe. In a similar manner, predicates are mapped onto relations over the Herbrand universe. The result is that a Herbrand interpretation can be viewed as the set of ground atoms that are true in the interpretation. This is the view that we will be employing throughout this book when talking about interpretations. Deﬁnition 2.7. A Herbrand interpretation of a set of clauses is a set of ground atoms (over the constant, function and predicate symbols occurring in the set of clauses). All ground atoms in the interpretation are assumed to be true, and all others are assumed to be false. Example 2.8. One possible Herbrand interpretation I1 over the bibliographic database of Ex. 2.1 is: {authorOf(russell, logic for learning), authorOf(russell, quinlan)} The interpretation I2 consists of all the ground atoms occurring in the bibliographic database. Example 2.9. Similarly, for the above speciﬁed natural numbers, I3 , I4 , and I5 deﬁned below are Herbrand interpretations. I3 = {} I4 = {nat(0), nat(succ(0))} I5 = {nat(0), nat(succ(0)), nat(succ(succ(0))), . . .}. 24 2 An Introduction to Logic Some interpretations do not really reﬂect the properties of the clauses that we have written down. For instance, the ﬁrst interpretations in both the bibliographic database and the deﬁnition of the natural numbers are intuitively impossible given the clauses that we have written down. This motivates the following deﬁnition. Deﬁnition 2.10. A Herbrand interpretation I is a Herbrand model for a set of clauses C if and only if for all clauses h1 ; · · · ; hn ← b1 , · · · , bm ∈ C and for all ground substitutions θ: {b1 θ, · · · , bm θ} ⊆ I → {h1 θ, · · · , hn θ} ∩ I = ∅. So, a Herbrand interpretation I is a model for a clause c if for all substitutions θ for which body(c)θ is true in I, head(c)θ is also true in I. If an interpretation is a model for a clause (or set of clauses), the interpretation satisﬁes the clause (or set of clauses); otherwise, the interpretation violates the clause (or set of clauses). Clausal theories that possess a Herbrand model are called satisifable; those that do not possess one are called unsatisﬁable. When the context is clear, we will talk about interpretations and models rather than Herbrand interpretations and Herbrand models. Example 2.11. Reconsider our bibliographic database and the interpretation I1 . I1 is not a model for the database because there exists a clause f authorOf(russell, ai a modern approach) ← such that body(f ) = {} ⊆ I1 but head(f ) = {authorOf(russell, ai a modern approach))} ∩ I1 = ∅. At the same time, it is easy to see that I2 is a model for the bibliographic database. Example 2.12. For the natural numbers, I3 is not a model, because of the fact nat(0) ←. Neither is I4 , because of the recursive clause c for which there exists a substitution θ ={X/succ(0)} for which body(c)θ = {nat(succ(0))}⊆ I4 but head(c)θ ={nat(succ(succ(0)))}∩I4 = ∅. Finally, I5 is a model for the two clauses deﬁning nat/1. Model theory is the basis for reasoning about the declarative semantics of logical formulae, which relies on the notion of logical entailment. Logical entailment deﬁnes when one formula is a consequence of, or follows from, another one. Deﬁnition 2.13. Let C be a set of clauses and c be a clause. C logically entails c, notation C = c, if and only if all models of C are also models of c. In this deﬁnition, not only the Herbrand interpretations (and models) but all interpretations and models are considered. The following theorems, however, allow us to focus our attention on Herbrand interpretations and models to reason about the satisﬁability of a particular formula. 2.3 The Semantics of Clausal Logic — Model Theory 25 Theorem 2.14. (Proposition 3.30 from [NienhuysCheng and de Wolf, 1997]) Let T be a set of clauses. Then T has a model if and only if T has a Herbrand model. To generalize the applicability of this result to entailment between two formulae C and c, the theorem below can be used. Theorem 2.15. (This follows from Proposition 3.31 [NienhuysCheng and de Wolf, 1997]) Let C be a set of clauses and c be a clause. Then C = c if and only if C ∧ ¬c is unsatisﬁable, that is, if C ∧ ¬c = . The empty clause ←, sometimes written as , is the clause whose head and body are both empty. This clause is unsatisﬁable because its body is always true and its head is never true, regardless of the interpretation. Example 2.16. Reconsider our bibliographic database B consisting of all facts listed in Ex. 2.1. Then B = authorOf(lloyd, logic for learning) ← because B ∧ ¬ authorOf(lloyd, logic for learning) is unsatisﬁable as authorOf(lloyd, logic for learning) and its negation are contradictory. Example 2.17. Similarly, let N consist of the two clauses deﬁning nat/1. Then N = nat(0) and also N = nat(succ(0)). An algorithm to generate a (Herbrand) model of a set of clauses is listed in Algo. 2.1. It assumes that all clauses are rangerestricted, i.e., that all variables appearing in the head of a clause also appear in its body. Rangerestrictedness implies that all facts are ground. Algorithm 2.1 Generating a model of a set of clauses M := ∅ while M is not a model of C do if there is a denial ← b1 , · · · , bm in C that is violated by M then backtrack end if select h1 ; · · · , hn ← b1 , · · · , bm from C and θ (choice point) such that {b1 θ, · · · , bm θ} ⊆ M , and {h1 θ, · · · , hn θ} ∩ M = ∅ add one of the hi θ to M (choice point) end while The algorithm starts with the empty model and repeatedly expands it by nondeterministically adding facts belonging to the head of a violated clause. 26 2 An Introduction to Logic This process continues until M either becomes a model of the theory, or until a denial is violated. When a denial is violated, the model cannot be expanded by adding facts, which explains why the algorithm then backtracks to earlier choice points. Observe that the algorithm can fail when no model exists, and also that it can loop inﬁnitely while attempting to construct an inﬁnite model. There exists also a very elegant and short implementation of this algorithm in Prolog. It is called Satchmo and was published by Manthey and Bry [1987]; it is described in detail in [Flach, 1994]. Example 2.18. Consider the following clausal theory: human(X) ← male(X) human(X) ← female(X) female(X); male(X) ← human(X) human(john) ← One possible trace of Algo. 2.1 generates the following sequence of interpretations: {} {human(john)} using the last clause. {human(john), female(john)} using the third clause. Another model for this example is {human(john), male(john)}. Example 2.19. Reconsider the bibliographic database together with the clause deﬁning the predicate cites/2. Algo. 2.1 would generate the interpretation consisting of all ground facts for the predicates reference/2, cites/2 and authorOf/2 that were listed in the earlier examples. Example 2.20. Reconsider the deﬁnition of the natural numbers using the predicate nat/1. For these clauses, the algorithm does not terminate, because it attempts to generate the inﬁnite model I5 deﬁned above. Exercise 2.21. Does the following clausal theory have a model? If so, generate one. ← student(X), vampire(X) ← student(X), professor(X) ← female(X), male(X) being(dracula) ← clever(X); student(X) ← being(X) female(X); male(X); vampire(X) ← being(X) student(X); professor(X); vampire(X) ← being(X) Exercise 2.22. Use the theorem prover Satchmo implemented in Prolog (cf. [Manthey and Bry, 1987] or [Flach, 1994]) to generate more models of the theory in the previous example. (This exercise requires that you have access to a Prolog implementation. Some excellent Prolog implementations, such as SWIProlog and YAPProlog, are available from the public domain.) 2.3 The Semantics of Clausal Logic — Model Theory 27 Some theories (sets of clauses) have multiple models. Furthermore, one model can be a subset of another model, which motivates the notion of a minimal model. A Herbrand model is minimal if and only if none of its subsets is a model. Example 2.23. Reconsider the theory listed in Ex. 2.18. The two models listed there are minimal. However, the model {human(john), male(john), female(john)} is not minimal and this interpretation would not be a model if the clause ← female(X), male(X) belonged to the theory. When restricting one’s attention to deﬁnite clauses, which is the subset of clausal logic most often applied in logic programming, the minimal model is unique, which explains why it is called the least Herbrand model. The least Herbrand model of a set of clauses C will be denoted as M (C). The least Herbrand model is an important concept because it captures the semantics of the set of clauses. It consists of all ground atoms that are logically entailed by the deﬁnite clause theory. This can be formally speciﬁed as: Property 2.24. Let C be a set of deﬁnite clauses and f be a ground fact. Then C = f if and only if f ∈ M (C). The least Herbrand model also captures the intuitive meaning of the theory. This is best illustrated by the bibliographic database, where the least Herbrand model is that sketched in Ex. 2.19, which indeed contains exactly those facts which would be generated using the SQL statement, and which one would intuitively expect. When applied to (rangerestricted) deﬁnite clause theories, Algo. 2.1 will indeed generate the least Herbrand model of the theory. However, in this case one can also use the simpliﬁed version sketched in Algo. 2.2. Algorithm 2.2 Computing the least Herbrand model of a deﬁnite clause theory M0 := ∅ M1 := {f f ← is a fact in C} i := 1 while Mi = Mi−1 do Mi+1 := ∅ for all h ← b1 , · · · , bn ∈ C do for all θ such that {b1 θ, · · · , bn θ} ⊆ Mi do add hθ to Mi+1 end for end for Mi+1 := Mi ∪ Mi+1 i := i + 1 end while 28 2 An Introduction to Logic The key diﬀerence between this algorithm and the previous one is that it processes all violated clauses in parallel to generate the next model. It can be optimized by adding {b1 θ, · · · , bn θ} ∩ (Mi − Mi−1 ) = ∅ as an additional constraint in the inner for loop. This way the same conclusions hθ will not be regenerated in every iteration, which will remove some redundant computations. Example 2.25. Suppose the deﬁnite clause theory is: ancestor(X, Y) ← parent(X, Y) ancestor(X, Y) ← parent(X, Z), ancestor(Z, Y) parent(rose, luc) ← parent(leo, rose) ← The algorithm then computes the following sequence of models: M0 M1 M2 M3 M4 =∅ = {parent(rose, luc), parent(leo, rose)} = M1 ∪ {ancestor(rose, luc), ancestor(leo, rose)} = M2 ∪ {ancestor(leo, luc)} = M3 Exercise 2.26. Specify the least Herbrand model of the following theory: plus(0, X, X) ← nat(X) plus(succ(X), Y, succ(Z)) ← plus(X, Y, Z) nat(0) ← nat(succ(X)) ← nat(X) 2.4 Inference with Clausal Logic — Proof Theory Now that the semantics of clausal logic has been deﬁned, we discuss how to perform inference in clausal logic. Deductive inference is concerned with deciding whether one formula F logically entails another one G, that is, deciding whether F = G. When working with clausal logic, the typical inference procedure is based on the deductive inference rule known as resolution. The resolution principle was introduced by Robinson [1965]. It is employed by the large majority of theorem provers for ﬁrstorder logic. Even though many variants of the resolution principle exist, we will restrict our attention to resolution for Horn clauses and SLDresolution, because this form of resolution underlies the programming language Prolog on which the vast majority of logical learning approaches is based. Let us start by introducing resolution for propositional logic. In propositional logic, all predicates have arity 0, which implies that there are no constants, variables or structured terms that need to be taken into account. 2.4 Inference with Clausal Logic — Proof Theory 29 Given the clauses l ← b1 , · · · , bn and h ← c1 , · · · , ci−1 , l, ci , · · · , cm the propositional resolution operator infers the resolvent h ← c1 , · · · , ci−1 , b1 , · · · , bn , ci , · · · , cm (2.1) The rule is sometimes displayed as l ← b1 , · · · , bn and h ← c1 , · · · , ci−1 , l, ci , · · · , cm h ← c1 , · · · , ci−1 , b1 , · · · , bn , ci , · · · , cm (2.2) which indicates that if the clauses above the line are presented, the ones below the line may be inferred. We sometimes use the notation l ← b1 , · · · , bn and h ← c1 , · · · , ci−1 , l, ci , · · · , cm res h ← c1 , · · · , ci−1 , b1 , · · · , bn , ci , · · · , cm (2.3) to denote a particular resolution step. This inference step is graphically illustrated in Fig. 2.1, where the operator takes the typical V form. Notice that this rule also holds when h = {}, that is, when working with a denial instead of a deﬁnite clause. l ← b1 , ..., bn h ← c1 , · · · , ci−1 , l, ci , · · · , cm h ← c1 , · · · , ci−1 , b1 , ..., bn , ci , · · · , cm Fig. 2.1. The propositional resolution operator Example 2.27. mammal ← rabbit animal ← mammal res animal ← rabbit 30 2 An Introduction to Logic The resolution operator is sound, which means that whenever c1 ∧ c2 res c holds, c1 ∧ c2 = c holds as well. When more than one resolution step is performed, one talks about resolution derivations or proofs. A resolution proof of a clause c from a theory T = {c1 , ..., cn }, notation T c, is a sequence of resolution steps c1,1 ∧c1,2 res c1 , · · · , ck,1 ∧ck,2 res ck where ck = c and ci,j ∈ T ∪{c1 , · · · , ci−1 }. Resolution proofs can be graphically represented as trees; see Fig. 2.2 for an example. pos ← foursided, red pos ← rectangle, red foursided ← rectangle rectangle ← square pos ← square, red Fig. 2.2. A resolution derivation Example 2.28. A resolution proof of the clause pos ← square, red (red squares are positive) is shown in Fig. 2.2. It assumes the following theory T is given: foursided ← rectangle rectangle ← square pos ← foursided, red A popular technique in mathematics when proving theorems is to assume that a theorem is false and to prove that this leads to a contradiction. This technique also applies to logical inference and theorem proving, where it is known as proving by refutation. The idea of refutation was already formulated in Theorem 2.15, where it was stated that C = c if and only if C ∧ ¬c = . A proof by refutation is now a resolution derivation that ends in the empty clause or ←, which is unsatisﬁable. Resolution is not a complete operator for deﬁnite clause logic (even in the propositional case). Completeness would require that whenever C = c there also exists a resolution derivation of c from C. Example 2.29. Indeed, mammal ← rabbit = mammal ← rabbit, brown 2.4 Inference with Clausal Logic — Proof Theory 31 but it is impossible to derive the second clause by resolution from the ﬁrst one. Fortunately, resolution is refutation complete, as indicated in the following property. Property 2.30. (Refutation completeness; cf. Theorem 5.18 of [NienhuysCheng and de Wolf, 1997]) Let C be a set of clauses. Then C is unsatisﬁable, that is, C = , if and only if there exists a resolution derivation of the empty clause starting from C. Due to the soundness and refutation completeness of resolution (for propositional Horn clauses) we now have an eﬀective procedure for deciding logical entailment. Deciding whether a set of Horn clauses logically entails a Horn clause, that is whether C = h ← b1 , · · · , bn , can be realized as follows: • • negate the clause h ← b1 , · · · , bn , which yields the clauses T = { ← h, b1 ←, · · · , bn ←} try to derive the empty clause from C ∧ T , that is, decide whether C ∧ T . Example 2.31. Reconsider the clauses in Ex. 2.28. We now prove by refutation that T =pos ← square, red. To this end, we ﬁrst negate the clause, which yields the clauses: square ← and red ← and ← pos Figure 2.3 shows how to derive from these clauses and T . So far, we have introduced resolution for propositional logic, but resolution becomes more interesting when clauses contain terms. The key diﬀerence is that uniﬁcation must be used in this case. Uniﬁcation is needed to make a literal in one clause match a literal in the other clause. For instance, when trying to resolve father(X, Y) ← parent(X, Y), male(X) with ← father(luc, maarten) it is necessary to unify the literals father(X, Y) and father(luc, maarten) using the substitution {X/luc, Y/maarten} to yield the clause ← parent(luc, maarten), male(luc). Uniﬁcation was already implicitly used in Algos. 2.1 and 2.2. Formally, a uniﬁer of two expressions f1 and f2 (terms or atoms) is a substitution such that f1 θ = f2 θ. 32 2 An Introduction to Logic pos ← foursided, red ← pos foursided ← rectangle ← foursided, red ← rectangle, red rectangle ← square ← square, red square ← red ← ← red Fig. 2.3. A proof by refutation Example 2.32. For instance, to unify father(luc, X) and father(Y, soetkin) one can use the substitution {Y/luc, X/soetkin}, and to unify the atoms plus(succ(succ(0)), succ(X), succ(Y)) and plus(A, B, succ(succ(C))) one can use θ1 = {A/succ(succ(0)), B/succ(X), Y/succ(C)}, or θ2 = {A/succ(succ(0)), B/succ(0), X/0, Y/suc