Artificial Intelligence

Size: px

Start display at page:

Download "Artificial Intelligence"

Clinton Hart
6 years ago
Views:

Giorgio Fumera Course web site: http://www.diee.unica.

1 Artificial Intelligence M.Sc. Program on Electronic Engineering Academic year 2017/2018 Instructor: Giorgio Fumera Course web site: Pattern Recognition and Applications Lab Department of Electrical and Electronic Engineering University of Cagliari 1

2 Syllabus 1. Introduction to AI: historical notes 2. Solving problems by searching formulating search problems search strategies and their properties uninformed search strategies informed search strategies, heuristics 3. Knowledge representation and inference introduction to logic logical languages: propositional logic, predicate logic inference algorithms: propositional and first-order inference application: expert systems 4. The Lisp Language 5. Machine learning learning from examples, supervised learning, classification decision trees artificial neural networks 2

3 Introduction to AI: historical notes 3

4 What is intelligence? Broad definition: a set of capabilities that allow humans to learn, think, understand, communicate, be self-conscious, build abstract models of the world, plan, adapt to novel external conditions, etc. (some of these capabilities are exhibited also by animals, e.g., associative memory, reacting to stimuli, communicating). Different aspects of intelligence have been studied since a long time by several disciplines: logic, psychology, neurophysiology, etc. 4

: Leonardo da Vinci s robot knight (about 1495) Mechanical

5 Intelligent machines? The idea of building intelligent machines has also been envisaged, e.g.: Leonardo da Vinci s robot knight (about 1495) Mechanical Turk automaton chess player (late 18th cent.) science fiction, e.g.: the HAL 9000 computer of A.C. Clarke s novel and S. Kubrik s 2001: A Space Odyssey movie (1968) the replicants of R. Scott s Blade runner movie (1982) 5

6 Artificial intelligence Artificial Intelligence (AI) was born in the 1950s from the confluence of two broad and articulated earlier research efforts: understanding human intelligence building machines capable of autonomously performing complex tasks that are deemed to require intelligence 6

7 Earlier investigations on human intelligence Goal: understanding different aspects of human intelligence. High-level manifestation: rationality Logic: Aristotle (4th cent. bc), G.W. Leibniz (17th-18th cent.), G. Boole (19th cent.), etc. High-level manifestation: behavior and mind Psychology and cognitive science (since 19th cent.) Low-level biological support: the brain Neuroanatomy and Neurophysiology (since 19th cent.): McCulloch and Pitt s model of neuron (1943), D.O. Hebb s theory on neurons as basic units of thought, etc. 7

Earlier investigations on technology Goal: building machines capable of autonomously performing complex tasks. I Automata (e.g., jacquard loom, 1804) Cybernetics (feedback and control): N. Wiener, W.

8 Earlier investigations on technology Goal: building machines capable of autonomously performing complex tasks. I Automata (e.g., jacquard loom, 1804) Cybernetics (feedback and control): N. Wiener, W.R. Ashby (1940s) I Statistics and probability as tools to deal with uncertainty in reasoning and decision-making: T. Bayes (18th cent.); K. Pearson, R.A. Fisher, A. Wald, J. Neyman (late 19th 20th cent.) 8

Earlier investigations on technology: the computer I Precursors:

devices (1940s): K. Zuse, J.P. Eckert, J.W. Mauchly, J.

9 Earlier investigations on technology: the computer I Precursors: mechanical devices: B. Pascal and G.W. Leibniz (17th cent.), C. Babbage s analytical engine (19th cent.) I Contributions from engineering: electromechanical and electronic devices (1940s): K. Zuse, J.P. Eckert, J.W. Mauchly, J. von Neumann I Contributions from logic and mathematics: computational theory, the foundation of computer science: A.M. Turing, A. Church (1930s) 9

10 Earlier investigations on thinking computers Alan M. Turing s ( ): first investigations into the nature of computing the logical computing machine (Turing machine): a universal computer envisioning intelligent computers: Computing Machinery and Intelligence, Mind, Vol. LIX, No. 263, , 1950; operational definition of intelligence: the Turing Test Are electronic computers the right tool for building intelligent machines? 10

In the summer of 1956 some of them meet at a workshop at the Dartmouth College (USA), and found a new discipline named

11 The birth of AI In the 1950s many researchers from different disciplines are investigating human intelligence, and others are building machines capable of performing complex tasks. In the summer of 1956 some of them meet at a workshop at the Dartmouth College (USA), and found a new discipline named artificial intelligence, whose aim is to build intelligent machines. The founders: J. McCarthy, M. Minsky, A. Newell, H. Simon, C. Shannon, O. Selfridge, R. Solomonoff, and others. 11

12 AI early explorations: 1950s and 1960s Goals: identifying specific tasks that require intelligence, and figuring out how to get machines to do them. Great interest in mimicking high-level human thought and mental abilities, e.g.: reasoning understanding natural language understanding images Some investigations also on low-level abilities: recognizing speech sounds distinguishing objects in images reading cursive script Main problem: how do humans do that? 12

13 AI early explorations: 1950s and 1960s Starting point: toy problems (easy to formalize and investigate), and some real-world ones game playing: 15-puzzle, checkers, chess, etc. theorem proving natural language processing (NLP) recognizing objects in images 13

14 AI early explorations: 1950s and 1960s Dominant viewpoint: the essence of intelligence is deemed to be symbol processing. Early AI research focused therefore on a symbolic approach, aimed at simulating high-level manifestations of human intelligence. Main tools: heuristic search syntax analysis/generation symbolic knowledge representation (symbols, lists, graphs) symbolic knowledge processing: new programming languages (LISP, etc.) 14

15 Heuristic search methods Symbol processing approach, applied to problems like: game playing: 15-puzzle, checkers, chess (the Drosophila of AI ) geometric analogy problems theorem proving mechanizing problem solving: A. Newell and H. Simon s General Problem Solver (1959) 15

16 Heuristic search methods Common approach: knowledge representation: lists of symbols (main feature of the LISP language: 1958, J. McCarthy) search methods: search tree, heuristics; an example: search tree for the 8-puzzle problem 16

17 Natural language processing Aim: understanding, generating and translating natural language. A difficult problem, due for instance to different linguistic levels to take into account: morphology: word parts (e.g.: walking = walk + -ing) syntax (grammar): rules that define well-formed sentences (e.g.: John hit the ball: Yes; ball the hit John: No) semantics: meaning of a sentence pragmatics: context and background knowledge, e.g.: John went to the bank John threw the ball to the window and broke it John threw the glass to the wall and broke it 17

18 Natural language processing Main focus of earlier research: syntactic level (symbol processing approach). Seminal work: N. Chomsky, Syntactic Structures, Grammar definition: syntax rules for analyzing/generating sentences; main tool: parse tree. Applications: question answering (original goal: computer interfaces) machine translation: early optimism, but a very hard task 18

19 Non-symbolic approach A secondary (by then) approach was a non-symbolic one, aimed at simulating low-level manifestations/capabilities of intelligence like perception (mainly visual perception). This approach gave rise to the pattern recognition discipline, which later emerged as a relevant branch of AI artificial neural networks, that became one of the main AI tools (now re-flourishing as deep learning) 19

20 Pattern recognition Goal: classifying signals (images, sounds, electronic signals, etc.) into one of several categories. first problem addressed: image classification first application: optical character recognition (OCR) Main approaches: template matching learning: image pre-processing (noise filtering, line thickening, edge enhancement,...), feature extraction (e.g., shape), classification rules learnt from examples 20

21 Artificial neural networks (ANNs) Non-symbolic (low-level), connectionist approach. The origins: McCulloch and Pitts mathematical model of neuron (1943) the perceptron by F. Rosenblatt (1957): a potential model of human learning, cognition and memory network of McCulloch-Pitts neural elements learning algorithm for adjusting connection weights from examples First application: pattern (image) recognition OCR aerial images 21

22 Great expansion: mid 1960s to early 1980s From toy/lab problems to real-world and commercial applications: computer vision mobile robots game playing speech recognition, NLP knowledge representation and reasoning Relevant public funding: DARPA s Strategic Computing Program (USA) Fifth Generation Computer Systems (Japan) ESPRIT (Europe) 22

Great expansion: mid 1960s to early 1980s Computer vision: MIT Summer Vision project (1966) low-level, hierarchical image processing (hints from biology); image filters; line, corner, surface

23 Great expansion: mid 1960s to early 1980s Computer vision: MIT Summer Vision project (1966) low-level, hierarchical image processing (hints from biology); image filters; line, corner, surface detection; 3D reconstruction early application: guiding a robot arm to manipulate blocks high-level vision: finding objects in scenes (templates, parts) two main approaches emerge: whole scene reconstruction: difficult perceiving to guide robot action (purposive vision): easier 23

24 Great expansion: mid 1960s to early 1980s Mobile robots: sensors, actuators, computer vision, environment modeling, planning route finding: heuristic search, A* algorithm first autonomous vehicles 24

25 Great expansion: mid 1960s to early 1980s Game playing: progress in chess programs that attain human-level (not master) capability investigations into human ability: accumulated knowledge vs massive search 25

Great expansion: mid 1960s to early 1980s Progress in NLP, on less ambitious goals than in the 1950s: improvements in grammars machine translation with humans in the loop speech recognition (easy),

26 Great expansion: mid 1960s to early 1980s Progress in NLP, on less ambitious goals than in the 1950s: improvements in grammars machine translation with humans in the loop speech recognition (easy), and understanding (difficult) declarative (logical languages, inference algorithms) vs procedural ( hard-wired ) knowledge dialog systems (1971, T. Winograd s SHRDLU: blocks world, procedural knowledge): 26

27 Great expansion: mid 1960s to early 1980s Knowledge representation and reasoning: consulting/decision support/expert systems; main idea: solving domain-specific problems by embedding expert knowledge in the form of IF-THEN rules applications: chemistry, medical diagnosis, geology, military; since the 1990s: business 27

28 Great expansion: mid 1960s to early 1980s Summing up, until the 1970s AI research is mainly based on the symbol processing conception of human intelligence main approach: mimicking high-level human abilities through heuristic search and symbolic processing ( good, old-fashioned AI, GOFAI) many successful applications through a pragmatic approach in specific tasks but very limited achievements with respect to early expectations for a general AI 28

29 Mid 1980s: the AI winter Real-world tasks turned out to require much more intelligence than that achievable by heuristic search and symbolic processing (GOFAI). Two main issues emerge: computational complexity: combinatorial explosion human problem-solving relies on a large body of implicit background knowledge (including common sense) The non-symbolic, connectionist approach (artificial neural networks) exhibits limitations as well. Main consequences: drop of interest in AI scaling back AI s goals reduction of research funding 29

30 Mid 1980s to 1990s: technical and theoretical advances The AI winter was overcome thanks to new results in several fields, based on solid theoretical foundations from: mathematics statistics and probability theory control engineering This enabled concrete progress in real-world tasks, albeit still far from initial expectations: knowledge representation and reasoning machine learning computer vision Intelligent Agent architectures 30

31 Mid 1980s to 1990s: technical and theoretical advances Advances in search algorithms: evolutionary approach, genetic algorithms (inspired by evolution). Advances in knowledge representation and reasoning: new paradigms, e.g.; fuzzy logic, soft computing (inspired by human mind) semantic networks, ontologies; e.g.: WordNet, BabelNet, probabilistic reasoning to overcome the limits of logic (probabilistic graphical models, Bayesian networks, learning): 31

32 Mid 1980s to 1990s: technical and theoretical advances The rise of machine learning: huge amount of data in digital form become available main idea: automatically inferring knowledge (patterns, rules, etc.) from data instead of eliciting it from domain experts data analysis methods: data mining, etc. theoretical foundations: statistics novel techniques: inductive logic programming, decision trees, resurgence of ANNs (1986: back-propagation algorithm), support vector machines, ensemble methods, etc. many application fields: computer vision, natural language processing, etc. 32

33 Mid 1980s to 1990s: technical and theoretical advances Computer vision: two main approaches persist: scene analysis, purposive vision main achievements: surface, depth; tracking, object recognition fruitful exchanges with research on animal/human vision novel techniques: hierarchical models, ANNs, deep neural networks extensive application of machine learning techniques 33

34 Mid 1980s to 1990s: technical and theoretical advances Intelligent Agent architectures: sensor networks autonomous, cooperating robots; emergent behavior the intelligent agent paradigm A toy (?) example: soccer-playing robots 34

35 Mid 1990s today: main achievements The original goal of building intelligent machines is still far-reaching. Nevertheless, several real-world problems can be successfully addressed, many commercial applications have been developed, and many startup companies are exploiting AI techniques. 35

36 Mid 1990s today: main achievements Some examples: games: master level has been achieved in checkers, chess and (very recently) Go computer vision (object recognition, scene understanding, etc.) driverless automobiles, space vehicles automatic language translation pervasive applications: home automation, route finding in maps (search algorithms), recommender systems (machine learning, social/collaborative filtering), characters in video games, etc. medicine (e.g., diagnosis) business rule management systems automated (high-frequency) trading 36

37 Summary of the main approaches to AI The approaches pursued so far to build intelligent machines can be categorized along two main dimensions: Human performance Rationality Mind/ Systems that think Systems that think thinking like humans rationally (cognitive modeling approach) ( law of thought approach) Systems that act Systems that act Behavior like humans rationally (Turing test approach) (rational agent approach) The rational agent approach is the most general one, and is amenable to scientific/technological development, although it may be not useful enough for understanding human intelligence. 37

38 A snapshot of current AI research Research topics: machine learning knowledge representation and reasoning reasoning under uncertainty or imprecision natural language understanding / translation multi-agent systems planning heuristic search robotics vision pattern recognition... 38

39 A snapshot of current AI research Associations: Association for the Advancement of Artificial Intelligence (AAAI) European Coordinating Committee for Artificial Intelligence (ECCAI), Italian Association for Artificial Intelligence (AI*IA) Conferences: Int. Joint Conf. on Artificial Intelligence, ijcai.org Scientific journals: Artificial Intelligence J. of Artificial Intelligence Research, 39

40 Philosophical issues A long-standing question: Can machines be intelligent? Two main hypotheses: Weak AI: machines can emulate intelligence (act intelligently) Strong AI: machines can be intelligent (if they act intelligently, they are intelligent, e.g.: Turing test) Another long-standing question: Is (human) mind a machine? 40

41 Philosophical issues Some of the arguments raised against Weak AI: machines can never do X (X = make mistakes, learn from experience, have a sense of humor, enjoy ice creams, etc.) machines are formal systems, and formal systems cannot establish the truth of every mathematical sentence (Gödel incompleteness theorem), whereas humans (in principle) can human behavior can not be captured by a set of rules A. Turing s viewpoint (Mind, 1950): Can a machine think? is an ill-posed question. Consider this one: Can a machine fly/swim? airplanes fly, but not as birds fly ships swim in Russian, not in English or in Italian... 41

42 Philosophical issues Some arguments against Strong AI: even if machines can emulate intelligence, they cannot be self-conscious relationship between mental states and body (brain) states (free will, consciousness, intentions): dualism (R. Descartes, 17th cent.) vs materialism ( brains cause mind ) a machine running the right program (e.g., for natural language understanding) does not necessarily have a mind (the Chinese room thought experiment, J. Searle, 1980) intelligence is an emerging behavior that can be only supported by biological brains 42

43 Ethical issues Some etchical issues against AI: even if we could build intelligent machines, should we? consequences on humans: loss of jobs, loss of the sense of being unique, end of human race, etc. accountability (e.g., driverless cars)... Ethical concerns are currently re-flourishing, as many believe that human-level AI is now in reach (e.g., this is the current focus of the Future of Life Institute, 43

Some recent projects Human Brain Project (EU) https://www.humanbrainproject.

44 Some recent projects Human Brain Project (EU) Overall goal: understanding the human brain and its diseases, and emulating its computational capabilities RoboLaw (EU) Regulating Emerging Robotic Technologies in Europe: Robotics facing Law and Ethics 44

45 Solving problems by searching 45

46 Some motivating problems Consider the following problems, and assume that your goal is to design a rational agent, in the form of a computer program, capable of autonomously solving them. Remember: a rational agent is a system that acts rationally, according to a well-defined objective. 46

Some motivating problems Missionaries and cannibals A classic AI toy-problem: three missionaries and three cannibals must cross a river on a boat that

47 Some motivating problems Missionaries and cannibals A classic AI toy-problem: three missionaries and three cannibals must cross a river on a boat that can only hold two people, without leaving more cannibals than missionaries on either side of the river. How can all six get across the river safely? 47

48 Some motivating problems Game playing: 15-puzzle Another classic AI s toy-problem: transform an array of tiles from an initial configuration into a given, desired configuration, by a sequence of moves of a tile into an adjacent empty cell. A more challenging goal: find the shortest of such sequences. An example: initial configuration desired configuration 48

49 Some motivating problems Game playing: checkers and chess Two historical problems addressed by many researchers since the early days of AI. Chess has been named the Drosophila of AI. 49

50 Some motivating problems Robot navigation: a real-world problem addressed in mid 1960s. Left: Shakey the robot (1968). Right: a navigation problem for Shakey: finding a route from R to G, possibly the shortest one, avoiding obstacles (in black). 50

51 Some motivating problems 14 Chapter 3. Solving Problems by Searching Route finding in maps An example: finding a route from Arad to Bucharest using the information shown in the map below. A more challenging version: find the shortest route. 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 97 Pitesti Drobeta Mehadia Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. 51

52 Common features of the above problems Although the above problems may seem very different from each other, they share some high-level features that allow one to solve them using the same approach. Main feature: a clear goal can be defined, in terms of a set of desired world states. Once the goal is defined, the task is to search for a sequence of actions that lead to a goal state. Hence the name search problem. This requires one to suitably define the actions and the states to be considered. 52

53 A framework for search problems Problems exhibiting the above characteristics can be formalized as follows: 1. Goal formulation: what are the desired world states? 2. Problem formulation: given the goal: what are the actions to consider? what are the states to consider? The crucial point in this step is to find a proper level of abstraction, by removing every irrelevant detail. Under the above formulation: the solution of a problem is a sequence of actions that lead to a goal state the process of looking for a solution is called search 53

54 Goal and problem formulation: examples 15-puzzle initial configuration desired configuration goal: getting to the desired tile configuration (possibly, by the shortest sequence of moves) states: each possible 16! tile configurations actions: moving the n-th tile (n = 1,..., 15) to one of the adjacent cells (two to four), if empty 54

55 Goal and problem formulation: examples Route finding in maps 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 97 Pitesti Drobeta Mehadia Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. goal: getting from a given city to a destination one (possibly, through the shortest route) states: being in each possible city actions: moving between two adjacent cities 55

56 Goal and problem formulation: examples Chess goal: to checkmate (this goal is achieved in many possible chessboard configurations) states: each possible chessboard configuration actions: all legal moves 56

57 Properties of search problems Static vs dynamic: does the environment change over time? Examples: 15-puzzle and chess are static; robot navigation is dynamic, if the position of obstacles changes over time Fully vs partially observable: is the current state completely known? Examples: 15-puzzle and chess are fully observable; robot navigation is partially observable, if sensors are not perfect Discrete vs continuous sets of states and actions. Examples: 15-puzzle and chess are discrete, robot navigation is continuous Deterministic vs non-deterministic: is the outcome (the resulting state) of any sequence of actions certain? Examples: 15-puzzle is deterministic, chess is not (due to the opponent s move, that is unknown when deciding one s own) 57

58 Examples of real-world problems Many challenging real-world problems can be formulated as search problems. Some examples: traveling salesperson problem: finding the shortest tour that allows one to visit every city of a given map exactly once (applications to planning, logistics, microchip manufacture, DNA sequencing, etc.) route-finding: routing in computer networks, airline travel planning, etc. VLSI design: cell layout, channel routing 58

59 Search problems: formal defintion How to devise algorithms, and to implement them using some programming language, to solve search problems? First, a rigorous problem definition is needed. The goal and problem formulation sketched above can be formally defined in terms of four components: the initial state the set of possible actions the goal test the path cost 59

60 Search problems: formal defintion A description of the initial state: the state where the search starts A description of the possible actions, defined as a successor function SF which, given a state s, returns the set of ordered pairs (a, s ), where a is a legal action in state s and and s the resulting state. The above components implicitly define the the space state: it can be represented as a graph whose nodes correspond to states and edges to actions. A path is a sequence of states connected by a sequence of actions A goal test function: given a state s, it determines whether or not s is a goal state A path cost function: it assigns a numeric cost to each path. In many problems it is defined as the sum of the costs of the individual actions (step cost) along the path 60

61 Example: 8-puzzle A simpler version of the 15-puzzle problem Start State Goal State Figure 3.4 FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8- State description: puzzle. the location of each tile in the board Initial state: any given board configuration (however, note that only half of the configurations can be reached by any given one) Goal test: checking whether the input state matches the desired one Path cost: if the goal is to reach the desired configuration by the shortest sequence of moves, each action costs 1 move, and the path cost is the number of steps in the path 61

62 Example: 8-puzzle Actions: different descriptions are possible, e.g.: moving the n-th tile (n = 1,..., 8) to one of the empty adjacent cells, if any (e.g., move tile 3 right ): 32 actions moving the blank to one of the adjacent cells (e.g., move the blank down ): 4 actions Using, e.g., the latter description, the successor function SF returns all the states that can be reached by a given state, with the corresponding action descriptions; for instance: SF ( ) = {( move the blank down, ),...} The resulting state space is a graph made up of 9! 2 nodes (possible board configurations reachable from any initial state). Note that in this problem it is not convenient nor necessary to compute and store the state space beforehand into computer memory: it is implicitly defined by the initial state and SF. 62

63 Example: route finding in maps 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 70 Mehadia 75 Drobeta Pitesti Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. State description: the name of the city where one is in Initial state: any given city in the map Goal test: checking whether an input city is the destination one Actions can be described as moving from a city to an adjacent one. The corresponding successor function returns all the cities adjacent to a given one (no action description is necessary in this particular case), e.g.: SF (Arad) = {(Timisoara, Sibiu, Zerind)} 63

64 Example: route finding in maps Path cost: if the goal is to reach the destination through the shortest route, the step cost is the length (e.g., in km) of the route between two adjacent cities, and the path cost is the sum of the correspoding step costs In this kind of problem the state space (graph) corresponds to the map (if one views each road as standing for two actions, e.g., moving from Arad to Sibiu and moving from Sibiu to Arad). Accordingly, storing in computer memory the information on the map amounts to explicitly storing the whole state space. 64

65 Solving a search problem From now on we shall consider the simplest kind of search problem: static, fully observable, discrete, deterministic (e.g., 8/15-puzzle and route finding in maps). The key feature of this kind of problem is that the search for a solution can be made offline, i.e., before starting the execution of the corresponding actions. The main steps for solving such problems are thus the following: 1. goal and problem formulation (discussed above) 2. searching for a solution (offline) 3. executing the actions In the following we shall focus on the search step. 65

66 Search algorithms A possible technique for solving a search problem is to iteratively construct and expand a set of partial solutions, each one starting from the initial state. The set of partial solutions can be represented by a tree data structure, named search tree: the root node corresponds to the initial state every other node is generated by applying the successor function to another node already in the tree This approach gives rise to a family of search algorithms. 66

67 Search tree A search tree represents a set of partial solutions (sequences of actions) starting from the initial state: each node corresponds to a state an edge from a parent node A to a child node B corresponds to the action that leads from state A to state B a leaf node corresponds to the end state of a partial solution each sequence of nodes from the root to a leaf is a possible path in the state space depth of a node: the number of actions in the path from the root to that node (the root node has zero depth) the set of leaf nodes is called fringe Note that the same state can appear in different nodes, if it belongs to different paths (partial solutions). 67

68 14 Chapter 3. Solving Problems by Searching Search tree: an example 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui Lugoj Pitesti Mehadia Urziceni Bucharest Drobeta Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Root node: the starting city, Arad Six leaf nodes (fringe, in white): six partial solutions Arad Sibiu Arad Arad Sibiu Fagaras... Edges: moving from a city to an adjacent one 68

69 14 Chapter 3. Solving Problems by Searching Space state and search tree 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui Lugoj Pitesti Mehadia Urziceni Bucharest Drobeta Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Note that a search tree is different from the state space. One key difference is that every node corresponds to a single state of the state space, but every state may appear in several nodes of the search tree if it belongs to different partial solutions (paths). In the example above, this is the case of the state Arad. 69

70 Sketch of a general tree-search algorithm Every tree-search algorithm works as follows: 1. construct the root node R of the search tree, associate the initial state to R, and set the fringe equal to {R} (the initial state is the only partial solution at this point) 2. repeat the following steps: 2.1 if the fringe is empty, then no solution has been found and the algorithm stops 2.2 choose one of the partial solutions, i.e., one leaf node N from the fringe (in the first iteration only R can be selected) 2.3 if N contains a goal state, then the search is successfully completed: the algorithm stops and returns the sequence of actions in the path from R to N as the solution 2.4 expand the state in N: apply SF to the state in N for each state generated by SF, construct a new leaf node, add it to the tree as a child of N, and add it to the fringe remove N from the fringe 70

71 Sketch of a general tree-search algorithm Note that the above tree-search algorithm is independent of the search problem. The key point is the choice of a leaf node in step 2.2: different criteria can be used for this choice each criterion defines a specific search strategy each search strategy leads to a different search algorithm 71

72 14 Chapter 3. Solving Problems by Searching Tree-search algorithm: an example Example: route finding on maps, getting from Arad to Bucharest. 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 97 Pitesti Drobeta Mehadia Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. 72

73 Tree-search algorithm: an example Step 1: the root node is constructed, coresponding to the initial state (Arad). 73

74 Tree-search algorithm: an example Step 2: the first iteration starts. Step 2.1: the fringe is not empty Step 2.2: the fringe contains a single leaf node, the root node (Arad), which is therefore selected Step 2.3 (goal test): Arad is not the desired state Step 2.4: the chosen leaf (the root node) has to be expanded 74

75 Tree-search algorithm: an example Step 2.4.1: the successor function is applied to the state Arad, which generates all the states reachable from Arad Step 2.4.2: the newely generated states are added as child nodes to the root node, and to the fringe Step 2.4.3: the expanded node is removed from the fringe Leaf nodes, not yet expanded (the fringe), are shown in white; non-leaf nodes, already expanded (no more in the fringe) are shaded. 75

76 Tree-search algorithm: an example Step 2: a new iteration starts. Step 2.1: the fringe is not empty Step 2.2: the fringe contains three leaf nodes: which one to choose? This is decided by a search strategy (see later). For instance, assume that Sibiu is chosen. Step 2.3 (goal test): Sibiu is not the desired state Step 2.4: the chosen leaf (Sibiu) has to be expanded 76

77 Tree-search algorithm: an example Step 2.4.1: the successor function is applied to the state Sibiu, which generates all the states reachable from Sibiu (including Arad) Step 2.4.2: the newely generated states are added as child nodes to the expanded node, and to the fringe Step 2.4.3: the expanded node is removed from the fringe A new iteration starts... 77

78 General tree-search algorithm A more concise (but still informal) description: function Tree-Search (problem, strategy) returns a solution, or failure construct the root node using the initial state of problem loop do if there are no leaf nodes then return failure choose a leaf node according to strategy if the chosen leaf node contains a goal state then return the corresponding solution expand the chosen leaf node 78

79 Implementation hints In the following, a more formal version of the above tree-search algorithm is presented in pseudo-code, together with the corresponding data structures. This version is independent of the search problem, and of the programming language. Details may change depending on the specific programming language used for implementing the algorithm. 79

80 Data structure: nodes of the search tree The example below depicts the information that need to be stored in computer memory to represent a node of the search tree: the state associated to the node the parent node (this is needed to easily reconstruct the path from the root node when a goal state is found) the action that lead to this node from the parent one the path cost from the root node to this one Additional information may be useful, e.g., the children nodes and the depth. 21 PARENT 5 4 Node ACTION = Right PATH-COST = STATE Figure 3.10 FILES: figures/state-vs-node.eps (Tue Nov 3 13:50: ). Nodes are the data structures from which the search tree is constructed. Each has a parent, a state, and various bookkeeping fields. Arrows point from child to parent. 80

81 Data structures: nodes of the search tree The information outlined above can be conveniently stored into a record data structure, like the C language s struct, containing the following fields: State Parent-Node Children-Nodes Action Path-Cost Depth a problem-dependent representation of the corresponding state a pointer to the parent node pointers to the children nodes a description of the action that lead from the parent node to this one the total cost of the actions on the path from the root to this node the number of actions in the path from the root to this node 81

82 Data structures: fringe of the search tree At each step of the tree-search algorithm one of the leaf nodes (i.e., the nodes in the fringe of the search tree) must be selected to be expanded. The choice among all the leaf nodes is made according to a given search strategy, as discussed later. The leaf nodes must therefore be quickly accessible. To this aim a suitable data structure should be used, e.g., based on pointers. A convenient solution is to store pointers to each node in the fringe in a queue, a first-in first-out (FIFO) data structure. Newly generated nodes are therefore inserted into the queue in the order in which they will be expanded by the chosen search strategy. This way, the next node to be expanded is always the first node in the current fringe. 82

83 Data structures: the search problem Some information specific to the search problem to be solved must also be stored. One possibility is to use another record data structure: Initial-State Goal-Test Successor-Fn Step-Cost a problem-dependent representation of the initial state (possibly using another data structure) a function that checks whether an input state is a goal state the function SF (see above) that returns a set of pairs (action, state) from a given state a function that returns the cost of carrying out a given action in a given node 83

84 A note on the implementation of data structures In the above record data structure the fields Goal-Test, Successor-Fn and Step-Cost contain a function. The implementation of the above data structures depends on the chosen programming language. For instance, in C language the values of the above fields can be pointers to functions. Different implementations are of course possible, e.g., to avoid storing the above functions as values of record fields. Moreover, the data structure to be used to represent the states of the search problem, as well as the goal-test, successor and step-cost functions, are all problem-specific: they have to be defined based the search problem at hand. 84

85 Implementation of the tree-search algorithm A possible implemementation of the tree-search algorithm is shown in the next slide. Note that this implementation is problem-independent, and is an example of modular programming style: it can be used for any search problem all the problem-specific details (e.g., the data structure for representing states and the goal-test function) are represented or implemented separately Function and field names are written in Small Capitals. The notation Field-Name[record] denotes the value of the field Field-Name of the record record. 85

86 Implementation of the tree-search algorithm function Tree-Search (problem, Enqueue) returns a solution, or failure fringe an empty queue fringe Enqueue(Make-Node(Initial-State[problem]), fringe) loop do if Empty?(fringe) then return failure node Remove-First(fringe) if Goal-Test[problem](State[node]) succeeds then return Solution(node) fringe Enqueue(Expand(node, problem), fringe) The search strategy is assumed to be defined through the function Enqueue, which is passed to Tree-Search as an argument (e.g., a pointer to a function in C language). As explained above, Enqueue inserts the newly generated nodes in a queue, in the order in which they have to be expanded according to the corresponding search strategy. 86

87 Implementing auxiliary functions: node expansion This function expands an input node and adds the children nodes to the search tree and to the fringe: function Expand(node, problem) returns a set of nodes successors the empty set for each action, result in Successor-Fn[problem](State[node]) do n a new Node State[n] result Parent-Node[n] node Action[n] action Path-Cost[n] Path-Cost[node] + Step-Cost[problem](node, action) Depth[n] Depth[node] + 1 Children-Nodes[n] the empty set add n to Children-Nodes[node] add n to successors return successors 87

88 Other auxiliary functions Make-Node(s) returns a new instance of the node data srtructure, storing the state s in its State field Remove-First(q) removes the first element from the queue q and returns it Solution(n) returns the sequence of actions (the values of the Action fields) from the root of the tree to node n, following the pointers in the Parent-Node fields from n backwards to the root Enqueue(nodes, q) inserts in the queue q each node in the set nodes, in a position defined by the search strategy (a different implementation of Enqueue must be defined for each search strategy) 88

89 A note on the space state representation For some search problems the state space can be very large. For instance, the space state of the 8-puzzle game has size 9!/2: Start State Goal State Figure 3.4 puzzle. FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8- Fortunately, it is not necessary to store the whole state space in memory (e.g., represented as a graph): each state can indeed be obtained from the initial one through the successor function SF (remember that the tree-search algorithm iteratively constructs partial solutions as sequence of actions from the initial state). In other words, as already pointed out previously, the initial state and SF implicitly define the space state. 89

90 A note on the space state representation As a particular case, in problems like route finding in maps the space state is relatively small, and storing the information about the SF (and cost) function corresponds to storing the whole state space. 14 Chapter 3. Solving Problems by Searching For instance, encoding the information in the map below (what cities are directly connected to any given city, and the cost of moving between any two adjacent cities) amounts to define SF : 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 70 Mehadia 75 Drobeta Pitesti Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. 90

91 Measuring problem-solving performance The performance of a tree-search algorithm can be evaluated according to two main criteria: effectiveness: how good is the solution found (if any)? efficiency: what is the cost of finding a solution (if any)? Effectiveness can in turn be evaluated in terms of: completeness: is the algorithm guaranteed to find a solution, when there is one? optimality: is the solution found by the algorithm the one with minimum path cost? Efficiency is related to the computational complexity of a search algorithm, which in turn depends on two aspects: time complexity: how long does it take to find a solution? space complexity: how much memory is needed to perform the search? Often a trade-off between effectiveness and efficiency is required. 91

92 Search strategies 14 Chapter 3. Solving Problems by Searching The essence of search algorithms is to choose one of the partial solutions (leaf nodes) to follow up at each step. An example: 71 Oradea Neamt Zerind Iasi Arad Sibiu 99 Fagaras 118 Vaslui 80 Rimnicu Vilcea Timisoara Lugoj Pitesti Hirsova Mehadia Urziceni Bucharest Drobeta Craiova Eforie Giurgiu which of the six partial solutions should one choose? Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Two kinds of strategies exist, depending on the available information about which choice is better than another: no information: uninformed search strategies must be used some information: informed search strategies can be used 92

93 Uninformed search strategies Rationale: in absence of any information about the best partial solution, systematically explore the space state. Main strategies: breadth-first depth-first uniform-cost depth-limited iterative-deepening depth-first bidirectional 93

94 Breadth-first search (BFS) BFS expands first the shallowest leaf node. If there is more than one leaf node at the lowest depth, one of them is randomly chosen. 14 Chapter 3. Solving Problems by Searching In other words, after expanding the root node, first all nodes at depth 1 are expanded, then all nodes at depth 2, and so on. An example: 71 Oradea Neamt Zerind Iasi Arad Sibiu 99 Fagaras 118 Vaslui 80 Rimnicu Vilcea Timisoara Lugoj Pitesti Hirsova Mehadia Urziceni Bucharest Drobeta Craiova Eforie Giurgiu shallowest leaf nodes: Timisoara, Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. 94

95 Example of breadth-first search (1/6) Route finding on maps: getting from Arad to Bucharest. 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 97 Pitesti Drobeta Mehadia Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. The nodes of the search tree will be numbered to refer to them without ambiguities (several nodes can share the same state). The steps of the algorithms are numbered according to the general tree-search algorithm shown above. 95

96 Example of breadth-first search (2/6) 1. root node: the initial state, Arad; fringe = { Arad } 2.1 the fringe is not empty 2.2 the shallowest node in the fringe (the root) is chosen (Arad) 2.3 Arad is not a goal state Arad 1 Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea Arad 5 6 Arad 1 96

97 Example of breadth-first search (3/6) 2.4 Arad is removed from the fringe and expanded, generating three nodes having identical depth the newly generated nodes must be inserted in the fringe (a queue) in the order in which they will be expanded by BFS; by definition, in BFS a newly expanded node has depth equal or higher than all the other nodes in the fringe; therefore, the newly expanded nodes are inserted at the end of the fringe since the newly generated nodes always have identical depth, they are inserted in the fringearad in any 1 order between themselves Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Current fringe: { Zerind (2), Sibiu (3), Timisoara (4) }. Zerind Arad 1 2 Sibiu 3 Timisoara 4 Oradea Arad

98 Example of breadth-first search (4/6) 2.1 the fringe is not empty 2.2 the first node in the fringe is selected (it is guaranteed to be one of the shallowest leaf nodes) Arad the corresponding state, Zerind, is not the desired state Arad Zerind is removed from the fringe and expanded, generating two nodes that are inserted 2 at the end 3of the fringe 4 Zerind Sibiu Timisoara Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea Arad 5 6 Arad 1 Current fringe: { Sibiu (3), Timisoara (4), Oradea (5), Arad (6) }. Zerind 2 Sibiu 3 Timisoara 4 Oradea Arad Oradea Arad Rimnicu Vilcea Fagaras Arad 1 98

99 Example of breadth-first search (5/6) Arad 1 Arad the fringe is not empty Zerind 2 Sibiu 3 Timisoara the first node in the fringe is selected Arad the corresponding state, Sibiu, is not the desired state 2.4 Sibiu is removed Zerind from 2the fringe Sibiuand 3 expanded, Timisoara generating 4 four nodes that are inserted at the end of the fringe Oradea Arad 5 6 Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea Arad Oradea Arad Rimnicu Vilcea Fagaras Current fringe: { Timisoara (4), Oradea (5), Arad (6), Oradea (7), Zerind 2 Sibiu 3 Timisoara 4 Arad (8), Rimnicu Vilcea (9), Fagaras (10) }. Arad Oradea Arad Oradea Arad Rimnicu Vilcea Fagaras Lugoj Arad

100 Zerind 2 Sibiu 3 Timisoara 4 Example of breadth-first search (6/6) 2.1 the fringe Oradea is not Arad empty the first node in the fringe is selected Arad the corresponding state, Timisoara, is not the desired state Zerind 2 Sibiu 3 Timisoara Timisoara is removed from the fringe and expanded, generating two nodes that Oradeaare Aradinserted Oradea AradatRimnicu the Vilcea end of Fagaras the fringe Arad Zerind 2 Sibiu 3 Timisoara Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea Arad Oradea Arad Rimnicu Vilcea Fagaras Lugoj Arad Current fringe: { Oradea (5), Arad (6), Oradea (7), Arad (8), Rimnicu Vilcea (9), Fagaras (10), Lugoj (11), Arad (12) }. And so on. 100

101 Exercise 16 Chapter 3. Solving Problems by Searching Apply the BFS algorithm to the 8-puzzle problem, considering the initial and goal states below, and expanding the first four nodes of the search tree (i.e., execute the first four iterations of the general tree-search algorithm) Start State Goal State Figure 3.4 puzzle. FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8-101

102 Properties of breadth-first search In terms of effectiveness, it can be easily shown that BFS is: complete: a solution is always found, if one exists non-optimal: it is not guaranteed that the solution with minimum path cost is found (if any), unless the path cost is a non-decreasing function of depth Computational complexity can be evaluated as follows. 102

103 Computational complexity of algorithms The execution time of a given algorithm depends on several factors not intrinsic to the algorithm itself (the implementation in a specific language, the hardware on which the program is executed, etc.). Time complexity is therefore evaluated not in terms of the execution time, but in terms of the number of elementary operations carried out by the algorithm, assuming that each of them can be executed in constant time. For instance, depending on the algorithm, they can be additions, multiplications, comparisons, etc. The first step to compute time complexity for a given algorithm is therefore to identify what its elementary operations are. For instance: sorting algorithms (selection sort, quick sort, etc.): comparison between a pair of values converting a number to base two: computing the quotient and remainder of a division 103

104 Computational complexity of algorithms The number of elementary operations carried out by an algorithm depends either on the value or the size of its input, e.g.: to sort a sequence of numbers, the number of comparisons depends on the size of the sequence to convert a number to base two, the number of divisions depends on its value Even if the number of elementary operations depends on more than one factor, to ease computations a single factor is considered and the others are kept constant. 104

105 Computational complexity of algorithms As an example, consider the well-known selection sort algorithm, to sort a sequence of values in a certain order. It can be described as follows: 1. find the minimum value in the sequence 2. swap it with the value in the first position 3. repeat the steps above for the remainder of the sequence (starting at the second position, then at the third one, up to the penultimate position) 105

106 Computational complexity of algorithms A possible implementation of selection sort in C language is reported below, for sorting in non-decreasing order an array of integers: void selection_sort (int a[], int length) { int i, j, ind_min, tmp; for (i = 0; i < length - 1; i++) { ind_min = i; for (j = i + 1; j < length; j++) if (a[j] < a[ind_min]) ind_min = j; if (ind_min!= i) { tmp = a[i]; a[i] = a[ind_min]; a[ind_min] = tmp; } } } 106

107 Computational complexity of algorithms Each comparison in the nested loop can be considered as the elementary operation. For a sequence of length n the outer loop is repeated for n 1 times. At iteration k (k = 1,..., n 1), n k comparisons are made to find the lowest element starting from the k-th position, then a swap is possibly made. It is now easy to compute the exact number of comparisons, which is given by: n(n 1) (n 1) + (n 2) = 2 Time complexity depends therefore on the size of the input, i.e., the sequence length. Note that all problem instances of a given size (sequences of a given length) have the same time complexity. 107

108 Computational complexity of algorithms Let n denote the value of the factor which the number of elementary operations carried out by an algorithm depends on (e.g., the length of a sequence to be sorted, or the value of a number to be converted to base two), and f (n) the corresponding number of of elementary operations. In practice evaluating f (n) can be difficult. Moreover, f (n) may also depend on the problem instance, i.e., on the specific value of the input of the algorithm. Time complexity is therefore evaluated only for some categories of problem instances. The most interesting category is usually the one that corresponds to the highest number of elementary operations, i.e., the worst case. Other categories of interest are: average-case complexity, corresponding to the average number of elementary operations over all possible problem istances best-case complexity, corresponding to the problem instances that require the lowest number of elementary operations 108

109 Computational complexity of algorithms To evaluate and compare algorithms it useful to consider their asymptotic time complexity, i.e., the behaviour of f (n) as n. To this aim an upper bound g(n) of f (n) is determined using asymptotic analysis, known as big O notation. A function f (n) is said to be O(g(n)) ( order g ), if there exists some n 0 > 0 and c > 0 such that f (n) c g(n) for each n n 0. Of course, the tightest upper bound g(n) is of interest. For instance, it is easy to see that any polynomial of degree p, a p n p + a p 1 n p a 1 p + a 0, is O(n p ). As an example, this implies that the time complexity of selection sort, given by n(n 1) 2, is O(n 2 ). 109

110 Computational complexity of algorithms Well-known categories of (increasing) asymptotic complexity are the following: O(1): constant time algorithms (their execution time is identical for all problem instances) O(logn): logarithmic time (e.g., binary search in sorted sequences) O(n): linear time O(n p ), for a given integer p: polynomial time (e.g., selection sort, with p = 2) O(k n ), for a given k > 1: exponential time (e.g., the simplex algorithm in linear programming) 110

111 Computational complexity of search algorithms Getting back to the general tree-search algorithm, its elementary operation can be identified as generation of a new node during the expansion of a node in step 2.4. It follows that worst-case time complexity can be evaluated by counting the highest number of nodes that are generated before a solution is found (if any). Space complexity can be computed analogously, taking into account that the main data that have to be stored are the nodes of the search tree. A node can thus be considered as the elementary data, assuming a constant amount of memory for each node. Accordingly, worst-case space complexity can be evaluated as the maximum number of nodes that have to be simultaneously stored in memory. 111

112 Breadth-first search: computational complexity In the specific case of BSF, it is not difficult to see that computational complexity depends on two main factors: the number of successors b of each node of the search tree (named branching factor) the depth d of the shallowest solution (the one found by BFS) Since different nodes can have a different branching factor (see, e.g., 8-puzzle and route finding on maps), to simplify computations a constant branching factor is considered. For instance, for b = 2 we have a binary tree: A A A C B C B C F G D E F G D E F G bfs-progress.eps (Tue Nov 3 16:22: ). Breadth-first search on ge,thenodetobeexpandednext is indicated by a marker. 112

113 Breadth-first search: computational complexity Computational complexity can now be evaluated as a function of the single factor d (corresponding to n in the previous discussion). Time complexity: in the worst case the goal state is in the last node that is selected in step 2.2 to be expanded, among all the ones at depth d. This means that all the other nodes at depth d are expanded before. The number of generated nodes can be computed by evaluating the number of nodes that are generated at each depth: Depth Number of generated nodes 0 1 (root node) 1 b 2 b 2 3 b d b d d + 1 b d+1 b Total: 1 + b + b 2 + b b d + (b d+1 b) 113

114 Breadth-first search: computational complexity To evaluate space complexity it suffices to notice that all generated nodes must remain in memory until a solution is found. It follows that space complexity equals time complexity. The worst case time and space complexity of BFS, for a search tree of constant branching factor b and shallowest solution at depth d, are therefore given by 1 + b + b 2 + b b d + (b d+1 b). It easily follows that the asymptotic complexity of BFS is O(b d+1 ), i.e., it is exponential. Intuitively, algorithms like BFS, characterized by an exponential complexity, are not very efficient. 114

115 Breadth-first search: computational complexity As an example of what an exponential complexity means, consider the following scenario for a search problem: branching factor b = 10 (real-world problems can exhibit larger values) time for generating one node: 10 4 s storage required for a single node: 1, 000 bytes The corresponding worst-case time and space complexity of BFS, for different values of the depth d of the shallowest solution, is the following: Depth Nodes Time Memory 2 1, sec. 1 megabyte 4 111, sec. 106 megabytes minutes 10 gigabytes hours 1 terabyte days 101 terabytes years 10 petabytes , 523 years 1 exabyte 115

116 Properties of breadth-first search To sum up, BFS exhibits the following properties: it is complete: a solution is always found, if one exists it is non-optimal: it is not guaranteed that the solution with minimum path cost is found (if any), unless the path cost is a non-decreasing function of depth it has an exponential time and space complexity with respect to the branching factor of the search tree 116

117 Depth-first search (DFS) Contrary to BFS, DFS expands first the deepest leaf node (in case there are several such nodes, a random choice is made). 14 Chapter 3. Solving Problems by Searching This amounts to explore first one of the possible paths, then another one, and so on. An example: 71 Oradea Neamt Zerind Iasi Arad Sibiu 99 Fagaras 118 Vaslui 80 Rimnicu Vilcea Timisoara Lugoj Pitesti Hirsova Mehadia Urziceni Bucharest Drobeta Craiova Eforie Giurgiu deepest leaf nodes: Arad, Fagaras, Oradea, Rimnicu Vilcea Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. 117

118 14 Chapter 3. Solving Problems by Searching Example of depth-first search (1/5) Route finding on maps: getting from Arad to Bucharest. 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 97 Pitesti Drobeta Mehadia Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. 118

119 Example of depth-first search (2/5) 1. root node: the initial state, Arad; fringe = { Arad } 2.1 the fringe is not empty 2.2 the deepest node in the fringe (the root) is chosen (Arad) 2.3 Arad is not a goal state Arad 1 Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea 5 Arad 6 119

120 Example of depth-first search (3/5) 2.4 Arad is removed from the fringe and expanded, generating three nodes having identical depth the newly generated nodes must be inserted in the fringe (a queue) in the order in which they will be expanded by DFS; by definition, in DFS a newly expanded node has depth equal or higher than all the other nodes in the fringe; therefore, the newly expanded nodes are inserted at the top of the fringe since the newly generated nodes always have identical depth, they are inserted in the fringe Arad in any 1 order between themselves Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Current fringe: { Zerind (2), Sibiu (3), Timisoara (4) }. Zerind Arad 1 2 Sibiu 3 Timisoara 4 Oradea 5 Arad 6 120

121 Example of depth-first search (4/5) 2.1 the fringe is not empty 2.2 the first node in the fringe is selected (it is guaranteed to be one of the deepest leaf nodes) Arad the corresponding state, Zerind, is not the desired state Arad Zerind is removed from the fringe and expanded, generating two nodes that are inserted 2 at the top of 3 the fringe 4 Zerind Sibiu Timisoara Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea 5 Arad 6 Current fringe: { Oradea (5), Arad (6), Sibiu (3), Timisoara (4) }. Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea 5 Arad 6 Sibiu 7 Zerind 8 121

122 Arad 1 Example of depth-first search (5/5) 1 Zerind 2.1 the fringe is not empty Arad the first node in the fringe is selected 2.3 the correspondingzerind state, 2 Oradea, Sibiu is 3not the desired Timisoara 4state 2.4 Oradea is removed from the fringe and expanded, generating Oradea 5 Arad 6 two nodes that are inserted at the top of the fringe Arad 2 Sibiu 3 Timisoara 4 Arad 1 Zerind 2 Sibiu 3 Timisoara 4 Oradea 5 Arad 6 Sibiu 7 Zerind 8 Current fringe: { Sibiu (7), Zerind (8), Arad (6), Sibiu (3), Timisoara (4) }. And so on. 122

123 Exercise 16 Chapter 3. Solving Problems by Searching Apply the DFS algorithm to the 8-puzzle problem, considering the initial and goal states below, and expanding the first four nodes of the search tree (i.e., execute the first four iterations of the general tree-search algorithm) Start State Goal State Figure 3.4 puzzle. FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8-123

124 Some remarks on depth-first search A drawback of DFS is that it can get stuck going down along very long paths. Infinite paths can also occur, due for instance to trivial loops like Arad Zerind Arad Zerind... (loops can be avoided by suitable changes to the general tree-search algorithm see later). On the other hand, DFS has modest memory requirements: if all the paths starting from a given node have been fully explored (if they are not infinite) and no solution has been found, the sub-tree having such a node as the root can be removed from memory therefore, only a single path form the root to a leaf node need to be stored in memory, together with the unexpanded sibling nodes for each node on that path An example is given in the following for a binary search tree, assuming that each path has depth 3, and that node M contains a goal state (shaded nodes are the ones not yet generated). 124

125 Example of depth-first search The order of expansion is top-down, left to right: A A A B C B C B C D E F G H I J K L M N O D E F G H I J K L M N O D E F G H I J K L M N O A A A B C B C C D E F G H I J K L M N O D E F G I J K L M N O E F G J K L M N O A A C B C C E F G E F G F G J K L M N O K L M N O L M N O A A A C C C F G F G F G L M N O L M N O M N O Figure 3.16 FILES: figures/dfs-progress-noblack.eps (Tue Nov 3 13:30: ). Depth-first search on a binary tree. The unexplored region is shown in light gray. Explored nodes with no descendants in the frontier are removed from memory. Nodes at depth 3havenosuccessorsand M is the only goal node. 125

126 Depth-first search: computational complexity Computational complexity of DFS can be evaluated by considering: an identical branching factor b for all nodes of the search tree, similarly to BFS the maximum depth of the search tree, denoted by m the depth of the shallowest solution, which equals m in the worst case Moreover, in the worst case the goal state is in the last path that is explored by DFS. To compute time complexity notice that the above assumptions imply that all nodes up to depth m are generated before the solution is found. To compute space complexity, remember that only a single path from the root to a leaf node, along with the remaining unexpanded sibling nodes for each node in the path, must be stored (see the example above). 126

127 Depth-first search: computational complexity Computational complexity can therefore be evaluated by considering first the number of nodes generated, and the number of nodes simultaneously kept in memory, at each depth: Time complexity Space complexity Depth N. of generated nodes N. of stored nodes 0 1 (root node) 1 (root node) 1 b b 2 b 2 b m b m b Total: 1 + b b m = O(b m ) 1 + mb = O(m) Time complexity is therefore exponential, as that of BFS, but space complexity is linear. 127

128 Properties of depth-first search DFS exhibits the following properties: it is complete, unless there are infinite paths it is non-optimal: a deeper, suboptimal solution can be found along a path that is explored before another one containing the optimal solution at smaller depth it has an exponential time complexity with respect to the branching factor of the search tree, but only a linear space complexity 128

129 Other strategies 26 Chapter 3. Solving Problems by Searching Uniform-cost: expands the leaf node with the lowest path cost Depth-limited: depth-first search with a predefined depth limit (avoids infinite paths, but is not complete) Iterative-deepening depth-first: repeated depth-limited search with depth limit 1, 2, 3,..., until a solution is found (avoids infinite paths, and is complete) Bidirectional: simultaneously searching forward from the initial state and backwards from the goal state, until the two searches meet Start Goal Figure 3.20 FILES: figures/bidirectional.eps (Tue Nov 3 16:22: ). Aschematicviewofa bidirectional search that is about to succeed when a branch from the start node meets a branch from the goal node. 129

130 Avoiding repeated states A critical issue of the search process: wasting time by expanding the same state several times (along different paths). This may happen, e.g., when: actions are reversible (15-puzzle, route finding, etc.) more than one path from the root can lead to the same state (15-puzzle, route finding, etc.) 130

131 Avoiding repeated states Four main alternative solutions of increasing complexity exist. When a node n is expanded: 1. if reversible actions exist, discard the newly generated node containing the same state of the parent node of n 2. discard all children nodes containing states already present in the same path from the root to n 3. discard all children nodes containing states already present in the current search tree (ineffective for DFS, which does not store all generates nodes) 4. discard all children nodes containing states previously inserted in the search tree, even if not present in the current one 131

132 Avoiding repeated states The above solutions require to compare every newly generated node with some other nodes. This has relevant implications in terms of computational complexity: solution 1 requires a single comparison solution 2 requires a number of comparisons equal to the depth of the parent node n solution 3 instead requires a comparison with all the nodes in the search tree: this means that its time complexity is exponential for strategies exhibiting exponential space complexity (like BFS) even worse, solution 4 has an exponential time and space complexity, since it also requires to store all previously generated nodes (this does not allow to exploit the low space complexity of DFS) Note that a more effective implementation of the above strategies can be obtained by removing the repeated state(s) with highest path cost, to avoid sub-optimal choices. 132

133 16 Chapter 3. Solving Problems by Searching Effectiveness of uninformed search: an example One may think that the high computational complexity of uninformed search strategies is an issue only for real-world problems, not for toy ones. Consider again 8-puzzle, apparently a very simple toy problem: Start State Goal State Figure 3.4 FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8- puzzle. How long does it take to solve it using, e.g., BFS? 133

134 Effectiveness of uninformed search: an example Some facts about 8-puzzle: the state space contains 9! = 362, 880 distinct states (only 9!/2 = 181, 440 are reachable from any given initial state) it can be shown that the average solution depth (over all possible pairs of initial and goal states) is about 22 the average branching factor b (over all possible states) is about 3 (note that from each state 2 to 4 actions can be performed) How many nodes does BFS generate and store, when the shallowest solution has depth d = 22 (i.e., in the average case)? The worst-case time and space complexity of BFS is O(b d+1 ), which in this case amounts to For instance, taking into account that representing a state requires at least log 2 9! = 19 bits, storing 3 23 states requires about bits, i.e., more than 200 GB

135 Exercise 1. Implement the general tree-search algorithm, and the related data structures, in a programming language of your choice 2. Implement the additional, specific functions for breadth-first, depth-first and uniform-cost search 3. Implement the additional, specific data structures and functions for the 8-puzzle problem, and the route finding problem in the Romania map 4. Run the above search algorithms on specific problem instances, and evaluate the number of generated and stored nodes 135

136 Informed search Uninformed search is based on systematically exploring the search space, and does not exploit any information (if any) about what nodes are more promising than others towards the solution. When such a kind of knowledge is available, it can be exploited to improve the effectiveness and the efficency of tree search. The main idea is to use the available problem-specific knowledge to identify the best node to expand at each step of the general tree-search algorithm, instead of using uninformed criteria like expanding the shallowest or deepest node. This general approach is named best-first search. 136

137 Best-first search Best-first search is based on quantitatively evaluating how promising a given node n is towards a solution, through a suitable node evaluation function f (n) (usually lower values of f correspond to better nodes). Different definitions of f (n) lead to different, specific best-first search strategies, for instance: greedy search A*-search and its many variants (iterative-deepening A*, memory-bounded A*, etc.) 137

138 Best-first search Once a suitable f (n) (i.e., a specific best-first search strategy) has been defined, the corresponding search algorithm can be implemented using the same general tree-search algorithm presented above. Best-first search can be easily implemented by sorting nodes in the fringe for increasing values of f (n): this allows the node n with the lowest f (n) (the best node) to be selected for expansion at each iteration. 138

139 Best-first search To define f (n), a very useful information is the cost of the actions that will lead from any given node n to a goal state. Although in non-trivial problems the exact cost is usually unknown, often an estimate can be easily computed. The estimated cost as a function of the nodes is formalized as a function h(n). Note that, by definition, h(n) = 0 if n contains a goal state (this is the only case in which the cost is exactly known). For historical reasons h(n) is named heuristic function, and search strategies based on it are named heuristic search. Heuristic search is one of the earlier achievements of AI (dating back to the 1950 s), and is still widely used in real-world problems and investigated by researchers in AI. 139

140 14 Chapter 3. Solving Problems by Searching Heuristic functions: an example Consider the problem of route finding in maps, e.g., finding the shortest route form Arad to Bucharest using the information on the map below: 71 Oradea Neamt Zerind 75 Arad Timisoara 151 Sibiu 99 Fagaras 80 Rimnicu Vilcea 87 Iasi 92 Vaslui 111 Lugoj 97 Pitesti Drobeta Mehadia Urziceni 138 Bucharest 90 Craiova Giurgiu Hirsova 86 Eforie Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Since the goal is to find the shortest route, the cost of the actions is evaluated as the route length (e.g., in Km). Defining an heuristic function for this problem amounts therefore to estimating the distance between any given city and the destination (in the problem instance above, Bucharest). 140

141 Heuristic functions: an example An easy to compute estimate for this kind of problem is the straight-line distance. If the destination is Bucharest, the heuristic function h(n) can therefore be defined as the straight-line distance from the city of node n to Bucharest. The values of h(n) (considering Bucharest as the destination) are reported below, since they will be used later on: Arad Bucharest Craiova Drobeta Eforie Fagaras Giurgiu Hirsova Iasi Lugoj Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of h SLD straight-line distances to Bucharest. 141

142 Greedy best-first search This is the simplest best-first search strategy: expanding the node that is closest to the solution. As explained above, the exact cost is usually unknown; this strategy is therefore implemented using the estimated cost, i.e., the heuristic function h(n). Accordingly, the node evaluation function is simply defined as: f (n) = h(n) This strategy is called greedy since it favours partial solutions that appear (since h(n) is only an estimate) to be closest to the solution but, as it will become more clear later, this is not an optimal choice. 142

143 Greedy best-first search: an example Consider again the problem of finding the shortest route from Arad to Bucharest, using the straight-line distance heuristic. In the following the search tree built by the greedy search strategy, until a solution is found, is shown, including the value of f (n) for each node. The node selected for expansion is highlighted by an arrow. Remember that using the general tree-search algorithm a solution is found when a node containing a goal state is selected to be expanded, not when it is generated by the expansion of its parent node. 143

144 Greedy best-first search: an example 71 Oradea Neamt Arad Zerind Sibiu 99 Fagaras 87 Iasi 92 Vaslui 80 Arad Rimnicu Vilcea 366 Mehadia Timisoara Bucharest 0 Neamt Lugoj Pitesti 211 Craiova 160 Oradea 97 Drobeta 242 Pitesti Eforie 161 Rimnicu Vilcea Hirsova Mehadia Urziceni Fagaras 176 Sibiu 28 Chapter Giurgiu Solving Problems by Searching 77 Timisoara Bucharest Drobeta 120 Hirsova 151 Urziceni 90 Iasi 226 Vaslui Craiova Eforie Giurgiu Lugoj 244 Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld straight-line distances to Bucharest (a) The initial state (b) After expanding Arad Arad 366 Arad Sibiu Timisoara Zerind (c) After expanding Sibiu Arad Sibiu Timisoara 329 Zerind

145 Greedy best-first search: an example 71 Oradea Neamt Arad Zerind Sibiu 99 Fagaras 87 Iasi 92 Vaslui 80 Arad Rimnicu Vilcea 366 Mehadia Timisoara Bucharest 0 Neamt Lugoj Pitesti 211 Chapter 3. Craiova Solving Problems 160by Searching Oradea 97 Drobeta 242 Pitesti Eforie 161 Rimnicu Vilcea Hirsova Mehadia Urziceni Fagaras 176 Sibiu Giurgiu 77 Timisoara Bucharest Drobeta 120 Hirsova 151 Urziceni 90 Iasi 226 Vaslui Craiova Eforie Giurgiu Lugoj 244 Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld (a) The initial state straight-line Araddistances to Bucharest (b) After expanding Arad Arad Sibiu Timisoara Zerind (c) After expanding Sibiu Arad Sibiu Timisoara 329 Zerind 374 Arad Fagaras Oradea Rimnicu Vilcea

146 Greedy best-first search: an example 71 Oradea Neamt Arad Zerind Iasi Chapter 3. Solving Problems by Searching Timisoara Drobeta Lugoj Sibiu Rimnicu Vilcea Fagaras Pitesti Mehadia (a) The initial state (b) Craiova After expanding Arad Giurgiu Bucharest Urziceni Vaslui Hirsova Arad Eforie Arad Arad Bucharest Craiova Drobeta Eforie Fagaras Giurgiu Hirsova Iasi Lugoj Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld Sibiu Timisoara straight-line distances to Bucharest. Zerind (c) After expanding Sibiu Arad Sibiu Timisoara 329 Zerind 374 Arad Fagaras Oradea Rimnicu Vilcea (d) After expanding Fagaras Arad Sibiu Timisoara Zerind

147 Greedy best-first search: an example 28 Chapter 3. Solving Problems by Searching 71 Oradea Neamt Arad Zerind Timisoara Drobeta 151 (a) The initial state Sibiu (b) After 99 expanding Fagaras Arad 87 Iasi 92 Vaslui 80 Arad Rimnicu Vilcea 366 Mehadia Sibiu Timisoara Bucharest 0 ZerindNeamt 142 Lugoj Pitesti 211 Craiova 160 Oradea Drobeta 242 Pitesti 98 Eforie 161 Rimnicu Vilcea Hirsova Mehadia (c) After 146 expanding 101Sibiu 85 Urziceni Arad Fagaras 176 Sibiu Giurgiu 77 Timisoara Bucharest 120 Hirsova 151 Urziceni 90 Iasi 226 Vaslui Craiova Sibiu Eforie Timisoara Zerind Giurgiu Lugoj 244 Zerind Arad 366 Arad Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld Arad Fagaras Oradea Rimnicu Vilcea straight-line distances to Bucharest (d) After expanding Fagaras Arad Sibiu Timisoara Zerind Arad Fagaras Oradea Rimnicu Vilcea Sibiu Bucharest Figure 3.23 FILES: figures/greedy-progress.eps (Tue Nov 3 16:22: ). Stages in a greedy best-first tree search for Bucharest with the straight-line distance heuristic hsld. Nodesarelabeled with their h-values. 147

148 Properties of greedy best-first search It can be shown that greedy best-first search exhibits the following properties: it is complete, unless there are infinite paths (including trivial loops) it is non-optimal: for instance, carefully looking at the above example it can be seen that a shorter route between Arad and Bucharest exists (through Sibiu and Rimnicu Vilcea), than the one found by the algorithm worst-case time and space complexity is exponential in the depth of the shallowest solution m, for a given branching factor b: O(b m ) 148

149 A* search A* is the most relevant best-first search strategy. It was devised in the 1960s for robot navigation tasks: Many variants of A* have been proposed since then to tune the trade-off between its effectiveness and efficiency. 149

150 A* search Greedy search chooses for expansion the node n which appears closest to a solution, i.e., such that the estimated cost of the actions from n to a goal state is minimum. However, it disregards the cost of the actions from the root to n. A* uses instead an estimate of the total cost of the action sequence from the root to a goal state through n, defined as the sum of the path cost of n (which is exactly known) and the estimated cost from n to the solution. The corresponding node evaluation function is defined as: f (n) = g(n) }{{} + h(n) }{{} path cost from root to n estimated cost from n to the solution 150

151 A* search: an example In the following the search tree built by A* is shown for the same problem of the previous example (Arad-Bucharest, using the straight-line distance heuristic). The value of f (n) = g(n) + h(n) is also shown for each node. Note that after the fourth iteration (the expansion of the node containing the state Fagaras) a leaf node containing the goal state Bucharest is generated. However, it is not selected for expansion at the next iteration (and thus a solution has not been found yet), since it is not the node with the minimum value of f (n). During the expansion of the node containing the state Pitesti in the fifth iteration, a different node containing the goal state Bucharest is generated. This latter node is selected for expansion at the next iteration, since it has the minimum value of f (n), and since it contains a goal state, a solution is found and A* terminates. 151

152 A* search: an example 71 Oradea Neamt Arad Zerind Timisoara Drobeta 151 Lugoj Mehadia 120 Sibiu Rimnicu Vilcea Craiova 97 Fagaras Pitesti Giurgiu 87 Bucharest Urziceni Iasi Vaslui Hirsova 86 Eforie Arad Bucharest Craiova Drobeta Eforie Fagaras Giurgiu Hirsova Iasi Lugoj Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld straight-line distances to Bucharest (a) The initial state (b) After expanding Arad Arad 366=0+366 Arad Sibiu 393= Timisoara 447= Zerind 449= (c) After expanding Sibiu Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea Rimnicu Vilcea 646= = = =

153 A* search: an example 71 Oradea Neamt Arad Zerind Timisoara Drobeta 151 Lugoj Mehadia 120 Sibiu Rimnicu Vilcea Craiova 97 Fagaras Pitesti Giurgiu 87 Bucharest Urziceni Iasi Vaslui Hirsova 86 Eforie Arad Bucharest Craiova Drobeta Eforie Fagaras Giurgiu Hirsova Iasi Lugoj Mehadia Neamt Oradea Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind29 Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld (a) The initial state straight-line Arad distances to Bucharest. 366= (b) After expanding Arad Arad Sibiu 393= Timisoara 447= Zerind 449= (c) After expanding Sibiu Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea Rimnicu Vilcea 646= = = = (d) After expanding Rimnicu Vilcea Arad 153

154 A* search: an example 71 Oradea Neamt Arad Zerind Timisoara Drobeta 151 Lugoj Mehadia 120 Sibiu Rimnicu Vilcea 97 Fagaras Pitesti (a) The initial state 138 Bucharest 90 (b) Craiova After expanding Giurgiu Arad 87 Urziceni Iasi Vaslui Hirsova 86Arad 366=0+366 Eforie Arad Arad Bucharest Craiova Drobeta Eforie Fagaras Giurgiu Hirsova Iasi Lugoj Mehadia Neamt Oradea29 Pitesti Rimnicu Vilcea Sibiu Timisoara Urziceni Vaslui Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Sibiu Figure 3.22 Timisoara FILES: figures/romania-sld.eps Zerind (Tue Nov 3 16:23: ). Values of hsld 393= straight-line distances 447= to Bucharest. 449= (c) After expanding Sibiu Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea Rimnicu Vilcea 646= = = = (d) After expanding Rimnicu Vilcea Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea 646= = = Rimnicu Vilcea 154

155 A* search: an example Arad Zerind Timisoara 111 Drobeta Oradea Lugoj Mehadia 120 Sibiu 80 (a) The initial state 99 Rimnicu Vilcea Craiova 97 Fagaras Pitesti 211 Neamt (b) After expanding Arad Sibiu 393= (c) After 138expanding Sibiu Bucharest 90 Giurgiu Sibiu 87 Urziceni Iasi Arad 366=0+366 Vaslui Arad Arad Bucharest Craiova Timisoara Drobeta 447= Eforie Hirsova Fagaras 86 Arad Giurgiu Hirsova Iasi Eforie Timisoara Lugoj 366 Mehadia 0 Neamt 160 Oradea Zerind 242 Pitesti =75+374Rimnicu Vilcea 176 Sibiu 77 Timisoara 151 Urziceni 226 Vaslui 244 Zerind Zerind Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad 447= = map of part of Romania. Figure 3.22 FILES: figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld Arad Fagaras Oradea Rimnicu Vilcea straight-line distances to Bucharest. 646= = = = (d) After expanding Rimnicu Vilcea Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea 646= = = Rimnicu Vilcea Craiova Pitesti Sibiu 526= = = (e) After expanding Fagaras Arad Sibiu Timisoara Zerind 447= =

156 (a) The initial state A* search: an example (b) After expanding Arad Arad 366=0+366 Arad 71 Oradea Sibiu 393= Neamt (c) After expanding Sibiu 87 Arad Zerind Iasi Arad 140 Sibiu 92 Timisoara Zerind Sibiu Fagaras = = Vaslui 80 Arad Rimnicu Vilcea Fagaras Oradea Arad 366 Mehadia 241 Rimnicu Vilcea 646= = = = Bucharest 0 Neamt 234 Timisoara Craiova 160 Oradea Pitesti (d) After 97 expanding Rimnicu Vilcea Arad Drobeta 242 Pitesti 100 Lugoj Eforie 161 Rimnicu Vilcea Hirsova Fagaras 176 Sibiu Mehadia 101 Urziceni Sibiu Bucharest Timisoara Giurgiu 77 Zerind Timisoara = Hirsova =75+374Urziceni 80 Drobeta 120 Iasi 226 Vaslui Eforie CraiovaArad Fagaras GiurgiuOradea Lugoj 244 Zerind 374 Rimnicu Vilcea 646= = = Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. CraiovaFigurePitesti 3.22 FILES: Sibiu figures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld 526= straight-line 417= distances 553= to Bucharest. Timisoara 447= Zerind 449= (e) After expanding Fagaras Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea Rimnicu Vilcea 646= = Sibiu Bucharest Craiova Pitesti Sibiu 591= = = = = (f) After expanding Pitesti Arad Sibiu Timisoara Zerind 447= =

157 Sibiu A* search: an example Arad Fagaras Oradea Rimnicu Vilcea 646= = = = Timisoara Zerind 447= = Oradea (d) After expanding Rimnicu Vilcea Neamt Arad Sibiu Zerind 87 Timisoara Zerind = = Iasi Arad 140 Arad Fagaras Oradea Rimnicu Vilcea 92 Sibiu 646= Fagaras 415= = Vaslui 80 Craiova Pitesti Sibiu Arad Rimnicu Vilcea 366 Mehadia 241 Timisoara 526= = = Bucharest 0 Neamt Lugoj Pitesti 211 Craiova 160 Oradea 380 (e) After 97 expanding Fagaras Arad Drobeta 242 Pitesti Eforie 161 Rimnicu Vilcea Hirsova 193 Mehadia Urziceni Fagaras 176 Sibiu Sibiu 86 Timisoara Giurgiu 77 Zerind Timisoara 329 Bucharest Drobeta = Hirsova = Urziceni Iasi 226 Vaslui 199 CraiovaArad Fagaras Eforie GiurgiuOradea Rimnicu Vilcea Lugoj 244 Zerind = = Figure 3.2 FILES: figures/romania-distances.eps (Tue Nov 3 16:23: ). Asimplifiedroad map of part of Romania. Sibiu Bucharest CraiovaFigure Pitesti 3.22 FILES: Sibiufigures/romania-sld.eps (Tue Nov 3 16:23: ). Values of hsld 591= = = straight-line 417= distances 553= to Bucharest. (f) After expanding Pitesti Arad Sibiu Timisoara Zerind 447= = Arad Fagaras Oradea Rimnicu Vilcea 646= = Sibiu Bucharest Craiova Pitesti Sibiu 526= = = =450+0 Bucharest Rimnicu Vilcea Craiova 418= = = Figure 3.24 FILES: figures/astar-progress.eps (Tue Nov 3 16:22: ). Stages in an A search for Bucharest. Nodes are labeled with f = g + h. The h values are the straight-line distances to Bucharest taken from Figure

158 Properties of A* search It can be shown that greedy best-first search exhibits the following properties: it is optimal (the proof is given in the following), provided that the heuristic is admissible, i.e., never overestimates the cost to the solution (e.g., the straight-line distance is an admissible heuristic for route finding in maps) it is complete, and is also optimally efficient (i.e., it expands the minimum number of nodes) for any admissible heuristic among algorithms that extend search paths from the root worst-case time and space complexity are exponential in the depth of the shallowest solution m, for a given branching factor b: O(b m ); nevertheless, A* is often much more efficient (i.e., it generates a much smaller number of nodes) than other uninformed and informed search strategies 158

159 Proof of A* optimality Assume the fringe contains one leaf node n with a suboptimal goal state, and no leaf node with an optimal goal state. Can A* ever select n to be expanded, thus returning it as a suboptimal solution? First, note that some leaf node n n in the path toward an optimal solution must exist in the fringe. We have to consider therefore the following scenario: root node suboptimal solution n' fringe... n'' optimal solution (not yet generated) 159

160 Proof of A* optimality Denoting with C the cost of an optimal solution, the above assumptions imply: 1. h(n ) = 0 (n contains a goal state) 2. f (n ) = g(n ) + h(n ) = g(n ) > C (n contains a sub-optimal goal state) 3. f (n ) = g(n ) + h(n ) C (n is in the path toward an optimal solution, and h is admissible, thus f (n ) does not overestimate the cost of any solution reachable through n ) In turn, expressions 2 and 3 imply: f (n ) > C f (n ) This means that n cannot be selected to be expanded, and thus A* cannot return a suboptimal solution. 160

161 Improving A* search Good heuristics (discussed later) can reduce time and memory requirements, especially with respect to uninformed search However, in many practical problems even A* is infeasible: memory requirements are the main drawback. Alternative approaches have been devised: using non-optimal A* variants that find suboptimal solutions quickly using A* variants with reduced memory requirements and a small increase in execution time, but still optimal 161

162 Defining heuristic functions Intuitively, the more accurate is the estimate of the cost to the solution from a given node provided by the heuristic function, the more efficient a best-first algorithm is. Defining a good (i.e., accurate) heuristics is therefore crucial for informed search. Moreover, heuristics have to be admissible to guarantee the optimality of A*. 162

163 Defining heuristic functions: examples 16 Chapter 3. Solving Problems by Searching We have seen that a possible heuristic for route finding in maps is the straight-line distance. Consider now the 8-puzzle problem. Remember that about nodes are generated on average by breadth-first (uninformed) search: therefore a good heuristic can be of great practical help also in this toy problem Start State Goal State Figure 3.4 puzzle. FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8- As an exercise, try to devise admissible heuristic functions for the 8-puzzle problem. 163

164 Defining heuristic functions Well-known admissible heuristics for k-puzzle are the following: 16 Chapter 3. Solving Problems by Searching number of misplaced tiles (in the following, h 1 (n)) sum of the distances of each tile from its goal position (city block or Manhattan distance, h 2 (n)) For instance, the value of h 1 and h 2 for the start state below on the left, with respect to the goal state on the right, is given by: h 1 (start state): 8 (all 8 tiles are misplaced) h 2 (start state): = 18 (tiles 1 to 8) Start State Goal State Figure 3.4 puzzle. FILES: figures/8puzzle.eps (Tue Nov 3 16:22: ). Atypicalinstanceofthe8-164

165 Defining/choosing heuristic functions For some problems it may be not straightforward to define a heuristic function. In that case a general criterion is to set h(n) equal to the exact cost of a relaxed version of the problem at hand. Some examples: k-puzzle: by relaxing the constraint that tiles can move only to a free adjacent square, and allowing them to move to any adjacent square, one obtains h 2 (n) (see above) k-puzzle: similarly, allowing tiles to move to any square (even non-adjacent and occupied ones), one obtains h 1 (n) route finding in maps: by relaxing the constraint that an adjacent city can be reached only through the corresponding route, and allowing one to move straight to it, one obtains the straight-line distance heuristic 165

166 Defining/choosing heuristic functions On the other hand, for some problems it can be possible to define several admissible heuristics h 1,..., h p (e.g., h 1 and h 2 for 8-puzzle). In this case one should choose or define a single heuristic h which dominates all the other ones, i.e.: for each node n, h(n) h i (n), i = 1,..., p. It is easy to see that such a heuristic is admissible, and provides a more accurate estimate of the cost to the solution than h 1,..., h p. To this aim, h can be defined as follows: if there is a dominating heuristic among h 1,..., h p, choose it as the heuristic for the problem at hand otherwise, for a given node n use the following heuristic: h(n) = max{h 1 (n),..., h p (n)}, which dominates by definition h 1,..., h p 166

167 Evaluating heuristic functions To evaluate the quality of heuristic functions the concept of effective branching factor (denoted as b ) is used: let N be the number of nodes generated by A* for a given problem, and d be the solution depth b is defined as the b.f. of a uniform tree of depth d containing N nodes, which is the solution of the equation: N = 1 + b + (b ) (b ) d The lower the value of b, the better the heuristic. Since b depends on the problem instance, it is usually evaluated empirically as the average over a set of instances. 167

168 Evaluating heuristic functions As an example, the table below reports the results of an empirical evaluation of the effective branching factor of heuristics h 1 and h 2 for the 8-puzzle (used in A*), and, for comparison, of one of the most efficient uninformed search strategies, iterative-deepening depth-first search (IDS). The comparison is made on 600 randomly generated problem instances with solution depth d = 4, 8,..., 24 (100 instances for each depth value). The symbol means that IDS could not terminate due to memory overflow. It is clear that h 2 is significantly better than h 1, and that uninformed search is unfeasible even for 8-puzzle. search cost (expanded nodes) effective branching factor d IDS A* (h 1 ) A* (h 2 ) IDS A* (h 1 ) A* (h 2 ) , ,644, , , ,135 1,

169 Knowledge representation and inference 169

170 Some motivating problems Consider the following problems, and assume that your goal is to design rational agents, in the form of computer programs, capable of autonomously solving them. 170

171 Some motivating problems Automatic theorem proving An example: write a computer program capable to prove or to refute the following statement: Goldbach s conjecture (1742) For any even number p 4, there exists at least one pair of prime numbers q and r (identical or not) such that q + r = p. 171

172 Some motivating problems Game playing An example: write a computer program capable of playing the wumpus game, a text-based computer game (by G. Yob, c. 1972) used in a modified version as an AI s toy-problem Stench Stench START Breez e Stench Gold Breez e Breez e PIT Breez e PIT PIT Breez e Breez e ES: figures/wumpus-world.eps (Tue Nov 3 16:24: ). Atypicalwumpus in the bottom left corner, facing right. the wumpus world: a cave made up of connected rooms, bottomless pits, a heap of gold, and the wumpus, a beast that eats anyone who enter its room goal: starting from room (1,1), find the gold and go back to (1,1), without falling into a pit or hitting the wumpus the content of any room is known only after entering it in rooms neighboring the wumpus and pits a stench and a breeze is perceived, respectively 172

173 Knowledge-based systems Humans usually solve problems like the ones above by combining knowledge and reasoning. Knowledge-based systems aim at mechanizing these two high-level human capabilities: representing knowledge about the world reasoning to derive new knowledge (and to guide action) 173

174 An example Sketch of a possible reasoning process for deciding the next move in the wumpus game, starting from the configuration shown above (not all moves are shown). Chapter Logical Agents a) c) 1,4 1,4 2,4 2,4 3,4 3,4 4,4 4,4 1,3 1,3 2,3 2,3 3,3 3,3 4,3 4,3 1,2 1,2 2,2 2,2 3,2 3,2 4,2 4,2 OK OK A A = Agent = B B = Breeze = G G = Glitter, = Gold Gold OK OK = Safe = Safe square P P = Pit = Pit S S = Stench = V V = Visited = WW= Wumpus = 1,4 1,4 2,4 2,4 3,4 3,4 4,4 4,4 1,4 2,4 3,4 4,4 1,3 1,3 2,3 2,3 3,3 3,3 4,3 4,3 1,2 1,2 2,2 2,2 3,2 3,2 4,2 4,2 P? P? 1,3 2,3 3,3 4,3 W! OK OK 1,1 1,1 2,1 2,1 3,1 3,1 4,1 4,1 1,1 1,1 2,1 2,1 3,1 3,1 4,1 4,1 A A P? P? A A OK OK OK OK V V B B OK OK OK OK b) (a) (a) 1,2 2,2 3,2 (b) (b) 4,2 1,4 1,4 2,4 2,4 3,4 3,4 4,4 4,4 A A = Agent = Agent 1,4 1,4 2,4 2,4 3,4 3,4 4,4 4,4 P? P? 3 Figure FILES: B B = Breeze = figures/wumpus-seq01.eps S (Tue (Tue Nov Nov 3 16:24: ). The The 1,3 1,3 ter ter percept 2,3 2,3 [None, 3,3 3,3 None, 4,3 None, 4,3 None, None]. (b) (b) After After one one move, with with percept P P = Pit = Pit 1,3 1,3 2,3 2,3 3,3 3,3 4,3 4,3 G G= Glitter, = Glitter, Gold Gold first firststep steptaken by bythe theagent in inthethe wumpus world. OK OK = Safe = Safe square OK (a) (a) The Theinitial situation, af- af- W! W! W! W! A [None, Breeze, None, None, None]. A S S = Stench = 1,1 2,1 3,1 S SG G V V = Visited = Visited B B B P! P? 4,1 P? WW= Wumpus = 1,2 1,2 A 2,2 2,2 3,2 3,2 4,2 4,2 V A 1,2 V1,2 S 2,2 2,2 3,2 3,2 4,2 4,2 S S S OK OKV V V V OK OK OK OK OK OK OK OK 1,1 1,1 2,1 2,1 3,1 3,1 4,1 4,1 B P! B P! V V V V OK OK OK OK (a) (a) d) 1,1 1,1 2,1 2,1 3,1 3,1 4,1 4,1 B B P! P! V V (a) V V OK OK OK OK (b) A = Agent B = Breeze G = Glitter, Gold OK = Safe square P = Pit S = Stench V = Visited W = Wumpus 1,4 2 1,3 W! 2 1,2 2 S V OK 1,1 2 V OK (b) Figure 7.4 FILES: figures/wumpus-seq35.eps (Tue Nov 3 16:24:11 2 progress of the agent. (a) After the third move, with percept [Stench, Non Figure FILES: figures/wumpus-seq35.eps (Tue (Tue Nov Nov 3 16:24: ). Two Twolaterstages in inthe progress of of the the agent. (a) (a) After After the the third third move, with with percept [Stench, None, None, None, None]. (b) (b) 174

175 Main approaches to AI system design Procedural: the desired behavior (actions) are encoded directly as program code (no explicit knowledge representation and reasoning). Declarative: explicit representation, in a knowledge base, of background knowledge (e.g., the rules of the wumpus game) knowledge about one specific problem instance (e.g., what the agent knows about a specific wumpus cave it is exploring) the agent s goal Actions are then derived by reasoning. 175

176 Architecture of knowledge-based systems Knowledge base Reasoning module (Inference engine) update update actions update Sensors Actuators Environment Main feature: separation between knowledge representation and reasoning the knowledge base contains all the agent s knowledge about its environment, in declarative form the inference engine implements a reasoning process to derive new knowledge and to make decisions 176

177 Knowledge representation and reasoning Logic is one of the main tools used in IA for knowledge representation: logical languages propositional logic predicate (first-order) logic reasoning: inference rules and algorithms Some of the main contributions: Aristotle (4th cent. bc): the laws of thought G. Boole ( ): Boolean algebra (propositional logic) G. Frege ( ): predicate logic K. Gödel ( ): incompleteness theorem 177

178 Main applications Automatic theorem provers Logic programming languages (Prolog, etc.) Expert systems 178

179 A short introduction to logic What is logic? Propositions, argumentations Logical (formal) languages Logical reasoning 179

180 Logic Definition (a possible one) Logic is the study of conditions under which an argumentation (reasoning) is correct. The above definition involves the following concepts: argumentation: a set of statements consisting of some premises and one conclusion. A famous example: All men are mortal; Socrates is a man; then, Socrates is mortal correctness: when the conclusion cannot be false when all the premises are true proof: a procedure to assess correctness 180

181 Propositions Natural language: very complex, vague, difficult to formalize. Logic considers argumentations made up of only a subset of statements: propositions (or declarative statements). Definition A proposition is a statement expressing a concept that can be either true or false. Example Socrates is a man Two and two makes four If the Earth had been flat, then Columbus would have not reached America A counterexample: Read that book! 181

182 Simple and complex propositions Definition A proposition is: simple, if it does not contain simpler propositions complex, if it is made up of simpler propositions connected by logical connectives Example Simple propositions: Socrates is a man Two and two makes four Complex propositions: A tennis match can be won or lost If the Earth had been flat, then Columbus would have not reached America 182

183 Argumentations When can a proposition be considered true or false? This is a philosophical question. Logic does not address this question: it only analyzes the structure of an argumentation. Example All men are mortal; Socrates is a man; then, Socrates is mortal. Is the structure of this argumentation correct, whatever its actual propositions are (i.e., regardless of whether they are true or false)? Informally, the structure of this argumentation is: all P are Q; x is P; then x is Q. 183

184 Formal languages Logic provides formal languages for representing (the structure of) propositions, in the form of sentences. A formal language is defined by a syntax and a semantics. Definition syntax (grammar): rules that define what are the well-formed sentences semantics: rules that define the meaning of sentences Examples of formal languages: arithmetic: propositions about numbers programming languages: instructions to be executed by a computer (for imperative languages like C) 184

185 Natural vs logical (formal) languages In natural languages: syntax is not rigorously defined semantics defines the content of a statement, i.e., what it refers to in the real world Example (syntax) The book is on the table: syntactically correct statement, with a clear semantics Book the on is table the: syntactically incorrect statement, no meaning can be attributed to it Colorless green ideas sleep furiously: 1 syntactically correct, but what does it mean? 1 N. Chomsky, Syntactic Structures,

186 Natural vs logical (formal) languages Logical languages: syntax: formally defined semantics: rules that define the truth value of each well-formed sentence with respect to each possible model (a possible world represented by that sentence) Example (arithmetic) Syntax: x + y = 4 is a well-formed sentence, x4y+ = is not Model: the symbol 4 represents the natural number four, x and y any natural number, + the sum operator, etc. Semantics: x + y = 4 is true for x = 1 and y = 3, x = 2 and y = 2, etc. 186

187 Logical entailment Logical reasoning is based on the relation of logical entailment between sentences, that defines when a sentence follows logically from another one. Definition The sentence α entails the sentence β, if and only if, in every model in which α is true, also β is true. In symbols: α = β Example (from arithmetic) x + y = 4 = x = 4 y, because in every model (i.e., for any assignment of numbers to x and y) in which x + y = 4 is true, also x = 4 y is true. 187

188 Logical inference Definition logical inference: the process of deriving conclusions from premises inference algorithm: a procedure that derives sentences (conclusions) from other sentences (premises), in a given formal language. Formally, the fact that an inference algorithm A derives a sentence α from a set of sentences ( knowledge base ) KB is written as: KB A α 188

189 Properties of inference algorithms Definition soundness (truth-preservation): if an inference algorithm derives only sentences entailed by the premises, i.e.: if KB A α, then KB = α completeness: if an inference algorithm derives all the sentences entailed by the premises, i.e.: if KB = α, then KB A α A sound algorithm derives conclusions that are guaranteed to be true in any world in which the premises are true. 189

190 8 Chapter 7. Logical Agents Properties of inference algorithms Inference algorithms operate only at the syntactic level: sentences are physical configurations of an agent (e.g., bits in registers) inference algorithms construct new physical configurations from old ones logical reasoning should ensure that new configurations represent aspects of the world that actually follow from the ones represented by old configurations Representation World Sentences Semantics Entails Sentence Semantics Aspects of the real world Follows Aspect of the real world Figure 7.6 FILES: figures/follows+entails.eps (Tue Nov 3 16:22: ). Sentences are physical configurations of the agent, and reasoning is a process of constructing new physical configurations from old ones. Logical reasoning should ensure that the new configurations represent aspects of the world that actually follow from the aspects that the old configurations represent. 190

191 Applications of inference algorithms In AI inference is used to answer two main kinds of questions: does a given conclusion α logically follows from the agent s knowledge KB? (i.e., KB = α?) what are all the conclusions that logically follow from the agent s knowledge? (i.e., find all α such that KB = α) Example (the wumpus world) 85 = Agent = Breeze = Glitter, Gold K = Safe square = Pit = Stench = Visited = Wumpus 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 1,2 2,2 3,2 4,2 P? OK 1,1 2,1 3,1 4,1 A P? V OK B OK (b) s-seq01.eps (Tue Nov 3 16:24: ). The e wumpus world. (a) The initial situation, af-, None]. (b) After one move, with percept does a breeze in room (2,1) entail the presence of a pit in room (2,2)? what conclusions can be derived about the presence of pits and of the wumpus in each room, from the current knowledge? 191

192 Inference algorithms: model checking The definition of entailment can be directly applied to construct a simple inference algorithm: Definition Model checking: given a set of premises KB and a sentence α, enumerate all possible models and check whether α is true in every model in which KB is true. Example (arithmetic) KB : {x + y = 4} α : y = 4 x Is the inference {x + y = 4} y = 4 x correct? Model checking: enumerate all possible pairs of numbers x, y, and check whether y = 4 x is true whenever x + y = 4 is. 192

193 The issue of grounding A knowledge base KB (set of sentences considered true) is just syntax (a physical configuration of the agent): what is the connection between a KB and the real world? how does one know that KB is true in the real world? This is the same philosophical question met before. For humans: a set of beliefs (set of statements considered true) is a physical configuration of our brain how do we know that our beliefs are true in the real world? A simple answer can be given for agents (e.g., computer programs or robots): the connection is created by sensors, e.g.: perceiving a breeze in the wumpus world learning, e.g., when a breeze is perceived, there is a pit in some adjacent room Of course, perception and learning are fallible. 193

194 Architecture of knowledge-based systems revisited Knowledge base Reasoning module (Inference engine) update update actions update Sensors Actuators Environment If logical languages are used: knowledge base: a set of sentences in a given logical language inference engine: an inference algorithm for the same logical language 194

195 Logical languages Propositional logic the simplest logical language an extension of Boolean algebra (G. Boole, ) Predicate (or first-order) logic more expressive and concise than propositional logic seminal work: G. Frege ( ) 195

196 Propositional logic: syntax Atomic sentences either a propositional symbol that denotes a given proposition (usually written in capitals), e.g.: P, Q,... or a propositional symbol with fixed meaning: True and False Complex sentences consist of atomic or (recursively) complex sentences connected by logical connectives (corresponding to natural language connectives like and, or, not, etc.) Logical connectives (only the commonly used ones are shown different notations exist): (and) (or) (not) (implies) (if and only if / logical equivalence) 196

197 Propositional logic: syntax A formal grammar in Backus-Naur Form (BNF): Sentence AtomicSentence ComplexSentence AtomicSentence True False Symbol Symbol P Q R... ComplexSentence Sentence ( Sentence Sentence ) ( Sentence Sentence ) ( Sentence Sentence ) ( Sentence Sentence ) 197

198 Propositional logic: semantics Semantics of logical languages: meaning of a sentence: its truth value with respect to a particular model model: a possible assignment of truth values to all propositional symbols that appear in the sentence Example The sentence P Q R has 2 3 = 8 possible models. One model is {P = True, Q = False, R = True}. Note: models are abstract mathematical objects with no unique connection to the real world (e.g., P may stand for any proposition in natural language). 198

199 Propositional logic: semantics Atomic sentences: True is true in every model False is false in every model the truth value of every propositional symbol (atomic sentence) must be specified in the model Complex sentences: their truth value is recursively defined as a function of the simpler sentences and of the truth table of the logical connectives they contain 199

200 Truth tables of commonly used connectives P Q P P Q P Q P Q P Q false false true false false true true false true true false true true false true false false false true false false true true false true true true true 200

201 Example Determining the truth value of P (Q R) in all possible models: P Q R (Q R) P (Q R) false false false false false false false true true true false true false true true false true true true true true false false false false true false true true false true true false true false true true true true false 201

202 Propositional logic and natural language The truth table of and, or and not is intuitive, but captures only a subset of their meaning in natural language. Example He felt down and broke his leg. Here and includes a temporal and a causal relation (He broke his leg and felt down does not have the same meaning) A tennis match can be won or lost. Disjunctive or, usually denoted in logic by 202

203 Propositional logic and natural language The truth table of P Q may not fit one s intuitive understanding of P implies Q or if P then Q 5 is odd implies Tokyo is the capital of Japan: meaningless in natural language, true in propositional logic (P Q does not assume causation or relevance between P and Q) 5 is even implies 10 is even: can be considered false in natural language, but is true in propositional logic (P Q is true whenever P is false) 203

204 Propositional logic and natural language Correct interpretation of P Q: P is a sufficient but not necessary condition for Q to be true Therefore, the only way for P Q to be false is when P is true and Q is false. In other words: if P is true, then I am claiming that Q is true; otherwise, I am making no claim (so, I cannot make a false claim). As a particular case, the meaning of the implication connective corresponds to the subset operator in mathematics: given two sets P and Q such that P Q, the sentence if x P then x Q (where x denotes a given object) is clearly true; by denoting the proposition x P with P and x Q with Q, then the above sentence can be represented exactly as P Q. 204

205 84 Chapter 7. Logical Agents Exercise 1. Define a set of propositional symbols to represent the wumpus world: the position of the agent, wumpus, pits, etc. 2. Define the model corresponding to the configuration below 3. Define the part of the initial agent s KB corresponding to its knowledge about the cave configuration in the figure below 4. Write a sentence for the proposition: If the wumpus is in room (3,1) then there is a stench in rooms (2,1), (4,1) and (3,2) 4 Stench Breez e PIT Breez e 3 Stench PIT Breez e Gold 2 Stench Breez e 1 Breez e PIT Breez e START Figure 7.2 FILES: figures/wumpus-world.eps (Tue Nov 3 16:24: ). Atypicalwumpus 205

206 Solution (1/4) A possible choice of propositional symbols: A 1,1 ( the agent is in room (1,1) ), A 1,2,..., A 4,4 W 1,1 ( the wumpus is in room (1,1) ), W 1,2,..., W 4,4 P 1,1 ( there is a pit in room (1,1) ), P 1,2,..., P 4,4 G 1,1 ( the gold is in room (1,1) ), G 1,2,..., G 4,4 B 1,1 ( there is a breeze in room (1,1) ), B 1,2,..., B 4,4 S 1,1 ( there is stench in room (1,1) ), S 1,2,..., S 4,4 206

207 Solution (2/4) Model corresponding to the considered configuration: A 1,1 is true; A 1,2, A 1,3,... are false 4 Stench Breez e PIT W 3,1 is true; W 1,1, W 1,2,... are false Stench START Breez e Stench Gold Breez e PIT Breez e PIT Breez e Breez e FILES: figures/wumpus-world.eps (Tue Nov 3 16:24: ). Atypicalwumpus he agent is in the bottom left corner, facing right. P 1,3, P 3,3, P 4,4 are true; P 1,1, P 1,2,... are false G 3,2 is true; G 1,1, G 1,2,... are false B 1,2, B 1,4,... are true; B 1,1, B 1,3,... are false S 2,1, S 3,2, B 4,1 are true; S 1,1, S 1,2,... are false 207

208 Solution (3/4) What the agent knows in the starting configuration: I am in room (1,1) (starting position of the game) I am alive: there are no pits nor the wumpus in this room there is no gold in this room I do not perceive a breeze nor a stench The corresponding agent s KB in propositional logic (the set of sentences the agent believes to be true) A 1,1, A 1,2, A 1,3,..., A 4,4 (16 sentences) W 1,1 G 1,1 B 1,1, S 1,1 208

209 Solution (4/4) One may think to translate the considered proposition using the implication connective ( ): W 3,1 (S 2,1 S 4,1 S 3,2 ) However, since there is only one wumpus, the opposite is also true: (S 2,1 S 4,1 S 3,2 ) W 3,1 An equivalent, more concise way to express both sentences: (S 2,1 S 4,1 S 3,2 ) W 3,1 209

210 Inference: model checking Goal of inference: given a KB and a sentence α, deciding whether KB = α. A simple inference algorithm: model checking (see above). Application to propositional logic: enumerate all possible models for sentences KB {α} check whether α is true in every model in which KB is true Implementation: truth tables. 210

211 Model checking: an example Determine whether {P Q, P R, Q R} = P R, using model checking. Propositional symbols Premises Conclusion P Q R P Q P R Q R P R false false false false true true false false false true false true true true false true false true true false false false true true true true true true true false false true false true true true false true true true true true true true false true false false true true true true true true true true Answer: yes, because the conclusion is true in every model in which the premises are true (grey rows). 211

212 Properties of model checking Soundness: yes, it directly implements the definition of entailment Completeness: yes, it works for any (finite) KB any and α, and the corresponding set of models is finite Computational complexity: O(2 n ), where n is the number of propositional symbols appearing in KB and α Its exponential computational complexity makes model checking infeasible when the number of propositional symbols is high. Example In the exercise about the wumpus world, 96 propositional symbols have been used: the corresponding truth table is made up of rows. 212

213 Inference: general concepts Two sentences α and β are logically equivalent (α β), if they are true under the same models, i.e., if and only if α = β and β = α An example: (P Q) (Q P) (see the truth tables) A sentence is valid if it is true in all models. Such sentences are also called tautologies (an example: P P) A sentence is satisfiable if it is true only in some model An example: P Q 213

214 Inference: general concepts Two useful properties related to the above concepts: for any α and β, α = β if and only if α β is valid; for instance, given a set KB of premises and a possible conclusion α, the model checking inference algorithm works by checking whether (KB α) is valid satisfiability is related to the standard mathematical proof technique of reductio ad absurdum (proof by refutation or by contradiction): α = β if and only if (α β) is unsatisfiable 214

215 Inference rules Practical inference algorithms are based on inference rules. An inference rule represents a standard pattern of inference: it implements a simple reasoning step whose soundness can be easily proven, that can be applied to a set of premises having a specific structure to derive a conclusion. Inference rules are represented as follows: premises conclusion 215

216 Examples of inference rules In the following, α and β denote any propositional sentences. And Elimination And Introduction Or Introduction First De Morgan s law Second De Morgan s law Double Negation Modus Ponens α 1 α 2 α i, i = 1, 2 α 1,α 2 α 1 α 2 α 1 α 1 α 2 (α 1 α 2) α 1 α 2 (α 1 α 2) α 1 α 2 ( α) α α β, α β (α 2 can be any sentence) The first five rules above easily generalize to any set of sentences α 1,..., α n. 216

217 Soundness of inference rules Since inference rules usually involve a few sentences, their soundness can be easily proven using model checking. An example: Modus Ponens premise conclusion premise α β α β false false true false true true true false false true true true 217

218 Inference algorithms Given a set of premises KB and a hypothetical conclusion α, the goal of an inference algorithm A is to find a proof KB A α (if any), i.e., a sequence of applications of inference rules that leads from KB to α. 218

219 Inference algorithms: an example In the initial configuration of the Wumpus game shown in the figure below, the agent s KB includes: (a) B 1,1 (current percept) 4 3 Stench Breez e Stench Breez e PIT PIT Breez e (b) B 1,1 P 1,2 P 2,1 (one of the rules of the game) 2 1 Stench Gold Breez e Breez e PIT Breez e START The agent can be interested in knowing whether room (1, 2) contains a pit, i.e., whether KB = P 1,2 : applying Modus Ponens to (a) and (b) it derives: (c) P 1,2 P 2,1 Figure 7.2 FILES: figures/wumpus-world.eps (Tue Nov 3 16:24: ). Atypicalwumpus world. The agent is in the bottom left corner, facing right. applying And elimination to (c), it derives P 1,2 So, it can conclude that room (1, 2) does not contain a pit. 219

220 Properties of inference algorithms Three main issues: is a given inference algorithm sound (correct)? is it complete? what is its computational complexity? It is not difficult to see that, if the considered inference rules are sound, so is an inference algorithm based on them. Completeness is more difficult to prove: it depends on the set of available inference rules, and on the ways in which they are applied. 220

221 Properties of inference algorithms What about computational complexity? Note that finding a proof KB A α, given a set of inference rules R, can also be formulated as a search problem: initial state: the set of sentences KB state space: any set of sentences made up of the union of KB and of the sentences that can be derived by applying to KB any sequence of rules in R operators: the inference rules in R goal state: set(s) of sentences including α 221

222 Properties of inference algorithms This suggests that the computational complexity can be very high: the solution depth may be high (some proofs require a large number of steps) the branching factor can be high several inference rules can be applicable to a given KB each of them can be applicable to several sets of sentences an example: the 16 sentences of the agent s KB at the beginning of the Wumps game, A 1,1, A 1,2, A 1,3,..., A 4,4, allow And-Introduction to be applied in 16 k=2 different ways Efficiency can be improved by ignoring irrelevant propositions with respect to the conclusion α. For instance, to prove P Q, propositions like R, S and T can be ignored. ( 16 k ) 222

223 Horn clauses In many domains of practical interest, the whole KB can be expressed in the form of if... then... propositions that can be encoded as Horn clauses, i.e., implications where: the antecedent is a conjunction ( ) of atomic sentences (non-negated propostitional symbols) the consequent is a single atomic sentence P 1... P n Q For instance, S 2,1 S 4,1 S 3,2 W 3,1 is a Horn clause. As particular cases, also atomic sentences (i.e., propositional symbols) and their negation can be rewritten as Horn clauses. Indeed, since (P Q) ( P Q): P True P True P P P False P False 223

224 Forward and backward chaining Two practical inference algorithms exist in the particular case when: the KB can be expressed as a set of Horn clauses the conclusion is an atomic and non-negated sentence These algorithms, named forward and backward chaining, exhibit the following characteristics: they are complete they use a single inference rule (Modus Ponens) they exhibit a computational complexity linear in the size of the KB 224

225 Forward chaining Given a KB made up by Horn clauses, forward chaining (FC) derives all the entailed atomic (non-negated) sentences: function Forward-chaining (KB) repeat apply MP in all possible ways to sentences in KB add to KB the derived sentences not already present (if any) until some sentences not yet present in KB have been derived return KB 225

226 Forward chaining FC is an example of data-driven reasoning: it starts from the known data, and derives their consequences. For instance, in the Wumpus game FC could be used to update the agent s knowledge about the environment (the presence of pits in each room, etc.), based on the new percepts after each move. The inference engine of expert systems (described later) is inspired by the FC inference algorithm. 226

227 Forward chaining: an example Consider the KB shown below, made up of Horn clauses: 1. P Q 2. L M P 3. B L M 4. A P L 5. A B L 6. A 7. B (cont.) 227

228 Forward chaining: an example By applying FC one obtains: 8. the only implication whose premises (individual propositional symbols) are in the KB is 5: MP derives L and adds it to the current KB 9. now the premises of 3 are all true: MP derives M and adds it to the KB 10. the premises of 2 have become all true: MP derives P and adds it to the KB 11. the premises of 1 and 4 are now all true: MP derives Q form 1 and adds it to the KB, but disregards 4 since its consequent (L) is already present in the KB 12. no new sentences can be derived from 1 11: FC ends and returns the updated KB containing the original sentences 1 7 and the ones derived in the above steps: {L, M, P, Q} 228

229 Backward chaining For a given KB made up of Horn clauses, and a given atomic, non-negated sentence α, FC can be used to prove whether or not KB = α. To this aim, one has to check whether α is present or not among the derived sentences. However, backward chaining (BC) is more effective to this goal. BC recursively applies MP backwards. It exploits the fact that KB = α, if and only if: either α KB (this terminates recursion) or KB contains some implication β 1,..., β n α, and (recursively) KB = β 1,..., KB = β n The sentence α to be proven is also called query. 229

230 Backward chaining function Backward-Chaining (KB, α) if α KB then return True let B be the set of sentences of KB having α as the consequent for each β B let β 1, β 2,... be the propostitional symbols in the antecedent of β if Backward-Chaining (KB, β i ) = True for all β i s then return True return False 230

231 Backward chaining BC is a form of goal-directed reasoning. For instance, in the Wumpus game it could be used to answer queries like: given the current agent s knowledge, is moving upward the best action? The computational complexity of BC is even lower than that of FC, since BC focuses only on relevant sentences. The Prolog inductive logic programming language is based on the predicate logic version of the BC inference algorithm (described later). 231

232 Backward chaining: an example Consider a KB representing the rules followed by a financial institution for deciding whether to grant a loan to an individual. The following propositional symbols are used: OK: the loan should be approved COLLAT : the collateral for the loan is satisfactory PYMT : the applicant is able to repay the loan REP: the applicant has a good financial reputation APP: the appraisal on the collateral is sufficiently greater than the loan amount RATING: the applicant has a good credit rating INC: the applicant has a good, steady income 232

233 Backward chaining: an example The KB is made up of the five rules (implications) on the left, and of the data about a specific applicant encoded by the four sentences on the right (all of them are Horn clauses): 1. COLLAT PYMT REP OK 2. APP COLLAT 3. RATING REP 4. INC PYMT 5. BAL REP OK 6. APP 7. RATING 8. INC 9. BAL Should the loan be approved for this specific applicant? This amounts to prove whether OK is entailed by the KB, i.e., whether KB = OK. 233

234 Backward chaining: an example The BC recursive proof KB BC OK can be conventiently represented as an AND-OR graph, a tree-like graph in which: multiple links joined by an arc indicate a conjunction (every link must be proven) multiple links without an arc indicate a disjunction (any link can be proven) 234

235 Backward chaining: an example The first call Backward-Chaining(KB, OK) is represented by the tree root, corresponding to the sentence to be proven: OK OK Since OK / KB, implications having OK as the consequent are searched for. There are two such sentences: 1 and 5. The BC procedure tries to BAL REP prove all the antecedents of at least one of them. Considering first 5, a recursive call to Backward-Chaining is made for each of its two antecedents, represented by an AND-link: B OK OK OK BAL REP BAL OK REP RATING BAL REP 235

236 Backward chaining: an example Consider the call Backward-Chaining(KB, REP): since REP / KB, and the only implication having REP as the consequent is 3, another recursive call is made for the antecedent of 3: OK OK OK BAL REP OK BAL REP RATING OK OK The call Backward-Chaining(KB, BAL RATING) REP returns True, since OK OK RATING KB, and thus also the call Backward-Chaining(KB, RATING REP) returns True: BAL REP BAL REP OK RATING RATING BAL REP OK OK BAL REP RATING BAL REP RATING BAL REP COLLAT PYMT REP OK 236

237 Backward OK chaining: OK an example However, the call Backward-Chaining(KB, BAL) returns False, since BAL BAL / KB andrep BAL REP there are no implications having BAL as the consequent. Therefore, the first call Backward-Chaining(KB, RATING OK) is not able to prove OK through this AND-link: OK OK OK OK OK BAL REP OK BAL REP BAL REP BAL REP RATING RATING OK RATING OK The other sentence OK in the KB having OK as the consequent, 1, is now BAL REP BAL REP considered, and another AND-link is generated with three recursive calls RATING for each of the antecedents of 1: BAL REP RATING COLLAT PYMT REP OK RATING OK BAL REP COLLAT PYMT REP BAL REP RATING COLLAT PYMT OK REP 237

238 BAL REP Backward chaining: an RATING example OK BAL REP RATING OK OK The call Backward-Chaining(KB, BAL REP COLLAT BAL ) generates REP in turn another recursive call BAL to prove REPthe antecedent COLLAT PYMT of the REP RATING RATING only implication having COLLAT as the consequent, 2: RATING OK OK BAL REP COLLAT PYMT REP BAL REP COLLAT PYMT REP RATING RATING APP OK OK The call Backward-Chaining(KB, APP) returns True, since BAL REP COLLAT PYMT REP APP KB, and thus also Backward-Chaining(KB, COLLAT ) BAL returns True REP COLLAT PYMT REP RATING APP RATING APP OK INC BAL REP COLLAT PYMT REP RATING APP INC 238

239 OK Backward chaining: an example BAL REP COLLAT PYMT REP RATING APP INC Similarly, the calls Backward-Chaining(KB, PYMT ) and OK Backward-Chaining(KB, REP) return True. BAL REP COLLAT PYMT REP The corresponding AND-link is then proven, which finally allows the first call Backward-Chaining(KB, RATING OK) APP to return INC RATING True: OK BAL REP COLLAT PYMT REP RATING APP INC RATING The proof KB BC OK is then successfully completed. 239

240 Resolution algorithm FC and BC exhibit a low computational complexity. They are also complete, but limited to: KB s made up of Horn clauses conclusions consisting of a non-negated propositional symbol It turns out that a complete inference algorithm for the full propositional logic also exists: the resolution algorithm, which uses a single inference rule, named itself resolution. Given any KB and any sentence α, the resolution algorithm proves whether or not KB = α. Its computational complexity is however much higher than that of FC and BC. The predicate logic version of the resolution algorithm is used in automatic theorem provers, to assist mathematicians to develop complex proofs. 240

241 Exercise 1 Construct the agent s initial KB for the wumpus game. The KB should contain: the rules of the game: the agent starts in room (1,1); there is a breeze in rooms adjacent to pits, etc. rules to decide the agent s move at each step of the game Note that the KB must be updated at each step of the game: 1. adding percepts in the current room (from sensors) 2. reasoning to derive new knowledge about the position of pits and wumpus 3. reasoning to decide the next move 4. updating the agent s position 241

242 Exercise 1 Rules of the wumpus game: the agent starts in room (1,1): A 1,1 A 1,2... A 4,4 there is a breeze in rooms adjacent to pits: P 1,1 (B 2,1 B 1,2 ), P 1,2 (B 1,1 B 2,2 B 1,3 ),... (one proposition in natural language, sixteen sentences in propositional logic one for each room) there is only one wumpus: (W 1,1 W 1,2 W 1,3... W 4,4 ) ( W 1,1 W 1,2 W 1,3... W 4,4 )... (one proposition in natural language, sixteen sentences in propositional logic one for each room)... Often, one concise proposition in natural language needs to be represented by many complex sentences in propositional logic. 242

243 Exercise 1 How to update the KB to account for the change of the agent s position after each move? E.g., A 1,1 is true in the starting position, and becomes false after the first move: adding A 1,1 makes the KB contradictory, since A 1,1 is still present but inference rules do not allow removing sentences Solution: using a different propositional symbol for each time step, e.g., A t i,j, t = 1, 2,... initial KB: A 1 1,1, A1 1,2,... A1 4,4 if the agent moves to (1,2), the following sentences must be added to the KB: A 2 1,1, A2 1,2, A2 1,3..., A2 4,4 ; and so on Things get complicated

244 Exercise 2 The following argumentation (an example of syllogism) is intuitively correct; prove its correctness using propositional logic: All men are mortal; Socrates is a man; then, Socrates is mortal. Three distinct propositional symbols must be used: P (All men are mortal), Q (Socrates is a man), R (Socrates is mortal) Therefore: premises: {P, Q} conclusion: R Do the premises entail the conclusion, i.e., {P, Q} = R? Model checking easily allows on to prove that the answer is no: in the model {P = True, Q = True, R = False}, the premises are true but the conclusion is false. What s wrong? 244

245 Limitations of propositional logic Main problems: limited expressive power, lack of conciseness. Example (wumpus world) Even small knowledge bases (in natural language) require a large number of propositional symbols and sentences. Example (syllogisms) Inferences involving the structure of atomic sentences (All men are mortal,... ) cannot be made. 245

246 From propositional to predicate logic The description of many domains of interest for real world applications (e.g., mathematics, philosophy, AI) involve the following elements in natural language: nouns denoting objects (or persons), e.g.: wumpus and pits; Socrates and Plato; the numbers one, two, etc. verbs denoting properties of individual objects and relations between them, e.g.: Socrates is a man, five is prime, four is lower than five; the sum of two and two equals four some relations between objects can be represented as functions, e.g.: father of, two plus two facts involving some or all the objects, e.g.: all squares neighboring the wumpus are smelly; some numbers are prime These elements cannot be represented in propositional logic, and require the more expressive predicate logic. 246

247 Predicate logic: models A model in predicate logic consists of: domain of discourse: a set of objects, e.g.: the set of natural numbers a set of individuals: Socrates, Plato,... relations between objects; each relation is represented as the set of tuples of objects that are related, e.g.: being greater than (binary relation): {(2,1), (3,1),... } being a prime number (unary relation): {1, 2, 3, 5, 7, 11,... } being the sum of (ternary relation): {(1,1,2), (1,2,3),... } being the father of (binary relation): {(John, Mary),... } (unary relations are also called properties) functions that map tuples of objects to a single object, e.g.: plus: (1,1) 2, (1,2) 3,... father of: John Mary,... Note that relations and functions are defined extensionally, i.e., by explicitly enumerating the corresponding tuples. 247

248 Predicate logic: syntax The basic elements are symbols to represent objects, relations and functions: constant symbols denote objects, e.g.: One, Two, Three, John, Mary predicate symbols denote relations, e.g.: GreaterThan, Prime, Sum, Father function symbols denote functions, e.g.: Plus, FatherOf 248

249 Predicate logic: syntax A formal grammar in Backus-Naur Form (BNF): Sentence AtomicSentence (Sentence Connective Sentence) Quantifier Variable,... Sentence Sentence AtomicSentence Predicate(Term,... ) Term Function(Term,... ) Constant Variable Connective Quantifier Constant John Mary One Two... Variable a x s... Predicate GreaterThan Father... Function Plus FatherOf

250 Semantics of predicate logic: interpretations Remember that semantics defines the truth of well-formed sentences, related to a particular model. In predicate logic this requires an interpretation: defining which objects, relations and functions are referred to by symbols. Examples: One, Two and Three denote the natural numbers 1, 2, 3 John and Mary denote the individuals John and Mary GreaterThan denotes the binary relation > between numbers Father denote the fatherhood relation between individuals Plus denotes the function mapping a pair of numbers to their sum 250

251 Semantics: terms Terms are logical expressions denoting objects. A term can be: simple: a constant symbol, e.g.: One, Two, John complex: a function symbol applied (possibly, recursively) to other terms, e.g.: FatherOf (Mary) Plus(One, Two) Plus(One, Plus(One, One)) Note: assigning a constant symbol to every object in the domain is not required (domains can be even infinite) an object can be denoted by more than one constant symbol 251

252 Semantics: atomic sentences The simplest kind of proposition: a predicate symbol applied to a list of terms. Examples: GreaterThan(Two, One), Prime(Two), Prime(Plus(Two, Two)), Sum(One, One, Two) Father(John, Mary), Father(FatherOf (John), FatherOf (Mary)) 252

253 Semantics: atomic sentences Definition An atomic sentence is true, in a given model and under a given interpretation, if the relation referred to by its predicate symbol holds between the objects referred to by its arguments (terms). Example According to the above model and interpretation: GreaterThan(Two, One) is false Prime(Two) is true Prime(Plus(One, One)) is true Sum(One, One, Two) is true Father(John, Mary) is true 253

254 Semantics: complex sentences Complex sentences are obtained as in propositional logic, using logical connectives. Examples: Prime(Two) Prime(Three) Sum(One, One, Two) GreaterThan(One, Two) ( GreaterThan(Two, One)) Father(John, Mary) Father(Mary, John) Semantics (truth value) is determined as in propositional logic. Examples: the second sentence above is false, the others are true. 254

255 Semantics: quantifiers Quantifiers allow one to express propositions involving collections of objects, without enumerating them explicitly. Two main quantifiers are used in predicate logic: universal quantifier, e.g.: All men are mortal All rooms neighboring the wumpus are smelly All even numbers are not prime existential quantifier, e.g.: Some numbers are prime Some rooms contain pits Some men are philosophers Quantifiers require a new kind of term: variable symbols, usually denoted with lowercase letters. 255

256 Semantics: universal quantifier Example Assume that the domain is the set of natural numbers. All natural numbers are greater or equal than one x GreaterOrEqual(x, One) All natural numbers are either even or odd x Even(x) Odd(x) 256

257 Semantics: universal quantifier The semantics of a sentence x α(x), where α(x) is a sentence containing the variable x, is: α(x) is true for each domain element in place of x. Example If the domain is the set of natural numbers, x GreaterOrEqual(x, One) means that the following (infinite) sentences are all true: GreaterOrEqual(One, One) GreaterOrEqual(Two, One)

258 Semantics: universal quantifier Consider the proposition: all even numbers greater than two are not prime. A common mistake is to represent it as follows: x Even(x) GreaterThan(x, Two) ( Prime(x)) The above sentence actually means: all numbers are even, greater than two, and are not prime, which is different from the original one (and is also false). The correct sentence can be obtained by noting that the original proposition can be restated as: for all x, if x is even and greater than two, then it is not prime, which is represented by an implication: x (Even(x) GreaterThan(x, Two)) ( Prime(x)) In general, propositions where all refers to all the elements of the domain that satisfy some condition must be represented using an implication. 258

259 Semantics: universal quantifier Consider again this sentence: x (Even(x) GreaterThan(x, Two)) ( Prime(x)) Claiming that it is true means that also sentences like the following are true: (Even(One) GreaterThan(One, Two)) ( Prime(One)) Note that the antecedent of the implication is false (the number one is not even, nor it is grater than the number two ). This is not contradictory, since implications with false antecedents are true by definition (see again the truth table of ). 259

260 Semantics: existential quantifier Example Assume that the domain is the set of natural numbers. Some numbers are prime x Prime(x) This is read as: there exists some x such that x is prime Some numbers are not greater than three, and are even x GreaterThan(x, Three) Even(x) 260

261 Semantics: existential quantifier Consider a proposition like the following: some odd numbers are prime. A common mistake is to represent it using an implication: x Odd(x) Prime(x) The above sentence actually means: there exists some number such that, if it is odd, then it is prime, which is weaker than than the original proposition, since it is true (by definition of ) also if there were no odd numbers (i.e., if the antecedent Odd(x) is false for all domain elements). The correct sentence can be obtained by noting that the original proposition can be restated as: there exists some x such that x is odd and x is prime x Odd(x) Prime(x) In general, propositions introduced by some must be represented using a conjunction. 261

262 Semantics: nested quantifiers A sentence can contain more than one quantified variable. If the quantifier is the same for all variables, e.g.: x( y( z... α[x, y, z,...]...)) then the sentence can be rewritten more concisely as: x, y, z... α[x, y, z,...] For instance, the sentence if a number is greater than another number, then the latter is lower than the former can be written in predicate logic as: x, y GreaterThan(x, y) LowerThan(y, x) 262

263 Semantics: nested quantifiers If a sentence contains both universally and existentially quantified variables, its meaning depends on the order of quantification. In particular, x( y α[x, y]) and y( x α[x, y]) are not equivalent, i.e., they are not true under the same models. For instance, x y Loves(x, y) means (i.e., is true under a model in which) everybody loves somebody. Instead, y x Loves(x, y) means there is someone who is loved by everyone. 263

264 Semantics: connections between and and are connected with each other through negation. For instance, asserting that every natural number is greater or equal to one is the same as asserting that there does not exist some natural number which is not greater or equal to one. In general, since is a conjunction over all domain objects and is a disjunct, they obey De Morgan s rules (shown below on the left, in the usual form involving two propositional variables): P Q (P Q) x( α[x]) ( x α[x]) (P Q) ( P) ( Q) ( x α[x]) x( α[x]) P Q ( P Q) x α[x] ( x( α[x])) P Q ( P Q) x α[x] ( x ( α[x])) 264

265 Exercises Represent the following propositions using sentences in predicate logic (including the definition of the domain): 1. All men are mortal; Socrates is a man; Socrates is mortal 2. All rooms neighboring a pit are breezy (wumpus game) 3. Peano-Russell s axioms of arithmetic, that define natural numbers (nonnegative integers): P1 zero is a natural number P2 the successor of any natural number is a natural number P3 zero is not the successor of any natural number P4 no two natural numbers have the same successor P5 any property which belongs to zero, and to the successor of every natural number which has the property, belongs to all natural numbers 265

266 Exercises 4. Represent the following propositions using sentences in predicate logic, assuming that the goal is to prove that West is a criminal (using suitable inference algorithms, see below). The law says that it is a crime for an American to sell weapons to hostile countries. The country Nono, an enemy of America, has some missiles, and all of its missiles were sold to it by Colonel West, who is American. Note that in a knowledge-based system the first proposition above encodes the general knowledge about the problem at hand ( rule memory, analogously to the rules of chess and of the wumpus game), whereas the second proposition encodes a specific problem instance ( working memory, analogously to a specific chess or wumpus game). 266

267 Solution of exercise 1 Model and symbols: domain: any set including all men constant symbols: Socrates predicate symbols: Man and Mortal, unary predicates; e.g., Man(Socrates) means that Socrates is a man. The sentences are: x Man(x) Mortal(x) Man(Socrates) Mortal(Socrates) 267

268 Solution of exercise 2 A possible choice of model and symbols: domain: row and column coordinates constant symbols: 1, 2, 3, 4 predicate symbols: Pit, binary predicate; e.g., P(1, 2) means that there is a pit in room (1,2) Adjacent, predicate with four terms; e.g., Adjacent(1, 1, 1, 2) means that room (1,1) is adjacent to room (1,2) Breezy, binary predicate; e.g., Breezy(2, 2) means that there is a breeze in room (2,2) 268

269 Solution of exercise 2 One possible sentence is the following: x, y (Breezy(x, y) ( p, q Adjacent(x, y, p, q) Pit(p, q))) Note that the sentence above also expresses the fact that rooms with no adjacent pits are not breezy. Another possible sentence: x, y (Pit(x, y) ( p, q Adjacent(x, y, p, q) Breezy(p, q))) In this case there is no logical equivalence: if all the rooms adjacent to a given one are breezy, the latter does not necessarily contain a pit. 269

270 Solution of exercise 3 A possible choice of model and symbols: domain: any set including all natural numbers (e.g., the set of real numbers) constant symbols: Z, denoting the number zero predicate symbols: N, unary predicate denoting the fact of being a natural number; e.g., N(Z) means that zero is a natural number Eq, binary predicate denoting equality; e.g., Eq(Z, Z) means that zero equals zero P denoting any given property function symbols: S, mapping a natural number to its successor; e.g., S(Z) denotes one, S(S(Z)) denotes two 270

271 Solution of exercise 3 P1 N(Z) P2 x N(x) N(S(x)) P3 ( x Eq(Z, S(x))) P4 x, y Eq(S(x), S(y)) Eq(x, y) P5 (P(Z) x((n(x) P(x)) P(S(x)))) ( x (N(x) P(x))) 271

272 Solution of exercise 4 A possible choice of model and symbols: domain: a set including different individuals (among which Colonel West), nations (among which America and Nono), and missiles constant symbols: West, America and Nono predicate symbols: Country( ), American( ), Missile( ), Weapon( ), Hostile( ) (respectively, being a country, an American citizen, a missile, a weapon, hostile) Enemy(< who >, < to whom >) (being enemies) Owns(< who >, < what >) (owning something) Sells(< who >, < what >, < to whom >) (selling something to someone) no function symbols are necessary 272

273 Solution of exercise 4 The law says that it is a crime for an American to sell weapons to hostile nations: x, y, z (American(x) Country(y) Hostile(y) Weapon(z) Sells(x, y, z)) Criminal(x) The second proposition can be conveniently split into simpler ones: Nono is a country...: Country(Nono)...Nono is an enemy of America (which is also a country)...: Enemy(Nono, America) Country(America)...Nono has some missiles...: x Missile(x) Owns(Nono, x)...all Nono s missiles were sold to it by Colonel West: x (Missile(x) Owns(Nono, x)) Sells(West, Nono, x) 273

274 Solution of exercise 4 A human would intuitively say that the above propositions in natural language imply that West is a criminal. However, it is not difficult to see that the above sentences in predicate logic are not sufficient to prove this. The reason is that humans exploit background (or common sense) knowledge that is not explicitly stated in the above propositions. In particular, there are two missing links : an enemy nation is hostile a missile is a weapon To use such additional knowledge, it must be explicitly represented by sentences in predicate logic: x, y (Country(x) Enemy(x, America)) Hostile(x) x Missile(x) Weapon(x) 274

275 Knowledge engineering Knowledge engineering is the process of constructing the KB. It consists of investigating a specific domain, identifying the relevant concepts (knowledge acquisition), and formally representing them. This requires the interaction between a domain expert (DE) a knowledge engineer (KE), who is expert in knowledge representation and inference, but usually not in the domain of interest A possible approach, suitable for special-purpose KBs (in predicate logic), is the following. 275

276 Knowledge engineering 1. Identify the task: what range of queries will the KB support? what kind of facts will be available for each problem instance? 2. Knowledge acquisition: eliciting from the domain expert the general knowledge about the domain (e.g., the rules of chess) 3. Choice of a vocabulary: what concepts have to be represented as objects, predicates, functions? The result is the domain s ontology, which affects the complexity of the representation and the inferences that can be made. E.g., in the Wumpus game pits can be represented either as objects, or as unary predicates on squares. 276

277 Knowledge engineering 4. Encoding the domain s general knowledge acquired in step 2 (this may require to revise the vocabulary of step 3) 5. Encoding a specific problem instance (e.g., a specific chess game) 6. Posing queries to the inference procedure and getting answers 7. Debugging the KB, based on the results of step 6 277

278 Inference in predicate logic Inference algorithms are more complex than in propositional logic, due to quantifiers and functions. Basic tools: two inference rules for sentences with quantifiers (Universal and Existential Instantiation), that derive sentences without quantifiers. This reduces first-order inference to propositional inference, with complete but semidecidable inference procedures: algorithms exist that find a proof KB α in a finite number of steps for every entailed sentence KB = α no algorithm is capable to find the proof KB α in a finite number of steps for every non-entailed sentence KB α Therefore, since one does not know that a sentence is entailed until the proof is done, when a proof procedure is running one does not know whether it is about to find a proof, or whether it will never find one. 278

279 Inference in predicate logic Modus Ponens can be generalized to predicate logic, leading to the first-order versions of the FC and BC algorithms, which are complete and decidable limited to Horn clauses. The resolution rule can also be generalized to predicate logic, leading to the first-order version of the complete but semidecidable resolution algorithm. 279

280 Inference rules for quantifiers Let θ denote a substitution list {v 1 /t 1,..., v n /t n }, where: v 1,..., v n are variable names t 1,..., t n are terms (either constant symbols, variables, or functions recursively applied to terms) α is any sentence in which one or more variables appear Let Subst(θ, α) denote the sentence obtained by applying the substitution θ to the sentence α. An example: Subst({y/One}, x, y Eq(S(x), S(y)) Eq(x, y)) produces x Eq(S(x), S(One)) Eq(x, One) 280

281 Inference rules for quantifiers Universal Instantiation: v α Subst({v/t}, α) where t can be any term without variables. In other words, since a sentence x α[x] states that α is true for every domain element in place of x, then one can derive that α is true for any given element t. An example: from x N(x) N(S(x)) one can derive N(Z) N(S(Z)), for θ = {x/z} N(S(S(Z))) N(S(S(S(Z)))), for θ = {x/s(s(z))} and so on. 281

282 Inference rules for quantifiers Existential Instantiation: v α Subst({v/t}, α) where t must be a constant symbol that does not appear elsewhere in the KB. A sentence v α[v] states that there is some object satisfying a condition. The above rule just gives a name to one such object, but that name must not belong to another object because we do not know which objects satisfy that condition. For instance, from x Missile(x) Owns(Nono, x) one can derive Missile(M) Owns(Nono, M), provided that M has not been already used in other sentences; one cannot derive, instead, Missile(West) Owns(Nono, West). 282

283 Inference rules for quantifiers A more general form of Existential Instantiation must be applied when an existential quantifier appears in the scope of a universal quantifier: x,... y,... α[x,..., y...] For instance, from x y Loves(x, y) (everybody loves somebody) it is not correct to derive x Loves(x, A) (everybody loves A), since the latter sentence means that everybody loves the same person. 283

284 Inference rules for quantifiers Instead of a constant symbol, a new function symbol must be introduced, known as Skolem function, with as many arguments as universally quantified variables. Therefore, from: x,... y,... α[x,..., y...] the correct application of Existential Instantion derives: x,... α[x,..., F 1 (x),...] For instance, from one can correctly derive x y Loves(x, y) x Loves(x, F (x)) where F maps any individual x to someone loved by x. 284

285 Inference algorithms and quantifiers First-order inference algorithms usually apply Existential Instantiation as a pre-processing step: every existentially quantified sentence is replaced by a single sentence. It can be proven that the resulting KB is inferentially equivalent to the original one, i.e., it is satisfiable when the original one is. Accordingly, the resulting KB contains only sentences without variables, and sentences where all the variables are universally quantified. Another useful pre-processing step is renaming all the variables in the KB to avoid name clashes between variables used in differente sentences. For instance, the variables in x P(x) and x Q(x) are not related to each other, and renaming any of them (say, y Q(y)) produces an equivalent sentence. 285

286 Unification Another widely used tool in first-order inference algorithms is unification, the process of finding a subsitution (if any) that makes two sentences (where at least one contains variables) identical. For instance, x, y Knows(x, y) and z Knows(John, z) can be unified by different substitutions. Assuming that Bill is one of the constant symbols, two possible unifiers are: {x/john, y/bill, z/bill} {x/john, y/z} Among all possible unifiers, the one of interest for first-order inference algorithms is the most general unifier, i.e., the one that places the fewest restrictions on the values of the variables. The only constraint is that every occurrence of a given variable can be replaced by the same term. In the above example, the most general unifier is {x/john, y/z}, as it does not restrict the value of y and z. 286

287 Unification: an example Consider the sentence x Knows(John, x) (John knows everyone). Assume that the KB also contains the following sentences (note that different variables names are used in different sentences): 1. Knows(John, Jane) 2. y Knows(y, Bill) 3. z Knows(z, Mother(z)) 4. Knows(Elizabeth, Bill) The most general unifier with Knows(John, x) is: 1. {x/john} (note that x Knows(John, x) implies that also Knows(John, John) is true, i.e., John knows himself ) 2. {y/john, x/bill} 3. {z/john, x/mother(john)} 4. no unifier exists, as the constant symbols John and Elizabeth in the first argument are different 287

288 First-order inference: an example Consider a domain made up of two individuals denoted with the constant symbols John and Richard, and the following KB: x King(x) Greedy(x) Evil(x) (1) y Greedy(y) (2) King(John) (3) Brother(Richard, John) (4) Intuitively, this entails Evil(John), i.e., KB = Evil(John). The corresponding inference KB Evil(John) can be obtained by using the above inference rules, as shown in the following. 288

289 First-order inference: an example Applying Universal Instantiation to (1) produces: (5) King(John) Greedy(John) Evil(John), with {x/john} (6) King(Richard) Greedy(Richard) Evil(Richard), with {x/richard} Applying Universal Instantiation to (2) produces: (7) Greedy(John), with {y/john} (8) Greedy(Richard), with {y/richard} Applying And Introduction to (3) and (7) produces: (9) King(John) Greedy(John) Applying Modus Ponens to (5) and (9) produces: (10) Evil(John) 289

290 Generalized Modus Ponens All but the last inference steps in the above example can be seen as pre-processing steps whose aim is to prepare the application of Modus Ponens. Moreover, some of these steps (Universal Instantiation using the symbol Richard) are clearly useless to derive the consequent of implication (1), i.e., Evil(John). Indeed, the above steps can be combined into a single first-order inference rule, Generalized Modus Ponens (GMP): given atomic sentences (non-negated predicates) p i, p i, i = 1,..., n, and q, and a substitution θ such that Subst(θ, p i ) = Subst(θ, p i ) for all i: (p 1 p 2... p n q), p 1, p 2..., p n Subst(θ, q) 290

291 Generalized Modus Ponens In the previous example, GMP allows Evil(John) to be derived in a single step, and avoids unnecessary applications of inference rules like Universal Instantiation to sentences (1) and (2) with {x/richard} or {y/richard}. In particular, GMP can be applied to sentences (1), (2) and (3), with θ = {x/john}: this immediately derives Evil(John). 291

292 Horn clauses in predicate logic GMP allows the forward chaining (FC) and backward chaining (BC) inference algorithms to be generalized to predicate logic. This in turn requires to generalize the concept of Horn clause. A Horn clause in predicate logic is an implication α β in which: α is a conjunction of non-negated predicates β is a single non-negated predicate all variables (if any) are universally quantified, and the quantifier appears at the beginning of the sentence An example: x (P(x) Q(x)) R(x). Also single (possibly negated) predicates are Horn clauses: P(t 1,..., t n ) (True P(t1,..., tn)) P(t 1,..., t n ) (P(t 1,..., t n ) False) 292

293 Forward chaining in predicate logic Similarly to propositional logic, FC consists of repeatedly applying GMP in all possible ways, adding to the initial KB all newly derived atomic sentences until no new sentence can be derived. FC is normally triggered by the addition of new sentences into the KB, to derive all their consequences. For instance, it can be used in the Wumpus game when new percepts are added to the KB, after each agent s move. 293

294 Forward chaining in predicate logic A simple (but inefficient) implementation of FC: function Forward-chaining (KB) local variable: new repeat new (the empty set) for each sentence s = (p 1... p n q) in KB do for each θ such that Subst(θ, p 1... p n ) = Subst(θ, p 1... p n) for some p 1,..., p n KB do q Subst(θ, q) if q / KB and q / new then add q to new add new to KB until new is empty return KB 294

295 Forward chaining: an example The sentences in the exercise about Colonel West can be written as Horn clauses, after applying Existential Instantiation and then And Elimination to x Missile(x) Owns(Nono, x) (the predicate Country is omitted for the sake of simplicity; the universal quantifiers are not shown to keep the notation uncluttered): (American(x) Hostile(y) Weapon(z) Sells(x, y, z)) Criminal(x) (1) (Missile(x) Owns(Nono, x)) Sells(West, Nono, x) (2) Enemy(x, America) Hostile(x) (3) Missile(x) Weapon(x) (4) American(West) (5) Enemy(Nono, America) (6) Owns(Nono, M) (7) Missile(M) (8) 295

296 Forward chaining: an example The FC algorithm carries out two repeat-until loops on the above KB. No new sentences can be derived after the second loop. First iteration: GMP to (2), (7) and (8), with {x/m}: Sells(West, Nono, M) GMP to (3) and (6), with {x/nono, y/america}: (10) Hostile(Nono) GMP to (4) and (8), with {x/m}: (11) Weapon(M) Second iteration: GMP to (1), (5), (10), (11) and (9), with {x/west, y/nono, z/m}: (12) Criminal(West) 296

297 Backward chaining in predicate logic The first-order BC algorithm works similarly to its propositional version: it starts from a sentence (query) to be proven and recursively applies GMP backward. Note that every substitution that is made to unify an atomic sentence with the consequent of an implication must be propagated back to every antecedent. If the consequent of an implication unifies with more than one atomic sentence, at least one unification must allow the consequent to be proven. For a possible implementation of BC, see the course textbook. 297

298 Backward chaining: an example A proof by BC can be represented as an And-Or graph, as in propositional logic. The following graph (which should be read depth first, left to right) shows the proof of the query Criminal(West) using the previous sentences (1) (8) as the KB. Criminal(West) American(West) { } Weapon(y) Sells(West,M 1,z) {z/nono} Hostile(Nono) Missile(y) Missile(M 1 ) Owns(Nono,M 1 ) Enemy(Nono,America) {y/m 1 } { } { } { } Figure 9.7 FILES: figures/crime-bc.eps (Tue Nov 3 16:22: ). Proof tree constructed by backward chaining to prove that West is a criminal. The tree should be read depth first, left to right. To prove Criminal(West), wehavetoprovethefourconjunctsbelowit. Someoftheseareinthe knowledge base, and others require further backward chaining. Bindings for each successful unification 298

299 Backward chaining: an example If the predicate Country is used, sentence (1) becomes: x, y, z (American(x) Country(y) Hostile(y) Weapon(z) Sells(x, y, z)) Criminal(x) The sentences Country(America) and Country(Nono) must also be added to the KB. In this case the additional term Hostile(y) appears in the And link below Criminal(West). Two sentences in the KB unify with Hostile(y): Country(America) and Country(Nono). If the unification with Country(America) is attempted first, the conjunct Hostile(America) can not be proven, and the proof fails. In such a case a backtracking step can be applied, i.e., one of the other possibile unifications can be attempted. In this case, the unification with Country(Nono) allows the proof to be completed. 299

300 The resolution algorithm Completeness theorem for predicate logic (Kurt Gödel, 1930): For every first-order sentence α entailed by a given KB (KB = α) there exists some inference algorithm that derives α (KB α) in a finite number of steps. The opposite does not hold: predicate logic is semidecidable. A complete inference algorithm for predicate logic: resolution (1965), based on: converting sentences into Conjunctive Normal Form the resolution inference rule proof by contradiction: to prove KB = α, prove that KB α is unsatisfiable (contradictory) refutation-completeness: if KB α is unsatisfiable, then resolution derives a contradiction in a finite number of steps 300

301 Applications of forward chaining Encoding condition-action rules to recommend actions, based on a data-driven approach: production systems (production: condition-action rule) expert systems 301

302 Applications of backward chaining Logic programming languages: Prolog rapid prototyping symbol processing: compilers, natural language parsers developing expert systems Example of a Prolog clause: criminal(x) :- american(x), weapon(y), sells(x,y,z), hostile(z). Running a program = proving a sentence (query) by BC, e.g.:?- criminal(west) produces Yes?- criminal(a) produces A=west, Yes 302

303 Applications of the resolution algorithm Main application: theorem provers, used for assisting (not replacing) mathematicians proof checking verification and synthesis of hardware and software hardware design (e.g., entire CPUs) programming languages (syntax) software engineering (verifying program specifications, e.g., RSA public key encryption algorithm) 303

304 Beyond classical logic Classical logic is based on two principles: bivalence: there exist only two truth values, true and false determinateness: each proposition has only one truth value But: how to deal with propositions like the following ones? Tomorrow will be a sunny day: is this true or false, today? John is tall: is this completely true (or false)? This kind of problem is addressed by fuzzy logic Goldbach s conjecture: Every even number is the sum of a pair of prime numbers Can we say this is either true or false, even if no proof has been found yet? 304

305 Expert systems One of the main applications of knowledge-based systems: encoding human experts problem-solving knowledge in specific application domains for which no algorithmic solution exists (e.g., medical diagnosis) commonly used as decision support systems problem-independent architecture for knowledge representation and reasoning knowledge representation: IF...THEN... rules 305

306 Expert systems: historical notes Main motivation: limitations of general problem-solving approaches pursued in AI until the 1960s First expert systems: 1970s Widespread use in the 1980s: many commercial applications Used in niche/focused domains since the 1990s 306

307 Main current applications of expert systems Medical diagnosis an example: UK NHS Direct symptom checker (now closed) Geology, botany (e.g.: rock and plant classification) Help desk Finance Military strategies Software engineering (e.g., design patterns) 307

308 Expert system architecture User Facts (Working memory) User Interface Rules (Rule memory) Knowledge Base Inference Engine Explanation Module Knowledge Engineer Domain Expert 308

309 Designing the Knowledge Base of expert systems Two main, distinct roles: knowledge engineer domain expert Main issues: defining suitable data structures for representing facts (problem instances) in working memory suitably eliciting experts knowledge (general knowledge) and encoding it as IF...THEN... rules (rule memory) 309

310 How expert systems work The inference engine implements a forward chaining-like algorithm, triggered by the addition of new facts in the working memory: while there is some active rule do select one active rule (using conflict resolution strategies) execute the actions of the selected rule Three kinds of actions exist: modifying one fact in the working memory adding one fact to the working memory removing one fact from the working memory 310

311 Expert system shells: CLIPS C Language Integrated Production System Developed since 1984 by the Johnson Space Center, NASA Now maintained independently from NASA as public domain, open source, free software Currently used by government, industry, and academia Main features: interpreted, functional language (object-oriented extension) specific data structures and instructions/functions for expert system implementation interfaces to other languages (C, Python, etc.) 311

312 Other expert system shells Some free shells: Jess (Java platform) Drools (Business Rules Management System) Several commercial shells are also available. 312

313 The Lisp language 313

314 Programming paradigms Many high-level programming languages have been devised since the 1950s, each one for more or less specific purposes: number crunching : FORTRAN (1957, first high-level language) artificial intelligence (symbol manipulation): Lisp (1958) business: COBOL (1959) system programming: C (1972) data bases: SQL (early 1970s)... One of the main distinguishing features: programming paradigms procedural (imperative) declarative 314

315 Procedural languages Main characteristics: Von Neumann s computing paradigm program: a sequence of instructions expressing actions to be executed on data Examples: machine language, Assembly FORTRAN, Pascal, C, Python, etc. object-oriented languages: SmallTalk, C++, Java, etc. 315

316 Declarative languages Main characteristics: higher abstraction level than procedural languages programs: recursive expressions, representing what to obtain, not how Examples: logical languages: SQL, Prolog functional languages: Lisp, Erlang, Haskell, OCaml, Scheme 316

317 Characteristics of procedural languages Computational model: explicit manipulation of memory cells (variables), named side effects control structures: sequential, conditional, iterative execution Pros: most programmers are used to procedural thinking program structure is close to executor structure (hardware): efficient translation and execution Cons: side effects are difficult to manage understanding programs and verifying correctness is difficult 317

318 Drawbacks of procedural languages: examples 1. Are the following C statements equivalent? w = x + f(y); w = f(y) + x; 2. What is the operation carried out by the following C procedure? void mystery (int n) { int i,j; boolean b; for (i = 2; i <= n; i++) { j = 2; b = true; while (b == true && j <= i/2) if (i % j!= 0) j++; else b = false; if (b == true) printf ( %d, i); } } 3. The previous C procedure prints all prime numbers from 2 to its argument: is that true? 318

319 Drawbacks of procedural languages: examples 1. Not necessarily: x can be a global variable modified by f. This is an example of side effect: the outcome of a statement depends on its context. 2. Difficult to understand, due to nested loops and shared variables. 3. It is not easy to verify that a program behaves how it is supposed (or claimed) to, for the same reasons above. 319

320 Characteristics of (ideal) functional languages Program structure: recursive expressions (including function calls) which are evaluated to produce a value mathematics-like style no variables recursion takes the place of iteration Pros: absence of variables no side effects referential transparency: the value of expressions is context-independent (like mathematical expressions) it is easier to understand and verify program correctness Cons: thinking recursively may be non-intuitive recursion requires higher processing cost (memory and time) 320

321 Functional-style programming: an example A C function for computing the factorial of a natural number, written in procedural style: int factorial (int n) { int i, fact = 1; for (i=2; i <= n; i++) fact = fact * i; return fact; } The same function written in functional style (no explicit alteration of variables by assignment, recursion used instead of iteration): int factorial (int n) { if (n == 0) return 1; else return n * factorial (n - 1); } 321

322 Characteristics of real programming languages The only pure procedural language: machine language / Assembly. All the other existing languages combine to various degrees procedural and functional features for taking advantage of both. A hypothetical procedural functional line: machine language, Assembly FORTRAN C Python Lisp "pure" Lisp pure imperative languages pure functional languages 322

323 Functional languages: some industrial applications Lisp: expert systems, Apple Macintosh, simulation, astronomy, embedded languages (AutoLisp), rapid prototyping Erlang: Ericsson, T-Mobile, Facebook, EDF OCaml: financial analysis, industrial robot programming, embedded software analysis Haskell: aerospace systems, hardware design, Web programming 323

324 A short history of Lisp Lisp is a functional, interpreted language invented by John McCarthy, one of the AI s fathers (Dartmouth Workshop) 1958: key ideas mathematical neatness in a practical programming language computing with symbolic expressions rather than numbers main data structure: list (e.g., symbolic expressions) hence the name: List processing main list operations expressed as a few built-in functions recursion (function composition) and conditional expression to form more complex functions representing Lisp programs as Lisp data : first implementations, application to AI problems : spreading into a variety of computers and dialects 1970s: main dialects: MacLisp, InterLisp, Scheme 1970s: Lisp Machines, minicomputers designed to run Lisp 1984: Common Lisp dialect becomes the ANSI standard 324

325 Summary Lisp programs, expressions, and main data types Expressions: atoms (numbers and symbols) and lists (function calls, special expressions) Main built-in functions mathematics: +, -, *, /, sqrt, exp, cos,... list processing: cons, first, rest,... predicates: equal, atom, listp, null Main special expressions: quote; and, or, not; cond User-defined Lisp functions: the special expression defun Procedural features the special expression setq (binding symbols to values) I/O functions: format, read user-defined data structures: the special expressions defstruct and setf 325

326 Lisp programs and expressions Program: set of recursive calls to expressions (functional style). Expression are evaluated by the interpreter, and return a value. Lisp expressions can be written: in the Listener window (the interactive intepreter): their value is immediately evaluated and printed to screen in the editor window: they will be evaluated on user s request Two kinds of expressions: atoms: either numbers or symbols lists: ordered sequences of (recursively) atoms and lists, enclosed in round brackets 326

327 Expressions: atoms Main atomic expressions: numbers, which evaluate to themselves; e.g.: 12, -5, , 2e-5, 3/5 complex numbers: #C(1-2) (denotes 1 2i) symbols: any sequence of characters excluding brackets and spaces, that is not a number; e.g.: blob, a-symbol, +, 2d, b^2-4*a*c Only the following symbol expressions are legal: predefined symbols, which evaluate to themselves, e.g.: t and nil, used to represent the Boolean values True and False, and (nil) an empty list symbols previously bound to a value (e.g., using setq), which is returned as the result of their evaluation 327

328 Expressions: lists A list is an ordered sequence of zero or more atoms or (nested) lists, enclosed in round brackets, and separated by spaces. Examples: (1 2 3) (a b) (+ 1 2) (1 xyz (w 4) (4 (t g (f)) qwerty)) () (the empty list, equivalent to the atom nil) 328

329 Expressions: lists Legal list expressions: the empty list () function calls: the first element must be a symbol associated to the name of a built-in or user-defined function the other elements (if any) must be expressions (either atoms or lists), whose values are passed to the function as arguments special expressions: each one has an ad hoc evaluation rule the first element must be a symbol associated to the name of a built-in or user-defined special expression every other element (if any) must be an expression (either an atom or a list); either the expression itself or its value is passed as the argument, depending on the special expression at hand 329

330 Main built-in functions Mathematical functions: +, -, *, /, =, sqrt, cos, sin, exp,... Examples: (+ 1 2) ( ) (* (- 2 3) 5) (= 3 (+ 2 1)) (sqrt -1) 330

331 Main built-in functions List processing functions are the Lisp core. Three main functions for recursive list processing: (cons <expression> <list>) returns a list obtained by adding the value of <expression> to the front of the value of the expression <list>, which must evaluate to a list; e.g.: (cons 1 nil) (1) (cons 1 (cons 2 nil)) (1 2) (first <list>) (also named car) returns the first element of a list, e.g.: (first (cons 1 (cons 2 nil)) 1 (rest <list>) (also named cdr) returns a list without its first element, e.g.: (rest (cons 1 (cons 2 nil)) (2) Any other operation on lists can be obtained by a suitable, recursive combination of cons, first and rest. 331

332 Lisp data types Lisp data are represented as Lisp expressions: this is one of the main distinctive features from other languages. Every Lisp value is either an atom (number or symbol) or a list. Rationale: many AI applications involve abstract data structures which can be represented as symbols and lists of symbols, e.g.: sentences in logical languages: (Prime Two), (GreatherThan Two One), (FatherOf John Mary) symbolic mathematical expressions: (integral (power x two)) for x 2 trees, e.g.: (A (B ((D) (E))) (C)) D B A E C 332

333 Atoms and lists: expressions or data? How to distinguish a Lisp expression (to be evaluated) from a Lisp data (not to be evaluated)? An example: how to construct the list (Prime Two)? (cons Prime (cons Two nil)) does not work: since cons is a function, Prime and Two are treated as expressions to be evaluated. 333

334 The special expression quote The special expression quote allows one to distinguish Lisp expressions from Lisp data. (quote <expression>) returns the <expression> itself, not the result of its evaluation. An example: the list (Prime Two) can be constructed by (cons (quote Prime) (cons (quote Two) nil)) (quote (Prime Two)) Since quote is often used in Lisp programs, the shorthand notation <expression> has been introduced, e.g.: x is equivalent to (quote x) (Prime Two) is equivalent to (quote (Prime Two)) 334

335 Main built-in functions (equal <e1> <e2>) compares the values of two expressions (atom <expr>) checks whether the value of <expr> is an atom (this is true also for the empty list ()) (listp <expr>) checks whether the value of <expr> is a list (this is true also for the symbol nil) (null <expr>) checks whether the value of <expr> is the empty list (this is true also for the symbol nil) (second <expr>), (third <expr>),... : returns the second, third,... element of a list (nth <n> <list>) returns the n-th element of a list (n = 0 denotes the first element) (list <expr1> <expr2>...) returns a list made up of the values of <expr1>, <expr2>,... (length <list>) evaluates the length of a list (append <list1> <list2>...) concatenates lists 335

336 Main built-in special expressions (and <e1> <e2>...) checks whether none of the arguments evaluates to nil; if so, it returns the value of the last expression, otherwise it returns nil (or <e1> <e2>...) checks whether at least one of the arguments does not evaluate to nil; if so, it returns the value of the first such expression, otherwise it returns nil (not <expr>) checks whether the value of <expr> is different from nil 336

337 The cond special expression This expression implements a set of nested conditional expressions: if <test1> then evaluate <expr1> else if <test2> then evaluate <expr2> else if <test3> then evaluate <expr3>... Syntax: (cond (<test1> <expr1>) (<test2> <expr2>) (<test3> <expr3>)... ) 337

338 The cond special expression Semantics: <test1>, <test2>,... are considered as conditional expressions: their logical value is false if they evaluate to nil, it is true otherwise <expr1>, <expr2>,... are any Lisp expressions the interpreter evaluates in sequence <test1>, <test2>,..., until it finds one, say <testk>, whose value is not nil; then the corresponding <exprk>is evaluated and its value is returned as the value of the whole cond expression if all the conditions <test1>, <test2>,... evaluate to nil, the cond expression returns nil the expression (symbol) t can be used as the last <test> expression, to implement a default condition 338

339 The cond special expression Examples: (cond ((> 2 1) t)) checks whether 2 is greater than 1 (cond ((= 1 (- 2 1)) (list a b)) ((equal x (first (x y))) 0) (t z)) if (2 1) = 1, return the list (a b); otherwise, if the first element of the list (x y) is the symbol x, return 0; otherwise return the symbol z 339

340 Defining new Lisp functions Built-in functions and special expressions (including cond and quote) allow one to define new Lisp functions. This requires the special expression defun. Syntax: (defun <function-name> (<arg1> <arg2>...) <expression>) (cont.) 340

341 Defining new Lisp functions (cont.) <function-name> must be a symbol (not a predefined one like t or nil) <arg1>, <arg2>,... must be symbols that will be bound to the arguments of the function when it is called, i.e., the arguments become the values of the corresponding symbols <expression> is the function body: it can be any Lisp expression, which can use the symbols <arg1>, <arg2>,... as expressions to be evaluated the value returned by the function call is defined as the value of <expression> Note that in a pure functional language the function body is made up of a single expression: since no variables can be used, the value of any expression but the last one would be lost. 341

342 Example In functional languages, functions are intrinsically recursive: the expression in their body can contain recursive calls to special expressions or other functions (including itself). When defining recursive functions (that call themselves), care must be taken to include a terminal condition to stop recursion. An example: the factorial function: (defun factorial (n) (cond ((= n 0) 1) (t (* n (factorial (- n 1)))))) 342

343 Other Lisp functions Evaluating Lisp expressions: the built-in function eval. Examples: (eval 1) 1 (eval (+ 1 2)) 3 after (setq x (+ 1 2)): (eval x) (+ 1 2), (eval x) 3 Applying functions to arguments using their symbol names (useful, e.g., to pass functions as arguments of other functions): the built-in functions apply and funcall. Examples: (defun my-function (f x y) (funcall f x y)) (my-function + 1 2) 3 (my-function list 1 2) (1 2) 343

344 Structure of Lisp programs Lisp programs are usually made up of several function definitions. A program is invoked by calling one of its functions (analogous to the main function of C programs). 344

345 Procedural features in Lisp: binding values to symbols For the sake of efficiency, and to ease the definition of certain programs, procedural features have been introduced in Lisp. Variables are implemented in Lisp by binding symbols to values (note that this happens implicitly for function arguments). The special expression setq can be used to this aim. Syntax: (setq <symbol> <expression>) <symbol> is the desired symbol (it is not evaluated), which is bound to the value of <expression> subsequent evaluations of <symbol> produce the value of <expression> setq returns the value of <expression> 345

346 Procedural features in Lisp: binding values to symbols Examples: (setq x 1) binds the symbol x to the value of the expression 1, which is again 1: from now on, the expression x will evaluate to 1 (setq y (a b c)) binds the symbol y to the value of the expression (a b c), which is the list (a b c): from now on, the expression y will evaluate to (a b c) 346

347 Procedural features in Lisp: local variables Local variables can be used in function definitions, by enclosing the function body inside the special expression let (it can be used only inside function definitions). This also allows the function body to be a sequence of expressions. Syntax: (defun <function-name> (<arg1> <arg2>...) (let (<var1> <var2>...) <expression1> <expression2>... )) <var1>, <var2>,... must be symbols: they will be bound to values (through setq), and can be used by <expression1>,... there is no conflict with symbols having the same names outside the function (if any) the value returned by let, and thus by the function, is the one of the last expression inside let 347

348 Procedural features in Lisp: local variables An example: a function that checks if a list has zero, one or more arguments, and returns the symbols zero, one and many, respectively, by using length. Without using local variables, length should be called twice: (defun how-many-elements (l) (cond ((= (length l) 0) zero) ((= (length l) 1) one) (t many))) Using local variables, length can be called only once: (defun how-many-elements (l) (let (n) (setq n (length l)) (cond ((= n 0) zero) ((= n 1) one) (t many)))) 348

349 Other Lisp features Other built-in, atomic data types (self-evaluating symbols): characters, e.g.: #\A strings, e.g.: "Lisp"... User-defined data types: record-structures (analogous to struct s in C language) made up of slots (fields), defined by the special expression defstruct (every instance is an atom) I/O functions (procedural features, produce side effects): format (printing to screen) read (reading from the keyboard) 349

350 Machine learning 350

351 Historical notes Machine learning was inspired by humans learning capabilities Early ideas (1950s): programming a computer to learn from experience, avoiding detailed programming efforts (e.g., learning to play checkers by analyzing expert players games) Early applications in pattern recognition and computer vision: humans can easily carry out tasks in these domains, but they don t know how (no algorithmic solution is known) optical character recognition (OCR) form noisy images, as an alternative to template matching recognition of aerial images Main tools until the 1980s: artificial neural networks, considered at the fringe of AI in those days 351

352 Historical notes Since the 1990s: automated data gathering, inexpensive storage (sensors, commerce, financial transactions, Web interaction, biological data, news feeds, etc.) Machine learning becomes a mainstream AI approach to automatically extract useful knowledge from data: from knowledge-driven to data-driven approach Methodologies strongly rooted in statistics 352

353 Main current application areas Many computer vision and pattern recognition tasks: OCR handwriting recognition document recognition (forms, invoices, etc.) object detection, localization, recognition, tracking, re-identification medical image analysis content-based image retrieval scene understanding biometric identity recognition (fingerprint, face, iris, etc.)

354 Main current application areas Text categorization, Web page categorization, topic detection in text documents Natural language understanding and translation Speech recognition and understanding Data mining Automated (high-frequency) trading Computer security (spam/phishing detection, intrusion detection in computer systems/networks, botnet detection, malware detection, etc.) Bioinformatics (predicting protein secondary structure, binding strength between molecules, etc.) Recommender systems, on-line advertisement (modelling user s behavior)

355 Main learning paradigms Unsupervised learning: clustering, pattern discovery. No examples of the correct outcome are available. Application examples: data mining, recommender systems, topic discovery from documents. Supervised learning: classification, regression (prediction). Labelled examples are available. Application examples: object detection/recognition from images (e.g., faces), text/document categorization. Reinforcement learning: finding an optimal action policy. Only the outcome of an entire course of action is available (e.g.: either success or failure). Application examples: robot control, game playing. 355

356 Supervised classification (categorical prediction) Learning from examples: given a predefined set of classes, and a set of labelled samples, learn how to assign novel samples to the correct class. This kind of problem occurs in applications like: OCR (printed or handwritten characters) object detection (e.g., face, car, pedestrian) from images (e.g., automatic camera focus, video surveillance) biometric identity recognition text/image/document categorization into predefined topics (e.g., news tagging, image tagging, spam filtering) computer security: detecting malicious network traffic, malware, etc. speech recognition

357 Supervised classification: problem formulation Defining the task: what are the instances to be classified? (e.g., images, s) what are the classes? (e.g., the 26 letters of the alphabet, spam and legitimate messages) Choosing a representation for the instances, e.g.: vectors of attribute values (Boolean, numerical, etc.), sentences in a logical language, strings, graphs,... Choosing a representation for the classification rules to be learnt, e.g.: boolean functions, numerical functions, IF...THEN... rules, sentences in a logical language,... Devising a learning algorithm to produce a classifier from labelled examples (training set), capable to generalize to unseen instances 357

358 Supervised classification: problem formulation Formal description: set of class labels: Y = {1,..., m} widely used representation for instances: vector of d attribute 1 values from given domains X 1,..., X d x = (x 1,..., x d ) X = X 1... X d training set: a set of n examples (labelled instances) T = {(x 1, y 1 ),..., (x n, y n )} (X Y ) n hypothesis space: a set of decision functions (classifiers) H = {f : X Y }, Goal: devising a learning algorithm to choose a classifier f H capable to correctly predict the label of unseen instances. 1 The term feature is used in the pattern recognition field. 358

359 Main issues How to represent samples, e.g., what attributes to choose? (known as feature engineering in pattern recognition) How to choose a suitable hypothesis space H? (by using prior knowledge, if any, or by guessing) How to devise a learning algorithm capable to find a good hypothesis from H, with a low computational complexity, being the true hypothesis unknown? How many examples are needed to find a good hypothesis? (labelling samples usually requires human effort) How to evaluate the goodness of a hypothesis? 359

360 Inductive learning Learning from examples is a form of inductive learning: from specific observations to general principles. This is the opposite of deductive learning: from general principles to specific consequences. A typical form of deductive reasoning is the one used in mathematics: from axioms to theorems. Inductive learning is used to formulate scientific theories in disciplines like physics, and is a widely studied problem in the phylosophy of science. Inductive learning is an ill-posed problem: usually, many different hypotheses agree with the observations at hand. What is the correct hypothesis (if any)? 360

361 Learning from examples: general principles Some general principles can be followed to learn a hypothesis from a set of examples, in supervised classification problems. The goal is to find a hypothesis capable to correctly predict the class of unseen instances (not belonging to the training set), i.e., exhibiting a good generalization capability. To this aim, a trade-off must be attained between two factors: consistency with the observations: the hypothesis should correctly classify the examples in the training set minimal complexity: the hypothesis should be as simple as possible (a principle also known as Occam s razor, from the 14th century logician William of Occam) 361

362 Learning from examples: general principles Given a hypothesis space H = {f : X Y } and a training set T, what is the best hypothesis in H? According to the above principles, it is defined as the simplest hypothesis among the ones consistent with T. Formally, a hypothesis f is consistent with T, if: f (x i ) = y i, for every (x i, y i ) T. 362

363 Learning from examples: main issues How to evaluate the complexity of a hypothesis? What is the computational complexity of finding the simplest consistent hypothesis? What if no consistent hypotesis exists? Enforcing consistency may lead to over-fitting: perfectly fitting training instances, but poorly fitting unseen ones; this is analogous to: interpolating a set of n points without approximations, using a very complex function (e.g., a n 1 degree polynomial) preparing for exams by rote learning What if the true (unknown) hypotesis does not belong to H? 363

364 Learning from examples: main issues When developing a learning algorithm, the representation for instances and hypotheses must be taken into account. For instance: possible representation for the instances: fixed-size attribute vectors (either numerical, e.g., the gray level of image pixels; or categorical, e.g., Boolean values to denote the presence or absence of a given word in an ) strings or graphs (e.g., used to represent fingerprint or face images in biometric identity recognition) logical sentences possible representations for the hypothesis space: Boolean functions (used, e.g., in two-class problems where instances are represented as fixed-size vectors of Boolean attributes, like in spam recognition) mathematical functions (e.g., when instances are represented as fixed-size vectors of numerical attributes, like the gray-level value of pixels in image classification tasks) logical sentences 364

365 Two different, well-known classifiers Decision Trees introduced in the 1950s in psychology as models of high-level human learning (verbal and concept learning) explicit representation of IF-THEN classification rules structured as a tree originally devised for categorical attributes widely used today in several applications (e.g., computer vision, bioinformatics) Artificial Neural Networks introduced in the 1950s as models of low-level human brain functions (neurons, networks of neurons) implicit, distributed encoding of classification rules suited to numerical attributes widely used since the late 1980s current evolution: deep neural networks, and in particular convolutional neural networks for computer vision tasks 365

366 Decision Trees For discrete attributes having a finite number of values, a Decision Tree (DT) is a tree in which: every non-leaf node is associated to one attribute every edge corresponds to one possible value of the attribute in the parent node every leaf node is associated to a class label every attribute appears at most once in a path from the root to a leaf node 366

367 Example The following is an example of a DT for a problem with: two classes, Y = {1, 2} d = 3 Boolean attributes, X 1, X 2, X 3 {true, false} true X 2 false X 1 true false true X 3 X 3 true false false 2 367

368 Decision Trees A DT represents a set of IF-THEN rules. Every path from the root node to a leaf corresponds to a rule: where: IF X (1) = v 1 AND... AND X (k) = v k THEN Y =... k is the number of non-leaf nodes in the path X (i), i = 1,..., k, are distinct attributes from {X 1,..., X d } associated to non-leaf nodes in the path v i X (i), i = 1,..., k is the value of attribute X (i) associated to the corresponding edge For instance, the leftmost leaf node in the tree of the previous page corresponds to the rule: IF X 2 = true AND X 1 = true AND X 3 = true THEN Y = 2 368

369 Learning algorithms for Decision Trees Goal: constructing the simplest DT consistent with the training set. A possible, intuitive measure of the complexity of a DT: the number of its nodes. Two issues emerge: how to construct a consistent tree? how to find the simplest (smallest) consistent tree, without explicitly building all the consistent ones? 369

370 Learning algorithms for Decision Trees What is the number of distinct DTs that can be constructed given the attributes X 1,..., X d and the m class labels? the maximum depth of a DT is d (each attribute can appear at most once in a path from the root to a leaf) in the simplest case (two-class problems, all attributes have two values), a full DT of depth d contains d nodes......and different DTs can be obtained by permuting the d attributes in each path form the root to a leaf... The hypothesis space is therefore huge: a learning algorithm cannot use a brute force approach, i.e., explicitly searching for (constructing) all consistent DTs and then selecting the smallest one. 370

371 Learning algorithms for Decision Trees Disregarding for a moment the size of a DT, a very simple algorithm exists for constructing a consistent DT: for each training example, build a distinct path form the root to a leaf node include all the d attributes in the path, in any order label every edge in the path with the value of the corresponding attribute in the example at hand label the leaf node with the class label of that example Main drawback: such a DT just memorizes training examples. It does not capture the main distinctive characteristics of the classes, and is thus likely to exhibit a poor generalization capability. 371

372 Learning algorithms for Decision Trees A trade-off: constructing a reasonably small (not necessarily the smallest), consistent DT. This approach could be suboptimal, but allows one to define efficient learning algorithms for DTs. A well-known, simple learning algorithm named ID3 has been proposed by J.R. Quinlan in

373 Sketch of the ID3 learning algorithm ID3 is a top-down, recursive algorithm: from root to leaves. Each recursive step consists in building one node N of the DT, taking into account the subset T of the examples in the training set T that reach that node (i.e., whose attribute values correspond to the ones in the path from the root to that node). To obtain a small and consistent DT: if all examples in T belong to the same class, a leaf node is inserted with the label of that class otherwise, a non-leaf node is inserted, with the most discriminative attribute, i.e., the one whose values split the examples in T as much as possible according to their class; then ID3 is called recursively to build the sub-trees corresponding to each value of the chosen attribute An example can be found, in the course web site, in the companion file Decision_Tree_Learning.pdf. 373

374 Evaluating the discriminant capability of attributes The original measure used in ID3 to evaluate the discriminant capability of attributes is entropy. It will be described in the following, in the simplest case of: a two-class problem, with class labels Y = {+, } ( positive and negative, e.g., spam and legitimate s) a set of Boolean attributes, X i {0, 1}, i = 1,..., d The following notation will be used: p and n denote the number of positive and negative examples in the training set for any given attribute X, p j and n j denote the number of positive and negative examples such that X = j, j {0, 1} 374

375 Evaluating the discriminant capability of attributes The most discriminative attribute is the one whose values exactly split the training examples according to their class: X = 1 (or X = 0) for all positive examples, i.e., p 1 = p (or p 0 = p) X = 0 (or X = 1) for all negative examples, i.e., n 0 = n (or n 1 = n) The least discriminative attribute is instead the one for which p 0 = n 0 and p 1 = n 1. In other words, the value of such an attribute does not provide any information on the class label. 375

376 Evaluating the discriminant capability of attributes In practice, no perfectly discriminative attribute may exist, but one can search for the one that discriminates better than the others. To evaluate how close an attribute X is to splitting the examples according to their class, for each value of X one can estimate the class distribution from the corresponding training examples, and then evaluate the entropy of the resulting distributions: the lower the entropy, the higher the discriminative capability. 376

377 Evaluating the discriminant capability of attributes The entropy of a discrete random variable V which can take on k values v 1,..., v k is defined as: k H(V ) = P(V = v i ) log 2 P(V = v i ) i=1 It is known that H(V ) [0, 1]: if all values are equiprobable (maximum uncertainty), i.e., P(V = v i ) = 1 k for all v i, then H(V ) = 1 if only one value can occur (no uncertainty), i.e., P(V = v i ) = 1 for a given v i, and P(V = v j ) = 0 for all j i, then H(V ) = 0 377

378 Evaluating the discriminant capability of attributes Considering the class label Y {+, } as a random variable, its conditional entropy given the value of a given (Boolean) attribute X is defined as follows: H(Y X) = P(X = 0)H(Y X = 0) + P(X = 1)H(Y X = 1) From the definition of entropy, one obtains: H(Y X = 0) = P(Y = + X = 0) log 2 P(Y = + X = 0) P(Y = X = 0) log 2 P(Y = X = 0) H(Y X = 1) = P(Y = + X = 1) log 2 P(Y = + X = 1) P(Y = X = 0) log 2 P(Y = X = 1) 378

379 Evaluating the discriminant capability of attributes The probability distributions involved in the expression of H(Y X) can be estimated form the training examples as follows: P(X = 0) = p 0+n 0 p+n P(Y = + X = 0) = P(X = 1) = p 1+n 1 p+n P(Y = X = 0) = P(Y = + X = 1) = P(Y = X = 1) = p 0 p 0 +n 0 n 0 p 0 +n 0 p 1 p 1 +n 1 n 1 p 1 +n 1 Finally, the most discriminative attribute X for a given node, among the attributes that have not yet been used in the same path, is defined as: X = arg min X i H(Y X i ) 379

380 Issue and extensions of Decision Trees To prevent over-fitting it may be useful to prune a DT, i.e., to replace some subtrees with a leaf node, even if this results in losing consistency Irrelevant attributes with many values (e.g., the birth date of a patient with respect to a given disease) may exhibit a high discriminant capability on training examples, causing a severe over-fitting: the entropy-based measure can be modified to avoid this drawback To deal with numerical attributes with many or infinite values, a test like X t instead if X = v can be used, where t is a node-specific value to be chosen by the learning algorithm 380

381 Evaluating the generalization capability Given a hypothesis f, in statistical terms the simplest measure of its generalization capability is its error probability, i.e., the probability of misclassifying a random instance (X,Y): P(f (X) Y ) However, in real classification problems the probability distribution of (X, Y ) is unknown (e.g., what is the probability of receiving a specific spam ?). Therefore, P(f (X) Y ) is estimated as the fraction of misclassified instances from a given set of labelled ones, whihc is named error rate. What labelled instances should one use to estimate the error rate? 381

382 Evaluating the generalization capability A simple, but naive solution is to estimate the error rate from the training set T = {(x 1, y 1 ),..., (x n, y n )} which was previously used to find the hypothesis f : 1 n n I(f (x i ) y i ) i=0 where I(A) = 1, if A =true, and I(A) = 0, if A =false. This estimate is called resubstitution error. 382

383 Evaluating the generalization capability The resubstitution error is however an optimistically biased estimate of the error probability, since learning algorithms enforce consistency, i.e., zero errors on T, which is prone to over-fitting. A better estimate could be obtained using the hold-out technique: learning a classifier from a subset of n < n labelled examples, and computing its error rate from the remaining n n instances, called testing set. However: the hold-out estimate can be unstable, since the error rate depends on the set of instances on which it is computed a classifier learnt from a smaller training set can exhibit a higher error probability on the other hand, if the testing set size n n is too small, the estimate of the error rate is unreliable 383

384 Evaluating the generalization capability The drawbacks of the hold-out estimate can be overcome by the k-fold cross-validation technique: subdivide the available labelled examples T into k disjoint and equally sized partitions T 1,..., T k for each T k, learn a classifier f (k) using the remaining partitions T T k as the training set, and compute the number of errors e k made by f (k) on T k compute the error rate as 1 n k e k i=1 As the limit case, leave-one-out cross-validation consists in using k = n partitions of one sample each (in this case, e k {0, 1}). 384

385 Artificial Neural Networks: historical notes Inspired by findings in neuroanathomy and neurophysiology (19th early 20th cent.) 1943: McCulloch and Pitts model of neurons as logic units 1949: Hebb s model of changes in synaptic strength and cell assemblies as the origin of adaptation, learning and thinking 1957: Rosenblatts perceptron: network of McCulloch and Pitts neural elements training procedures for adjusting connection weights applications to pattern recognition: error correction training procedure (perceptron learning algorithm) 385

386 Artificial Neural Networks: historical notes 1970s: limits of perceptrons (limited expressive power, lack of mathematical rigor), drop of interest in neural networks (M.L. Minsky and S.A. Papert, Perceptrons, MIT Press,1969) Mid 1980s: renaissance of neural networks: the connectionist approach an efficient learning algorithm: back-propagation theoretical support: statistics, computational learning theory seminal works: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel Distributed Processing, MIT Press, 1986 D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating errors, Nature 323, ,

387 Artificial Neural Networks: historical notes Since the 1990s: different kinds of ANNs: Feed-Forward Multi-Layer Perceptron Radial Basis Networks recurrent networks: Hopfield networks, associative memories Boltzmann machines applications in several fields: computer vision, pattern recognition, control systems, etc. today: great interest on deep neural networks 387

388 Brains and neurons: some facts Basic unit : neuron (nerve cell) Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma neurons, connections between them Apparently simple neuron behavior: firing (up to 10 3 Hz) in response to specific patterns of input signals Massive parallelism Robustness to noise Learning capability: novel connections between neurons and changing connection strength in response to external stimuli 388

389 The perceptron McCulloch and Pitts model of neurons as logic units (1943): x 1 w 1 x 2... w 2 y x d w d w0 x 0 = -1 x 1,..., x d : input signals; x 0 = 1: bias w 0,..., w d : connection weights (real numbers) a = d i=0 w ix i : input y = g(a): activation (output); g is called activation function 389

390 The perceptron The activation function of the perceptron is defined as the Heaviside function (or step-function): g(a) = { 1, if a 0 0, if a < 0 390

391 Expressive capability of perceptrons What kind of input-output behavior can a perceptron exhibit? It can act as a logic gate: suitable weight values can lead to Boolean functions like AND and OR. For instance, for d = 2, and x 1, x 2 {0, 1}, it is easy to see that: w 1 = w 2 = 1, w 0 = 1.5: AND w 1 = w 2 = 1, w 0 = 0.5: OR In general, a perceptron can represent a binary function of Boolean and/or numerical variables. This allows it to be used as a two-class classifier, considering the input signals as the attribute values In particular, in the case of numerical attributes, a perceptron implements a linear discriminant function in attribute space: { 1, if di=0 w y = i x i 0 0, if d i=0 w i x i < 0 391

392 The perceptron as a learning machine It is known that experience and learning change (among other things) the connection strength between neurons in the human brain. Can the perceptron emulate this behaviour? To this aim, the connection weights should be modified by some learning algorithm to reproduce a desired input-output behaviour, according to a given set of examples. Such an algorithm was devised by F. Rosenblatt in

393 The perceptron learning algorithm Notation: attribute vector: x = (x 0, x 1,..., x d ) weight vector: w = (w 0, w 1,..., w d ) perceptron input: a(x, w) = w 0 x 0 + w 1 x w d x d perceptron output (activation): y = g(a) = { 1, if a 0 0, if a < 0 training set: T = {(x k, t k )} n k=1, where t k { 1, +1} ( target ) is the class label of x k 393

394 The perceptron learning algorithm An error function can be defined to evaluate the extent to which, for a given w, the perceptron fails to correctly classify a given example (x k, t k ): It is easy to see that: E(x k, w) = t k a(x k, w) E(x k, w) > 0 x k is misclassified if E(x k, w) > 0, the higher E(x k, w), the larger the changes to be made on w to correctly classify x k Finding a hypothesis consistent with training examples can now be formulated as the optimization problem of finding weight values w that minimize the error function on misclassified examples: w = arg min w (x k,t k ) T :E(x k,w)>0 E(x k, w) 394

395 Perceptron learning algorithm Rosenblatt s learning algorithm follows an on-line gradient descent approach: it starts from a randomly chosen w, then it repeatedly scans training examples and, whenever a misclassified example x k is found, it updates w to reduce E(x k, w): w i w i η E(xk, w) w i, i = 0,..., d where η > 0 is an arbitrary constant term. From the definition of E(x k, w) the above weight update rule can be rewritten as: w i w i + ηxi k t k, w 0 w 0 ηt k i = 1,..., d 395

396 Perceptron learning algorithm function Perceptron-Learning (T ) returns weight values w randomly choose the initial weight values w repeat for each (x k, t k ) T do if E(x k, w) > 0 then w i w i E(xk,w) w i, i = 0,..., d end for until a stopping condition is satisfied return w Each for loop over training examples is named epoch. 396

397 Perceptron learning algorithm Perceptron convergence theorem (F. Rosenblatt, 1960): If the training set T is linearly separable, the perceptron learning algorithm always converges to a consistent hypothesis after a finite number of epochs, for any η > 0. Since convergence is guaranteed for any η > 0, usually η = 1 is chosen. In practice one does not know in advance whether T is linearly separable: a suitable stopping condition has to be used. For instance, if T is not linearly separable, it has been observed that after a certain number of epochs the weights start oscillating : the detection of this behavior can be used as the stopping condition. 397

398 Limits of the perceptron The perceptron turned out to be an oversimplified model of neurons: it is not useful for understanding the human brain. More accurate models are still being developed by neuroscientists, but their complexity is too high for AI applications. However, even for AI applications the perceptron has a too limited expressive capability, e.g.: it cannot represent some Boolean functions, like XOR (this can be easily proven geometrically: no line can separate the points (0, 0) and (1, 1) from (0, 1) and (1, 0) on a plane) in supervised classification problems the classes are often non-linearly separable, and the class boundary can be highly non-linear 398

399 Perceptron networks Neurons in the human brain are organized into highly interconnected networks, which enables them to produce very complex input-output behaviors. To mimic them, artificial neural networks (ANN) made up of interconnected perceptrons (possibly with recurrent connections) can be used. An example: a perceptron network with two inputs, one output, and several so-called hidden units (bias inputs are not shown for simplicity): x 1 y x 2 inputs output unit hidden units 399

400 Perceptron networks Perceptron networks exhibit a non-linear input-output behaviour, i.e., they can implement a non-linear discriminant function in attribute space. For instance, for Boolean inputs x 1, x 2 it is easy to see that the simple perceptron network below implements the XOR function (see next slide): x 1 x u 1 u x 0 = -1 y 400

401 Perceptron networks To prove that the above perceptron network implements the XOR function, note that: unit u 1 outputs the value 1 only when x 1 = x 2 = 1 unit u 2 outputs the value 1 only when x 1 = x 2 = 0 the output y equals 1, only when both u 1 and u 2 output the value 0, i.e., when either x 1 = 0, x 2 = 1 or x 1 = 1, x 2 = 0 This can be seen more easily from the plot below, where the dashed lines represent the discriminant functions of units u 1 and u 2 : x x 1 x 1 +x x 1 +x

402 Issues of perceptron networks Perceptron networks exhibit the following main problems: no target output can be defined for hidden units: Rosenblatt s learning algorithm cannot be applied to set the weight values of their input connections alternative learning algorithms devised in the 1970s exhibited a too high computational complexity how to define a suitable network architecture, i.e., the number of perpceptrons and the connections between them, for a given application? can the architecture itself be chosen by the learning algorithm, together with connection weights? These difficulties contributed to a drop of interest in ANNs in the 1970s. 402

403 The renaissance of artificial neural networks A practical solution to the above issues was found in the 1980s: learning the network architecture is too difficult: it is better to use a predefined architecture, and to learn only the connection weights efficient learning algorithms can be devised for: specific architectures, like feed-forward networks continuous activation functions, instead of the step-function 403

404 Continuous activation functions Two widely used activation functions for ANN units, rooted in the statistical setting of supervised classification problems: logistic ( sigmoid ) function: hyperbolic tangent: g(a) = g(a) = tanh(a) 1 (0, 1) 1 + e a = ea e a e a ( 1, 1) + e a 404

405 The feed-forward multi-layer (FF-ML) architecture A widely used architecture for supervised classification problems: units are arranged into layers (hence the name multi-layer ) input layer: fictitious units corresponding to inputs output layer: one unit for two-class problems, m units for m-class problems with m > 2 one or more hidden layers there are no recurrent connections (hence the name feed-forward ), and: every hidden and output unit receives inputs only from units of the previous layer every input and hidden unit sends its output only to units in the next layer Usually FF-ML networks are fully-connected: every input and hidden unit is connected to all units in the next layer. 405

406 FF-ML networks: an example The figure below shows a fully-connected FF-ML network with two inputs (x 1 and x 2 ), two hidden layers of three and four units each, and one output unit (it can be used as a two-class classifier): x 1 y x 2 layer 1 (input) layer 2 (hidden) layer 3 (hidden) layer 4 (output) 406

407 The feed-forward multi-layer (FF-ML) architecture Since the network output is continuous, a post-processing step is necessary to convert it into a class label (which is a discrete, categorical value). For two-class problems a single output unit is used. To classify a new instance x, the network output y(x) is usually converted by thresholding it. For instance, if the logistic activation function is used, x is labelled as 0 if y(x) < 0.5, and as 1 if y(x) 0.5. For m-class problems (m > 2), m output units are used: for any instance of the k-th class, the target output is defined as y k = 1, and y p = 0 for p k an instance x is assigned to the class k corresponding to the highest output value: k = arg max k=1,...,m y k (x) 407

408 Expressive capability of FF-ML networks It has been shown that a FF-ML network with a suitable number of hidden units with logistic or tanh activation function, and output units with linear activation function g(a) = a, can represent: every Boolean function, using one hidden layer every bounded and continuous function, with arbitrarily small approximation error, using one hidden layer any bounded, discontinuous function, with arbitrarily small approximation error, using two hidden layers However, in the worst case an exponential number (in the number of inputs) of hidden units is required 408

409 Expressive capability of FF-ML networks In practical classification problems the target function (i.e., the relationship between attribute values and class label) is unknown. This means that also the most suitable FF-ML architecture (number of hidden layers and of hidden units) is unknown. An iterative trial-and-error design approach is usually adopted: start with a small FF-ML network (e.g., one hidden layer and a few hidden units) evaluate increasingly complex architectures (by increasing the number of hidden units and/or hidden layers) until a desired generalization capability is attained 409

410 The back-propagation learning algorithm An efficient learning algorithm was devised in the 1980s for FF ANNs, including multi-layer networks: back-propagation. It exploits the continuous and derivable activation function of each unit, which makes the network output a continuous and derivable function of the network inputs. Accordingly: a continuous and derivable error function can be defined to evaluate the difference between the network output and the target output of a given example the error function can be minimized over a training set using a gradient descent-like procedure 410

411 The back-propagation learning algorithm For a FF-ML network with d inputs and one output y, the simplest error function for an example (x k, t k ) is the squared error: E(x k, w) = 1 2 [ y(x k ) t k ] 2, where w is a vector made up of all connection weights (suitably ordered). Note that t k must be defined according to the activation function of the output unit: logistic function: g(a) (0, 1), therefore t k {0, 1} tanh function: g(a) ( 1, +1), therefore t k { 1, +1} 411

412 The back-propagation learning algorithm Similar to the perceptron learning algorithm, the goal of the back-propagation algorithm is to find the weight values that minimize the error function on the whole training set T : w = arg min w n E(x k, w) k=1 This is achieved by a gradient descent procedure, starting from a randomly chosen w, and carrying out several epochs over training examples, in two alternative modalities: batch: compute the error on T using the current w, nk=1 E(x k, w), then update w to reduce it on-line (usually faster): for each example x k compute the error E(x k, w) using the current w, then update w to reduce it 412

413 The back-propagation learning algorithm From now on, a fully connected FF ML network will be considered. The following notation will be used: E(x k, w) will be written as E k w ji denotes the weight of the connection from unit u i to unit u j (note that u i can be either a hidden or an input unit: in the latter case its output corresponds to the value of one attribute, i.e., one of the network inputs) a j and z j denote respectively the input and output of unit u j... z i... w ji uj z j 413

414 The back-propagation learning algorithm In the on-line back-propagation modality, according to the gradient descent procedure the update rule for the k-th example is: w ji w ji η E k w ji Since the weight w ji affects E k through the input a j of u j, it is convenient to compute the above derivative as: E k w ji = E k a j a j w ji 414

415 The back-propagation learning algorithm Remember that a j = w jp z p, where p ranges over all the units u p of the previous layer. It easily follows that: a j w ji = z i, which is a known quantity given the network input x k. The term E k a j corresponds to the contribution of the input of u j to E k : it is called the error of u j, and is denoted as δ j. The partial derivative E k w ji can therefore be written as: E k w ji = δ j z i Now the problem is: how to compute the term δ j? 415

416 The back-propagation learning algorithm Consider first the case when u j is the output unit (thus u i is a hidden unit), i.e., y = z j :... i z i... w ji j z j y = z j By definition of E k, and remembering that z j = g(a j ): E k := 1 2 (y t k) 2 = 1 2 (z j t k ) 2 = 1 2 [g(a j) t k ] 2 Therefore: δ j := E k a j = [g(a j ) t k ] dg(a j) da j 416

417 The back-propagation learning algorithm If the logistic activation function is used, which is defined as g(a j ) = [1 + exp( a j )] 1, it is easy to see that: Therefore: dg(a j ) da j = g(a j ) [1 g(a j )] = z j (1 z j ) δ j = (z j t k )z j (1 z j ) Putting everything together, when w ji is the weight of a connection to the output unit: E k w ji = δ j z i = (z j t k )z j (1 z j )z i Note that all the above terms can be computed, given the network input x k. 417

418 The back-propagation learning algorithm Consider now the case when u j is a hidden unit (thus u i can be either a input unit or another hidden unit): u i z i w ji... uj z j w pj u p... In this case the term δ j can be rewritten as follows, taking into account that a j affects E k through the inputs a p of all the units of the next layer to which u j is connected: δ j := E k a j = p E k a p a p a j 418

419 The back-propagation learning algorithm Remember that: a p = q w qpz q, where q ranges over the units of the previous layer with respect to u p, including u j by definition, z j = g(a j ) Therefore, if the logistic activation function is used: a p a j = a p z j dz j da j = w pj dz j da j = w pj z j (1 z j ) Note now that the terms E k a p correspond by definition to the errors δ p of the units of the next layer with respect to u j. Therefore: δ j = z j (1 z j ) w pj δ p p 419

420 The back-propagation learning algorithm Putting everything together, when w ji is the weight of a connection to a hidden unit: E k = δ j z i = z j (1 z j )z i w pj δ p w ji Given the network input x k, all the above terms can be computed, provided that all the errors δ p of the next layer (with respect to u j ) have already been computed. p 420

421 The back-propagation learning algorithm To sum up, the basic steps of the on-line back-propagation algorithm, for a given training example (x k, t k ), are the following: 1. compute the network output y(x k ) by propagating forward the input x k to the output layer 2. update the connection weights by propagating backwards the errors δ j from all the connections to the output layer to all the connections to the first hidden layer (hence the name back-propagation ): for connections w ji to an output unit u j : E k w ji = δ j z i = (z j t k )z j (1 z j )z i for connections w ji to a hidden unit u j : E k = δ j z i = z j (1 z j )z i w pj δ p w ji p 421

422 The back-propagation learning algorithm A formal description of the algorithm: function Back-Propagation (T ) returns weight values w randomly choose the initial weight values w repeat for each (x k, t k ) T do compute the network output y(x k ) (forward-propagation) update the weights w (back-propagation) end for until a stopping condition is satisfied return w Also in this case each loop over training samples is named epoch. Usually the error function decreases across some epochs at the beginning of the training procedure, then it reaches an almost constant value. A typical stopping condition consists in detecting when k E k remains almost constant for some consecutive epochs. 422

423 Issues of the back-propagation algorithm The error function k E k usually has many local minima: gradient descent does not guarantee convergence to the smallest error on training examples. A multi-start strategy can be used to mitigate this problem: the algorithm is run several times starting from different random weights, and the solution with minimum error is chosen. Like all classifiers also ANNs are prone to over-fitting. In ANNs over-fitting can be detected by running the back-propagation algorithm on a training set made up of a subset of the available examples, and by evaluating the error function also on (a subset of) the remaining examples, called validation set: if the error on validation examples starts increasing, over-fitting is occurring. To prevent over-fitting, an early-stopping strategy can be used: the back-propagation algorithm is stopped when the error function starts increasing on validation examples. 423

424 Artificial neural networks vs Decision trees Expressiveness: DTs are easily interpretable: they represent high-level IF-THEN rules by design ANNs are not interpretable: they are black boxes that implement classification rules distributed over all connection weights Kind of decision function (for numerical attributes): DTs produce axis-parallel splits of the attribute space ANNs can produce more flexible class boundaries Generalization capability: ANNs (with a suitable architecture) can outperform DTs ANNs are usually more robust than DTs to noise on class labels and attribute values 424

An example of a (not so deep) DNN: taken from http://neuralnetworksanddeeplearning.

425 Deep neural networks (DNN) DNNs are a recent, very popular extension of ANNs (but early ideas date back to the 1970s), inspired by the structure of the human brain. Basically, they are multi-layer networks with many hidden layers. An example of a (not so deep) DNN: taken from Ad hoc modifications to activation functions and training procedures have been introduced to avoid drawbacks of standard ones (e.g., very slow convergence). 425

Accordingly, the CNN architecture is not fully connected, but its connections exploit the spatial adjacency between

426 Deep neural networks DNNs are widely employed for computer vision tasks, for which specialized architectures have been proposed, named convolutional neural networks (CNNs). In this case the CNN input is a raw image, e.g., an array of pixel values. Accordingly, the CNN architecture is not fully connected, but its connections exploit the spatial adjacency between pixels. In particular, input units are arranged into an array, e.g.: taken from 426

Artificial Intelligence

Artificial Intelligence Academic year 2016/2017 Giorgio Fumera http://pralab.diee.unica.it fumera@diee.unica.it Pattern Recognition and Applications Lab Department of Electrical and Electronic Engineering