Performance of Clause Selection Heuristics for Saturation-Based Theorem Proving. Stephan Schulz Martin Möhrmann

Similar documents
Playing with Vampire: the dark art of theorem proving

System Description: iprover An Instantiation-Based Theorem Prover for First-Order Logic

CSL105: Discrete Mathematical Structures. Ragesh Jaiswal, CSE, IIT Delhi

CSL373: Lecture 6 CPU Scheduling

Logical reasoning systems

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

DPLL(Γ+T): a new style of reasoning for program checking

CPU Scheduling. Schedulers. CPSC 313: Intro to Computer Systems. Intro to Scheduling. Schedulers in the OS

Advanced Operating Systems (CS 202) Scheduling (1)

EECS 219C: Computer-Aided Verification Boolean Satisfiability Solving. Sanjit A. Seshia EECS, UC Berkeley

Integrating a SAT Solver with Isabelle/HOL

Foundations of Computing

Scheduling. Scheduling. Scheduling. Scheduling Criteria. Priorities. Scheduling

EECS 219C: Formal Methods Boolean Satisfiability Solving. Sanjit A. Seshia EECS, UC Berkeley

Module 6. Knowledge Representation and Logic (First Order Logic) Version 2 CSE IIT, Kharagpur

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

versat: A Verified Modern SAT Solver

Slothrop: Knuth-Bendix Completion with a Modern Termination Checker

First-Order Proof Tactics in Higher Order Logic Theorem Provers Joe Hurd p.1/24

8: Scheduling. Scheduling. Mark Handley

The Resolution Width Problem is EXPTIME-Complete

Deductive Methods, Bounded Model Checking

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

INTRODUCTION TO HEURISTIC SEARCH

Scheduling of processes

Operating Systems. Process scheduling. Thomas Ropars.

CS 4100 // artificial intelligence

Reasoning About Loops Using Vampire

Artificial Intelligence

Formally Certified Satisfiability Solving

8 NP-complete problem Hard problems: demo

Lost in translation. Leonardo de Moura Microsoft Research. how easy problems become hard due to bad encodings. Vampire Workshop 2015

Mechanically-Verified Validation of Satisfiability Solvers

Multi Domain Logic and its Applications to SAT

Compositional Cutpoint Verification

G Robert Grimm New York University

Automated Reasoning PROLOG and Automated Reasoning 13.4 Further Issues in Automated Reasoning 13.5 Epilogue and References 13.

Scheduling Mar. 19, 2018

Data Compression Techniques

Scheduling in the Supermarket

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Runtime Checking for Program Verification Systems

Vertex Cover Approximations

Properties of Processes

Robbing the Bank with a Theorem Prover

Foundations of AI. 9. Predicate Logic. Syntax and Semantics, Normal Forms, Herbrand Expansion, Resolution

The Resolution Algorithm

(for more info see:

Computational Logic. SLD resolution. Damiano Zanardini

Advanced Operating Systems (CS 202) Scheduling (1) Jan, 23, 2017

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Lecture 7. s.t. e = (u,v) E x u + x v 1 (2) v V x v 0 (3)

Example: Map coloring

Hardness of Approximation for the TSP. Michael Lampis LAMSADE Université Paris Dauphine

Cooperation of Heterogeneous Provers

CS47300: Web Information Search and Management

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Reasoning About Programs Panagiotis Manolios

Operating Systems. Scheduling

Towards the Proof of the PCP Theorem

Interactive Scheduling

Two Level Scheduling. Interactive Scheduling. Our Earlier Example. Round Robin Scheduling. Round Robin Schedule. Round Robin Schedule

A proof-producing CSP solver: A proof supplement

Multitasking and scheduling

IronFleet. Dmitry Bondarenko, Yixuan Chen

Finite Model Generation for Isabelle/HOL Using a SAT Solver

6.034 Notes: Section 3.1

Announcements. CS 188: Artificial Intelligence Fall Reminder: CSPs. Today. Example: 3-SAT. Example: Boolean Satisfiability.

CS 188: Artificial Intelligence Fall 2008

Scheduling Strategies for Processing Continuous Queries Over Streams

E 1.4 User Manual. preliminary version. Stephan Schulz August 20, 2011

Reasoning About Programs Panagiotis Manolios

Eventual Consistency 1

Broadcast: Befo re 1

5. Compare the volume of a three dimensional figure to surface area.

bitcoin allnet exam review: transport layer TCP basics congestion control project 2 Computer Networks ICS 651

Frequently asked questions from the previous class survey

CPSC 536N: Randomized Algorithms Term 2. Lecture 5

Watching Clauses in Quantified Boolean Formulae

Propositional Resolution Part 3. Short Review Professor Anita Wasilewska CSE 352 Artificial Intelligence

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

Last Class: Processes

Integrity Constraints (Chapter 7.3) Overview. Bottom-Up. Top-Down. Integrity Constraint. Disjunctive & Negative Knowledge. Proof by Refutation

Routing in packet-switching networks

CS370 Operating Systems

P Is Not Equal to NP. ScholarlyCommons. University of Pennsylvania. Jon Freeman University of Pennsylvania. October 1989

Receive Livelock. Robert Grimm New York University

Explaining Subsumption in ALEHF R + TBoxes

On Resolution Proofs for Combinational Equivalence Checking

Where Can We Draw The Line?

Research Report. (Im)Possibilities of Predicate Detection in Crash-Affected Systems. RZ 3361 (# 93407) 20/08/2001 Computer Science 27 pages

Randomized Splay Trees Final Project

Improving Coq Propositional Reasoning Using a Lazy CNF Conversion

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

Horn Formulae. CS124 Course Notes 8 Spring 2018

(a) (4 pts) Prove that if a and b are rational, then ab is rational. Since a and b are rational they can be written as the ratio of integers a 1

Staged Memory Scheduling

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

CPU Scheduling. The scheduling problem: When do we make decision? - Have K jobs ready to run - Have N 1 CPUs - Which jobs to assign to which CPU(s)

Transcription:

Performance of Clause Selection Heuristics for Saturation-Based Theorem Proving Stephan Schulz Martin Möhrmann PROOF

Agenda Introduction Heuristics for saturating theorem proving Saturation with the given-clause algorithm Clause selection heuristics Experimental setup Results and analysis Comparison of heuristics Potential for improvement - how good are we? Conclusion 2

Introduction Heuristics are crucial for first-order theorem provers Practical experience is clear Proof search happens in an infinite search space Proofs are rare A lot of collected developer experience (folklore)... but no (published) systematic evaluation... and no (published) recent evaluation at all 3

Saturating Theorem Proving Search state is a set of first-order clauses Inferences add new clauses Existing clauses are premises Inference generates new clause If clause set is unsatisfiable then can eventually be derived Redundancy elimination (rewriting, subsumption... ) simplifies search state Inference rules try to minimize necessary consequences Restricted by term orderings Restricted by literal orderings Question: In which order do we compute potential consequences? Given-clause algorithm Controlled by clause selection heuristic 4

The Given-Clause Algorithm g=? P (processed clauses) Aim: Move everything from U to P g U (unprocessed clauses) 5

The Given-Clause Algorithm g=? Generate g P (processed clauses) Aim: Move everything from U to P Invariant: All generating inferences with premises from P have been performed U (unprocessed clauses) 5

The Given-Clause Algorithm g=? g Simplify P (processed clauses) Generate Simplifiable? U (unprocessed clauses) Aim: Move everything from U to P Invariant: All generating inferences with premises from P have been performed Invariant: P is interreduced 5

The Given-Clause Algorithm g=? g Simplify P (processed clauses) Generate Simplifiable? U (unprocessed clauses) Cheap Simplify Aim: Move everything from U to P Invariant: All generating inferences with premises from P have been performed Invariant: P is interreduced Clauses added to U are simplified with respect to P 5

Choice Point Clause Selection P (processed clauses) g=? Aim: Move everything from U to P g U (unprocessed clauses) 6

Choice Point Clause Selection P (processed clauses) g=? Aim: Move everything from U to P Without generation: Only choice point! g Choice Point U (unprocessed clauses) 6

Choice Point Clause Selection P (processed clauses) g=? Generate g Choice Point U (unprocessed clauses) Aim: Move everything from U to P Without generation: Only choice point! With generation: Still the major dynamic choice point! 6

Choice Point Clause Selection P (processed clauses) g=? g Simplify Choice Point U Generate Simplifiable? (unprocessed clauses) Cheap Simplify Aim: Move everything from U to P Without generation: Only choice point! With generation: Still the major dynamic choice point! With simplification: Still the major dynamic choice point! 6

The Size of the Problem P (processed clauses) g=? Generate Simplifiable? g Cheap Simplify Simplify Choice Point U (unprocessed clauses) U (unprocessed clauses) 7

The Size of the Problem P (processed clauses) U P 2 U 3 10 7 after 300s g=? Generate Simplifiable? g Cheap Simplify Simplify Choice Point U (unprocessed clauses) U (unprocessed clauses) 7

The Size of the Problem P (processed clauses) U P 2 U 3 10 7 after 300s g=? g Simplify Generate Simplifiable? Cheap Simplify How do we make the best choice among millions? Choice Point U (unprocessed clauses) U (unprocessed clauses) 7

Basic Clause Selection Heuristics Basic idea: Clauses ordered by heuristic evaluation Heuristic assigns a numerical value to a clause Clauses with smaller (better) evaluations are processed first Example: Evaluation by symbol counting {f (X ) a, P(a) $true, g(y ) = f (a)} = 10 Motivation: Small clauses are general, has 0 symbols Best-first search Example: FIFO evaluation Clause evaluation based on generation time (always prefer older clauses) Motivation: Simulate breadth-first search, find shortest proofs Combine best-first/breadth-first seach E.g. pick 4 out of every 5 clauses according to size, the last according to age 8

Clause Selection Heuristics in E Many symbol-counting variants E.g. Assign different weights to symbol classes (predicates, functions, variables) E.g. Goal directed: lower weight for symbols occuring in original conjecture E.g. ordering-aware/calculus-aware: higher weight for symbols in inference terms Arbitrary combinations of base evaluation functions E.g. 5 priority queues ordered by different evaluation functions, weighted round-robin selection 9

Clause Selection Heuristics in E Many symbol-counting variants E.g. Assign different weights to symbol classes (predicates, functions, variables) E.g. Goal directed: lower weight for symbols occuring in original conjecture E.g. ordering-aware/calculus-aware: higher weight for symbols in inference terms Arbitrary combinations of base evaluation functions E.g. 5 priority queues ordered by different evaluation functions, weighted round-robin selection E can simulate nearly all other approaches to clause selection! 9

Folklore on Clause Selection/Evaluation FIFO is obviously fair, but awful Everybody Prefering small clauses is good Everybody Interleaving best-first (small) and breadth-first (FIFO) is better The optimal pick-given ratio is 5 Otter Processing all initial clauses early is good Waldmeister Preferring clauses with orientable equation is good DISCOUNT Goal-direction is good E 10

Folklore on Clause Selection/Evaluation FIFO is obviously fair, but awful Everybody Prefering small clauses is good Everybody Interleaving best-first (small) and breadth-first (FIFO) is better The optimal pick-given ratio is 5 Otter Processing all initial clauses early is good Waldmeister Preferring clauses with orientable equation is good DISCOUNT Goal-direction is good E Can we confirm or refute these claims? 10

Experimental setup Prover: E 1.9.1-pre 14 different heuristics 13 selected to test folklore claims (interleave 1 or 2 evaluations) Plus modern evolved heuristic (interleaves 5 evaluations) TPTP release 6.3.0 Only (assumed) provable first-order problems 13774 problems: 7082 FOF and 6692 CNF Compute environment StarExec cluster: single threaded run on Xeon E5-2609 (2.4 GHz) 300 second time limit, no memory limit ( 64 GB/core physical) 11

Meet the Heuristics Heuristic Rank Successes Successes within 1s total unique absolute of column 3 FIFO 14 4930 (35.8%) 17 3941 79.9% SC12 13 4972 (36.1%) 5 4155 83.6% SC11 9 5340 (38.8%) 0 4285 80.2% SC21 10 5326 (38.7%) 17 4194 78.7% RW212 11 5254 (38.1%) 13 5764 79.8% 2SC11/FIFO 7 7220 (52.4%) 24 5846 79.7% 5SC11/FIFO 5 7331 (53.2%) 3 5781 78.3% 10SC11/FIFO 3 7385 (53.6%) 1 5656 77.6% 15SC11/FIFO 6 7287 (52.9%) 6 5006 82.5% GD 12 4998 (36.3%) 12 5856 78.4% 5GD/FIFO 4 7379 (53.6%) 62 4213 80.2% SC11-PI 8 6071 (44.1%) 13 4313 86.3% 10SC11/FIFO-PI 2 7467 (54.2%) 31 5934 80.4% Evolved 1 8423 (61.2%) 593 6406 76.1% 12

Successes Over Time 9000 successes 8000 7000 6000 5000 Evolved 10SC11/FIFO-PI 10SC11/FIFO 15SC11/FIFO 5SC11/FIFO 2SC11/FIFO SC11-PI SC11 SC21 SC12 FIFO 4000 0 50 100 150 200 250 time 13

Folklore put to the Test FIFO is awful, prefering small clauses is good mostly confirmed In general, only modest advantage for symbol counting (36% FIFO vs. 39% for best SC) Exception: UEQ (32% vs. 63%) 14

Folklore put to the Test FIFO is awful, prefering small clauses is good mostly confirmed In general, only modest advantage for symbol counting (36% FIFO vs. 39% for best SC) Exception: UEQ (32% vs. 63%) Interleaving best-first/breadth-first is better confirmed 54% for interleaving vs. 39% for best SC Influence of different pick-given ratios is surprisingly small UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) The optimal pick-given ratio is 10 (for E) 14

Folklore put to the Test FIFO is awful, prefering small clauses is good mostly confirmed In general, only modest advantage for symbol counting (36% FIFO vs. 39% for best SC) Exception: UEQ (32% vs. 63%) Interleaving best-first/breadth-first is better confirmed 54% for interleaving vs. 39% for best SC Influence of different pick-given ratios is surprisingly small UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) The optimal pick-given ratio is 10 (for E) Processing all initial clauses early is good confirmed Effect is less pronounced for interleaved heuristics 14

Folklore put to the Test FIFO is awful, prefering small clauses is good mostly confirmed In general, only modest advantage for symbol counting (36% FIFO vs. 39% for best SC) Exception: UEQ (32% vs. 63%) Interleaving best-first/breadth-first is better confirmed 54% for interleaving vs. 39% for best SC Influence of different pick-given ratios is surprisingly small UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) The optimal pick-given ratio is 10 (for E) Processing all initial clauses early is good confirmed Effect is less pronounced for interleaved heuristics Preferring clauses with orientable equation is good not confirmed There is no evidence in our data, not even for UEQ 14

Folklore put to the Test FIFO is awful, prefering small clauses is good mostly confirmed In general, only modest advantage for symbol counting (36% FIFO vs. 39% for best SC) Exception: UEQ (32% vs. 63%) Interleaving best-first/breadth-first is better confirmed 54% for interleaving vs. 39% for best SC Influence of different pick-given ratios is surprisingly small UEQ is again an outlier (60% for 2:1 vs. 70% for 15:1) The optimal pick-given ratio is 10 (for E) Processing all initial clauses early is good confirmed Effect is less pronounced for interleaved heuristics Preferring clauses with orientable equation is good not confirmed There is no evidence in our data, not even for UEQ Goal-direction is good partially confirmed GD on its own performs similar to SC GD shines in combination with FIFO 14

Selected Results Good heuristics do make a difference 71% more solutions with Evolved vs. FIFO 58% more solutions with Evolved vs. best SC 15

Selected Results Good heuristics do make a difference 71% more solutions with Evolved vs. FIFO 58% more solutions with Evolved vs. best SC Success comes early 80% of all proofs found in less than 1s... with little variation between strategies (spread: 76% 84%) 15

Selected Results Good heuristics do make a difference 71% more solutions with Evolved vs. FIFO 58% more solutions with Evolved vs. best SC Success comes early 80% of all proofs found in less than 1s... with little variation between strategies (spread: 76% 84%) Cooperation beats portfolio/strategy scheduling SC11 solves 5340 problems FIFO solves 4930 problems Union of the previous two contains 6329 problems... but 10SC11/FIFO solves 7385 15

Selected Results Good heuristics do make a difference 71% more solutions with Evolved vs. FIFO 58% more solutions with Evolved vs. best SC Success comes early 80% of all proofs found in less than 1s... with little variation between strategies (spread: 76% 84%) Cooperation beats portfolio/strategy scheduling SC11 solves 5340 problems FIFO solves 4930 problems Union of the previous two contains 6329 problems... but 10SC11/FIFO solves 7385 Evolving Evolved paid off Significantly better than best naive heuristic 10 more unique solutions than second-best 15

Measuring absolute Performance Definition: Given-clause utilization A useful given clause appears in the proof object A useless given clause does not contribute to the proof The given-clause utilization ratio for a proof search is the ratio of useful given clauses to all given clauses Given-clause utilization ratio measures heuristic quality A perfect heuristic has a GCUR of 1 A failed heurisic has a GCUR of 0 A better heuristic will pick fewer useless clauses 16

Given Clause Utilization ratio 1.0 0.8 0.6 0.4 0.2 Evolved 10SC11/FIFO SC11 FIFO 2000 non-trivial problems solved by all 4 heuristics GCUR rank corresponds to global performance GCUR rates are low even for these easy problems 0.0 0 500 1000 1500 problems 17

Given Clause Utilization ratio 1.0 0.8 0.6 0.4 0.2 Evolved 10SC11/FIFO SC11 FIFO 0.0 0 500 1000 1500 problems 2000 non-trivial problems solved by all 4 heuristics GCUR rank corresponds to global performance GCUR rates are low even for these easy problems Significant potential for improvement! 17

Future Work Evolve/develop better individual heuristics and collections of heuristics (Even) more complex heuristics? Evolve for diversity/swarm success Learn better heuristics from proofs/proof searches Feature-based learning? Pattern-based learning Deep learning? Evaluate how results transfer to other situations Otter-loop? AVATAR? 18

Conclusion First-order proof search is a hard problem that critically depends on complex heuristics Developer folklore is useful... but needs to be documented... but needs to be re-verified Given-clause utilization is useful to gauge the quality of heuristics Given-clause utilization is low for current heuristics... especially for harder problems Huge potential for further improvements 19

Conclusion First-order proof search is a hard problem that critically depends on complex heuristics Developer folklore is useful... but needs to be documented... but needs to be re-verified Given-clause utilization is useful to gauge the quality of heuristics Given-clause utilization is low for current heuristics... especially for harder problems Huge potential for further improvements Questions? 19

Conclusion First-order proof search is a hard problem that critically depends on complex heuristics Developer folklore is useful... but needs to be documented... but needs to be re-verified Given-clause utilization is useful to gauge the quality of heuristics Given-clause utilization is low for current heuristics... especially for harder problems Huge potential for further improvements Thank you! 19