Fuzz Testing Projects in Massive Courses

Learning@Scale 2016 Fuzz Testing Projects in Massive Courses Sumukh Sridhara, Brian Hou, Jeffrey Lu and John DeNero UC Berkeley April 26th 2016 Edinburgh, UK www.bestppt.com

Programming Projects in MOOCs IMPLEMENT Programming Projects 1 Primarily Instructional FEEDBACK 2 Instructor Solution Exists REVISE 3 Automated Feedback 2

Feedback Goals 1 Help students arrive at a correct answer 2 Students can help themselves 3 Every missed bug is a missed learning opportunity 3

Targeted Comprehensive 1 Isolates One Issue 100% Tests every case Guide Student Attention Hard To Engineer Many Targeted Tests Hard To Compute www.bestppt.com

Fuzz Testing

Fuzz Testing Testing the behavior of the program on many random inputs Complementary to Manual Testing Historically used for security

Fuzz Testing (Security) Generate Random Inputs User Program Verify Security Constraints 7

Fuzz Testing (Programming Projects) Generate Random Inputs Student Program Verify Correctness 8

Fuzz Testing (Programming Projects) Generate Random Inputs Student Program Instructor Program Student Output Expected Output 9

How to compare output? How many inputs are required? How to improve upon Fuzz Tests?

Creating & Distributing Tests Generate Inputs Input domain is known Generating Random Inputs is easy Student Program Instructor Program Verify Output 11

Implementation Generate Inputs q5(93, q5(93, 97, 97, 3) q5(23, 17, 3) Test #1007) Random input generation All students were provided with an identical set of tests Large number of inputs needed 12

Creating & Distributing Tests Generate Inputs Input domain is known Generating Random Inputs is easy Student Program Instructor Program Verify Output 13

Creating & Distributing Tests Generate Inputs Raw Output Check Hashing Student Program Instructor Program Program Tracing Verify Output 14

RQ1: How to Compare Outputs? Verify Output Raw comparison of output Compare against precomputed result q5(93, q5(93, 97, 97, 3) q5(23, 17, 3) Test #1007) q5(93, q5(93, 97, 97, 3) q5(23, 17, 3) Student output 7) Output Vector 15

RQ1: How to Compare Outputs? Hashing Output Hash combination of outputs Compare against precomputed result q5(93, q5(93, 97, 97, 3) q5(23, 17, 3) Test #1007) q5(93, q5(93, 97, 97, 3) q5(23, 17, 3) Student output 7) Output Vector Combine & Hash 477587826 16

Evaluation CS61A @ UC Berkeley cs61a.org In Person CS1 Course with 1400 Students Enrolled 17

Student Autograder OK Server Analytics Feedback okpy.org 18

Collected Dataset Students Completing Project 1,331 Code Snapshots 486,482 Average Snapshots per Student 349 Incorrect Attempts at Target Question 48,079 19

Output Vectors Generated Inputs Output Vector Correct Student Attempt Student Attempt Student Attempt Student Attempt 20

Output Vectors Output Vector 21

Frequency of Incorrect Outputs 1600 1200 800 400 0 22

RQ2: How many inputs are needed? 23

R2: How many inputs are needed? 1 Input 24

RQ2: How many inputs are needed? 2 Inputs 25

RQ2: How many inputs are needed? 3 Inputs 26

RQ2: How many inputs are needed? 4 Inputs 27

RQ2: How many inputs are needed? 7000 Number of False Positives 6000 5000 4000 3000 2000 1000 0 1 2 5 10 25 50 100 200 500 1000 Number of Inputs (Log Scale) 28

Fuzz Testing Effectiveness 52% 48% 656 Students (48.4%) passed all of the targeted tests but still had an error caught by the Fuzz Tests Targeted + Fuzz Tests Targeted Tests 29

RQ3: Improving Fuzz Tests Cumulative Percent Fixed Targeted Test Fuzz Test Attempts Required To Fix Error 30

How to Improve On Fuzz Tests? As a result of the Fuzz Test: 19% 7% 46% 46% of students reported spending 1 to 4 hours debugging 19% of students reported spending more than 4 hours debugging 28% Obfuscating output made it harder for instructors to help students No Time < 1 Hour >4 Hour >1 Hour 31

Improvements Program Inspection Incorrect result after playing 1 game(s): - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - score0 score1 Turn Summary - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Turn 0: 0 0 Player 0 rolls 0 dice: +1 1 0 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Turn 1: 1 0 Player 1 rolls 7 six- sided dice: +37 3, 4, 6, 3, 3, 4, 6 1 37 Dice sum: 29 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -... - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Incorrect implementation of game at turn 1. Please read over the trace to find your error. (error_id: 1189294328) 32

Thank you sumukh@berkeley.edu @sumukhsridhara okpy.org cs61a.org www.bestppt.com Fuzz Testing Projects in Massive Courses Learning@Scale 2016