Data driven student feedback for Programming Intensive MOOCs

Size: px

Start display at page:

Download "Data driven student feedback for Programming Intensive MOOCs"

Stewart Cross
5 years ago
Views:

1 Data driven student feedback for Programming Intensive MOOCs Towards global scale CS education Jonathan Huang Stanford University Andy Nguyen Chris Piech Leonidas Guibas

2 Steve Jobs Stanford, 2005 all of my working-class parents' savings were being spent on my college tuition the minute I dropped out [of college] I could stop taking the required classes that didn't interest me, and begin dropping in on the ones that looked interesting.

3 Course selection is better online MOOC = Massive Open Online Courses

Untapped potential Thousands 2000 1500 1000 500 0 2011 2016 2021

03-04 08-09 13-14 CS does not count towards high school math/science

4 Untapped potential Thousands Thousands $40 $30 $20 $10 $ CS does not count towards high school math/science requirements in 36 of 50 states *** source:

5 400 students 100,000 students Stanford ML-class

6 10 TAs 2,500 TAs (???) Stanford ML-class

7 Ease of global scale feedback on a spectrum Multiple choice Today: Programming Assignments Proofs Essay questions Short Response Long Response

J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient

taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples

8 Feedback for Coding Assignments: Easy? Linear Regression submission (Homework 1) for Coursera s ML class Test Inputs function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Test Outputs Correct / Incorrect? 8

9 The but it works!! solution function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) m = length(y); J_history = zeros(num_iters, 1); for iter = 1:num_iters hypo = X*theta; newmat = hypo y; trans1 = (X(:,1)); trans1 = trans1 ; newmat1 = trans1 * newmat; temp1 = sum(newmat1); temp1 = (temp1 *alpha)/m; A = [temp1]; theta(1) = theta(1) - A; trans2 = (X(:,2)) ; newmat2 = trans2*newmat; temp2 = sum(newmat2); temp2 = (temp2 *alpha)/m; B = [temp2]; theta(2)= theta(2) - B; J_history(iter) = computecost(x, y, theta); end theta(1) = theta(1); theta(2)= theta(2); Better: theta = theta-(alpha/m) Why?? *X'*(X*theta-y) Correctness Efficiency Style Elegance Good Good Poor Poor

10 Let s do this: for a class of 100,000, and in real time, But can t require too much instructor effort to make this work for: New programming problems New programming languages New courses

11 We now have massive datasets 20M # Students 1K 10K Intro CS Visualization of 40,000 implementations of linear regression submitted to Coursera s ML course [Moocshop, 2013]

12 Efficient index for code phrases of a MOOC dataset Shared structure discovery amongst many student submissions Codewebs Engine Applications such as bug finding (w/o execution) and MOOC-scale feedback Results on real MOOC with > 1 million submissions

13 First, an example application function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Correct function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Incorrect

14 First, an example application function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha Syntax based approach: m = length(y); % number of training examples J_history = zeros(num_iters, 1); Attach this message to everyone containing that exact expression for iter = 1:num_iters theta = theta-alpha*1/m*(x'*(x*theta-y)); J_history(iter) (covers = computecost(x, 99 submissions) y, theta); end function [theta, J_history] = gradientdescent(x, Dear Lisa y, Simpson, theta, alpha, consider num_iters) the %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, dimension alpha, num_iters) of the expression: updates theta by % taking num_iters gradient steps with learning X'*(X*theta-y) rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); Correct and what happens after you call sum on it for iter = 1:num_iters theta = theta-alpha*1/m*sum(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Incorrect

15 The extraneous sum bug takes many forms theta = theta-alpha*1/m*sum(x'*(x*theta-y)); theta = theta-alpha*1/m*sum(((theta *X ) -y) *X); theta = theta-alpha*1/m*sum(transpose(x*theta-y)*x); (Easier) Output based approach: Attach message to everyone who matched extraneous sum bug in unit test output (covers 1091 submissions)

16 Codewebs approach to feedback theta = theta-alpha*1/m*sum(x'*(x*theta-y)); Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation Step 3: Propagate feedback message to any submission containing equivalent expression 1091 Output based 1208 Codewebs 1604 Combined # submissions covered by single message

improvement over Step 3: Propagate just using feedback an output message based to any submission containing

17 Codewebs approach to feedback theta = theta-alpha*1/m*sum(x'*(x*theta-y)); Step 1: Find equivalent ways of writing buggy expression using Codewebs engine Step 2: Write a thoughtful/meaningful hint or explanation ~47% improvement over Step 3: Propagate just using feedback an output message based to any submission containing feedback equivalent system!! expression 1091 Output based 1208 Codewebs 1604 Combined # submissions covered by single message

18 Abstract syntax tree representations function A = warmupexercise() A = []; A = eye(5); endfunction ASTs ASSIGN IDENT (A) INDEX_EXP ASTs ignore: Whitespace Comments IDENT (eye) ARGUMENT_LIST CONST (5)

19 Indexing documents by phrases blue sky yellow submarine The bright and blue butterfly hangs on the breeze We all something something yellow submarine term/phrase document list best {1,3} blue {2,4,6} bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

20 Indexing documents by phrases blue sky yellow submarine The bright and blue butterfly term/phrase document list What basic queries best should {1,3} an AST blue {2,4,6} hangs on the breeze We all something something yellow submarine search engine support? bright {7,8,10,11,12} heat {1,5,13} kernel {2,5,6,9,56} sky {1,2} submarine {2,3,4} woes {10,19,38} yellow {2,4}

21 Code Phrases BINARY_EXP (*) POSTFIX ( ) BINARY_EXP (-) IDENT (X) BINARY_EXP (*) IDENT (y) IDENT (X) IDENT (theta) Subtrees and subforests of an AST

22 Code Phrases BINARY_EXP (*) POSTFIX ( ) BINARY_EXP (-) IDENT (X) BINARY_EXP (*) IDENT (y) IDENT (X) IDENT (theta) Context within a larger subtree

23 Code Phrases BINARY_EXP (*) POSTFIX ( ) BINARY_EXP (-) IDENT (X) replacement site IDENT (y) Context within a larger subtree

24 The Codewebs index 10 print hello 20 goto 10 Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 10 for i=1:10 20 x = x+1 30 end 61d4bfccaa97cca a297cc5acd281ea3a9 {1,2} 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38}

25 The Codewebs index 10 print hello 20 goto 10 Code phrase hash AST list 2ccf02adb1cbabfb347d3b5d0a05b249855a7583 {1,3} b3bc37a318c2b895b3e644a12cfc6ebcfa5a06bd {2,4,6} b3353c96e2cee8ee6c3ba260e037a93ca0ba3a5e {7,8,10,11,12} 2c bf01338c8cdb15cf9d844d65f04645 {1,5,13} 313f48d3f5888afc5d5aa28ab1393d94661edd31 {2,5,6,9,56} 10 for i=1:10 61d4bfccaa97cca a297cc5acd281ea3a9 {1,2} 20 x = x+1 30 end 467b4d400aab42d3bf96a119c4620e74d6fe57b3 {2,3,4} Very expensive, esp. for large 1ae95f6fa24bc25871cdc55cb472abdd68db93de {10,19,38} ASTs! def buildindex(): subtrees, subforests, contexts for A in ASTs: for every code phrase x contained in A: Compute hashcode h[x]

26 Hashing Code Phrases Step 1. Create postorder listing of nodes. BINARY_EXP (-) BINARY_EXP (*) IDENT (y) IDENT (X) IDENT (theta) IDENT (X) IDENT (theta) BINARY_EXP (*) IDENT (y) BINARY_EXP (-) Step 2. Hash postorder list via:

27 Recycling hash computations Observation: Can hash sublist of postorder to get hash of code phrases! BINARY_EXP (-) BINARY_EXP (*) IDENT (y) IDENT (X) IDENT (theta) IDENT (X) IDENT (theta) BINARY_EXP (*) IDENT (y) BINARY_EXP (-)

28 Recycling hash computations Observation: Can hash sublist of postorder to get hash of code phrases! BINARY_EXP (*) IDENT (X) BINARY_EXP (-) IDENT (theta) IDENT (y) Idea of DP: Store prefix hashes and prime powers for all IDENT (X) IDENT (theta) BINARY_EXP (*) IDENT (y) BINARY_EXP (-) O(n) in time and space

29 Recycling hash computations Observation: Can hash sublist of postorder to get hash of code phrases! BINARY_EXP (-) BINARY_EXP (*) IDENT (y) IDENT (X) IDENT (theta) IDENT (X) IDENT (theta) BINARY_EXP (*) IDENT (y) BINARY_EXP (-) After precomputation, can get any other hash in constant time!

30 Indexing is fast in practice Runtime (seconds) Time for indexing 1000 ASTs Average AST size (# nodes)

function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates

theta-alpha*1/m*sum(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end vs.

31 function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta-alpha*1/m*sum(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end vs. function [theta, J_history] = gradientdescent(x, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha m = length(y); % number of training examples J_history = zeros(num_iters, 1); Application: Statistical Bug for iter = 1:num_iters theta = theta-alpha*1/m*sum(x'*(x*theta-y)); J_history(iter) = computecost(x, y, theta); end Finding 83% of ASTs containing this code phrase were buggy! Solution: X'*(X*theta-y);

32 Is sum(x'*(x*theta-y)) likely to be a bug? Query Index Fail Fail Fail Pass Fail 83% of ASTs containing this code phrase were buggy! Fail

33 Is sum(x'*(x*theta-y)) likely to be a bug? 83% of ASTs containing this code phrase were buggy! Query Index Fail Fail Fail Pass Fail Fail

34 Is sum(x'*(x*theta-y)) likely to be a bug? 83% of ASTs containing this code phrase were buggy! Query Index Compute bug probability for all subforests, return smallest bugs found Many ways to formulate probabilistic bug localization (we compute probability on local contexts) Fail Fail Fail Pass Fail Fail

Bug Detection Accuracy better Bug detection F-score (Baseline) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

35 Bug Detection Accuracy better Bug detection F-score (Baseline) linear regression with gradient descent neural net training with backpropagation logistic regression objective Bug detection F-score (Codewebs) better Each point represents a single coding problem. Bubble size = Average # nodes per submitted AST

36 More than one way to skin a cat Canonicalization: apply semantic preserving transformation rules to ASTs to increase matching probability

37 Difficulties with Canonicalization X*(Y+Z) (Y+Z)*X X*(Z+Y) X*Z+X*Y Z*X+X*Y Z*X+Y*X X*1*(Y+Z) 1. Impossible to predict all ways of writing the same thing 1. Canonicalization rules not typically generalizable across languages X*[1;1] *[Y;Z] X*transpose([1;1])*[Y;Z] X*ones(1,2)*[Y;Z] X*ones(1,length([Y;Z]))*[Y;Z] X*repmat(1,2,1) *[Y;Z] X*repmat(1,size([Y;Z],1),1) *[Y;Z]

38 Codewebs Approach Use data to determine canonicalization rules Customize rules to each assignment Don t need to be perfect, we re not building a compiler!

39 Here s the idea. def residual (X, theta, y): hypothesis = X * theta solution = hypothesis - y return solution def residual(x, theta, y): hypothesis = (theta * X ) solution = hypothesis - y return solution

40 Counter Example def foo(): solution = solveproblem() print(solution) return solution def foo(): solution = solveproblem() print(!solution) return solution Agreement can be context dependent

41 ? =

42 ? = Query Index Query Index Join on context

43 Fail Fail Pass Pass Fail Pass Fail Fail Pass Pass Fail Pass 100% probability of equivalence! ** fine print: need to account for sample size in general

44 Workflow Human provides: residual theta = theta-alpha*1/m*(x'*(x*theta-y)); alphaoverm prediction

45 {m} m rows (X) rows (y) size (X, 1) length (y) size (y, 1) length (x (:, 1)) length (X) size (X) (1) {alphaoverm} alpha / {m} 1 / {m} * alpha alpha.* (1 / {m}) alpha./ {m} alpha * inv ({m}) alpha * (1./ {m}) 1 * alpha / {m} alpha * pinv ({m}) alpha * 1./ {m} alpha.* 1 / {m} 1.* alpha./ {m} alpha * (1 / {m}).01 / {m} alpha.* (1./ {m}) alpha * {m} ^ -1 {hypothesis} (X * theta) (theta' * X')' [X] * theta (X * theta (:)) theta(1) + theta (2) * X (:, 2) sum(x.*repmat(theta',{m},1), 2) {residual} (X * theta - y) (theta' * X' - y')' ({hypothesis} - y) ({hypothesis}' - y )' [{hypothesis} - y] sum({hypothesis} - y, 2)

46 Canonicalization improves bug detection accuracy Higher is better F-score with canonicalization without canonicalization # unique ASTs considered

47 How many submissions can we give feedback to with fixed effort? # submissions covered (out of 40,000) with 25 ASTs marked with 200 ASTs marked # equivalence classes Canonicalization, 25 marked ASTS No Canonicalization, 200 marked ASTs

Afrikaner Calvinism elections in South Africa gradient

48 If we can find shared structure, we can facilitate feedback apartheid inductive hypothesis base case Afrikaner Calvinism elections in South Africa gradient residual learning rate In education: ASTs, proofs, essays, architecture, poems

49 Data is revolutionizing many fields And it can revolutionize education too!

50 Thank you!!

Syntactic and Functional Variability of a Million Code Submissions in a Machine Learning MOOC

Syntactic and Functional Variability of a Million Code Submissions in a Machine Learning MOOC Jonathan Huang Andy Nguyen Chris Piech Leonidas Guibas A variety of assessments How can we efficiently grade