Download Automated Grading of DFA Constructions and more Study notes Theory of Computation in PDF only on Docsity! Automated Grading of DFA Constructions Rajeev Alur and Loris D’Antoni Sumit Gulwani Dileep Kini and Mahesh Viswanathan Department of Computer Science Microsoft Research Department of Computer Science University of Pennsylvania Redmond University of Illinois at Urbana-Champaign Abstract One challenge in making online education more effective is to develop automatic grading software that can provide meaningful feedback. This pa- per provides a solution to automatic grading of the standard computation-theory problem that asks a student to construct a deterministic finite automa- ton (DFA) from the given description of its lan- guage. We focus on how to assign partial grades for incorrect answers. Each student’s answer is com- pared to the correct DFA using a hybrid of three techniques devised to capture different classes of errors. First, in an attempt to catch syntactic mis- takes, we compute the edit distance between the two DFA descriptions. Second, we consider the en- tropy of the symmetric difference of the languages of the two DFAs, and compute a score that es- timates the fraction of the number of strings on which the student answer is wrong. Our third tech- nique is aimed at capturing mistakes in reading of the problem description. For this purpose, we con- sider a description language MOSEL, which adds syntactic sugar to the classical Monadic Second Or- der Logic, and allows defining regular languages in a concise and natural way. We provide algorithms, along with optimizations, for transforming MOSEL descriptions into DFAs and vice-versa. These allow us to compute the syntactic edit distance of the in- correct answer from the correct one in terms of their logical representations. We report an experimental study that evaluates hundreds of answers submitted by (real) students by comparing grades/feedback computed by our tool with human graders. Our conclusion is that the tool is able to assign partial grades in a meaningful way, and should be pre- ferred over the human graders for both scalability and consistency. 1 Introduction There has been a lot of interest recently in of- fering college-level education to students world- wide via information technology. Several websites such as EdX (https://www.edx.org/), Cours- era (https://www.coursera.org/), and Udac- ity (http://www.udacity.com/) are increasingly providing online courses on numerous topics, from computer science to psychology. Several challenges arise with this new teaching paradigm. Since these courses, often referred to as massive open online courses (MOOCs), are typically taken by several thousands of students located around the world, it is particularly hard for the instruction staff to provide useful personalized feedback for practice problem sets and homework assignments. Our focus in this paper is on the problem of determinis- tic finite automata (DFA) construction. The importance of DFA in computer science education hardly needs justifica- tion. Beside being part of the standardized computer science curriculum, the concept of DFA is rich in structure and poten- tial applications. It is useful in diverse settings such as control theory, text editors and lexical analyzers, and models of soft- ware interfaces. We focus on grading assignments in which a student is asked to provide a DFA construction correspond- ing to a regular language description. Our main goal is that of automatically measuring how far off the student solution is from the correct answer. This measure can then be used for two purposes: assigning a partial grade, and providing feed- back on why the answer is incorrect. Figure 1 shows five solutions from the ones we collected as part of an experiment involving students at UIUC. The so- lutions are for the following regular language description: L = {s | s contains the substring “ab” exactly twice} For this problem the alphabet is Σ = {a, b}. Current tech- nologies for this kind of problem [Aut, 2010] simply check whether the DFA proposed by the student is semantically equivalent to the correct one. For this particular example such a technique would only point out that the first solution A1 is correct, while all the other ones are wrong. Such a feedback, however, does not tell us how wrong each solution is. The four DFAs A2,A3,A4, and A5 in Figure 1 are repre- sentative of different mistakes. We first concentrate on A2. In this attempt the DFA accepts the language L1 = {s | s contains the substring “ab” at least twice} This example shows a common mistake in this type of as- signments: the student misunderstood the problem. We need an automated technique that is able to recognize this kind of mistake. The necessary ingredient to address this task is a a b A1 accepts the correct language Grade: 10/10 A1 0 a b a b 2 a b b a b a,b A2 accepts the strings that contain ̀ ab’ at least twice instead of exactly twice Grade: 5/10 A2 A3 A3 misses the final state 5 Grade: 9/10 A4 A4 behaves correctly on most of the strings Grade: 6/10 DFA Attempt Grade and Feedback 1 3 4 5 6 a a b 0 a b a b 2 a b a,b 1 3 4 a b 0 a b a b 2 a b b a b a,b 1 3 4 6 a 5 a b 0 a b a b 2 a b a b a,b 1 3 4 5 A5 A5 accepts the strings that contain ̀ ab’ at least twice instead of exactly twice Grade: 5/10 a b 0 a b a b 2 a a a,b 1 3 4 5 b Figure 1: Example of DFA grading. The dark states are final. Column 1 contains the name of the DFA depicted in column 2. Column 3 shows the grade computed by our tool for the DFA with the corresponding feedback. procedure that, given a DFA A, can synthesize a description of the language L(A) accepted by A. Here a question that immediately arises is: what should the description language for L(A) be? Ideally we would like to describe L(A) in En- glish, but such a description cannot be easily subjected to au- tomated analysis. A better option is a logical language that is not only efficient to reason about, but one which also pro- vides a rich set of primitives at a level of abstraction that is close to how language descriptions are normally stated in En- glish. For this purpose, we extend a well-known logic, called monadic-second order logic (MSO) [Thomas, 1996; Büchi and Landweber, 1969], that can describe regular languages, and we introduce MOSEL, an MSO-equivalent declarative logic enriched with syntactic sugar. In MOSEL, the languages L and L1 can be described by the formulas |indOf ‘ab’|= 2 and |indOf ‘ab’| ≥ 2 respectively. Thanks to this formal representation, we can compute how far apart two MOSEL descriptions are from each other and translate such a value into a grade. To compute the distance between two descrip- tions we use an algorithm for computing the edit distance be- tween trees [Bille, 2005]. We design two algorithms: the first one computes the DFA corresponding to a MOSEL descrip- tion, and conversely the second one computes the MOSEL description of the language accepted by a DFA. Despite the high computational complexity of such algorithms, through several optimizations, we were able to make them work on examples used to learn automata. We executed the first algo- rithm on all the DFA assignments appearing in [Hopcroft et al., 2006], achieving running times below 1 second. On the same set of assignments we were able to execute the second algorithm on 95% of the problems, achieving running times below 5 seconds. The approach presented in the previous paragraph is able to capture a particular class of mistakes. However, several DFAs, such as A3 in Figure 1, do not fall in this class. A3 has the full structure of the correct DFA but state 5 is not marked as final. A possible MOSEL description of A3 is |indOf ‘ab’| = 2 ∧ endWt ‘b’ where the second conjunct indicates that all strings must end with a b. This description is syntactically far from the description of L causing the cor- responding grade to be too low. This example shows that there should be a metric that tells how far A2 is from a cor- rect DFA. To address this class of mistakes we introduce a notion of DFA edit distance that given a DFA A and a regular language R computes how many states and transitions of A we need to modify in order to obtain a DFA A′ that accepts R. Such a computation naturally translates into a grade. The previous techniques cover two broad classes of mis- takes. However, in several cases they are still not enough. The language accepted by the DFA A4 in Figure 1 has a com- plicated MOSEL description and the number of operations needed to “fix” A4 is quite high (more than 5) because we need to add a new state and redirect several edges. However, this solution is on the right track and behaves correctly on most of the strings. The student just did not notice that in state 4 the machine does not necessarily read the symbol a causing strings such as ababb to be rejected. Hence, A4 cor- rectly rejects all the strings that are not in L, but also rejects “few” more. Following this intuition we introduce a notion of language density and we use it to approximate the percent- age of strings in Σ∗ on which a DFA A misbehaves. Again, such a quantity naturally translates into a grade. We finally combine the three techniques to compute a unique grade. DFA A5, despite being syntactically different from A2, computes exactly the same language as A2. This similarity might be hard to notice for a human. While our tool, using the same approach as for A2, assigned the same grade to both the attempts, we observed in our experiments that the same human grader assigned different grades. We evaluated our tool on DFAs submitted by students at UIUC and compared the grades generated by the tool to those provided by human graders. First, we identified several in- stances in which two identical DFAs were graded differently by the same grader, while this was not the case for the tool. Second, we observed that the tool produces grades compa- rable to those produced by humans. In order to check such properties, we used statistical metrics to compare the tool with two human graders, and manually inspected the cases in which there was a discrepancy between the grades assigned by the tool and by the human. The resulting data suggests that the tool grades as well as a human, and we often found that, in case of a discrepancy, the grade of the human was less fair than that of the tool. 2 MOSEL: Declarative Descriptions of Regular Languages This section provides a preliminary background on DFAs, de- fines the language MOSEL, and presents algorithms for trans- forming MOSEL descriptions into DFAs and vice-versa. 2.1 Background on DFAs A deterministic finite automaton (DFA) over an alphabet Σ is a tuple A = (Q, q0, δ, F ) where Q is a finite set of states, q0 ∈ Q is the initial state, δ : Q × Σ 7→ Q is the transi- tion function, and F ⊆ Q is the set of accepting states. We define the transitive closure of δ as, for all a ∈ Σ, s ∈ Σ∗, 3 An Algorithm for Grading DFA Constructions We next address the problem of grading a student attempt. Given a target language LT , and a student solution As, we need a metric that tells us how far As is from a correct solu- tion. Based on our experience related to teaching and grading DFA constructions, we identified three classes of mistakes: Problem Syntactic Mistake: the student gives a solution for a different problem (see (2) and (5) in Figure 1); Solution Syntactic Mistake: the student omits a transition or a final state (see (3) in Figure 1); and Problem Semantic Mistake: the solution is wrong on a small fraction of the strings (see (4) in Figure 1). We investigated three approaches that try to address each class. First, we use the classic notion of tree edit dis- tance [Gao et al., 2010] to compute the difference between two MOSEL formulas. Secondly, we introduce a notion of DFA edit distance to capture the distance between DFAs. Last, we use the concept of regular language density to com- pute the difference between two languages when viewed as sets. 3.1 Problem Syntactic Distance The following metric captures the case in which the MOSEL description of the language corresponding to the student DFAAs is close to the MOSEL description of the target lan- guage LT . This metric computes how syntactically close two MOSEL descriptions are. We consider MOSEL formu- las as the ordered trees induced by their parse trees. Given a MOSEL formula φ, we call Tφ its parse tree. Given two ordered trees t1 and t2, their tree edit distance TED(t1, t2) is defined as the minimum number of edits that can transform t1 into t2. Given a tree t, an edit is one of the following oper- ations: relabel: change the label of a node n; node deletion: given a node n with parent n′, 1) remove n, 2) place the children of n as children of n′, inserting them in the “place” left by n; and node insertion: given a node n, 1) replace a consecutive subsequence C of children of n with a new node n′, and 2) let C be the children of n′. We use the algorithm in [Gao et al., 2010] to compute TED. Next, we compute the distance D(φ1, φ2) between two for- mulas φ1 and φ2 as TED(Tφ1 , Tφ2). Finally, we compute WTED(φ1, φ2) def = D(φ1, φ2) |Tφ2 | where |T| is the number of nodes in T. In this way, for the same number of edits, less points are deducted for lan- guages with a bigger description. See Figure 5 for an exam- ple of parse tree. Since we are ultimately interested in grad- ing DFAs, given a DFA As we use the procedure proposed in § 2.4 to compute the formula φAs corresponding to As. Example 1 Consider the language L corresponding to φ def = |indOf ‘ab’|% 2 = 1 ∧ begWt ‘a’ over the alphabet Σ = {a, b}. Let’s assume the student provides the DFA A′ that implements the language φ′ def = |indOf ‘ab’|% 2 = 1 ∨ | |% = 02indOf ‘ ’ a Figure 5: Parse tree for φ = |indOf ‘a’|% 2 = 0 begWt ‘a’, where ∧ has been replaced by ∨. The problem syntactic distance will yield the following values: TED(φ′, φ) = 1 WTED(φ′, φ) = 1/9 In this case applying one node relabeling is enough to “fix” Tφ′ . We omit the parse tree of φ which contains 9 nodes. Since, for each language L, there exist infinitely many MOSEL formulas describing L, in our algorithm we set a time-out in the enumeration function and only consider the formulas discovered in such time span. Given a DFA A, S(A, s) is the set of formulas describing L(A), discovered in s seconds. Given two DFAs A1 and A2, we compute T-WTED(A1, A2, s) def = min{WTED(φ1, φ2) | φi ∈ S(Ai, s)} Finally, consider formula φ. Note that φ∧true, φ∧φ, etc. are formulas that are equivalent to φ, but their representation is not minimal. This would cause the grade to be too high in some cases. Our technique avoids enumerating such non- minimal formulas. This not only provides a better metric, but also makes the search process more efficient. 3.2 Solution Syntactic Difference The following metric captures the case in which the student DFA As is syntactically close to a correct one, by computing how many edits are needed to transform As to make it accept the correct language LT . We define the notion of DFA edit distance. Given two DFAs A1, A2, we say that the difference between A1 and A2, DFA-D(A1, A2) is the minimum num- ber of edits that can transform A1 into some DFA A′1 such that L(A′1) = L(A2). Given a DFA A, an edit is one of the following operations: transition redirection: given a state q and a symbol a ∈ Σ, update δ(q, a) = q′ to δ(q, a) = q′′ where q′ 6= q′′; state insertion: insert a new disconnected state q, with δ(q, a) = q for every a ∈ Σ; and state relabeling: given a state q, add it or remove it from the set of final states. Notice that, since we check for language equivalence instead of syntactic equivalence, the operation of node deletion is not necessary in order for two automata to always admit a finite edit difference. For example, a DFA A1 may be language equivalent to a DFA A2, but it may contain an extra state which is unreachable. Due to this fact the difference will be symmetric. To take into consideration the severity of a mistake based on the difficulty of the problem, we compute the quantity WDFA-D(A1, A2) def = DFA-D(A1, A2) k + t where k and t are, respectively, the number of states and tran- sitions of A2. Example 2 Consider the DFA A3 in Figure 1 where state 5 is mistakenly marked as non-final. A1 is the correct solution for the problem. In this case DFA-D(A3, A1) = 1 WDFA-D(A3, A1) = 1/12+6 = 1/18 since applying one state relabeling will “fix” A3. In the tool we compute this metric by trying all the possible edits and checking for equivalence with a technique similar to the one presented in Section 2.3. A similar distance notion graph edit distance [Bille, 2005]. However this metric does not take into account the language accepted by the DFA. 3.3 Problem Semantic Difference The following metric captures the case in which the DFA As behaves correctly on most inputs, by computing what per- centage of the input strings is correctly accepted/rejected by As. Given two languages L1 and L2, we define density dif- ference to be DEN-DIF(L1, L2) def = lim n→+∞ |((L1 \ L2) ∪ (L2 \ L1)) ∩ Σn| max(|L2 ∩ Σn|, 1) Σn denotes the set of strings in Σ∗ of length n. Informally, for every n, the expression E(n) inside the limit computes the number of strings of length n that are misclassified by L1 divided by the number of strings of length n in L2. The max in the denominator is used to avoid divisions by 0. Unfor- tunately, the density difference is not always defined, as the limit may not exist. Example 3 Consider the languages LA corresponding to |all|% 2 = 0 and LB corresponding to true (i.e. Σ∗) over the alphabet Σ = {a, b}. The limit DEN-DIF(LA, LB) is not defined since it keeps oscillating between 0 and 1. In practice we compute the approximated density A-DEN-DIF(L1, L2) def = ∑2k n=0 |((L1\L2)∪(L2\L1))∩Σn| max(|L2∩Σn|,1) 2k + 1 where k is the number of states of the minimum DFA repre- senting L2. This approximation is not precise, but it is very helpful for capturing the cases in which the student forgot a finite number of strings in its solution (for example only ε). Example 4 Consider the DFAs A1 and A4 in Figure 1 and their respective languages L(A1) and L(A4). In this case A-DEN-DIF(L(A4), L(A1)) = 0.09. This value is the one used to compute the grade shown in Figure 1. Similar notions of density have been proposed in the lit- erature [Bodirsky et al., 2004; Kozik, 2005]. The density DEN(L) of a regular language L over the alphabet Σ is de- fined as the limit DEN(L) def = lim n→+∞ |L ∩ Σn| |Σn| When this limit is defined it is also computable [Bodirsky et al., 2004] . The conditional language density DEN(L1|L2) of a given language L1 in a given language L2, such that L1 ⊆ L2, is the limit DEN(L1|L2) def = lim n→+∞ |L1 ∩ Σn| |L2 ∩ Σn| Again, there are languages for which these densities are not defined, but when are, they can also be computed [Kozik, 2005]. These definitions have good theoretical foundations, but, unlike our metric, they are undefined for most DFAs. Also, this penalty scheme is only fair since students should at least be testing their DFAs on small strings, if not large ones. 3.4 Combining the Approaches The aforementioned approaches need to be combined in or- der to compute the final grade. We are aware of the many machine learning techniques that could be used for combin- ing the three features, but instead, we decide to use a simple combination function for the following reason : 1) in the fu- ture we would like to extract feedback information from the computed grade, and 2) in general, only one of the three fea- ture succeeds in computing a positive grade. Next, we provide the general schema of the combining function. First, each deduction v, which ranges between 0 and 1, is scaled to a new value v′ using a formula of the form v′ := (v + c)2 − c2 where c is a constant. We used a training set of 60 manually graded attempts to identify the constants c for the combining function. Finally, we pick the metric which awarded the highest score. 4 Experimental Evaluation The aim of our experiment is to evaluate to what extent the grades given by our tool and those given by human instruc- tors agree. To do so we collected around 800 attempts at DFA construction questions by students taking a theory of compu- tation course for the first time. For each problem we had two instructors and our tool grade each attempt separately. In or- der to see how well the tool does we compare statistics that re- veal variation between human graders and variation between a human grader and our tool. To measure the extent of agree- ment between two graders we employ Pearson’s correlation coefficient. The correlation coefficient is a number between -1 and 1. A value of 1 indicates that the paired points are linearly related with a positive slope. When this quantity is closer to 1 it indicates that the two measurements being com- pared tend to vary together in the same direction. In order to obtain a basis for comparing the correlation co- efficients we also see how a naive grader would perform with respect to human graders. There could be many ways to de- fine a naive grader. A simple one that we consider uses the following grading scheme: (i) it awards near maximum (9 or 10) marks to the correct solutions, and (ii) for incorrect so- lutions it deducts marks based on the number of states that are lacking or in excess and adds a small random noise. We summarize the resulting calculations in Table 1. 4.1 Detailed Analysis In the following we only consider the first problem where the language is L1 = {s | s starts with a and has odd number of ab substrings}. The first column in the averages reads 0.99 for H1-H2 meaning that H1 has awarded, on average, 1 point more than H2. The next two columns show that H1 is on average closer to the naive grader N and to the tool T than it is to H2. However, the standard deviation for H1-N (2.62) is greater than that for H1-T (1.99), which means that the grades given by our tool show a lot less variation, and are in fact closer to H1 more often than N. The Pearson correlation coefficients shows that the degree of correlation between the Attempts Average Standard Deviation Pearson Correlation Problem Tot. Dis. H1-H2 H1-T H1-N H1-H2 H1-T H1-N H1-H2 H1-T H1-N L1 = {s | s starts with a and has odd number of ab substrings} 131 108 0.99 0.54 0.22 2.06 1.99 2.62 0.87 0.83 0.65 L2 = {s | s has more than 2 a’s or more than 2 b’s} 110 100 -0.66 0.85 0.26 1.80 2.44 2.71 0.90 0.80 0.75 L3 = {s | s where all odd positions contain the symbol a} 96 75 -0.52 0.86 -1.38 1.61 2.67 3.84 0.90 0.74 0.31 L4 = {s | s begins with ab and |s| is not divisible by 3} 92 68 0.40 1.32 0.36 1.68 2.78 2.48 0.81 0.71 0.61 L5 = {s | s contains the substring ab exactly twice} 52 46 0.02 0.19 0.29 2.01 1.88 3.23 0.71 0.79 0.49 L6 = {s | s contains the substring aa and ends with ab} 38 31 -0.50 -1.34 -1.5 2.42 2.90 3.70 0.76 0.63 0.34 Table 1: Comparing grades given by humans and tool. The grades were between 0 and 10. H1 and H2 denote the two human graders, T the tool, and N the naive grader. A-B denotes the difference between the graders A and B when grading each individual attempt. For each problem Li the table shows in the order: the number of student attempts, the number of distinct attempts, the average difference, the standard deviation, and the Pearson’s correlation between single attempt grades. a 0 a b a b b a,b 1 2 3 4 b a 0 a b a b b 1 2 4 b (1) (3) 3 b a a a a 0 a b a b b 1 2 (2) 3 a,b 4 b a (4) (5) a 0 a b 1 2 a,b 5 b 4 a b b b a 3 a a 0 a a 1 2 a,b 5 b 4 a b b b 3 Figure 6: Selected attempts for the language L1 = {s | s starts with a and has odd number of ab substrings}. tool T and H1 (0.83) is clearly better than that between N and H1 (0.65), and at the same time comes very close to the degree of correlation between two human graders H1 and H2 (0.87). We say that two graders agree on an attempt with a thresh- old of t if the grades given by the two graders do not dif- fer by more than t. The plot on the right shows 3 curves. 20 30 40 50 60 70 80 90 100 0 1 2 3 4 5 6 7 8 9 10 Pe rc en ta ge o f a , em pt s w ith in X Grade Difference H1-‐N H1-‐H2 H1-‐T Each curve com- pares two graders and displays how the percentage of problems on which they agree increases with the threshold varying from 0 to 10. The three curves compare T, H2 and N against H1, and it is easy to see that our tool T comes to an agreement much faster than N. More surprisingly, the tool also comes to an agreement faster than H2. Figure 6 shows five cases in which either the human graders and the tool have a discrepancy. In case (1) the computed language L′1 is described by the formula |indOf ‘b’|% 2 = 1 ∧ begWt ‘a’. Since the MOSEL de- scriptions of L1 and L′1 are similar, the tool gives a high grade for this attempt. However, L′1 has an easier construction than L1. We attenuate these “false high grades” by deducting ex- tra points when the size of the minimal DFA corresponding to the student solution has less states than the target DFA. We point out that H1 graded (1) and its minimal DFA version with two different grades, 5 and 3 respectively. A similar grading inconsistency was observed with A2 and A5 of Fig- ure 1. Case (2) shows a DFA for which the approximated language density is low, causing the tool to award this at- tempt with 7 points. One can argue that this grade is too high for such a DFA. However, the same DFA as (2) was submit- ted by multiple students, and, while the tool always awarded 7 points, both human graders were inconsistent: H1 graded five identical attempts with 4,5,7,7, and 7 points, while H2 awarded 1,1,2,3, and 8 points. Case (3) shows a DFA for which the DFA edit distance yields a too “generous” grade. For this DFA it is enough to remove the transition δ(0, b) in order to obtain a DFA that accepts L1. However, in this case the mistake is deeper than a simple typo. Case (4) shows a DFA for which the human awarded too high a score. Even though this DFA has several syntactic mistakes, H1 awarded the same grade as for attempt (5), where only one final state is missing. The grader was probably mislead by the visual similarity of the two DFAs. For case (5), where state 3 was mistakenly marked as non-final, both H1 and H2 lacked in consistency. H1 awarded different grades from 8 to 10 for 7 identical attempts. 4.2 Strengths of the tool Inspecting the data for the 131 attempts for the language L1 we observed the following: 1) in 6 cases one of the human graders mistakenly assigned the maximum score to an incor- rect attempt; 2) in more than 20 out of 34 cases in which T and H1 were disagreeing by at least 3 points, H1, after review- ing the attempt, agreed with the grade computed of T; and 3) in more than 20 cases at least one of the human graders was inconsistent: i.e. two syntactically equivalent attempts were graded differently by the same grader. 4.3 Limitations The tool suffers two types of limitations: behavioral and structural. The former type concerns the failure of the grading techniques on some particular examples. An example is the attempt (3) of Figure 6 where a small DFA edit distance did not reflect the severity of the mistake. As for the structural limitations, our techniques are crafted to perform well on problems appearing in theory of computation books [Sipser, 1996; Hopcroft et al., 2006]. Such techniques do not scale for DFAs with large alphabets or many states. Moreover the MOSEL edit distance fails when the language does not admit a succinct description. For example languages described by small regular expressions, such as a∗b∗ do not always have a