diff --git a/Koli_2017/Koli_2017_Stanger.tex b/Koli_2017/Koli_2017_Stanger.tex
index 1c9ec70..3168adc 100644
--- a/Koli_2017/Koli_2017_Stanger.tex
+++ b/Koli_2017/Koli_2017_Stanger.tex
@@ -43,11 +43,11 @@
 
 Courses that teach SQL usually include one or more assessments that test students' ability to create a database using SQL data definition (DDL) statements, and to interact with the database using SQL data manipulation (DML) statements. Manually grading the code submitted for such assessments can be a slow, tedious, and potentially error-prone process. Automated or semi-automated grading has been shown to improve turnaround time and consistency, and is generally received positively by students \cite{Douce.C-2005a-Automatic,Russell.G-2004a-Improving,Dekeyser.S-2007a-Computer,Prior.J-2004a-Backwash}. If the grading can be done in real time, the grading tool can even become part of a larger, interactive SQL learning environment (e.g.,  \cite{Kenny.C-2005a-Automated,Kleiner.C-2013a-Automated,Mitrovic.A-1998a-Learning,Russell.G-2004a-Improving,Sadiq.S-2004a-SQLator}).
 
-While there have been many prior efforts to automatically grade SQL DML (see \cref{sec-literature}), there appear to be no similar systems designed to automatically grade SQL \emph{DDL}. There are generally two main aspects that need to be considered when grading an SQL schema implementation. First, is the DDL code (i.e., \texttt{CREATE} statements) syntactically correct? This is already dealt with quite effectively by the syntax checkers built into every SQL DBMS (although it is fair to say that the errors produced by such checkers can sometimes be obscure and unhelpful). Any student who submits syntactically invalid code cannot expect to score well. A related aspect is code style (e.g., naming, formatting, indentation), but we do not consider this here.
+While there have been many prior efforts to automatically grade SQL DML (see \cref{sec-literature}), there appear to be no similar systems designed to automatically grade SQL \emph{DDL}. There are generally two main aspects that need to be considered when grading an SQL schema implementation. First, is the DDL code (i.e., \texttt{CREATE} statements) syntactically correct? This is already dealt with quite effectively by the syntax checkers built into every SQL DBMS (although it is fair to say that the errors produced by such checkers can sometimes be obscure and unhelpful). A student who submits code containing syntax errors cannot expect to score well! A related aspect is code style (e.g., naming, formatting, indentation), but we do not consider this here.
     
-Second, does the schema meet the requirements of the problem being solved? A database schema is normally designed and implemented within the context of a specific set of requirements, so verifying that the implemented SQL schema fulfils these requirements is an effective way to grade the implementation, and also provides a useful framework for providing feedback to students. The requirements for a database schema can usually be loosely divided into \emph{structure} (e.g., tables, columns, data types), \emph{integrity} (e.g., keys, constraints), and \emph{behaviour} (e.g., sequences, triggers).
+Second, does the schema meet the requirements of the problem being solved? A database schema is normally designed and implemented within the context of a specific set of requirements, so verifying that the implemented SQL schema fulfills these requirements is an effective way to grade the implementation, and also provides a useful framework for providing feedback to students. The requirements for a database schema can usually be loosely divided into \emph{structure} (e.g., tables, columns, data types), \emph{integrity} (e.g., keys, constraints), and \emph{behavior} (e.g., sequences, triggers).
 
-In this paper we describe a system that semi-automates the grading of SQL schema implementations. The system takes as input a machine-readable specification of the assessment requirements and a live instance of a submitted student schema, and checks whether the schema conforms to the requirements. Rather than attempt to parse and check the \texttt{CREATE TABLE} statements directly, the system instead issues queries on the schema's metadata (catalog), and compare the results of these queries against the machine-readable specification. The process effectively becomes one of unit testing the schema using the specification as a framework. We use the PHPunit database unit testing framework to carry out this process, albeit in a somewhat unorthodox way (see \cref{sec-design}).
+In this paper we describe a novel system that semi-automates the grading of SQL schema implementations. The system takes as input a machine-readable specification of the assessment requirements and a live instance of a submitted student schema, and verifies whether the schema conforms to the requirements. Rather than attempt to parse and check the \texttt{CREATE} statements directly, the system instead verifies the structure of the schema by issuing queries on the schema's metadata (catalog), and verifies the behavor of integrity constraints by inserting known legal and illegal values. The results of these queries are compared against the machine-readable specification. This process is effectively one of unit testing the schema using the specification as a framework. We use the PHPUnit database unit testing framework to carry out this process, albeit in a somewhat unorthodox way (see \cref{sec-design}).
 
 The remainder of the paper is structured as follows. In the next section we discuss related work and identify gaps, while \cref{sec-motivation} discusses the motivation for our approach. \Cref{sec-design} discusses the design of our system, and \cref{sec-evaluation} evaluates its effectiveness. We conclude in \cref{sec-conclusion}.
 
@@ -79,19 +79,19 @@
 
 \citeauthor{Bhangdiya.A-2015a-XDa-TA}'s \emph{XDa-TA}\footnote{\url{http://www.cse.iitb.ac.in/infolab/xdata/}} extended the idea of automated grading of SQL by adding the ability to generate data sets designed to catch common errors. These data sets were automatically derived from a set of correct SQL queries \cite{Bhangdiya.A-2015a-XDa-TA,Chandra.B-2015a-Data}. Later work \cite{Chandra.B-2016a-Partial} added support for awarding partial marks.
 
-\citeauthor{Gong.A-2015a-CS-121-Automation}'s ``CS 121 Automation Tool'' \cite{Gong.A-2015a-CS-121-Automation} was a tool designed to semi-automate the grading of SQL assessments, again focusing on SQL DML statements. Interestingly, the system appears to be extensible and could thus potentially be modified to support grading of \texttt{CREATE TABLE} statements.
+\citeauthor{Gong.A-2015a-CS-121-Automation}'s ``CS 121 Automation Tool'' \cite{Gong.A-2015a-CS-121-Automation} was a tool designed to semi-automate the grading of SQL assessments, again focusing on SQL DML statements. Interestingly, the system appears to be extensible and could thus potentially be modified to support grading of SQL DDL statements.
 
 There is relatively little work on unit testing of databases. Most authors working in this area have focused on testing database \emph{applications} rather than the database itself (e.g., \cite{Binnig.C-2008a-Multi-RQP,Chays.D-2008a-Query-based,Marcozzi.M-2012a-Test,Haller.K-2010a-Test}). \citeauthor{Ambler.S-2006a-Database} discusses how to test the functionality of a database \cite{Ambler.S-2006a-Database}, while \citeauthor{Farre.C-2008a-SVTe} test the ``correctness'' of a schema \cite{Farre.C-2008a-SVTe}, focusing mainly on consistency of constraints. Neither consider whether the database schema meets the specified requirements.
 
-To our knowledge there has been no work on automated grading of SQL \texttt{CREATE TABLE} statements. While dealing with these is simpler than dealing with \emph{SELECT} statements, the ability to at least semi-automate the grading of SQL schema definitions should reap rewards in terms of more consistent application of grading criteria, and faster turnaround time.
+To our knowledge there has been no work on automated grading of SQL DDL statements. While dealing with these is simpler than dealing with \emph{SELECT} statements, the ability to at least semi-automate the grading of SQL schema definitions should reap rewards in terms of more consistent application of grading criteria, and faster turnaround time.
 
-Only a couple of the systems discussed in this section [which?] have considered a more ``functional'' approach to checking SQL code, i.e., verifying that the code written fulfils the requirements of the problem, rather than focusing on the code itself. Given the relatively static nature of an SQL schema, we feel this is the most appropriate way of approaching an automated grading system. This sounds like it should be a useful application of formal methods \cite{Spivey.J-1989a-An-introduction}, but work with formal methods and databases seems to have focused either on \emph{generating} a valid schema from a specification (e.g., \cite{Vatanawood.W-2004a-Formal,Lukovic.I-2003a-Proceedings,Choppella.V-2006a-Constructing}), or on verifying schema transformation and evolution \cite{Bench-Capon.T-1998a-Report}.
+Only a couple of the systems discussed in this section [which?] have considered a more ``functional'' approach to checking SQL code, i.e., verifying that the code written fulfills the requirements of the problem, rather than focusing on the code itself. Given the relatively static nature of an SQL schema, we feel this is the most appropriate way of approaching an automated grading system. This sounds like it should be a useful application of formal methods \cite{Spivey.J-1989a-An-introduction}, but work with formal methods and databases has mainly focused either on \emph{generating} a valid schema from a specification (e.g., \cite{Vatanawood.W-2004a-Formal,Lukovic.I-2003a-Proceedings,Choppella.V-2006a-Constructing}), or on verifying schema transformation and evolution \cite{Bench-Capon.T-1998a-Report}.
 
 
 \section{Motivation}
 \label{sec-motivation}
 
-Since 1989, our department has offered a mandatory database concepts paper of some form, typically one semester during the second year\footnote{New Zealand Bachelor's degrees normally comprise three years of study.} of study, building on a brief introduction to basic data management concepts in the first year. Typical of papers of this nature, it covered core topics such as the relational model, relational algebra, data integrity, SQL (DDL and DML), and a varying mixture of other database topics such as transactions, concurrency control, triggers, and security. Assessment of SQL skills was typically carried out using a mixture of assignments and tests.
+Since 1989, our department has offered a mandatory database concepts course of some form, typically one semester during the second year\footnote{New Zealand Bachelor's degrees normally comprise three years of study.} of study, building on a brief introduction to basic data management concepts in the first year. Typical of courses of this nature, it covered core topics such as the relational model, relational algebra, data integrity, SQL (DDL and DML), and a varying mixture of other database topics such as transactions, concurrency control, triggers, and security. Assessment of SQL skills was typically carried out using a mixture of assignments and tests.
 
 From 2001 to 2003, we assessed students' SQL DDL skills with an in-lab practical examination. Students were given a fictional scenario specification, and had 100 minutes in which to modify a provided schema template with additional tables, constraints, etc. The test was generally easier to grade than a more ``realistic'' practical assignment, as the scenario specification tended to be quite tightly specified and thus less open to (mis)interpretation. However, the test experience was quite stressful to students due to the limited timeframe and limited access to online references. We did not attempt to automate the grading of these tests.
 
@@ -106,13 +106,13 @@
 \end{figure}
 
 
-Prior to 2001, the specifications for the scenarios used in this assignment were deliberately somewhat loosely defined and often contained areas that were under-specified or ambiguous. At the time we also had some data modeling content in the paper, so this approach enabled students to explore the different ways that a conceptual model could be converted into an implementation. This of course led to variation across student submissions, due to differing interpretations of the under-specified elements. Automation of grading was problematic, especially when students chose different database structures, or different names for tables and columns, than what we expected. We therefore did not make any significant attempt to automate grading under this approach.
+Prior to 2001, the specifications for the scenarios used in this assignment were deliberately somewhat loosely defined and often contained areas that were under-specified or ambiguous. At the time we also had some data modeling content in the course, so this approach enabled students to explore the different ways that a conceptual model could be converted into an implementation. This of course led to variation across student submissions, due to differing interpretations of the under-specified elements. Automation of grading was problematic, especially when students chose different database structures, or different names for tables and columns, than what we expected. We therefore did not make any significant attempt to automate grading under this approach.
 
-By 2004 the data modeling content was being migrated into other papers, so we began to tighten up the scenario specifications to reduce ambiguity. In 2013 we took the step of ``freezing'' the specification, i.e., students were not permitted to make arbitrary changes to the specification without strong justification, and even then only if the changes did not impact how client programs interacted with the database. For example, if a column needed to be restricted to a discrete set of values, they could choose to go beyond the specification and enforce this requirement using a separate lookup table, as long as this did not change the structure of the original table.
+By 2004 the data modeling content was being migrated into other courses, so we began to tighten up the scenario specifications to reduce ambiguity. In 2013 we took the step of ``freezing'' the specification, i.e., students were not permitted to make arbitrary changes to the specification without strong justification, and even then only if the changes did not impact how client programs interacted with the database. For example, if a column needed to be restricted to a discrete set of values, they could choose to go beyond the specification and enforce this requirement using a separate lookup table, as long as this did not change the structure of the original table.
 
 The in-scenario rationale for freezing the specification was that other developers were independently using the same specification to code end-user applications. Any significant variation from the specification would therefore break those applications. This approach tested not only the student's ability to write SQL DDL, but also their ability to correctly interpret and convert a natural language database specification into a corresponding SQL schema.
 
-This approach seemed effective, but maintaining consistent grading standards across all submissions was difficult, due the often large number of distinct gradable elements implied by the specification. This required a complex and highly-detailed rubric to be constructed so that no element was missed, and the grading process consequently took a significant amount of time. In 2012 a significant change to the structure of the paper resulted in higher grading workloads and increased time pressure, prompting interest in the possibility of at least semi-automating the grading of this assignment. This was now more feasible than in earlier years due to the tightly constrained nature of the project specification.
+This approach seemed effective, but maintaining consistent grading standards across all submissions was difficult, due the often large number of distinct gradable elements implied by the specification. This required a complex and highly-detailed rubric to be constructed so that no element was missed, and the grading process consequently took a significant amount of time. In 2012 a significant change to the structure of the course resulted in higher grading workloads and increased time pressure, prompting interest in the possibility of at least semi-automating the grading of this assignment. This was now more feasible than in earlier years due to the tightly constrained nature of the project specification.
 
 Another motivation for automation was that it can sometimes be difficult for novices to know whether they are on the right track while implementing a specification. If a limited version of the grading tool were available to students, it could be used to provide feedback on their progress.
 
@@ -138,7 +138,7 @@
         \node[anchor=south east] (web) at ($(driver.north east) + (0,3mm)$) {\shortstack{Web app \\ \footnotesize(student mode)}};
         \coordinate[below=3mm of web.south] (web port);
         
-        \node[below=5mm of driver] (phpunit) {PHPunit};
+        \node[below=5mm of driver] (phpunit) {PHPUnit};
         
         \node[left=5mm of phpunit] (spec) {\shortstack{Assign. \\ spec.}};
         
@@ -235,11 +235,11 @@
 \end{table}
 
 
-Each table specification also requires the definition of two distinct sets of tests to be run on the database. The first set verifies the structural elements of the table (columns, data types, etc.), thus verifying that a schema meets the minimum requirements of the assignment. When in student mode, only this set of tests is run. An empty data fixture (specified using an XML document) is also required to support this set of tests.
+Each table specification also requires the definition of two distinct sets of tests to be run on the database. The first set verifies the structural elements of the table (columns, data types, etc.), thus verifying that a schema meets the minimum requirements of the assignment. This is done by issuing queries against the metadata (catalog) of the schema for elements like tables, columns, and data types. When in student mode, only this set of tests is run. An empty data fixture (specified using an XML document) is also required to support this set of tests.
 
-The second set of tests verifies the behavioural elements of the table, i.e., it's constraints. The only integrity constraints that are directly tested are nullability (\texttt{NOT NULL}), and primary and foreign keys. The behaviour of all other constraints is tested by specifying appropriate lists of legal and illegal values, consistent with typical unit testing practice. When in staff mode, both this set of tests and the set of structural tests are run. A known-valid data fixture is also required to support this set of tests.
+The second set of tests verifies the integrity elements of the table, i.e., it's constraints. The only integrity constraints that are directly tested are nullability (\texttt{NOT NULL}), and primary and foreign keys, which are again verified using queries against the schema metadata. The remaining constraints are tested behaviorally by inserting lists of known legal and illegal values, an approach consistent with normal unit testing practice. When in staff mode, both this set of tests and the set of structural tests are run. A known-valid data fixture is also required to support this set of tests.
 
-The way that the system runs the tests is somewhat unusual. In typical unit testing, the tests are essentially standalone code units that are automatically executed in an indeterminate order by the unit testing framework. The framework handles dependencies and collation of test results internally, reporting only the final results back to the client. Our system effectively inverts (or perhaps subverts) this approach. The main driver explicitly creates test suites itself and executes them directly, listening for the results of each test and collating them. This is because we need to be able to control the order in which tests are executed. If the structural tests fail, there is little point in running the behavioural tests, as they will only generate a stream of errors. Similarly, if a column is missing, there is no point in running the data type and length tests.
+The way that the system runs the tests is somewhat unusual. In typical unit testing, the tests are essentially standalone code units that are automatically executed in an indeterminate order by the unit testing framework. The framework handles dependencies and collation of test results internally, reporting only the final results back to the client. Our system effectively inverts (or perhaps subverts) this approach. The main driver explicitly creates test suites itself and executes them directly, listening for the results of each test and collating them. This is because we need to be able to control the order in which tests are executed. If the structural tests fail, there is little point in running the integrity tests, as they will only generate a stream of errors. Similarly, if a column is missing, there is no point in running the data type and length tests.
 
 It is quite feasible to add table properties and tests beyond those already mentioned, as the coding of the specification is entirely under the teacher's control and not constrained in any way by the system. All the teacher needs to do is add the custom properties to the table specification, and add tests that use those properties.
 
@@ -255,12 +255,12 @@
 \tcbset{boxsep=0pt, boxrule=0pt, arc=0pt, left=0pt, right=0pt, top=0.5pt, bottom=0.5pt}
 
 
-Students can check their schema (\textsf{Student's schema} in \cref{fig-architecture}) by loading it under their personal database acccount (thus creating the tables), then entering their database login credentials into a web application (\textsf{Web app} in \cref{fig-architecture}). This calls the main driver in student mode and accesses the student's schema directly. Only the structural tests are run, and the output is displayed in the web browser. \Cref{fig-student-output} shows an example of the output produced by student mode.
+Students can check their schema (\textsf{Student's schema} in \cref{fig-architecture}) by loading it under their personal database account (thus creating the tables), then entering their database login credentials into a web application (\textsf{Web app} in \cref{fig-architecture}). This calls the main driver in student mode and accesses the student's schema directly. Only the structural tests are run, and the output is displayed in the web browser. \Cref{fig-student-output} shows an example of the output produced by student mode.
 
 
 \begin{figure}
     \includegraphics[width=0.95\columnwidth,keepaspectratio]{images/web_output.png}
-    \caption{Example of student mode output (web app). Grey indicates informative notes, green indicates passed tests, and red indicates failed tests.}
+    \caption{Example of student mode output (web app). \textcolor{gray}{Gray} indicates informative notes, \textcolor{green!45!black}{green} indicates passed tests, and \textcolor{red!90!black}{red} indicates failed tests.}
     \label{fig-student-output}
 \end{figure}
 
@@ -335,9 +335,9 @@
 \section{Evaluation}
 \label{sec-evaluation}
 
-Unfortunately, the system was implemented as a practical solution to a perceived problem, rather than with any formal evaluation in mind. We therefore did not carry out any user evaluations with students that used the system.
+Unfortunately, the system was implemented as a practical solution to a perceived problem, rather than with any formal evaluation in mind. We therefore did not carry out any user evaluations with students that used the system. From the teaching perspective, we found that the total amount of time taken to grade the relevant assignment was reduced only a little, as we still needed to convert the system's output into corresponding grades. There were also several submissions that did not meet the minimum requirements and therefore had to be manually graded. On a more positive note, the system automatically ensured that all gradable elements were checked, which improved consistency, and made the experience subjectively much less stressful.
 
-However, we did have accees to data regarding student performance on the relevant assignment, which was of the nature discussed in \cref{sec-motivation}. We collated data for the period 2009--2016, which encompassed several different permutations of scenario and available system modes, as summarized in \cref{tab-data}. The assignment counted for 15\% of the total grade in 2009 and 2010, and 10\% in subsequent years.
+We also had historical data regarding student performance on the relevant assignment. We collated data for the period 2009--2016, which encompassed several different permutations of scenario and available system modes, as summarized in \cref{tab-data}. The assignment counted for 15\% of the total grade in 2009 and 2010, and 10\% in subsequent years.
 
 
 \begin{table}
@@ -366,43 +366,52 @@
 \end{table}
 
 
-The two horizontal rules in \cref{tab-data} indicate two major transitions during this period. The first marks a significant reorganization of the paper's curriculum (and also a switch from first to second semester\footnote{Semesters at the University of Otago run from March to June and July to October.}) in 2012, the year we developed the prototype of our system. The second marks a shift from second semester back to first semester in 2014. Note that the system was not used at all in 2015 due to different staff teaching the paper. Student mode was not made available in 2016 due to technical issues. These differences provide us with a natural experiment with some interesting points for comparison.
+The two horizontal rules in \cref{tab-data} indicate two major transitions during this period. The first marks a significant reorganization of the course's curriculum (and also a switch from first to second semester\footnote{Semesters at the University of Otago run from March to June and July to October.}) in 2012, the year we developed the prototype of our system. The second marks a shift from second semester back to first semester in 2014. Note that the system was not used at all in 2015 due to different staff teaching the course. Student mode was not made available in 2016 due to technical issues. These differences provide us with a natural experiment with some interesting points for comparison.
 
 The mean grade for the assignment drifted slowly downwards from 2009 to 2012. This changed dramatically in 2013, however, the year we rolled out student mode. The grades are not normally distributed (they typically have negative skew), so we performed a Wilcoxon signed-rank test. This showed that the increase in the mean was highly statistically significant (\(p \approx 10^{-9}\)). The 2013 mean was also significantly higher than both 2010 (\(p \approx 0.0002\)) and 2011 (\(p \approx 10^{-6}\)), but not significantly higher than 2009. The mean decreased significantly again in 2014 (\(p \approx 0.0012\)), the second year that the system was used, and even more dramatically in 2015 (\(p \approx 0.0005\)), when the system was not used at all. There was no significant change from 2015 to 2016.
 
 Even more interesting, if we compare the performance between the years that student mode was available (2013--2014, mean 81.7\%) and the years it was not (2009--2012 and 2015--2016, mean 71.6\%), there is again a highly statistically significant decrease in the mean (\(p \approx 10^{-8}\)). This strongly suggests that the introduction of student mode may have had a positive impact on students' ability to complete the assignment more effectively.
 
-There are some potential confounding factors to consider, however. First, not only was 2013 the first year that student mode was available, it was also the first year that the assignment specification was ``frozen'' (as discussed in \cref{sec-motivation}). It could be argued that this improved grades due to students having less flexibility, and thus less opportunity for misinterpretation, than in previous years. However, the assignment specification was also ``frozen'' from 2014--2016, and there is consderable variation in the grades achieved over this period, especially in 2015. It therefore seems unlikely that this affected assignment performance.
+There are some potential confounding factors to consider, however. First, not only was 2013 the first year that student mode was available, it was also the first year that the assignment specification was ``frozen'' (as discussed in \cref{sec-motivation}). It could be argued that this improved grades due to students having less flexibility, and thus less opportunity for misinterpretation, than in previous years. However, the assignment specification was also ``frozen'' from 2014--2016, and there is considerable variation in the grades achieved over this period, especially in 2015. It therefore seems unlikely that this affected assignment performance.
 
-Second, the switch to second semester in 2012--2013 could have negatively impacted students' performance by increasing the length of time between their exposure to basic data management concepts in first year, and their entry into the second year database paper. In effect, they had longer to forget relevant material they learned in first year. If so, we could reasonably expect the grades in second semester offerings of the paper to be lower. However, grades for second semester offerings of the paper (2012--2013, mean 76.9\%) were significantly \emph{higher} (\(p \approx 0.015\)) than those for second semester offerings (2009--2011 and 2014--2016, mean 72.9\%). This should not be surprising, given that 2013 (second semester) had the highest grades of the entire period. This effectively rules out semester changes as a factor in the performance differences.
+Second, the switch to second semester in 2012--2013 could have negatively impacted students' performance by increasing the length of time between their exposure to basic data management concepts in first year, and their entry into the second year database course. In effect, they had longer to forget relevant material they learned in first year. If so, we could reasonably expect the grades in second semester offerings of the course to be lower. However, grades for second semester offerings of the course (2012--2013, mean 76.9\%) were significantly \emph{higher} (\(p \approx 0.015\)) than those for second semester offerings (2009--2011 and 2014--2016, mean 72.9\%). This should not be surprising, given that 2013 (second semester) had the highest grades of the entire period. This effectively rules out semester changes as a factor in the performance differences.
 
 Third, perhaps the years with higher grades used less complex scenarios. To test this, we computed a collection of different database complexity metrics \cite{Jamil.B-2010a-SMARtS,Piattini.M-2001a-Table,Pavlic.M-2008a-Database,Calero.C-2001a-Database,Sinha.B-2014a-Estimation} for each of the four scenarios used across the period. These showed that the ``BDL'', ``used cars'', and ``student records'' scenarios were all of similar complexity, while the ``postgrad'' scenario was about \(\frac{2}{3}\) the complexity of the others. It therefore seems unlikely that scenario complexity is a factor in the performance differences. It is also interesting to note that the ``used cars'' scenario was used in 2014 and 2015, and yet the 2015 grades were significantly \emph{lower} than those for 2014. The only clear difference is that our system was not used in 2015.
 
-Fourth, class size could be a factor. We might plausibly expect a smaller class to have a more collegial atmosphere that promotes better learning. However, if we look at the sizes of the classes in \cref{tab-data}, we can see no discernable pattern between class size and performance. Indeed, both the best (2013) and worst (2012, 2015) performances came from classes of similar size (75, 77, and 71, respectively).
+Fourth, class size could be a factor. We might plausibly expect a smaller class to have a more collegial atmosphere that promotes better learning. However, if we look at the sizes of the classes in \cref{tab-data}, we can see no discernible pattern between class size and performance. Indeed, both the best (2013) and worst (2012, 2015) performances came from classes of similar size (75, 77, and 71, respectively).
 
-Fifth, it could be that better performance occurred in years where the students were just more capable in general. We obtained GPA data for the students enrolled in each year, and computed the median as an indication of the general capability of the class. Looking at \cref{tab-data}, we can immediately see that the year with the best results (2013) was also the year with the second-lowest median GPA (3.2). Constrast this with the poor performance in 2012, where the median GPA was 3.4. Indeed, in both years that student mode was available, median GPA was lower than many other years, yet performance was better even than years with higher median GPA. This argues against the idea that we simply had a class full of very capable students in the years with better performance.
+Fifth, it could be that better performance occurred in years where the students were just more capable in general. We obtained GPA data for the students enrolled in each year, and computed the median as an indication of the general capability of the class. Looking at \cref{tab-data}, we can immediately see that the year with the best results (2013) was also the year with the second-lowest median GPA (3.2). Contrast this with the poor performance in 2012, where the median GPA was 3.4. Indeed, in both years that student mode was available, median GPA was lower than many other years, yet performance was better even than years with higher median GPA. This argues against the idea that we simply had a class full of very capable students in the years with better performance.
 
 Finally, perhaps the different weightings of the assignment (15\% in 2009--2010 vs.\ 10\% in 2011--2016) affected student motivation. It could be argued that the higher weighting in 2009--2010 provided a greater incentive for students to work more, as the potential reward was greater. If so, we should expect better performance in 2009--2010. Indeed, we do find this: the mean for 2009--2010 is 75.1\%, while that for 2011--2016 is 73.9\%, a statistically significant decrease (\(p \approx 0.034\)). However, since this change occurred well before our system was even implemented, it cannot be a factor in the improved performance seen in 2013 and 2014.
 
-% Anecdotal evidence from students?
-% Didn't substantially reduce grading time, but did improve consistency, as there was much less opportunity to miss or forget something.
 
+\section{Known issues \& future work}
+\label{sec-issues}
 
+While the use of student mode in 2013 and 2014 appears to have had a beneficial effect on student performance, there are some outstanding issues with the system that need to be addressed.
 
-\section{Discussion \& future work}
-\label{sec-discussion}
+There is currently no way to control or suppress the messages generated by PHPUnit test assertions. You can insert a meaningful message at the front, but PHPUnit will still generate a message like ``Failed asserting that 0 matches expected 1'', which can be confusing for students. One particularly tricky case was when a student failed to specify a precision for a numeric column, e.g., they declared a column as just \texttt{NUMERIC} rather than \texttt{NUMERIC(5)}. Most DBMSs will in this situation assign the maximum precision (e.g., 38 significant digits for Oracle). The student then sees a message ``Failed asserting that 38 matches expected 5'', and has no idea where the ``38'' came from. The only way to address this issue would be to change the way PHPUnit works internally.
 
-% known issues:
-% There's currently no control over the messages generated by PHPUnit assertions. You can put a meaningful message up front, but PHPUnit will still always generate something like ``Failed asserting that 0 matches expected 1.'' This can be particularly misleading when you, e.g., don't specify a precision for a numeric column, and the DBMS uses the default precision (e.g., Oracle's NUMBER defaults to 38 significant digits).
-% A partial schema causes a large number of errors, as tables don't exist. This could be alleviated by more careful exception handling?
-% Students in the first iteration tended to misuse the web application as a ``schema compiler'', fixing only one issue before re-submitting, rather than attempting to fix as many of the reported problems as possible. The system wasn't written to handle concurrent requests (as it wasn't expected that the request rate would be that high), leading to waits and timeouts. A workaround was to enable logging, and warn students who were abusing the system.
-% Current version doesn't automatically assign marks, but it would be a simple extension to do so. We've already implemented the core functionality to support this.
-% As of 2017, the main introduction to database content and SQL has moved to first year. With class sizes of 100--200, automation of grading is essential. We will look at rolling out a new version of the system to this class in 2018.
+A schema that fails to meet the minimum requirements will generate a large number of errors due to missing tables.  This could be alleviated by more careful exception handling in the main driver.
+
+In 2013, students tended to misuse the web application as a ``schema compiler'': they would submit their schema, fix only the first issue reported (rather than attempting to fix as many as possible), immediately re-submit, and so on. The system was single-threaded and thus not designed to handle concurrent requests, as we had not expected so high a request rate. Consequently the web application suffered from waits and timeouts. An initial workaround was to log all student access and warn those who were abusing the system, and we later enforced a short delay between attempts by the same student. Things improved in 2014 after we made efforts to manage student expectations, but longer term, the system should to be redesigned to handle concurrent loads. Porting it to Java could make this easier.
+
+The system currently only semi-automates the grading process; it is still up to the teacher to interpret the test results and assign appropriate marks. We have already implemented the core functionality required to assign penalties to different kinds of error, and it would be a relatively simple extension to write these penalties directly into a student management database.
+
+The system currently does not support behavioral aspects such as sequences, triggers, or stored procedures, but it would not be difficult to add support for these, if desired.
+
+It would be interesting to extend our system into a larger system for supporting the teaching of SQL DDL, as has been previously done with several systems for SQL DML (e.g., \cite{Kenny.C-2005a-Automated,Kleiner.C-2013a-Automated,Mitrovic.A-1998a-Learning,Russell.G-2004a-Improving,Sadiq.S-2004a-SQLator}).
+
+From 2017, we no longer have a dedicated second year database course, and the main introduction to database concepts and SQL is now part of our first year ``Foundations of Information Systems'' course.\footnote{http://www.otago.ac.nz/courses/papers/index.html?papercode=comp101} This new course attracted about 160 students in first semester, and about 100 students in second semester. Classes of this size strengthen the argument for automated grading, and we are currently exploring whether our system could be used in this course in 2018. If this goes ahead,  we will also carry out user evaluations with the students.
 
 
 \section{Conclusion}
 \label{sec-conclusion}
 
+In this paper we have described a novel system that semi-automates the process of grading SQL data definition code. All prior work has to our knowledge focused purely on data manipulation statements such as \texttt{SELECT}, so extending this to data definition code is the key contribution of our system. The system also takes a novel approach to checking the code: rather than attempting to parse the code itself, it instead verifies the behavior of the schema against a machine-readable specification. Finally, our system uses a unit testing approach to carry out the verification, which has not been used before in this context.
+
+While no user evaluations of the system were carried out, analysis of student performance in a relevant database assessment shows a statistically significant increase in the two years that the system was available to students. While we cannot be absolutely certain, analysis of all the factors involved certainly strongly suggests that use of our system had a positive impact on student performance in this assessment. This is an encouraging result, which we will explore further in future work.
+
 
 \bibliographystyle{ACM-Reference-Format}
 \bibliography{Koli_2017_Stanger}