Programming languages are designed to allow humans to instruct computers to perform automated tasks. The rules of syntax and semantics for the language give the programmer and the language implementation (e.g. compiler or interpreter) a common understanding of what code is valid and what it means, respectively. Unfortunately, a number of useful programming language features can give rise to injection vulnerabilities. We will provide a brief overview here, especially for the benefit of students with no background in programming.
It is useful to be aware that most computer languages will scan program text in left-to-right order. Syntax errors can be reported immediately, although other errors (run-time errors) may only be detected when the (syntactically-valid) code is attempted.
Take for example the following hypothetical code:
# Print a greeting to the user: name = read(); greeting = "Hello, $name"; print(greeting);
The first line is a comment, to assist human readers in understanding the purpose or function of the code, and to be ignored by the compiler/interpreter.
The remaining lines show three statements to be run in succession. First, user input is read into a variable called "name". Then, the variable is combined with a greeting and stored in another variable named "greeting". Lastly, the contents of the "greeting" variable are printed for display.
Various special characters ("#", ";", "#", "(", ")") are used here to mark or delimit particular features of the code. These will be discussed in more detail in the following subsections.
In general terms, delimiters are special characters in a programming language that serve to denote boundaries within code. Some common types of boundary include:
Character literals (strings) are one of the most common types of delimited code. In many languages, these are delimited using pairs of quotation marks, with the text between being treated verbatim by the programming language, much like quoted speech in written English. As the language compiler/interpreter is reading the input, after it encounters a string delimiter, it will treat the following text as literal character data, looking for the ending or closing delimeter. If it fails to find one when the input ends, it will raise a syntax error.
In some languages, single- (') and double- (") quotation marks have different functions. For example, in C and related languages, double-quotes delimit strings, and single-quotes delimit individual character values. In other languages (such as the Linux command shell), single-quoted strings are treated strictly literally, while double-quoted strings may be subject to further processing such as the substitution of values for variables.
Other language elements may have delimiters too, such as blocks of code within "{" and "}", lists of function parameters/arguments within "(" and ")", true/false conditions within "(" and ")", and vector/array/list indexes within "[" and "]". The use of asymmetric pairs of related characters helps with the readability.
While paired delimiters are often used to mark regions of a certain type, single-character delimiters are also common. For example, statements in many languages are terminated by semicolons (";").
A similar lexical element in some languages is the separator. These occur between language elements of a certain type, as opposed to after them as with statement terminators. An example is the comma used in SQL to separate items in a list, as in:
insert into Student (Student_ID, Name, Address) values (123, 'MORRIS, Horace', '99 Some St');
Some programming languages provide special "escape" characters, which serve to change the meaning of the character(s) following. A common example is the backslash ("\"), which in many languages can be used to denote special characters such as tabs ("\t") and line-breaks ("\n") without having to use those characters literally ("\\" can be used to express a single literal backslash).
Escape characters can also be used to indicate that a variable or other expression should be evaluated rather than simply being treated literally. The "$" sign is often used for this purpose, although ":", "@" and "%" are also encountered.
Computer programs are often expressed as collections of statements, where each statement represents a discrete instruction or command being given.
In the hypothetical example above, semicolon characters (";") are used to mark the end of a statement, much like a fullstop (".") in written English. In fact, some computer languages do use fullstops for this purpose, though semicolons (C, C++, Java, C#, Pascal) are more common in modern languages. In other languages (e.g. Python, Tcl), statements may be terminated simply by the end of a line (ASCII Carriage Return and/or Line Feed characters).
In most programming languages, a sequential flow of control is the default, meaning that statements are run one after the other, in order. In most languages, changing the order will change the behaviour of the program.
Program code can be dense, complex, and at times cryptic. The design rationale and development process that resulted in the code will often not be apparent from the final code. For this reason, most programming languages allow comments to be inserted into the code for the benefit of other programmers. The computer will ignore the comments, either treating them as a "no-op", skipping over them, or removing them from further processing.
Comments can often be written in single-line or multi-line (block) form. Single-line comments take effect until the end of the line (i.e. until the next line break character), and may be introduced by delimiter characters such as "#" (Python, Tcl, Unix shell), "//" (C and friends), and "--" (SQL, Lua).
Multi-line comment syntax often uses asymmetric delimiters, as in C-like languages ("/" marks the start, and "/" marks the end.)
Restrictions may exist on nesting comments within other comments (e.g. a single-line comment within a block comment).
Another class of special character is used when performing an inexact match on a string or filename. Instead of having to specify an exhaustive list of filenames, a wildcard character (such as "*") can be used instead (much like a "wild card" in certain card games, or the blank letter tiles in Scrabble). The wildcard can function either as a generic placeholder, or it can be expanded by the system to a list of matching names.
For example, the following shell command will remove all the files within the current working directory:
rm *
Some languages distinguish between single-character and multi-character wildcards. For example, SQL's LIKE
expression uses "_" to match any single character and "%" to match any string of any length (including the empty string). Similarly, wildcards in the style of MS-DOS support "?" for single character matches and "*" for any string.
For even more powerful string pattern matching, the regular expression language goes far beyond the capabilities of basic wildcards.
String concatenation is the process of combining multiple character string values into one, e.g. "spider" + "web" = "spiderweb".
Again, the syntax varies, but some common ways to perform concatenation in various languages are:
+
operator (Java, Python) (usually overloaded with numeric addition)||
operator (SQL) (not to be confused with the logical OR operation is C-like languages)&
operator (Visual BASIC, Ada)CONCAT()
Some programs generate code to be run, often in a different language. One of the most common examples of this is the use of embedded SQL database access code within a host language such as Java. The code may be parameterised: for example, a search string might be substituted into an SQL statement in order to perform the desired search.
The naive way to combine the parameter value would be to concatenate a variable's value with the literal text of the search query. Suppose we were dealing with a search query of the following form:
select Name, Price from Product where Name = 'calculator';
To allow searching for any product, we would need to generalise the product name, i.e. the "calculator" string, using a variable in place of the literal string.
Often these sorts of queries will also make use of wildcard pattern matching, e.g.
select Name, Price from Product where Name like '%chocolate%';
When embedded in a host program (in Java, in this case), the SQL command itself might have to be expressed as a character string:
String sql = "select Name, Price from Product where Name = '" + productName + "'";
Note also the tricky quoting: SQL uses single-quotes to delimit strings, and Java uses double-quotes. Three values are being concatenated: the main part of the query, the product name variable, and a literal single-quote to terminate the SQL string.
This approach is highly vulnerable to injection attacks, as we will see.