Monday, June 1, 2009

Sql parser issues that i had.

For the past few weeks, I have been trying to get the sql parser working. But the main issue I had was implementing the expression parsing logic in Javacc grammar. I had trouble with defining the non-terminals and was running into a lot of left recursion issues in Javacc. I finally moved all the logic to the Java classes and the parser simply passes all the tokens to the ExpressionBuilder class that I have written. This simplies the implementation quite a bit since I am not as well versed in Javacc as Java. 
I will post the SQL Parser shortly on Blitz DB page.

Wednesday, April 8, 2009

Anatomy of a "Select" Statement

One of the most common DML statements that is used in SQL world is a "Select - From - Where" statements. These statements consists of three different parts - Select Clause, From Clause and Where Clause. Every Select statement has two mandatory parts - Select and From clause and the Where Clause is optional. The Select statement returns a table as a result in the form of Resultset.

Select clause:
SELECT clause is the set of instructions to the SQL Engine pertaining to the Projection operation. The atoms specified here are projected in the resultset. Select clause consists of the Select keyword which marks the start of the Select Clause. The Select keyword can then be followed by DISTINCT or ALL keyword (these keywords are optional and if DISTINCT or ALL is not specified, default ALL is assumed by the SQL engine). DISTINCT specifies that no duplicate row should be returned whereas ALL specifies that even duplicate rows should be returned. After this is the list of selectable atoms and we call it as the "select-list". Any common SQL Expression can be used in place of a selectable atom. For eg., Common SQL expression types are
1. column reference
2. An expression that can contain a sub query
3. A constant value - either numeric or character

Each selection list item can be optionally followed by a "AS" keyword and an alias name for the selection item.
For eg., Select col1 as C1 from table1
In this example, C1 is the alias for the col1.

From Clause:
From clause is the set of table valued expressions from which the data is to be processed. From clause consists of the FROM keyword followed by the list of tables or table expression. The table expression that can be used here are subqueries that return a table as a result or a reference to a table in the database. Again the table expression can be aliased using AS keyword or by simply providing the alias name next to the table reference.
Eg.,
Select A.Col1 from Table_1 as A
The same query can also be written as
Select A.Col1 from Table_1 A

Where Clause:
Where Clause specifies a set of conditions that needs to be met by each of the resulting tuple. Where Clause contains the filtering condition that is a boolean expression which evaluates to either true or false. These boolean operations are typically relational operators like greater than, less than, equal to, etc. In addition to these operations, other clauses like IN clause and BETWEEN clause can also be used in the Where clause section of the SQL statement.

Typically the filter conditions are specified with the field name on the right hand side followed by the operation then followed by the other operand.
For eg.,
col1 != 5 is a valid condition.

There are other ways of writing filter condition too especially for the operations such as LIKE, BETWEEN, IN, etc.
Eg.,
abc BETWEEN '1' and '10'
abc LIKE "%1"

For IN clause, you can provide a list of values inside the values section.
Eg.,
abc IN ('1', '2', '3','4')
These list of values can also come from a subquery.
abc IN (Select a from tbl1 where a != 5)

Saturday, March 28, 2009

Implementing a SQL Parser using JavaCC - 3

The first sql grammar that we will write is a 'create table' statement. Lets first start by writing the methods which parses different types of sql statements like select, insert, delete, update, etc.

The key thing is the BNF production declaration is on the right hand side of the colon ':'. The BNF production declaration format is :
<Java method return type> JavaMethodName(Java Method Parameters) :
{
//Any variable used in the method is declared here.
}
{
// this section is the method body. Any java code should be surrounded by {}
}

Using the above format lets write a BNF production which looks for different types of sql statements like 'create' and 'select'

List<SQLExpression>parseInput():{
List<SQLExpression>sqlExprs = new LinkedList<SQLExpression>();
SQLExpression sqlExpr;
}
{
(sqlExpr = createSQLExpression(){
sqlExprs.add(sqlExpr);
}[<SEMICOLON>])+<EOF>
{
return sqlExprs;
}
}

In the above code snippet, the name of the BNF production is parseInput. The method doesn't have any input parameters and the return type is a list of SQLExpression objects.
The first {} block contains the declaration of all the variables used in the method.
The second {} block is the Method body. Inside the method body, you can call any other BNF production and also can make use any of the defined tokens like , , etc. You can also use various regular expressions and BNF production calls directly. createSQLExpression() is actually a BNF production call. The control is handed over to the BNF production that is called.

Any java code should be enclosed between {}. So, sqlExprs.add(sqlExpr); is written inside {} block. Remember this should be a valid Java statement which would compile without any problem. All these BNF productions are converted into Java methods and are added to the Parser file that we declared between the PARSER_BEGIN and PARSER_END section of the grammar file. Therefore, any global variable that is declared in the Java Compilation unit can be used here.

For grouping regular expressions, we can use (). This is seen here:
(sqlExpr = createSQLExpression(){
sqlExprs.add(sqlExpr);
}[<SEMICOLON>])+

Here the call to the createSQLExpression BNF production is grouped with token and this combination is allowed one or more repetitions which is specified by '+' after the grouping.

The whole grammar file which can parse create table statement is given below. Compile the below .jj file and copy it into appropriate java package and the supporting classes like SQLExpression, BlitzSQLParserException, CreateAtom, etc should also be provided for proper working of the parser. Disclaimer: Few of the helper BNF production methods used here are taken from Axion sql grammar file.
options{
IGNORE_CASE = true;
STATIC = false;
UNICODE_INPUT = true;
// some performance optimizations
OPTIMIZE_TOKEN_MANAGER = true;
ERROR_REPORTING = false;
}
PARSER_BEGIN(BlitzSQLParser)
package net.java.blitz.db.sql.parser;
import java.util.List;
import java.util.LinkedList;
import net.java.blitz.db.sql.expression.*;
/**
* @author karthikeyan subramanian
* Do not edit this file directly. This file is generated from BlitzSQLGrammar.jj
*
*/
public class BlitzSQLParser{}PARSER_END(BlitzSQLParser)
// ----------------------------------------------------------------------------
// TOKENS
// ----------------------------------------------------------------------------
SKIP:{
" "
| "\n"
| "\r"
| "\t"
}
SKIP:{
<LINE_COMMENT:"--"(~["\n", "\r"])*("\n"
| "\r"
| "\r\n")>
}
SKIP:{
<BLOCK_COMMENT:"/*"(~["*"])*"*"("*"
| (~["*", "/"](~["*"])*"*"))*"/">
}
TOKEN:// KEYWORDS
{
<ADD:"add">
| <ALL:"all">
| <ALTER:"alter">
| <ALWAYS:"always">
| <AND:"and">
| <AS:"as">
| <ASC:"asc">
| <BEGIN:"begin">
| <BETWEEN:"between">
| <BOTH:"both">
| <BY:"by">
| <CAST:"cast">
| <CASCADE:"cascade">
| <CASE:"case">
| <CHECK:"check">
| <CREATE:"create">
| <COLUMN:"column">
| <CONSTRAINT:"constraint">
| <CURRENT_TIMESTAMP:"current_timestamp">
| <CURRENT_DATE:"current_date">
| <CURRENT_TIME:"current_time">
| <CYCLE:"cycle">
| <DATABASE:"database">
| <DATA:"data">
| <DAY:"day">
| <DEFAULT_:"default">
| <DEFERRED:"deferred">
| <DEFERRABLE:"deferrable">
| <DEFRAG:"defrag">
| <DELETE:"delete">
| <DESC:"desc">
| <DISTINCT:"distinct">
| <DROP:"drop">
| <ELSE:"else">
| <END:"end">
| <ESCAPE:"escape">
| <EXCEPTION:"exception">
| <EXISTS:"exists">
| <EXPLAIN:"explain">
| <EXTRACT:"extract">
| <EXTERNAL:"external">
| <FALSE:"false">
| <FIRST:"first">
| <FOR:"for">
| <FOREIGN:"foreign">
| <FROM:"from">
| <FULL:"full">
| <GENERATED:"generated">
| <GROUP:"group">
| <HAVING:"having">
| <HOUR:"hour">
| <IDENTITY:"identity">
| <IF:"if">
| <INCREMENT:"increment">
| <IMMEDIATE:"immediate">
| <IN:"in">
| <INITIALLY:"initially">
| <INDEX:"index">
| <INNER:"inner">
| <INSERT:"insert">
| <INTO:"into">
| <IS:"is">
| <JOIN:"join">
| <KEY:"key">
| <LEADING:"leading">
| <LEFT:"left">
| <LIKE:"like">
| <LIMIT:"limit">
| <LINK:"link">
| <MAXVALUE:"maxvalue">
| <MATCHED:"matched">
| <MERGE:"merge">
| <MINUTE:"minute">
| <MINVALUE:"minvalue">
| <MILLISECOND:"millisecond">
| <MONTH:"month">
| <NEXT:"next">
| <NO:"no">
| <NOT:"not">
| <NULL:"null">
| <OFFSET:"offset">
| <ON:"on">
| <OR:"or">
| <ORDER:"order">
| <ORGANIZATION:"organization">
| <OUTER:"outer">
| <POSITION:"position">
| <PRIMARY:"primary">
| <QUARTER:"quarter">
| <RIGHT:"right">
| <REFERENCES:"references">
| <RENAME:"rename">
| <RESTART:"restart">
| <SECOND:"second">
| <SELECT:"select">
| <SEQUENCE:"sequence">
| <SET:"set">
| <SOUNDS:"sounds">
| <START:"start">
| <SUBSTRING:"substring">
| <SYSDATE:"sysdate">
| <TABLE:"table">
| <THEN:"then">
| <TO:"to">
| <TRAILING:"trailing">
| <TRIM:"trim">
| <TRUE:"true">
| <TRUNCATE:"truncate">
| <UNIQUE:"unique">
| <UPDATE:"update">
| <UPSERT:"upsert">
| <USER:"user">
| <USING:"using">
| <VALUES:"values">
| <VALUE:"value">
| <VIEW:"view">
| <WEEK:"week">
| <WHEN:"when">
| <WHERE:"where">
| <WITH:"with">
| <YEAR:"year">
}
TOKEN:// DATA TYPES
{
<BIT:"bit">
| <BYTE:"byte">
| <INT:"int">
| <REAL:"real">
| <CLOB:"clob">
| <BLOB:"blob">
| <CHAR:"char">
| <CHARACTER:"character">
| <DATE:"date">
| <TIME:"time">
| <FLOAT:"float">
| <BIGINT:"bigint">
| <LONG:"long">
| <RAW:"raw">
| <STRING:"string">
| <BINARY:"binary">
| <NUMERIC:"numeric">
| <NUMBER:"number">
| <DECIMAL:"decimal">
| <DEC:"dec">
| <BOOLEAN:"boolean">
| <TINYINT:"tinyint">
| <INTEGER:"integer">
| <VARCHAR:"varchar">
| <VARCHAR2:"varchar2">
| <LONGVARCHAR:"longvarchar">
| <TEXT:"text">
| <SMALLINT:"smallint">
| <SHORT:"short">
| <VARBINARY:"varbinary">
| <LONGVARBINARY:"longvarbinary">
| <IMAGE:"image">
| <VARYING:"varying">
| <LARGE:"large">
| <TIMESTAMP:"timestamp">
| <OBJECT:"object">
| <JAVA_OBJECT:"java_object">
| <DOUBLE:"double">
}
TOKEN:// LITERALS
{
<INTEGER_LITERAL:(["0"-"9"])+>
| <FLOATING_POINT_LITERAL:(["0"-"9"])+"."(["0"-"9"])+(<EXPONENT>)?
| "."(["0"-"9"])+(<EXPONENT>)?
| (["0"-"9"])+<EXPONENT>
| (["0"-"9"])+(<EXPONENT>)?>
| <#EXPONENT:["e", "E"](["+", "-"])?(["0"-"9"])+>
| <STRING_LITERAL:"'"(~["'"])*("''"(~["'"])*)*"'">
}
TOKEN:// IDENTIFIERS
{
<ID:(<LETTER>)+("_"
| "$"
| "#"
| <DIGIT>
| <LETTER>)*>
| <#LETTER:["A"-"Z", "a"-"z"]>
| <#DIGIT:["0"-"9"]>
}
TOKEN:// SEPARATORS AND OPERATORS
{
<ASSIGN:":=">
| <COMMA:",">
| <CONCAT:"||">
| <SEMICOLON:";">
| <DOT:".">
| <LESS:"<">
| <LESSEQUAL:"<=">
| <GREATER:">">
| <GREATEREQUAL:">=">
| <EQUAL:"=">
| <NOTEQUAL:"!=">
| <NOTEQUAL2:"<>">
| <JOINPLUS:"(+)">
| <OPENPAREN:"(">
| <CLOSEPAREN:")">
| <ASTERISK:"*">
| <SLASH:"/">
| <PLUS:"+">
| <MINUS:"-">
| <QUESTIONMARK:"?">
}
TOKEN:// START QUOTED IDENTIFIER
{
<START_QUOTED_IDENTIFIER:"\"">:STATE_QuotedIdentStart
}
<STATE_QuotedIdentStart>TOKEN:{
<QUOTED_IDENTIFIER:<ID>>:STATE_QuotedIdentEnd
}
<STATE_QuotedIdentEnd>TOKEN:// IDENTIFIER ESCAPE CHAR
{
<END_QUOTED_IDENTIFIER:"\"">:DEFAULT
}
// --------------------------------------------------------------
// SQL Parsing methods.
// --------------------------------------------------------------
// --------------------------------------------------------------
// Call this method to create a List of SQLExpression from the
// given input.
// --------------------------------------------------------------
List<SQLExpression>parseInput():{
List<SQLExpression>sqlExprs = new LinkedList<SQLExpression>();
SQLExpression sqlExpr;
}
{
(sqlExpr = createSQLExpression(){
sqlExprs.add(sqlExpr);
}[<SEMICOLON>])+<EOF>
{
return sqlExprs;
}
}
SQLExpression createSQLExpression():{
SQLExpression sqlExpr = new SQLExpression();
}
{
(parseSelectStatement()
| parseCreateStatement()){
return sqlExpr;
}
}
// --------------------------------------------------------------
// Helper methods
// --------------------------------------------------------------
String SqlIdentifier():{
Token t = null;
}
{
t = SqlQuotedId(){
return t.image;
}
}
Token SqlQuotedId():{
Token t = null;
}
{
(t = <ID>
| <START_QUOTED_IDENTIFIER>t = <QUOTED_IDENTIFIER><END_QUOTED_IDENTIFIER>){
return t;
}
}
Object[]SqlColumnDef():{
Object[]tuple = new Object[6];
}
{
tuple[0] = SqlIdentifier()(SqlExactNumericType(tuple)
| (LOOKAHEAD(2)SqlCharStringType(tuple)
| SqlBinaryStringType(tuple)
| SqlApproximateNumericType(tuple)
| SqlBooleanType(tuple)
| SqlDataTimeType(tuple))) {
return tuple;
}
}
void SqlCharStringType(Object[]tuple):{}{
LOOKAHEAD(2)((<CHAR>
| <CHARACTER>){
tuple[1] = "char";
tuple[2] = "1";
}
[SqlCharLength(tuple)])
| LOOKAHEAD(2)((<CHAR><VARYING>
| <CHARACTER><VARYING>
| <VARCHAR>
| <VARCHAR2>
| <STRING>){
tuple[1] = "varchar";
tuple[2] = "1";
}
[SqlCharLength(tuple)])
| ((LOOKAHEAD(2)<LONG>
| <LONG><VARCHAR>
| <LONGVARCHAR>
| <TEXT>){
tuple[1] = "varchar";
tuple[2] = ""+Integer.MAX_VALUE;
}
)
| ((<CHAR><LARGE><OBJECT>
| <CHARACTER><LARGE><OBJECT>
| <CLOB>){
tuple[1] = "clob";
}
[SqlPrecision(tuple)])
}
void SqlBinaryStringType(Object[]tuple):{}{
(<BYTE>{
tuple[1] = "byte";
}
[SqlPrecision(tuple)])
| LOOKAHEAD(2)((<BINARY>
| <VARBINARY>
| <RAW>){
tuple[1] = "varbinary";
}
[SqlPrecision(tuple)])
| ((LOOKAHEAD(2)<LONG><VARBINARY>
| <LONGVARBINARY>
| <LONG><RAW>
| <IMAGE>){
tuple[1] = "varbinary";
tuple[2] = ""+Integer.MAX_VALUE;
}
[SqlCharLength(tuple)])
| ((<BINARY><LARGE><OBJECT>
| <BLOB>){
tuple[1] = "blob";
}
[SqlPrecision(tuple)])
}
void SqlExactNumericType(Object[]tuple):{
Integer precision = new Integer(BigDecimalType.DEFAULT_PRECISION);
Integer scale = new Integer(BigDecimalType.DEFAULT_SCALE);
}
{
LOOKAHEAD(3)((<NUMERIC>
| <DECIMAL>
| <DEC>
| <NUMBER>){
tuple[1] = "numeric";
tuple[2] = precision.toString();
tuple[3] = scale.toString();
}
[SqlPrecisionAndScale(tuple)]{
precision = Integer.valueOf(tuple[2].toString());
scale = Integer.valueOf(tuple[3].toString());
if (scale.compareTo(precision)>0){
throw new BlitzParserException("scale value exceeds precision value");
}
else if (precision.compareTo(new Integer(BigDecimalType.MAX_PRECISION))>0){
throw new BlitzParserException("precision value exceeds maximum ("+BigDecimalType.MAX_PRECISION+")");
}
}
)
| ((<INTEGER>
| <INT>){
tuple[1] = "integer";
}
)
| ((<SMALLINT>
| <SHORT>){
tuple[1] = "short";
}
)
| ((<BIGINT>){
tuple[1] = "bigint";
}
)
}
void SqlApproximateNumericType(Object[]tuple):{}{
LOOKAHEAD(3)(<FLOAT>{
tuple[1] = "float";
}
[SqlPrecision(tuple)])
| (<REAL>{
tuple[1] = "float";
}
)
| (<DOUBLE>{
tuple[1] = "double";
}
)
}
void SqlBooleanType(Object[]tuple):{}{
((<BOOLEAN>
| <BIT>){
tuple[1] = "boolean";
}
)
}
void SqlDataTimeType(Object[]tuple):{}{
(<DATE>{
tuple[1] = "date";
}
)
| (<TIME>{
tuple[1] = "time";
}
[SqlPrecision(tuple)])
| (<TIMESTAMP>{
tuple[1] = "timestamp";
}
[SqlPrecision(tuple)])
}
void SqlCharLength(Object[]tuple):{}{
// If length is omitted, then a length of 1 (one) is implicit.
<OPENPAREN>tuple[2] = SqlPositiveInteger()<CLOSEPAREN>
}
void SqlPrecisionAndScale(Object[]tuple):{}{
<OPENPAREN>tuple[2] = SqlPositiveInteger()(<COMMA>tuple[3] = SqlUnsignedInteger())*<CLOSEPAREN>
}
void SqlPrecision(Object[]tuple):{}{
<OPENPAREN>tuple[2] = SqlUnsignedInteger()<CLOSEPAREN>
}
String SqlPositiveInteger():{
String valStr = null;
}
{
valStr = SqlUnsignedInteger(){
int val = (valStr != null)?Integer.valueOf(valStr).intValue():-1;
if (val == 0){
throw new BlitzParserException("value must be a positive integer");
}
return valStr;
}
}
String SqlUnsignedInteger():{
Token t = null;
}
{
t = <INTEGER_LITERAL>{
if (t != null){
return t.image;
}
return null;
}
}
CreateAtom createTable():{
String tblName;
String schemaName;
CreateAtom createAtom = new CreateAtom();
Object[] obj;
}
{
<CREATE><TABLE>[SqlIdentifier()<DOT>{
schemaName = token.image;
}
]SqlIdentifier(){
tblName = token.image;
}
[<IF><NOT><EXISTS>{
createAtom.setIfNotExists(true);
}]
<OPENPAREN>[(obj = SqlColumnDef(){
createAtom.addColumnDefinition(new ColumnDefinition(obj));
}
)](<COMMA>obj = SqlColumnDef(){
createAtom.addColumnDefinition(new ColumnDefinition(obj));
}
)*<CLOSEPAREN>
}

Friday, March 27, 2009

Implementing a SQL Parser using JavaCC - 2

Lets start with a JavaCC Grammar file for parsing the SQL queries. JavaCC grammar file for SQL Parsing is divided into various sections and they are as given below:
1. Options section
This is the first and foremost section of the JavaCC grammar file. This section contains various JavaCC configuration options.
Eg:
options {
IGNORE_CASE = true;
STATIC = false;
UNICODE_INPUT = true;

// some performance optimizations
ERROR_REPORTING = false;
}
2. The next section is the block that is surrounded by PARSER_BEGIN(parserName) and PARSER_END(parserName). In this block, a java compilation unit can be inserted. In the simplest case, it can be a very simple class declaration without any method or variable inside it. The package declaration used in the section will be used in all the JavaCC generated files like parser, token manager, token manager constants, char stream, etc.
 PARSER_BEGIN(BlitzSQLParser)
package net.java.blitz.db.sql.parser;
import java.util.List;
import java.util.LinkedList;
import net.java.blitz.db.sql.expression.*;
/**
* @author karthikeyan subramanian
* Do not edit this file directly. This file is generated from BlitzSQLGrammar.jj
*
*/
public class BlitzSQLParser{
}
PARSER_END(BlitzSQLParser)
3. This section is then followed by the set of regular expression production that define how the parser should behave when a match is found. Four special regular expressions are available for this purpose. Now we define our regular expressions here. (Disclosure: Lot of SQL related tokens are borrowed from AxionSQLParser from Axion.) Not all these tokens are used in this SQL Parser. They can be useful for future use.
// ----------------------------------------------------------------------------
// TOKENS
// ----------------------------------------------------------------------------
SKIP:{
" "
| "\n"
| "\r"
| "\t"
}
SKIP:{
<LINE_COMMENT:"--"(~["\n", "\r"])*("\n"
| "\r"
| "\r\n")>
}
SKIP:{
<BLOCK_COMMENT:"/*"(~["*"])*"*"("*"
| (~["*", "/"](~["*"])*"*"))*"/">
}
TOKEN:// KEYWORDS
{
<ADD:"add">
| <ALL:"all">
| <ALTER:"alter">
| <ALWAYS:"always">
| <AND:"and">
| <AS:"as">
| <ASC:"asc">
| <BEGIN:"begin">
| <BETWEEN:"between">
| <BOTH:"both">
| <BY:"by">
| <CAST:"cast">
| <CASCADE:"cascade">
| <CASE:"case">
| <CHECK:"check">
| <CREATE:"create">
| <COLUMN:"column">
| <CONSTRAINT:"constraint">
| <CURRENT_TIMESTAMP:"current_timestamp">
| <CURRENT_DATE:"current_date">
| <CURRENT_TIME:"current_time">
| <CYCLE:"cycle">
| <DATABASE:"database">
| <DATA:"data">
| <DAY:"day">
| <DEFAULT_:"default">
| <DEFERRED:"deferred">
| <DEFERRABLE:"deferrable">
| <DEFRAG:"defrag">
| <DELETE:"delete">
| <DESC:"desc">
| <DISTINCT:"distinct">
| <DROP:"drop">
| <ELSE:"else">
| <END:"end">
| <ESCAPE:"escape">
| <EXCEPTION:"exception">
| <EXISTS:"exists">
| <EXPLAIN:"explain">
| <EXTRACT:"extract">
| <EXTERNAL:"external">
| <FALSE:"false">
| <FIRST:"first">
| <FOR:"for">
| <FOREIGN:"foreign">
| <FROM:"from">
| <FULL:"full">
| <GENERATED:"generated">
| <GROUP:"group">
| <HAVING:"having">
| <HOUR:"hour">
| <IDENTITY:"identity">
| <IF:"if">
| <INCREMENT:"increment">
| <IMMEDIATE:"immediate">
| <IN:"in">
| <INITIALLY:"initially">
| <INDEX:"index">
| <INNER:"inner">
| <INSERT:"insert">
| <INTO:"into">
| <IS:"is">
| <JOIN:"join">
| <KEY:"key">
| <LEADING:"leading">
| <LEFT:"left">
| <LIKE:"like">
| <LIMIT:"limit">
| <LINK:"link">
| <MAXVALUE:"maxvalue">
| <MATCHED:"matched">
| <MERGE:"merge">
| <MINUTE:"minute">
| <MINVALUE:"minvalue">
| <MILLISECOND:"millisecond">
| <MONTH:"month">
| <NEXT:"next">
| <NO:"no">
| <NOT:"not">
| <NULL:"null">
| <OFFSET:"offset">
| <ON:"on">
| <OR:"or">
| <ORDER:"order">
| <ORGANIZATION:"organization">
| <OUTER:"outer">
| <POSITION:"position">
| <PRIMARY:"primary">
| <QUARTER:"quarter">
| <RIGHT:"right">
| <REFERENCES:"references">
| <RENAME:"rename">
| <RESTART:"restart">
| <SECOND:"second">
| <SELECT:"select">
| <SEQUENCE:"sequence">
| <SET:"set">
| <SOUNDS:"sounds">
| <START:"start">
| <SUBSTRING:"substring">
| <SYSDATE:"sysdate">
| <TABLE:"table">
| <THEN:"then">
| <TO:"to">
| <TRAILING:"trailing">
| <TRIM:"trim">
| <TRUE:"true">
| <TRUNCATE:"truncate">
| <UNIQUE:"unique">
| <UPDATE:"update">
| <UPSERT:"upsert">
| <USER:"user">
| <USING:"using">
| <VALUES:"values">
| <VALUE:"value">
| <VIEW:"view">
| <WEEK:"week">
| <WHEN:"when">
| <WHERE:"where">
| <WITH:"with">
| <YEAR:"year">
}
TOKEN:// DATA TYPES
{
<BIT:"bit">
| <BYTE:"byte">
| <INT:"int">
| <REAL:"real">
| <CLOB:"clob">
| <BLOB:"blob">
| <CHAR:"char">
| <CHARACTER:"character">
| <DATE:"date">
| <TIME:"time">
| <FLOAT:"float">
| <BIGINT:"bigint">
| <LONG:"long">
| <RAW:"raw">
| <STRING:"string">
| <BINARY:"binary">
| <NUMERIC:"numeric">
| <NUMBER:"number">
| <DECIMAL:"decimal">
| <DEC:"dec">
| <BOOLEAN:"boolean">
| <TINYINT:"tinyint">
| <INTEGER:"integer">
| <VARCHAR:"varchar">
| <VARCHAR2:"varchar2">
| <LONGVARCHAR:"longvarchar">
| <TEXT:"text">
| <SMALLINT:"smallint">
| <SHORT:"short">
| <VARBINARY:"varbinary">
| <LONGVARBINARY:"longvarbinary">
| <IMAGE:"image">
| <VARYING:"varying">
| <LARGE:"large">
| <TIMESTAMP:"timestamp">
| <OBJECT:"object">
| <JAVA_OBJECT:"java_object">
| <DOUBLE:"double">
}
TOKEN:// LITERALS
{
<INTEGER_LITERAL:(["0"-"9"])+>
| <FLOATING_POINT_LITERAL:(["0"-"9"])+"."(["0"-"9"])+(<EXPONENT>)?
| "."(["0"-"9"])+(<EXPONENT>)?
| (["0"-"9"])+<EXPONENT>
| (["0"-"9"])+(<EXPONENT>)?>
| <#EXPONENT:["e", "E"](["+", "-"])?(["0"-"9"])+>
| <STRING_LITERAL:"'"(~["'"])*("''"(~["'"])*)*"'">
}
TOKEN:// IDENTIFIERS
{
<ID:(<LETTER>)+("_"
| "$"
| "#"
| <DIGIT>
| <LETTER>)*>
| <#LETTER:["A"-"Z", "a"-"z"]>
| <#DIGIT:["0"-"9"]>
}
TOKEN:// SEPARATORS AND OPERATORS
{
<ASSIGN:":=">
| <COMMA:",">
| <CONCAT:"||">
| <SEMICOLON:";">
| <DOT:".">
| <LESS:"<">
| <LESSEQUAL:"<=">
| <GREATER:">">
| <GREATEREQUAL:">=">
| <EQUAL:"=">
| <NOTEQUAL:"!=">
| <NOTEQUAL2:"<>">
| <JOINPLUS:"(+)">
| <OPENPAREN:"(">
| <CLOSEPAREN:")">
| <ASTERISK:"*">
| <SLASH:"/">
| <PLUS:"+">
| <MINUS:"-">
| <QUESTIONMARK:"?">
}
TOKEN:// START QUOTED IDENTIFIER
{
<START_QUOTED_IDENTIFIER:"\"">:STATE_QuotedIdentStart
}
<STATE_QuotedIdentStart>TOKEN:{
<QUOTED_IDENTIFIER:<ID>>:STATE_QuotedIdentEnd
}
<STATE_QuotedIdentEnd>TOKEN:// IDENTIFIER ESCAPE CHAR
{
<END_QUOTED_IDENTIFIER:"\"">:DEFAULT
}
The SKIP regular expression production specifies that the tokens in the input that match this regular expressions should be skipped or ignored whereas TOKEN specifies that the matching tokens are to be considered as a valid TOKEN and should be processed.

(To be continued)...

Thursday, March 26, 2009

Implementing a SQL Parser using JavaCC - 1

JavaCC is the Compiler compiler implementation written in Java. What this means is JavaCC lets you define the grammar for any language and JavaCC will generate java classes that lets you to read, analyze and break the input into language units. Essentially, JavaCC generates parser and lexical analyser for you from the grammar file that you provided.

Every language is made up of set of alphabets. These alphabets are composed into meaningful units called words. These words are then composed into meaningful sentences. For implementing a SQL Parser, lets see how these language units compare
Alphatets - No particular equivalent in SQL
Words - Valid SQL Tokens like SELECT, INSERT, UPDATE, DELETE, CREATE, etc
Sentence - Valid ordering of various SQL Tokens i.e., Valid SQL Statement.
For eg., SELECT A from TABLE_1 is valid whereas SELECT INSERT VIEW is gibberish and is not valid.

Wednesday, March 25, 2009

SQL Parser

Database engine consists of a number of different sub components like query parser, optimizer, indexing facility and buffer manager. The first point of entry is the parser. The statements that are submitted to the database needs to be parsed and understood by the Database. When the database parses the statements, it can decipher the intended operation and can carry out appropriate actions. Javacc is a compiler compiler that can be used to create parser for a given grammar. I am looking into Javacc to create a new SQL Parser. I am planning to derive lot of grammar from Axion grammar. Initially I am planning to write SQL grammar for create, insert and delete statements. I will write more about my experiences in the following posts.

Blitz

I am currently working on creating a new java relational database engine called Blitz. You may wonder why I am into this now. Well, there are a few reasons. I want to create a database engine which can be small but powerful and also pretty extensible. So there are few databases out there like HSQL, H2, Axion, Derby,etc. So the whole difference lies in the fact that Blitz is meant to perform better for data analysis, mining and ware housing operations. The plan is to provide row store as well as column store. Also there is plan to provide data federation capabilities. Blitz is meant to take advantage of the Java 6 language features and also make use of the concurrency utils unlike some of the databases that came out before the concurrency packages were introduced. This will let Blitz to take advantage of multi core processors and can hopefully perform better. Only time will tell how much this will work!