Saturday, March 28, 2009

Implementing a SQL Parser using JavaCC - 3

The first sql grammar that we will write is a 'create table' statement. Lets first start by writing the methods which parses different types of sql statements like select, insert, delete, update, etc.

The key thing is the BNF production declaration is on the right hand side of the colon ':'. The BNF production declaration format is :
<Java method return type> JavaMethodName(Java Method Parameters) :
{
//Any variable used in the method is declared here.
}
{
// this section is the method body. Any java code should be surrounded by {}
}

Using the above format lets write a BNF production which looks for different types of sql statements like 'create' and 'select'

List<SQLExpression>parseInput():{
List<SQLExpression>sqlExprs = new LinkedList<SQLExpression>();
SQLExpression sqlExpr;
}
{
(sqlExpr = createSQLExpression(){
sqlExprs.add(sqlExpr);
}[<SEMICOLON>])+<EOF>
{
return sqlExprs;
}
}

In the above code snippet, the name of the BNF production is parseInput. The method doesn't have any input parameters and the return type is a list of SQLExpression objects.
The first {} block contains the declaration of all the variables used in the method.
The second {} block is the Method body. Inside the method body, you can call any other BNF production and also can make use any of the defined tokens like , , etc. You can also use various regular expressions and BNF production calls directly. createSQLExpression() is actually a BNF production call. The control is handed over to the BNF production that is called.

Any java code should be enclosed between {}. So, sqlExprs.add(sqlExpr); is written inside {} block. Remember this should be a valid Java statement which would compile without any problem. All these BNF productions are converted into Java methods and are added to the Parser file that we declared between the PARSER_BEGIN and PARSER_END section of the grammar file. Therefore, any global variable that is declared in the Java Compilation unit can be used here.

For grouping regular expressions, we can use (). This is seen here:
(sqlExpr = createSQLExpression(){
sqlExprs.add(sqlExpr);
}[<SEMICOLON>])+

Here the call to the createSQLExpression BNF production is grouped with token and this combination is allowed one or more repetitions which is specified by '+' after the grouping.

The whole grammar file which can parse create table statement is given below. Compile the below .jj file and copy it into appropriate java package and the supporting classes like SQLExpression, BlitzSQLParserException, CreateAtom, etc should also be provided for proper working of the parser. Disclaimer: Few of the helper BNF production methods used here are taken from Axion sql grammar file.
options{
IGNORE_CASE = true;
STATIC = false;
UNICODE_INPUT = true;
// some performance optimizations
OPTIMIZE_TOKEN_MANAGER = true;
ERROR_REPORTING = false;
}
PARSER_BEGIN(BlitzSQLParser)
package net.java.blitz.db.sql.parser;
import java.util.List;
import java.util.LinkedList;
import net.java.blitz.db.sql.expression.*;
/**
* @author karthikeyan subramanian
* Do not edit this file directly. This file is generated from BlitzSQLGrammar.jj
*
*/
public class BlitzSQLParser{}PARSER_END(BlitzSQLParser)
// ----------------------------------------------------------------------------
// TOKENS
// ----------------------------------------------------------------------------
SKIP:{
" "
| "\n"
| "\r"
| "\t"
}
SKIP:{
<LINE_COMMENT:"--"(~["\n", "\r"])*("\n"
| "\r"
| "\r\n")>
}
SKIP:{
<BLOCK_COMMENT:"/*"(~["*"])*"*"("*"
| (~["*", "/"](~["*"])*"*"))*"/">
}
TOKEN:// KEYWORDS
{
<ADD:"add">
| <ALL:"all">
| <ALTER:"alter">
| <ALWAYS:"always">
| <AND:"and">
| <AS:"as">
| <ASC:"asc">
| <BEGIN:"begin">
| <BETWEEN:"between">
| <BOTH:"both">
| <BY:"by">
| <CAST:"cast">
| <CASCADE:"cascade">
| <CASE:"case">
| <CHECK:"check">
| <CREATE:"create">
| <COLUMN:"column">
| <CONSTRAINT:"constraint">
| <CURRENT_TIMESTAMP:"current_timestamp">
| <CURRENT_DATE:"current_date">
| <CURRENT_TIME:"current_time">
| <CYCLE:"cycle">
| <DATABASE:"database">
| <DATA:"data">
| <DAY:"day">
| <DEFAULT_:"default">
| <DEFERRED:"deferred">
| <DEFERRABLE:"deferrable">
| <DEFRAG:"defrag">
| <DELETE:"delete">
| <DESC:"desc">
| <DISTINCT:"distinct">
| <DROP:"drop">
| <ELSE:"else">
| <END:"end">
| <ESCAPE:"escape">
| <EXCEPTION:"exception">
| <EXISTS:"exists">
| <EXPLAIN:"explain">
| <EXTRACT:"extract">
| <EXTERNAL:"external">
| <FALSE:"false">
| <FIRST:"first">
| <FOR:"for">
| <FOREIGN:"foreign">
| <FROM:"from">
| <FULL:"full">
| <GENERATED:"generated">
| <GROUP:"group">
| <HAVING:"having">
| <HOUR:"hour">
| <IDENTITY:"identity">
| <IF:"if">
| <INCREMENT:"increment">
| <IMMEDIATE:"immediate">
| <IN:"in">
| <INITIALLY:"initially">
| <INDEX:"index">
| <INNER:"inner">
| <INSERT:"insert">
| <INTO:"into">
| <IS:"is">
| <JOIN:"join">
| <KEY:"key">
| <LEADING:"leading">
| <LEFT:"left">
| <LIKE:"like">
| <LIMIT:"limit">
| <LINK:"link">
| <MAXVALUE:"maxvalue">
| <MATCHED:"matched">
| <MERGE:"merge">
| <MINUTE:"minute">
| <MINVALUE:"minvalue">
| <MILLISECOND:"millisecond">
| <MONTH:"month">
| <NEXT:"next">
| <NO:"no">
| <NOT:"not">
| <NULL:"null">
| <OFFSET:"offset">
| <ON:"on">
| <OR:"or">
| <ORDER:"order">
| <ORGANIZATION:"organization">
| <OUTER:"outer">
| <POSITION:"position">
| <PRIMARY:"primary">
| <QUARTER:"quarter">
| <RIGHT:"right">
| <REFERENCES:"references">
| <RENAME:"rename">
| <RESTART:"restart">
| <SECOND:"second">
| <SELECT:"select">
| <SEQUENCE:"sequence">
| <SET:"set">
| <SOUNDS:"sounds">
| <START:"start">
| <SUBSTRING:"substring">
| <SYSDATE:"sysdate">
| <TABLE:"table">
| <THEN:"then">
| <TO:"to">
| <TRAILING:"trailing">
| <TRIM:"trim">
| <TRUE:"true">
| <TRUNCATE:"truncate">
| <UNIQUE:"unique">
| <UPDATE:"update">
| <UPSERT:"upsert">
| <USER:"user">
| <USING:"using">
| <VALUES:"values">
| <VALUE:"value">
| <VIEW:"view">
| <WEEK:"week">
| <WHEN:"when">
| <WHERE:"where">
| <WITH:"with">
| <YEAR:"year">
}
TOKEN:// DATA TYPES
{
<BIT:"bit">
| <BYTE:"byte">
| <INT:"int">
| <REAL:"real">
| <CLOB:"clob">
| <BLOB:"blob">
| <CHAR:"char">
| <CHARACTER:"character">
| <DATE:"date">
| <TIME:"time">
| <FLOAT:"float">
| <BIGINT:"bigint">
| <LONG:"long">
| <RAW:"raw">
| <STRING:"string">
| <BINARY:"binary">
| <NUMERIC:"numeric">
| <NUMBER:"number">
| <DECIMAL:"decimal">
| <DEC:"dec">
| <BOOLEAN:"boolean">
| <TINYINT:"tinyint">
| <INTEGER:"integer">
| <VARCHAR:"varchar">
| <VARCHAR2:"varchar2">
| <LONGVARCHAR:"longvarchar">
| <TEXT:"text">
| <SMALLINT:"smallint">
| <SHORT:"short">
| <VARBINARY:"varbinary">
| <LONGVARBINARY:"longvarbinary">
| <IMAGE:"image">
| <VARYING:"varying">
| <LARGE:"large">
| <TIMESTAMP:"timestamp">
| <OBJECT:"object">
| <JAVA_OBJECT:"java_object">
| <DOUBLE:"double">
}
TOKEN:// LITERALS
{
<INTEGER_LITERAL:(["0"-"9"])+>
| <FLOATING_POINT_LITERAL:(["0"-"9"])+"."(["0"-"9"])+(<EXPONENT>)?
| "."(["0"-"9"])+(<EXPONENT>)?
| (["0"-"9"])+<EXPONENT>
| (["0"-"9"])+(<EXPONENT>)?>
| <#EXPONENT:["e", "E"](["+", "-"])?(["0"-"9"])+>
| <STRING_LITERAL:"'"(~["'"])*("''"(~["'"])*)*"'">
}
TOKEN:// IDENTIFIERS
{
<ID:(<LETTER>)+("_"
| "$"
| "#"
| <DIGIT>
| <LETTER>)*>
| <#LETTER:["A"-"Z", "a"-"z"]>
| <#DIGIT:["0"-"9"]>
}
TOKEN:// SEPARATORS AND OPERATORS
{
<ASSIGN:":=">
| <COMMA:",">
| <CONCAT:"||">
| <SEMICOLON:";">
| <DOT:".">
| <LESS:"<">
| <LESSEQUAL:"<=">
| <GREATER:">">
| <GREATEREQUAL:">=">
| <EQUAL:"=">
| <NOTEQUAL:"!=">
| <NOTEQUAL2:"<>">
| <JOINPLUS:"(+)">
| <OPENPAREN:"(">
| <CLOSEPAREN:")">
| <ASTERISK:"*">
| <SLASH:"/">
| <PLUS:"+">
| <MINUS:"-">
| <QUESTIONMARK:"?">
}
TOKEN:// START QUOTED IDENTIFIER
{
<START_QUOTED_IDENTIFIER:"\"">:STATE_QuotedIdentStart
}
<STATE_QuotedIdentStart>TOKEN:{
<QUOTED_IDENTIFIER:<ID>>:STATE_QuotedIdentEnd
}
<STATE_QuotedIdentEnd>TOKEN:// IDENTIFIER ESCAPE CHAR
{
<END_QUOTED_IDENTIFIER:"\"">:DEFAULT
}
// --------------------------------------------------------------
// SQL Parsing methods.
// --------------------------------------------------------------
// --------------------------------------------------------------
// Call this method to create a List of SQLExpression from the
// given input.
// --------------------------------------------------------------
List<SQLExpression>parseInput():{
List<SQLExpression>sqlExprs = new LinkedList<SQLExpression>();
SQLExpression sqlExpr;
}
{
(sqlExpr = createSQLExpression(){
sqlExprs.add(sqlExpr);
}[<SEMICOLON>])+<EOF>
{
return sqlExprs;
}
}
SQLExpression createSQLExpression():{
SQLExpression sqlExpr = new SQLExpression();
}
{
(parseSelectStatement()
| parseCreateStatement()){
return sqlExpr;
}
}
// --------------------------------------------------------------
// Helper methods
// --------------------------------------------------------------
String SqlIdentifier():{
Token t = null;
}
{
t = SqlQuotedId(){
return t.image;
}
}
Token SqlQuotedId():{
Token t = null;
}
{
(t = <ID>
| <START_QUOTED_IDENTIFIER>t = <QUOTED_IDENTIFIER><END_QUOTED_IDENTIFIER>){
return t;
}
}
Object[]SqlColumnDef():{
Object[]tuple = new Object[6];
}
{
tuple[0] = SqlIdentifier()(SqlExactNumericType(tuple)
| (LOOKAHEAD(2)SqlCharStringType(tuple)
| SqlBinaryStringType(tuple)
| SqlApproximateNumericType(tuple)
| SqlBooleanType(tuple)
| SqlDataTimeType(tuple))) {
return tuple;
}
}
void SqlCharStringType(Object[]tuple):{}{
LOOKAHEAD(2)((<CHAR>
| <CHARACTER>){
tuple[1] = "char";
tuple[2] = "1";
}
[SqlCharLength(tuple)])
| LOOKAHEAD(2)((<CHAR><VARYING>
| <CHARACTER><VARYING>
| <VARCHAR>
| <VARCHAR2>
| <STRING>){
tuple[1] = "varchar";
tuple[2] = "1";
}
[SqlCharLength(tuple)])
| ((LOOKAHEAD(2)<LONG>
| <LONG><VARCHAR>
| <LONGVARCHAR>
| <TEXT>){
tuple[1] = "varchar";
tuple[2] = ""+Integer.MAX_VALUE;
}
)
| ((<CHAR><LARGE><OBJECT>
| <CHARACTER><LARGE><OBJECT>
| <CLOB>){
tuple[1] = "clob";
}
[SqlPrecision(tuple)])
}
void SqlBinaryStringType(Object[]tuple):{}{
(<BYTE>{
tuple[1] = "byte";
}
[SqlPrecision(tuple)])
| LOOKAHEAD(2)((<BINARY>
| <VARBINARY>
| <RAW>){
tuple[1] = "varbinary";
}
[SqlPrecision(tuple)])
| ((LOOKAHEAD(2)<LONG><VARBINARY>
| <LONGVARBINARY>
| <LONG><RAW>
| <IMAGE>){
tuple[1] = "varbinary";
tuple[2] = ""+Integer.MAX_VALUE;
}
[SqlCharLength(tuple)])
| ((<BINARY><LARGE><OBJECT>
| <BLOB>){
tuple[1] = "blob";
}
[SqlPrecision(tuple)])
}
void SqlExactNumericType(Object[]tuple):{
Integer precision = new Integer(BigDecimalType.DEFAULT_PRECISION);
Integer scale = new Integer(BigDecimalType.DEFAULT_SCALE);
}
{
LOOKAHEAD(3)((<NUMERIC>
| <DECIMAL>
| <DEC>
| <NUMBER>){
tuple[1] = "numeric";
tuple[2] = precision.toString();
tuple[3] = scale.toString();
}
[SqlPrecisionAndScale(tuple)]{
precision = Integer.valueOf(tuple[2].toString());
scale = Integer.valueOf(tuple[3].toString());
if (scale.compareTo(precision)>0){
throw new BlitzParserException("scale value exceeds precision value");
}
else if (precision.compareTo(new Integer(BigDecimalType.MAX_PRECISION))>0){
throw new BlitzParserException("precision value exceeds maximum ("+BigDecimalType.MAX_PRECISION+")");
}
}
)
| ((<INTEGER>
| <INT>){
tuple[1] = "integer";
}
)
| ((<SMALLINT>
| <SHORT>){
tuple[1] = "short";
}
)
| ((<BIGINT>){
tuple[1] = "bigint";
}
)
}
void SqlApproximateNumericType(Object[]tuple):{}{
LOOKAHEAD(3)(<FLOAT>{
tuple[1] = "float";
}
[SqlPrecision(tuple)])
| (<REAL>{
tuple[1] = "float";
}
)
| (<DOUBLE>{
tuple[1] = "double";
}
)
}
void SqlBooleanType(Object[]tuple):{}{
((<BOOLEAN>
| <BIT>){
tuple[1] = "boolean";
}
)
}
void SqlDataTimeType(Object[]tuple):{}{
(<DATE>{
tuple[1] = "date";
}
)
| (<TIME>{
tuple[1] = "time";
}
[SqlPrecision(tuple)])
| (<TIMESTAMP>{
tuple[1] = "timestamp";
}
[SqlPrecision(tuple)])
}
void SqlCharLength(Object[]tuple):{}{
// If length is omitted, then a length of 1 (one) is implicit.
<OPENPAREN>tuple[2] = SqlPositiveInteger()<CLOSEPAREN>
}
void SqlPrecisionAndScale(Object[]tuple):{}{
<OPENPAREN>tuple[2] = SqlPositiveInteger()(<COMMA>tuple[3] = SqlUnsignedInteger())*<CLOSEPAREN>
}
void SqlPrecision(Object[]tuple):{}{
<OPENPAREN>tuple[2] = SqlUnsignedInteger()<CLOSEPAREN>
}
String SqlPositiveInteger():{
String valStr = null;
}
{
valStr = SqlUnsignedInteger(){
int val = (valStr != null)?Integer.valueOf(valStr).intValue():-1;
if (val == 0){
throw new BlitzParserException("value must be a positive integer");
}
return valStr;
}
}
String SqlUnsignedInteger():{
Token t = null;
}
{
t = <INTEGER_LITERAL>{
if (t != null){
return t.image;
}
return null;
}
}
CreateAtom createTable():{
String tblName;
String schemaName;
CreateAtom createAtom = new CreateAtom();
Object[] obj;
}
{
<CREATE><TABLE>[SqlIdentifier()<DOT>{
schemaName = token.image;
}
]SqlIdentifier(){
tblName = token.image;
}
[<IF><NOT><EXISTS>{
createAtom.setIfNotExists(true);
}]
<OPENPAREN>[(obj = SqlColumnDef(){
createAtom.addColumnDefinition(new ColumnDefinition(obj));
}
)](<COMMA>obj = SqlColumnDef(){
createAtom.addColumnDefinition(new ColumnDefinition(obj));
}
)*<CLOSEPAREN>
}

Friday, March 27, 2009

Implementing a SQL Parser using JavaCC - 2

Lets start with a JavaCC Grammar file for parsing the SQL queries. JavaCC grammar file for SQL Parsing is divided into various sections and they are as given below:
1. Options section
This is the first and foremost section of the JavaCC grammar file. This section contains various JavaCC configuration options.
Eg:
options {
IGNORE_CASE = true;
STATIC = false;
UNICODE_INPUT = true;

// some performance optimizations
ERROR_REPORTING = false;
}
2. The next section is the block that is surrounded by PARSER_BEGIN(parserName) and PARSER_END(parserName). In this block, a java compilation unit can be inserted. In the simplest case, it can be a very simple class declaration without any method or variable inside it. The package declaration used in the section will be used in all the JavaCC generated files like parser, token manager, token manager constants, char stream, etc.
 PARSER_BEGIN(BlitzSQLParser)
package net.java.blitz.db.sql.parser;
import java.util.List;
import java.util.LinkedList;
import net.java.blitz.db.sql.expression.*;
/**
* @author karthikeyan subramanian
* Do not edit this file directly. This file is generated from BlitzSQLGrammar.jj
*
*/
public class BlitzSQLParser{
}
PARSER_END(BlitzSQLParser)
3. This section is then followed by the set of regular expression production that define how the parser should behave when a match is found. Four special regular expressions are available for this purpose. Now we define our regular expressions here. (Disclosure: Lot of SQL related tokens are borrowed from AxionSQLParser from Axion.) Not all these tokens are used in this SQL Parser. They can be useful for future use.
// ----------------------------------------------------------------------------
// TOKENS
// ----------------------------------------------------------------------------
SKIP:{
" "
| "\n"
| "\r"
| "\t"
}
SKIP:{
<LINE_COMMENT:"--"(~["\n", "\r"])*("\n"
| "\r"
| "\r\n")>
}
SKIP:{
<BLOCK_COMMENT:"/*"(~["*"])*"*"("*"
| (~["*", "/"](~["*"])*"*"))*"/">
}
TOKEN:// KEYWORDS
{
<ADD:"add">
| <ALL:"all">
| <ALTER:"alter">
| <ALWAYS:"always">
| <AND:"and">
| <AS:"as">
| <ASC:"asc">
| <BEGIN:"begin">
| <BETWEEN:"between">
| <BOTH:"both">
| <BY:"by">
| <CAST:"cast">
| <CASCADE:"cascade">
| <CASE:"case">
| <CHECK:"check">
| <CREATE:"create">
| <COLUMN:"column">
| <CONSTRAINT:"constraint">
| <CURRENT_TIMESTAMP:"current_timestamp">
| <CURRENT_DATE:"current_date">
| <CURRENT_TIME:"current_time">
| <CYCLE:"cycle">
| <DATABASE:"database">
| <DATA:"data">
| <DAY:"day">
| <DEFAULT_:"default">
| <DEFERRED:"deferred">
| <DEFERRABLE:"deferrable">
| <DEFRAG:"defrag">
| <DELETE:"delete">
| <DESC:"desc">
| <DISTINCT:"distinct">
| <DROP:"drop">
| <ELSE:"else">
| <END:"end">
| <ESCAPE:"escape">
| <EXCEPTION:"exception">
| <EXISTS:"exists">
| <EXPLAIN:"explain">
| <EXTRACT:"extract">
| <EXTERNAL:"external">
| <FALSE:"false">
| <FIRST:"first">
| <FOR:"for">
| <FOREIGN:"foreign">
| <FROM:"from">
| <FULL:"full">
| <GENERATED:"generated">
| <GROUP:"group">
| <HAVING:"having">
| <HOUR:"hour">
| <IDENTITY:"identity">
| <IF:"if">
| <INCREMENT:"increment">
| <IMMEDIATE:"immediate">
| <IN:"in">
| <INITIALLY:"initially">
| <INDEX:"index">
| <INNER:"inner">
| <INSERT:"insert">
| <INTO:"into">
| <IS:"is">
| <JOIN:"join">
| <KEY:"key">
| <LEADING:"leading">
| <LEFT:"left">
| <LIKE:"like">
| <LIMIT:"limit">
| <LINK:"link">
| <MAXVALUE:"maxvalue">
| <MATCHED:"matched">
| <MERGE:"merge">
| <MINUTE:"minute">
| <MINVALUE:"minvalue">
| <MILLISECOND:"millisecond">
| <MONTH:"month">
| <NEXT:"next">
| <NO:"no">
| <NOT:"not">
| <NULL:"null">
| <OFFSET:"offset">
| <ON:"on">
| <OR:"or">
| <ORDER:"order">
| <ORGANIZATION:"organization">
| <OUTER:"outer">
| <POSITION:"position">
| <PRIMARY:"primary">
| <QUARTER:"quarter">
| <RIGHT:"right">
| <REFERENCES:"references">
| <RENAME:"rename">
| <RESTART:"restart">
| <SECOND:"second">
| <SELECT:"select">
| <SEQUENCE:"sequence">
| <SET:"set">
| <SOUNDS:"sounds">
| <START:"start">
| <SUBSTRING:"substring">
| <SYSDATE:"sysdate">
| <TABLE:"table">
| <THEN:"then">
| <TO:"to">
| <TRAILING:"trailing">
| <TRIM:"trim">
| <TRUE:"true">
| <TRUNCATE:"truncate">
| <UNIQUE:"unique">
| <UPDATE:"update">
| <UPSERT:"upsert">
| <USER:"user">
| <USING:"using">
| <VALUES:"values">
| <VALUE:"value">
| <VIEW:"view">
| <WEEK:"week">
| <WHEN:"when">
| <WHERE:"where">
| <WITH:"with">
| <YEAR:"year">
}
TOKEN:// DATA TYPES
{
<BIT:"bit">
| <BYTE:"byte">
| <INT:"int">
| <REAL:"real">
| <CLOB:"clob">
| <BLOB:"blob">
| <CHAR:"char">
| <CHARACTER:"character">
| <DATE:"date">
| <TIME:"time">
| <FLOAT:"float">
| <BIGINT:"bigint">
| <LONG:"long">
| <RAW:"raw">
| <STRING:"string">
| <BINARY:"binary">
| <NUMERIC:"numeric">
| <NUMBER:"number">
| <DECIMAL:"decimal">
| <DEC:"dec">
| <BOOLEAN:"boolean">
| <TINYINT:"tinyint">
| <INTEGER:"integer">
| <VARCHAR:"varchar">
| <VARCHAR2:"varchar2">
| <LONGVARCHAR:"longvarchar">
| <TEXT:"text">
| <SMALLINT:"smallint">
| <SHORT:"short">
| <VARBINARY:"varbinary">
| <LONGVARBINARY:"longvarbinary">
| <IMAGE:"image">
| <VARYING:"varying">
| <LARGE:"large">
| <TIMESTAMP:"timestamp">
| <OBJECT:"object">
| <JAVA_OBJECT:"java_object">
| <DOUBLE:"double">
}
TOKEN:// LITERALS
{
<INTEGER_LITERAL:(["0"-"9"])+>
| <FLOATING_POINT_LITERAL:(["0"-"9"])+"."(["0"-"9"])+(<EXPONENT>)?
| "."(["0"-"9"])+(<EXPONENT>)?
| (["0"-"9"])+<EXPONENT>
| (["0"-"9"])+(<EXPONENT>)?>
| <#EXPONENT:["e", "E"](["+", "-"])?(["0"-"9"])+>
| <STRING_LITERAL:"'"(~["'"])*("''"(~["'"])*)*"'">
}
TOKEN:// IDENTIFIERS
{
<ID:(<LETTER>)+("_"
| "$"
| "#"
| <DIGIT>
| <LETTER>)*>
| <#LETTER:["A"-"Z", "a"-"z"]>
| <#DIGIT:["0"-"9"]>
}
TOKEN:// SEPARATORS AND OPERATORS
{
<ASSIGN:":=">
| <COMMA:",">
| <CONCAT:"||">
| <SEMICOLON:";">
| <DOT:".">
| <LESS:"<">
| <LESSEQUAL:"<=">
| <GREATER:">">
| <GREATEREQUAL:">=">
| <EQUAL:"=">
| <NOTEQUAL:"!=">
| <NOTEQUAL2:"<>">
| <JOINPLUS:"(+)">
| <OPENPAREN:"(">
| <CLOSEPAREN:")">
| <ASTERISK:"*">
| <SLASH:"/">
| <PLUS:"+">
| <MINUS:"-">
| <QUESTIONMARK:"?">
}
TOKEN:// START QUOTED IDENTIFIER
{
<START_QUOTED_IDENTIFIER:"\"">:STATE_QuotedIdentStart
}
<STATE_QuotedIdentStart>TOKEN:{
<QUOTED_IDENTIFIER:<ID>>:STATE_QuotedIdentEnd
}
<STATE_QuotedIdentEnd>TOKEN:// IDENTIFIER ESCAPE CHAR
{
<END_QUOTED_IDENTIFIER:"\"">:DEFAULT
}
The SKIP regular expression production specifies that the tokens in the input that match this regular expressions should be skipped or ignored whereas TOKEN specifies that the matching tokens are to be considered as a valid TOKEN and should be processed.

(To be continued)...

Thursday, March 26, 2009

Implementing a SQL Parser using JavaCC - 1

JavaCC is the Compiler compiler implementation written in Java. What this means is JavaCC lets you define the grammar for any language and JavaCC will generate java classes that lets you to read, analyze and break the input into language units. Essentially, JavaCC generates parser and lexical analyser for you from the grammar file that you provided.

Every language is made up of set of alphabets. These alphabets are composed into meaningful units called words. These words are then composed into meaningful sentences. For implementing a SQL Parser, lets see how these language units compare
Alphatets - No particular equivalent in SQL
Words - Valid SQL Tokens like SELECT, INSERT, UPDATE, DELETE, CREATE, etc
Sentence - Valid ordering of various SQL Tokens i.e., Valid SQL Statement.
For eg., SELECT A from TABLE_1 is valid whereas SELECT INSERT VIEW is gibberish and is not valid.

Wednesday, March 25, 2009

SQL Parser

Database engine consists of a number of different sub components like query parser, optimizer, indexing facility and buffer manager. The first point of entry is the parser. The statements that are submitted to the database needs to be parsed and understood by the Database. When the database parses the statements, it can decipher the intended operation and can carry out appropriate actions. Javacc is a compiler compiler that can be used to create parser for a given grammar. I am looking into Javacc to create a new SQL Parser. I am planning to derive lot of grammar from Axion grammar. Initially I am planning to write SQL grammar for create, insert and delete statements. I will write more about my experiences in the following posts.

Blitz

I am currently working on creating a new java relational database engine called Blitz. You may wonder why I am into this now. Well, there are a few reasons. I want to create a database engine which can be small but powerful and also pretty extensible. So there are few databases out there like HSQL, H2, Axion, Derby,etc. So the whole difference lies in the fact that Blitz is meant to perform better for data analysis, mining and ware housing operations. The plan is to provide row store as well as column store. Also there is plan to provide data federation capabilities. Blitz is meant to take advantage of the Java 6 language features and also make use of the concurrency utils unlike some of the databases that came out before the concurrency packages were introduced. This will let Blitz to take advantage of multi core processors and can hopefully perform better. Only time will tell how much this will work!

Who am I

I am a software developer and a java enthusiast. I am a computer science graduate and have been working on java related technologies for 3-4 years. I am interested in databases, data warehousing and have worked on a couple of open source projects like Axion (java embedded database) and also ETL integrator (ETL tool from Sun Microsystems). I have done significant amount of work on ETL Integrator (I worked full time at Sun on this product) and have a very big soft spot for this product. Anyways, thats pretty much about my technical background. Now I work at Yahoo! on a display advertising platform product called APT. I like to play guitar and I do like video games just like most other men! I also like to travel a bit.