A source code to HTML syntax-highlighter

If you do coding and have a Web site where you publish your code, a program that converts your code into a HTML-formatted Web-ready document with syntax highlighting that you can paste into your page is a necessity.This is exactly what this program does for Pascal, PHP, Javascript and Visual Basic source code. The resulting document maintains the original formatting of the source code and surrounds syntactic features with HTML tags that are used by a CSS file for syntax-highlighting.

It is common experience for a programmer to write source code in a programming environment and notice that the programming environment displays it in a way that emphasizes the syntactic features of the language: special symbols of the language, strings, numbers, comments as well as other special sequences of characters - this is called "syntax-highlighting". This sort of display results from the fact that the text is parsed by the environment according to the syntactic rules of the language. However, despites the apparent syntax-highlighting provided by the environment, the source code remains plain Ansi text ready to be compiled or interpreted but without any embedded syntactic features.

This article presents a program that allows you to quickly create an HTML file from a given source code file and maintains the source formatting and colored syntax highlighting. The output of the program is a document that can be pasted into an existing HTML page.

Approach

Let's start with the definition of a parser. A parser breaks data into smaller elements (tokens), according to a set of rules that describe its structure. In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text made of a sequence of tokens (for example, a mailing address can be tokenized into a street address, a city, a postal code, a province and a country), to determine its grammatical structure with respect to a given (more or less) formal grammar. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser.

I started with the undocumented TParser class found in classes.pas in RAD Studio 2009. It is used internally in RAD Studio to parse (tokenize) VCL Form files (which have a .dfm extension on Win32, and .nfm on .NET). TParser performs a lexical analysis of the input stream. It breaks the stream up into floating point numbers, strings, Pascal identifiers or punctuation. Any other non-space characters are considered to be punctuation. Binary data, appearing as hexadecimal digits, is also handled by TParser. Binary data is enclosed in curly braces, and is handled specially by TParser [from Delphi3000].

TParser cannot handle Pascal comments but the Pascal-to-HTML converter developed by Marco Cantu and Tim Gooch in their book entitled "Delphi Developer's Handbook" (pp. 128-142) extends the TParser class and can handle Pascal comments. It looked somewhat closer to my requirement since Cantu and Gooch used it to produce syntax-highlighted Pascal code on a Web page. Their approach is clever and simple:

The GtroCodeParser project

The GtroCodeParser program provides an easy-to-use user interface described below. The main area of the UI is a TMemo component. In the programming environment, you select the code, copy it to the clipboeard and transfer it to the TMemo component by clicking on the LoadFromClipboard button.

GtroCodeParser UI
User interface of the GtroCodeParser program

When the user clicks on the "Convert to [language]" button, the Convert() procedure of the host program is called. It creates the source and destination streams and load the content of the TMemo component into the source stream. Additionally, it creates a buffer of AnsiChars and loads either the TGtroPascalParser, the TGtroPHPParser or the TGtroJavascriptParser objects depending on the language selection of the user. Once one of these object is created, it call the Convert() method of the base class and start parsing the source.

The logic of these classes is fairly simple: copy the original code to a stream and create a stream for the HTML-formatted code. Then, read the input stream token by token and interpret the meaning of each token as a delimiter, a symbol, a string or a comment and format it for the Web (embed them within HTML tags). With the program, the code is copied from the clipboard to the Memo component, converted to HTML by clicking on the "Convert to [language]" button and paste the new content of the Memo component to the clipboard. Once this is done, the user can paste it in a text editor.

The code is then nested within <pre class="code"> ... </pre> HTML tags. The syntactic features that are considered are as shown in this table:

Syntactic features and their HTML rendering
Syntactif features Token Value Description Rendering HTML tag
End of file
toEOF
#0
Indicates that the end-of-file has been reached.
no action
Symbol
toSymbol
#1
They start with a letter or the _ character. When detected, the characters that follow are read until a non-letter is found. This sequence is then transfered to the parser with the toSymbol token for comparison with the list of the keywords of the language. If there is a match, the sequence of character is converted to an Ansi string and surrounded by <b> tags
<b>
String
toString
#2
Strings are normally surrounded by string delimiters. When detected, the characters that follow are read until a the closing delimitr is found. This sequence is then transfered to the parser as an Ansi string with the toString token.
<em>
Integer
to Integer
#3
Numbers can be integers of floating point numbers.
no special tag
Float
toFloat
#4
Numbers can be integers of floating point numbers
no special tag
Comment
toComment
#5
Comments are of two kinds: End-of-line and block comments. When an end-of-line comment delimiter is detected, all the characters following it to the end of the line are returned to the parser. When a block comment delimiter is detected, all the characters following it are read until the closing comment delimiter is found (it may be multi-lines). This sequence is then transfered to the parser as an Ansi string with the toComment token.
<i>
Variable
toVar
#6
When names of variables are easily recognized such as with PHP (they start with the $ character), the characters that follow are read until a non-letter character is found. The sequence of characters thus produced is transfered to the parser as an Ansi string with the toVar token.
<var>

The rendering of these tags in the browser is left to the user and in the way his CSS file renders the <b>, <i>, <em> and <var> tags of the "code" class.

Beware!
Do not paste the output produced by this program directly in a WYSIWYG editor because all the HTML tags that the program has added to the document will be interpreted as text and their effect will be lost. Paste it in a text or a code editor.

Creating an abstract class

I created an abstract class that I called TGtroCodeParser, a renamed carbon copy of Marco Cantu's TCodeParser class. It does no parsing at all but defines the virtual/abstract methods that are needed to do the parsing in descendant classes. It takes control of the parsing process with its Convert() public method. Additionally, it performs housekeeping chores like creating the streams, filling the buffer, maintaining pointers on the input stream and producing the output stream. Descendant classes override the method declared in this class and generate the parsed output.

The Convert method

The Convert() method of the base class is at the core of the parsing process as it controls the parsing by calling the NextToken() method of the descendant classes and use the result provided by this method to format the result for the Web. What it does will be discussed in details later.

Formatting methods

Formatting the code for the Web is performed in the base class. The most important of these methods is TokenString() which generates an Ansistring from the the data received from NextToken() whereas the methods called BeforeKeywords(), AfterKeywords(), BeforeString(), AfterString(), BeforeComment() and AfterComment() simply enclose the strings produced by TokenString() within specific HTML tags.

Other housekeeping methods

Essentially, the input to the program is hosted by a TStringList component which is transferred to the source stream as a sequence of Ansi characters where each line is separated by carriage-return line-feed (#13#10) characters. Each of these Ansi characters (tokens) are considered one by one by the parsing process and, as such, some management of the input stream is needed during the parsing process. Details of this management is provided at Annexe A.

Creating one descendant class for each language

Using this approach, a descendant class must be created for each programming language whose code will be parsed. At the moment, two such descendant classes are descibed hereunder: the TGtroPascalParser and the TGtroPHPParser. Two other classes have been implemented but they are not described here: the TGtroJavascripParser and the TGtroVBasicParser.

How does the parser works?

When a user decides to convert his source code to HTML, he chooses the appropriate programming language and launches the parser by clicking on the "Convert to [language]" button. The conversion process is now initiated. Two streams and a buffer of AnsiChar are created, the file containing the keywords of the language is loaded, and the input stream if filled with the code to be parsed.

The code is transferred by groups of 4096 bytes to an AnsiChar buffer (it is an AnsiString) which is inspected character by character by the parser. The Convert() method of the base class takes control of the parsing and calls the NextToken() method of the descendant classes until all the content of the stream has been inspected. The Web-ready document is then put in the output stream and passed to the host program where it is displayed in a TMemo component.

Let's get in the detail now and remember that "Devil is in the detail".

Keywords

Each programming language has a set of keywords (reserved words for Delphi) which are to be highlighted by the parser. These keywords are stored in separate files in a format that can be understood and loaded by TStringList.LoadFromFile. These files are called Pascal.txt, PHP.txt, Javascript.txt and VBasic.txt and they need to be in the same directory as the host program.

The NextToken() tokenizer method

This method reads the characters (token) from the input stream and attempts to convert them into types it knows of. For example, if the character returned is a letter, the tokeniser should keep reading letters until it finds a non-letter symbol, then it should return the string to the parser (here the Convert() method). If the character is a string or a comment delimiter, it should keep reading characters until it finds the closing delimiter.

It is essentially the one developed by Cantu and Goch. Its role is to interpret the Ansi characters that it dissects one by one. The lexical analysis part of the parsing process is performed by the NextToken() method of the descendant classes as shown in the diagram that follows:

NextToken diagram
Flow diagram of the NextToken() method

The diagram of the NextToken() method shown above shows how each token is dissected. First, the AnsiString buffer is filled if needed. Then the pAnsiChar pointer P is set to the value of SourcePtr which points directly to the character under investigation in the buffer. Next, TokenStr is set to P to remember where P was located at the beginning of the method.

Then the token pointed to by P (it is P^) is compared with the various delimiters of the language as shown above in a case statement. When a delimiter is found, P is increased by a certain number of positions until a condition is met. SourcePtr is then set to the final value of P. If the delimiter was a string delimiter, another pAnsiChar pointer S is used. In this case, P and S are increased until the closing delimiter is met and the pointer StringPtr is set to P. The pointer arithmetic represented by the difference between SourcePtr and TokenPtr on the one hand, and between StringPtr and TokenPtr on the other hand will be used in the Convert() method of the base class which called NextToken() and will complete the parsing process by formatting the result for the Web.

This process is repeated for each character in the buffer and the result is passed to the Convert() method of the base class.

   [1]  function TGtroPascalParser.NextToken: AnsiChar;
   [2]  // called by Create, Convert
   [3]  var
   [4]    I: Integer;
   [5]    P, S: PAnsiChar;
   [6]  begin
   [7]    SkipBlanks;  // fills buffer when needed
   [8]    P:= SourcePtr;
   [9]    TokenPtr:= P;
  [10]    case P^ of
  [11]  
  [12]      // p^ is a letter or '_': looking for a symbol
  [13]      'A'..'Z', 'a'..'z', '_':
  [14]        begin
  [15]          Inc(P);
  [16]          while P^ in ['A'..'Z', 'a'..'z', '0'..'9', '_', '.'] do Inc(P);
  [17]          Result:= toSymbol;
  [18]        end;
  [19]  
  [20]      '#', '''': // => to_String
  [21]        begin
  [22]          S:= P;
  [23]          while True do
  [24]            case P^ of
  [25]              '#':
  [26]              begin
  [27]                Inc(P);
  [28]                I:= 0;
  [29]                while P^ in ['0'..'9'] do
  [30]                begin
  [31]                  I:= I * 10 + (Ord(P^) - Ord('0'));
  [32]                  Inc(P);
  [33]                end;
  [34]                S^:= AnsiChar(Chr(I));
  [35]                Inc(S);
  [36]              end; // #
  [37]  
  [38]              '''': //looking for a Ansistring
  [39]              begin
  [40]                Inc(P);
  [41]                while True do
  [42]                begin
  [43]                  case P^ of
  [44]                    #0, #10, #13:
  [45]                      Error('Invalid Ansistring');
  [46]  
  [47]                    '''':
  [48]                      begin
  [49]                        Inc(P);
  [50]                        if P^ <> '''' then Break;
  [51]                      end;
  [52]                  end; // case P^
  [53]                  S^:= P^;
  [54]                  Inc(S);
  [55]                  Inc(P);
  [56]                end; // while True
  [57]              end;
  [58]            else
  [59]              Break;
  [60]            end; // case P^
  [61]            StringPtr:= S;
  [62]            Result:= Classes.toString;
  [63]          end; // while True
  [64]  
  [65]      '$':
  [66]        begin
  [67]          Inc(P);
  [68]          while P^ in ['0'..'9', 'A'..'F', 'a'..'f'] do Inc(P);
  [69]          Result:= toInteger;
  [70]        end;
  [71]  
  [72]      '-', '0'..'9':
  [73]        begin
  [74]          Inc(P);
  [75]          while P^ in ['0'..'9'] do Inc(P);
  [76]          Result:= toInteger;
  [77]          while P^ in ['0'..'9', '.', 'e', 'E', '+', '-'] do
  [78]          begin
  [79]            Inc(P);
  [80]            Result:= toFloat;
  [81]          end;
  [82]        end;
  [83]  
  [84]     '{': // block comment
  [85]      begin
  [86]        // look for closing brace
  [87]        while (P^ <> '}') and (P^ <> toEOF) do
  [88]          Inc(P);
  [89]        // move to the next
  [90]        if (P^ <> toEOF) then
  [91]          Inc(P);
  [92]        Result:= toComment;
  [93]      end;
  [94]  
  [95]      else
  [96]        if (P^= '/') and (P^ <> toEOF) and ((P+1)^ = '/') then
  [97]        begin
  [98]          while (P^ <> #13) and (p^<> toEOF) do Inc(P);
  [99]          Result:= toComment;
 [100]        end
 [101]        else // anything else...
 [102]        begin
 [103]          Result:= P^;
 [104]          if Result <> toEOF then Inc(P);
 [105]        end;
 [106]    end; // case P^ of
 [107]    SourcePtr:= P;
 [108]    FToken:= Result; // FToken controls the main loop in Convert
 [109]  end;

The Convert() method

The Convert() method of the base class performs the syntactic analysis. It takes the token and the block of characters that NextToken() has produced and converts it into a syntax highlighted HTML-coded string ready to be added to the output.

It has been left nearly unchanged except for the statements highlighted in yellow in the code that follows:

  [1]  procedure TGtroCodeParser.Convert;
  [2]  // parses the entire source file
  [3]  var
  [4]    S, SDot: Ansistring;
  [5]    i, L: Integer;
  [6]  begin
  [7]    InitFile; // virtual
  [8]    ShowLineNumbers(0);
  [9]    NextToken; // get the first token
 [10]    FLine:= 1;
 [11]    Position:= 0;
 [12]    while Token <> toEOF do
 [13]    begin
 [14]  
 [15]      while SourceLine > FLine do
 [16]      begin // if the source code line has changed,
 [17]        OutStr:= OutStr + #13#10;
 [18]        ShowLineNumbers(FLine + 1);
 [19]        Inc(FLine);
 [20]        Position:= Position + 2; // 2 characters: cr+lf
 [21]      end;
 [22]  
 [23]      // add proper white spaces (formatting)
 [24]      while SourcePos > Position do
 [25]      begin
 [26]        OutStr:= OutStr + ' ';
 [27]        Inc(Position);
 [28]      end;
 [29]  
 [30]      // check the token
 [31]      case Token of
 [32]  
 [33]        toSymbol:
 [34]        begin
 [35]          SDot:= '';
 [36]          // if the token is not a keyword
 [37]          S:= TokenString;
 [38]          if S[length(S)] = '.' then
 [39]          begin
 [40]            S:= Copy(S, 0,Length(S)-1);
 [41]            SDot:= '.';
 [42]          end;
 [43]          if FKeywords.IndexOf(S) < 0 then // Ketword ?
 [44]            OutStr:= OutStr + S + SDot // + SDot added 17 June 2010
 [45]          else
 [46]          begin
 [47]            BeforeKeyword;
 [48]            OutStr:= OutStr + S;
 [49]            AfterKeyword;
 [50]            OutStr:= OutStr + SDot;
 [51]          end;
 [52]        end;
 [53]  
 [54]        Classes.toString:
 [55]        begin
 [56]          BeforeString; // virtual
 [57]          if (Length(TokenString) = 1) and (Ord(TokenString [1]) < 32) then
 [58]          begin
 [59]            OutStr:= OutStr + '#' +
 [60]              IntToStr(Ord(TokenString [1]));
 [61]            if Ord(TokenString [1]) < 10 then
 [62]              Position:= Position + 1
 [63]            else
 [64]              Position:= Position + 2;
 [65]          end
 [66]          else
 [67]          begin
 [68]            OutStr:= OutStr + MakeStringLegal(TokenString);
 [69]            Position:= Position + 2; // 2 characters, Cr+Lf
 [70]          end;
 [71]          AfterString;
 [72]        end;
 [73]  
 [74]        toInteger:
 [75]          OutStr:= OutStr + TokenString;
 [76]  
 [77]        toFloat:
 [78]          OutStr:= OutStr + TokenString;
 [79]  
 [80]        toComment:
 [81]        begin
 [82]          BeforeComment;
 [83]          OutStr:= OutStr + MakeCommentLegal(TokenString);
 [84]          AfterComment;
 [85]        end;
 [86]        else // any other token
 [87]          OutStr:= OutStr + CheckSpecialToken(Token);
 [88]      end; // case Token of
 [89]      // increase the current position
 [90]      Position:= Position + Length(TokenString);
 [91]      // move to the next token
 [92]      NextToken;
 [93]    end; // while Token <> toEOF do
 [94]    // add final code
 [95]    EndFile; // virtual
 [96]    // add the Ansistring to the stream
 [97]    Dest.WriteBuffer(Pointer(OutStr)^, Length(OutStr));
 [98]  end;

First, it starts the output string OutStr that will eventually end up in the output stream. Then it calls the NextToken() method. From the result of that method, it evaluates the Token returned and fills OutStr with HTML-formatted strings as shown below.

Cinvert diagram
Flow diagram of the Convert() method

Once the first token has been analyzed and its results added to OutStr, the process is repeated until the end of file is retrieved.

TokenString diagram
Flow diagram of the TokenString() method

The TokenString() method

When TokenString() is called, the parser does not yet hold a string that Delphi can manipulate. It has a Token and a block of characters delimited by pointers: one pointing to the beginning of a block in the buffer, the other pointing to its end. TokenString() performs the task of making such a string available by considering two possibilities: the token is toString or any other value.

It considers two cases:

The statement "SetString(Result, TokenPtr, L);" then produces an AnsiString that Delphi can manipulate.

Three more methods are used:

A new descendant class

After having used the parser developed by Cantu and Gooch to highlight Delphi code in my Web pages, it would have been natural to develop new descendant classes for C++ or Visual Basic. I did not because I don't use these languages. However, since I maintain a PHP-based Web site, PHP was the real candidate.

PHP is a server-based interpreted language that is used very broadly for Web sites. Its purpose and structure are very different from that of Pascal. What do I do normally with PHP:

This process required a new sub-class of TGtroCodeParser that I have called TGtroPHPParser. The code of TGtroPascalParter was copied and to the new class and adaptations had to be made due to differences in the languages:

Modifications to NextToken()

The adaptation of the NextToken() method for the PHP programming language required major modification to the code of NextToken() for Pascal. In the forthcoming code of the method, the statements which were added or modified are highlighted in yellow:

   [1]  function TGtroPHPParser.NextToken: AnsiChar;
   [2]  // called by Convert
   [3]  var
   [4]    P, S: PAnsiChar;
   [5]    i: Integer;
   [6]  begin
   [7]    SkipBlanks;  // fills buffer when needed
   [8]    P:= SourcePtr;
   [9]    TokenPtr:= P;
  [10]    case P^ of // Case Level 0
  [11]  
  [12]      'A'..'Z', 'a'..'z', '_':
  [13]      begin  // to Symbol: p^ is a letter or '_': looking for a symbol
  [14]        Inc(P);
  [15]        while P^ in ['A'..'Z', 'a'..'z', '0'..'9', '_', '.'] do
  [16]          Inc(P);
  [17]        Result:= toSymbol;
  [18]      end; // toSymbol
  [19]  
  [20]      '$': // toVar: a PHP variable
  [21]      begin
  [22]        Inc(P);
  [23]        while P^ in ['A'..'Z', 'a'..'z', '0'..'9', '_', '.'] do Inc(P);
  [24]        Result:= toVar;
  [25]      end; // toVar
  [26]  
  [27]      '''', '"': // toString: p^is '''' or '"'
  [28]      begin
  [29]        S:= P;
  [30]        while True do // while loop Level 1
  [31]          case P^ of // case Level 1
  [32]            '"':
  [33]            begin
  [34]              Inc(P);
  [35]              while True do // While loop Level 2
  [36]              begin
  [37]                case P^ of // Case Level 2
  [38]                  '"':
  [39]                  begin
  [40]                    Inc(P);
  [41]                    if P^ <> '"' then Break; // exit while loop Level 2
  [42]                  end; // '"':
  [43]  
  [44]                  '\': // escape character \"
  [45]                  begin
  [46]                    S^:= P^;
  [47]                    Inc(S);
  [48]                    Inc(P);
  [49]                    if not (P^ in ESC) then Break; // exit while loop Level 2
  [50]                  end;
  [51]                end; 
  [52]                S^:= P^;
  [53]                Inc(S);
  [54]                Inc(P);
  [55]              end; // End of while loop Level 2
  [56]              IsQuote:= False; //  executed after the Break
  [57]            end; // End of toString w/ delimiter '"'
  [58]  
  [59]            '''': //looking for a Ansistring
  [60]            begin
  [61]              Inc(P);
  [62]              while True do // While loop Level 2
  [63]              begin
  [64]                case P^ of // Case Level 2
  [65]  
  [66]                  #0, #10, #13:
  [67]                    Error('Invalid Ansistring');
  [68]  
  [69]                  '''':
  [70]                  begin
  [71]                    Inc(P);
  [72]                    if P^ <> '''' then Break; // exit the while loop Level 2
  [73]                  end;
  [74]                end; // End of case Level 2
  [75]                S^:= P^;
  [76]                Inc(S);
  [77]                Inc(P);
  [78]              end; // End of While loop Level 2
  [79]              IsQuote:= True; //  executed after the Breaking from loop Level 2
  [80]            end; // End of toString w/ delimiter ''''
  [81]          else // // Else for case Level 1
  [82]            Break;
  [83]  
  [84]        end; // End of while loop Level 1
  [85]        StringPtr:= S;
  [86]        Result:= Classes.toString;
  [87]      end; // End of toString
  [88]  
  [89]      '-', '0'..'9': // Case Level 0
  [90]      begin
  [91]        Inc(P);
  [92]        while P^ in ['0'..'9'] do Inc(P);
  [93]        Result:= toInteger;
  [94]        while P^ in ['0'..'9', '.', 'e', 'E', '+', '-'] do
  [95]        begin
  [96]          Inc(P);
  [97]          Result:= toFloat;
  [98]        end;
  [99]      end; // End of toFloat
 [100]  
 [101]      '#': // single line comment with "#"
 [102]      begin
 [103]        while P^ <> #13 do Inc(P);
 [104]        Result:= toComment;
 [105]      end; // End of toComment
 [106]  
 [107]    else // for case Level 0
 [108]  
 [109]      if (P^ = '/') and (P^ <> toEOF) and ((P+1)^ = '/') then
 [110]      begin // single line comment  with //
 [111]        while P^ <> #13 do Inc(P);
 [112]        Result:= toComment;
 [113]      end
 [114]      else
 [115]      if (P^ = '/') and (P^ <> toEOF) and ((P+1)^ = '*') then
 [116]      begin   // block comment with /* ... */
 [117]        Inc(P);
 [118]        Inc(P);
 [119]        while (P^ <> '*') and (P^ <> toEOF) and ((P+1)^ <> '/') do
 [120]          Inc(P);
 [121]        if (P^ <> toEOF) then
 [122]        begin
 [123]          Inc(P);
 [124]          Inc(P);
 [125]        end;
 [126]        Result:= toComment;
 [127]      end // End of toComment
 [128]      else // anything else...
 [129]      begin
 [130]        Result:= P^;
 [131]        if Result <> toEOF then Inc(P);
 [132]      end; // End of P^
 [133]  
 [134]    end; // End of case Level 0
 [135]    SourcePtr:= P;
 [136]    FToken:= Result; // FToken is used by Convert
 [137]  end;

The new toVar token

The first modifications shown on lines [20] to [25] have to do with the fact that PHP variables are easily recognized because they all start with the character "$". It has forced the definition of a new token called "toVar" which is now defined as #6 in the base class. A new section of the "case Token of" of Convert() had to be added to handle this new token. The code of this new section follows:

      toVar:
      begin
        SDot:= '';
        S:= TokenString;
        if S[length(S)] = '.' then
        begin
          S:= Copy(S, 0,Length(S)-1);
          SDot:= '.';
        end;
        BeforeVar; // put the <var> tag before
        OutStr:= OutStr + S;
        AfterVar; // append the </var> closing tag
      end;

An additional string delimiter

In addition to ', PHP recognizes " as a string delimiter. This has prompted the modifications shown on lines [32] to [57] in the NextToken() method. With few exception, this section of the code handles the strings the same way as they are treated with the old ' string delimiter. One of the exception is the treatment of the escaped sequences of PHP on lines [44] to [50] of NextToken(). The list of escaped sequences that follows is extracted from Escaped sequences:

Sequence Meaning

\"

Print the next character as a double quote, not a string closer

\'

Print the next character as a single quote, not a string closer

\n

Print a new line character (remember our print statements?)

\t

Print a tab character

\r

Print a carriage return (not used very often)

\$

Print the next character as a dollar, not as part of a variable

\\

Print the next character as a backslash, not an escape character

As a result, a constant called ESC = ['"', '''', 'r', 'n', 't', '$', '\']; has been defined at the beginning of the PHPConverter.pas unit and used on line [49] of the NextToken() method to interpret the escaped sequences correctly. There was no need to use similar code in the treatment of single-quoted string since these sequences are not interpreted symbolically there.

A new boolean private variable IsQuote also had to be created. It is updated on lines [56] and [79] and used to modify the behaviour of MakeStringLegal() as shown below:

  [1]  function TGtroPHPParser.MakeStringLegal(S: AnsiString): Ansistring;
  [2]  const
  [3]    SglQuote = ''''; // single quote
  [4]    DblQuote = '"';
  [5]  var
  [6]    i: Integer;
  [7]    Quote: AnsiString;
  [8]  begin
  [9]    Quote:= SglQuote;
 [10]    if not IsQuote then
 [11]      Quote:= DblQuote;
 [12]  
 [13]    if Length(S) < 1 then
 [14]    begin
 [15]      Result:= Quote + Quote; // here is the culprit!
 [16]      Exit;
 [17]    end;
 [18]  
 [19]    // if the first character is not special, add the open quote
 [20]    if S[1] > #31 then
 [21]      Result:= Quote  // here is the culprit!
 [22]    else
 [23]      Result:= '';
 [24]  
 [25]    // for each character of the Ansistring
 [26]    for i:= 1 to Length(S) do
 [27]      case S[i] of
 [28]  
 [29]        // special characters(characters below the value 32)
 [30]        #0..#31: begin
 [31]          Position:= Position + Length(IntToStr(Ord(S[I])));
 [32]          // if preceeding characters are plain ones,
 [33]          // close the Ansistring
 [34]          if (I > 1) and (S[I-1] > #31) then
 [35]            Result:= Result + Quote;
 [36]          // add the special character
 [37]          Result:= Result + '#' + IntToStr(Ord(S[I]));
 [38]          // if the following characters are plain ones,
 [39]          // open the Ansistring
 [40]          if (I < Length (S) - 1) and (S[I+1] > #31) then
 [41]            Result:= Result + Quote;
 [42]        end;
 [43]      else
 [44]        Result:= Result + CheckSpecialToken(S[I]);
 [45]      end;
 [46]  
 [47]    // if the last character was not special, add closing quote
 [48]    if (S[Length (S)] > #31) then
 [49]      Result:= Result + Quote;
 [50]  end;

Block comment /* */

Lines [115] to [127] of NextToken() handle the block comments delimited by /* and */. Its content is about the same as that dealing with the { } block comment delimiter in the Pascal parser except the /* ... */ are double characters delimiters that cannot be dealt with directly in the main case statement of the method. It is dealt with in the "else" clause of the case statement.

Conclusion

This article contains nothing new as far as programming is concerned: it is simply an extension of the approach takes by Cantu and Gooch that can operate on Pascal/Delphi, PHP, Javascript and Visual Basic source code and produce an output which will render syntax-highlighting on a Web page. In addition, it provides the documentation that was missing on the TParser class of RAD studio and provides a tutorial on the way to use pointer arithmetic and nested case and while true loops.

At the moment, the program can handle Pascal, PHP, Javascript and Visual Basic source code even though only the path to Delphi and PHP syntax-highlignting is covered in this article. However, I intend to add other programming languages to it when needed. The program and the code of the program will be updated as soon as new languages are implemented. You can download the code of the program and the executable from here.

This compressed file contains all the files needed to launch the GtroCodeParser project in Delphi including the executable, GtroCodeParser.exe dated 26 July 2010 19:10. This executable does not need any installation and the default language is Delphi but requires that the files containing the keywords of each language be in the same directory as the executable. The files containing the reserved words of each language are included in the folder.

Beware!
Do not paste the output produced by this program directly in a WYSIWYG editor because all the HTML tags that the program has added to the document will be interpreted as text and their effect will be lost. Paste it in a text or a code editor.

The content of the article is heavily based on work done by Marco Cantu and Tim Gooch in their book entitled "Delphi Developer's Handbook" (pp. 128-142). Indebtedness is hereby acknowledged.

Annexes

Annex A - Stream and buffer management

In this program, the code that is to be highlighted is stored in a memory stream called Source. It is a stream of bytes which would prove inconvenient for the parsing process. A buffer of 4096 Ansi characters is used to store the characters. It is an AnsiString with two pointers: FBuffer that points to the beginning of the buffer and FBufEnd that points to the end.

Two methods of the base class handle the management of the stream and the buffer:

A third method called ShowLineNumbers() that handles the display of line numbers in the output is also presented hereunder.

The SkipBlanks() method

The SkipBlanks() method is called at the beginning of the execution of the NextToken() method, i.e., each time a token is retrieved. It consists of a "while True do" loop that verifies what the character pointed to by SourcePtr is. If it is #0, the buffer needs to be filled by the ReadBuffer() method is called. If is is the line-feed character #10, a new line is generated. If it is a blank space (#32), it is skipped and if it is a printable character, the control leaves the method.

  [1]  procedure TGtroCodeParser.SkipBlanks;
  [2]  // called by NextToken
  [3]  begin
  [4]    while True do
  [5]    begin
  [6]      case SourcePtr^ of
  [7]        #0: // end of file character detected
  [8]          begin
  [9]            ReadBuffer; // time to reaad new sequence of character in the buffer
 [10]            if SourcePtr^ = #0 then Exit;
 [11]            Continue;
 [12]          end;
 [13]        #10: Inc(FSourceLine); // Linefeed character detected
 [14]        '!'..'ΓΏ' : Exit; // any printable character detected
 [15]      end; // case
 [16]      Inc(SourcePtr);
 [17]    end; // while True
 [18]  end;

The ReadBuffer() method

The code is essentially Borland's. If the purpose of the code was simply to fill the buffer with new Ansi characters, the plethora of pointers and the code highlighted in yellow would not be necessary. However, with the requirement that no line of code should overlap two consecutive buffers, they are needed.

  [1]  procedure TGtroCodeParser.ReadBuffer;
  [2]  // Called by CheckBuffer when SourcePtr^ is #0
  [3]  var
  [4]    Count: Integer;
  [5]  begin
  [6]    Inc(FOrigin, SourcePtr - FBuffer); // SourcePtr is initialized to FBuffer in constructor
  [7]    FSourceEnd[0]:= FSaveChar; // resets FSourceEnd[0]
  [8]    Count:= FBufPtr - SourcePtr; // is initially zero
  [9]    if Count <> 0 then // copies Count bytes from Source stream to the buffer.
 [10]      Move(SourcePtr[0], FBuffer[0], Count);
 [11]    FBufPtr:= FBuffer + Count;
 [12]    Inc(FBufPtr, Source.Read(FBufPtr[0], FBufEnd - FBufPtr));
 [13]    SourcePtr:= FBuffer; // brings back SourcePtr at the beginning of buffer
 [14]    FSourceEnd:= FBufPtr;
 [15]    if FSourceEnd = FBufEnd then // check for partial line at end of buffer
 [16]    begin
 [17]      FSourceEnd:= LineStart(FBuffer, FSourceEnd - 1); // LineStart declared in classes.pas
 [18]      if FSourceEnd = FBuffer then Error('Line too long');
 [19]    end;
 [20]    FSaveChar:= FSourceEnd[0]; // saves the line-feed character
 [21]    FSourceEnd[0]:= #0; // sets FSourceEnd to #0
 [22]  end;

This problem is solved with the statements in lines [15] to [19]: LineStart() finds the start of the last partial line in the buffer and sets FSourceEnd to point on the last line-feed character of the buffer. FSourceEnd[0] is then saved and replaced by #0 so that the parsing of the buffer stops there, leaving the remaining partial line unparsed. The next time the buffer is filled, this partial line is moved to the beginning of the buffer and the rest of the buffer is filled with characters from the input stream. This process is repeated on each fill of the buffer.

The ShowLineNumbers() method

This method has been added to display the line numbers of each line of code displayed in the output. It has been designed to display these line numbers right-justified in their field (as shown below).

  [1]  procedure TGtroCodeParser.ShowLineNumbers(LineNumber: Integer);
  [2]  var
  [3]    I, L: Integer;
  [4]  begin
  [5]    if HTMLParser.NumLinesOn then
  [6]    begin
  [7]      L:= Length(IntToStr(FLine + 1));
  [8]      for i:= 0 to NumLines - L + 1 do
  [9]        OutStr:= OutStr + ' ';
 [10]      OutStr:= OutStr + '[' + IntToStr(FLine + 1) + ']  '; // add the proper newline character
 [11]    end;
 [12]  end;

Warning!
This code was developed for the pleasure of it. Anyone who decides to use it does so at its own risk and agrees not to hold the author responsible for its failure.


Questions or comments?
E-Mail
Last modified: September 3rd 2014 12:12:25. []