1.20. Regular Matching

Felix provides support for matching strings against regular expressions.

Unlike commonly used regexp libraries, regular expressions are not strings: instead a first class syntax is used to define them.

Felix allows you to name regular expressions with the syntax:

  regexp <name> = <regexp> ;
The name is an identifier. A string used in a regexp stands for a match of each character of the string in sequence. The following symbols are special, and are given from weakest to strongest binding order:
symbolsyntaxmeaning
|infixalternatives
*postfix0 or more occurences
+postfix1 or more occurences
?postfix0 or 1 occurences
<juxtaposition>infixconcatenation
<name>atomicre denoted by the name in a REGEXP definition
<string>atomicsequence of chars of the string
[<charset>]atomicany char of the charset
[^<charset>]atomicany char not in the charset
.atomicany char other than end of line
_atomicany char
eofatomicend marker
(<regexp>)atomicbrackets
The usual notation for character sets is employed:
symbolmeaning
<string>any character in the string
<char>-<char>any between or including the two chars
Note that a char is represented by a one character string.

Start felix section to tut/examples/tut_beg121.flx[1 /2 ] Next Last
     1: #line 896 "./lpsrc/flx_tutorial.pak"
     2: #import <flx.flxh>
     3: regexp lower = ["abcdefghijklmnopqrstuvwxyz"];
     4: regexp upper = ["ABCDEFGHIJKLMNOPQRSTUVWXYZ"];
     5: regexp digit = ["0123456789"];
     6: regexp alpha = lower | upper | "_";
     7: regexp id = alpha (alpha | digit) *;
     8: 
End felix section to tut/examples/tut_beg121.flx[1]
Regular expressions are used in regular match expressions. These are like ordinary matches, except that the pattern is a regular expression, and the argument must be a string. If more than one pattern matches the string, the first one is used.
Start felix section to tut/examples/tut_beg121.flx[2 /2 ] Prev First
     9: #line 910 "./lpsrc/flx_tutorial.pak"
    10: print
    11:   regmatch "identifier" with
    12:   | digit+ => "Number"
    13:   | id =>  "Identifier"
    14:   endmatch
    15: ;
    16: endl;
    17: 
    18: print
    19:   regmatch "9999" with
    20:   | digit+ => "Number"
    21:   | id =>  "Identifier"
    22:   endmatch
    23: ;
    24: endl;
    25: 
    26: print
    27:   regmatch "999xxx" with
    28:   | digit+ => "Number"
    29:   | id =>  "Identifier"
    30:   | _* => "Neither"
    31:   endmatch
    32: ;
    33: endl;
    34: 
    35: 
End felix section to tut/examples/tut_beg121.flx[2]
Note that whilst conceptually regular matches are applied first to last, the actual implementation uses a finite state machine and guarranteed to be linear in the length of the input, and, in particular independent of the number of regular expressions. Each character is processed at most once.

Note: the generated code is *extremely* fast, within one or two memory fetches of the fastest possible code. here is the generated code for the inner loop of a regmatch:

  while(state && start != end)
    state = matrix[*start++][state];