2.20. Regular Matching

Felix provides support for matching strings against regular expressions.

Unlike commonly used regexp libraries, regular expressions are not strings: instead a first class syntax is used to define them.

Felix allows you to name regular expressions with the syntax:

  regexp <name> = <regexp> ;
The name is an identifier. A string used in a regexp stands for a match of each character of the string in sequence. The following symbols are special, and are given from weakest to strongest binding order:
symbolsyntaxmeaning
|infixalternatives
*postfix0 or more occurences
+postfix1 or more occurences
?postfix0 or 1 occurences
<juxtaposition>infixconcatenation
<name>atomicre denoted by the name in a REGEXP definition
<string>atomicsequence of chars of the string
[<charset>]atomicany char of the charset
[^<charset>]atomicany char not in the charset
.atomicany char other than end of line
_atomicany char
eofatomicend marker
(<regexp>)atomicbrackets
The usual notation for character sets is employed:
symbolmeaning
<string>any character in the string
<char>-<char>any between or including the two chars
Note that a char is represented by a one character string.

Start felix section to tut/tutorial/tut-1.20-0.flx[1 /2 ] Next Last
     1: #line 1103 "./lpsrc/flx_tutorial.pak"
     2: #import <flx.flxh>
     3: regexp lower = ["abcdefghijklmnopqrstuvwxyz"];
     4: regexp upper = ["ABCDEFGHIJKLMNOPQRSTUVWXYZ"];
     5: regexp digit = ["0123456789"];
     6: regexp alpha = lower | upper | "_";
     7: regexp id = alpha (alpha | digit) *;
End felix section to tut/tutorial/tut-1.20-0.flx[1]
Regular expressions are used in regular match expressions. These are like ordinary matches, except that the pattern is a regular expression, and the argument must be a string. If more than one pattern matches the string, the first one is used.
Start felix section to tut/tutorial/tut-1.20-0.flx[2 /2 ] Prev First
     8: #line 1118 "./lpsrc/flx_tutorial.pak"
     9: print
    10:   regmatch "identifier" with
    11:   | digit+ => "Number"
    12:   | id =>  "Identifier"
    13:   endmatch
    14: ;
    15: endl;
    16: 
    17: print
    18:   regmatch "9999" with
    19:   | digit+ => "Number"
    20:   | id =>  "Identifier"
    21:   endmatch
    22: ;
    23: endl;
    24: 
    25: print
    26:   regmatch "999xxx" with
    27:   | digit+ => "Number"
    28:   | id =>  "Identifier"
    29:   | _* => "Neither"
    30:   endmatch
    31: ;
    32: endl;
End felix section to tut/tutorial/tut-1.20-0.flx[2]
Start data section to tut/tutorial/tut-1.20-0.expect[1 /1 ]
     1: Identifier
     2: Number
     3: Neither
End data section to tut/tutorial/tut-1.20-0.expect[1]
Note that whilst conceptually regular matches are applied first to last, the actual implementation uses a finite state machine and guarranteed to be linear in the length of the input, and, in particular independent of the number of regular expressions. Each character is processed at most once.

Note: the generated code is *extremely* fast, within one or two memory fetches of the fastest possible code. here is the generated code for the inner loop of a regmatch:

  while(state && start != end)
    state = matrix[*start++][state];