Lexical Analysis With Flex, for Flex 2.6.2: Are certain equivalent patterns faster than others?

To: Adoram Rogel <adoram@orna.hybridge.com>
Subject: Re: Flex 2.5.2 performance questions
In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT.
Date: Wed, 18 Sep 96 10:51:02 PDT
From: Vern Paxson <vern>

[Note, the most recent flex release is 2.5.4, which you can get from
ftp.ee.lbl.gov.  It has bug fixes over 2.5.2 and 2.5.3.]

> 1. Using the pattern
>    ([Ff](oot)?)?[Nn](ote)?(\.)?
>    instead of
>    (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.)))
>    (in a very complicated flex program) caused the program to slow from
>    300K+/min to 100K/min (no other changes were done).

These two are not equivalent.  For example, the first can match "footnote."
but the second can only match "footnote".  This is almost certainly the
cause in the discrepancy - the slower scanner run is matching more tokens,
and/or having to do more backing up.

> 2. Which of these two are better: [Ff]oot or (F|f)oot ?

From a performance point of view, they're equivalent (modulo presumably
minor effects such as memory cache hit rates; and the presence of trailing
context, see below).  From a space point of view, the first is slightly
preferable.

> 3. I have a pattern that look like this:
>    pats {p1}|{p2}|{p3}|...|{p50}     (50 patterns ORd)
>
>    running yet another complicated program that includes the following rule:
>    <snext>{and}/{no4}{bb}{pats}
>
>    gets me to "too complicated - over 32,000 states"...

I can't tell from this example whether the trailing context is variable-length
or fixed-length (it could be the latter if {and} is fixed-length).  If it's
variable length, which flex -p will tell you, then this reflects a basic
performance problem, and if you can eliminate it by restructuring your
scanner, you will see significant improvement.

>    so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about
>    10 patterns and changed the rule to be 5 rules.
>    This did compile, but what is the rule of thumb here ?

The rule is to avoid trailing context other than fixed-length, in which for
a/b, either the 'a' pattern or the 'b' pattern have a fixed length.  Use
of the '|' operator automatically makes the pattern variable length, so in
this case '[Ff]oot' is preferred to '(F|f)oot'.

> 4. I changed a rule that looked like this:
>    <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN...
>
>    to the next 2 rules:
>    <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;}
>    <snext8>{and}{bb}/{ROMAN}         { BEGIN...
>
>    Again, I understand the using [^...] will cause a great performance loss

Actually, it doesn't cause any sort of performance loss.  It's a surprising
fact about regular expressions that they always match in linear time
regardless of how complex they are.

>    but are there any specific rules about it ?

See the "Performance Considerations" section of the man page, and also
the example in MISC/fastwc/.

		Vern