Next: Is backing up a big deal?, Previous: deleteme00, Up: FAQ [Contents][Index]
To: Adoram Rogel <adoram@orna.hybridge.com> Subject: Re: Flex 2.5.2 performance questions In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. Date: Wed, 18 Sep 96 10:51:02 PDT From: Vern Paxson <vern> [Note, the most recent flex release is 2.5.4, which you can get from ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] > 1. Using the pattern > ([Ff](oot)?)?[Nn](ote)?(\.)? > instead of > (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) > (in a very complicated flex program) caused the program to slow from > 300K+/min to 100K/min (no other changes were done). These two are not equivalent. For example, the first can match "footnote." but the second can only match "footnote". This is almost certainly the cause in the discrepancy - the slower scanner run is matching more tokens, and/or having to do more backing up. > 2. Which of these two are better: [Ff]oot or (F|f)oot ? From a performance point of view, they're equivalent (modulo presumably minor effects such as memory cache hit rates; and the presence of trailing context, see below). From a space point of view, the first is slightly preferable. > 3. I have a pattern that look like this: > pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) > > running yet another complicated program that includes the following rule: > <snext>{and}/{no4}{bb}{pats} > > gets me to "too complicated - over 32,000 states"... I can't tell from this example whether the trailing context is variable-length or fixed-length (it could be the latter if {and} is fixed-length). If it's variable length, which flex -p will tell you, then this reflects a basic performance problem, and if you can eliminate it by restructuring your scanner, you will see significant improvement. > so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about > 10 patterns and changed the rule to be 5 rules. > This did compile, but what is the rule of thumb here ? The rule is to avoid trailing context other than fixed-length, in which for a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use of the '|' operator automatically makes the pattern variable length, so in this case '[Ff]oot' is preferred to '(F|f)oot'. > 4. I changed a rule that looked like this: > <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... > > to the next 2 rules: > <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} > <snext8>{and}{bb}/{ROMAN} { BEGIN... > > Again, I understand the using [^...] will cause a great performance loss Actually, it doesn't cause any sort of performance loss. It's a surprising fact about regular expressions that they always match in linear time regardless of how complex they are. > but are there any specific rules about it ? See the "Performance Considerations" section of the man page, and also the example in MISC/fastwc/. Vern
Next: Is backing up a big deal?, Previous: deleteme00, Up: FAQ [Contents][Index]