Ask Reuben

Regex Compile

Is it better to use string.matches or util.regex.matches ?

An interesting question that was posed, with the new Regular Expression functionality added in 4.00, is it better to use string.matches or util.regexp.matches ?

The difference is that the string matches implementation is simple

LET result = string.matches(pattern)

whilst the code using the class is a little more complex

DEFINE re util.Regexp
LET re = util.Regexp.compile(pattern)
LET result = re.matches(subject)

The intuitive answer is that there must be a reason the compilation and matches steps are separated and we can validate this using the profiler which I covered in an earlier Ask Reuben article.

With this example I am executing each technique 10,000 times  …

IMPORT util

DEFINE s STRING
DEFINE r util.regexp
DEFINE result BOOLEAN
CONSTANT TEST_PATTERN = `^-?(0|[1-9]\\d*)(\\.\\d+)?$`
MAIN
    DEFINE i INTEGER
    LET s= "FOO100"
    LET r = util.Regexp.compile(TEST_PATTERN)

    FOR i = 1 TO 10000
        CALL test_string()
        CALL test_regexp()
    END FOR
END MAIN

FUNCTION test_string()
    LET result = s.matches(TEST_PATTERN)
END FUNCTION

FUNCTION test_regexp()
    LET result = r.matches(s)
END FUNCTION

… you will get results similar to …

Flat profile (order by self)
count %total %child %self name
10000 82.3 0.0 82.3 base.String.matches
10000 87.2 82.3 4.9 regexp_performance.test_string1 100.0 95.7 4.3 regexp_performance.main
10000 8.4 4.2 4.2 regexp_performance.test_regexp10000 4.2 0.0 4.2 util.Regexp.matches
1 0.1 0.0 0.1 util.Regexp.compile
1 0.0 0.0 0.0 .rts_forInit

The important thing to note is how much time was spent in the two functions, test_string(), test_regexp()  The less time, the better.

The highlighted values show that less time is spent when using test_regexp() which uses the util.regexp class.

Why is this? using string.matches, each iteration the compile and matches step is being carried out.    Using the regexp class the compilation step is carried out only once outside of the loop, and only the matches step is carried out every iteration.  This is very similar to database queries and how using cursors mean the “how am I going to do it” calculation is carried out once rather than each time the SQL statement is executed.

So moral of story, any time you are going to be reusing multiple times the same pattern match with a regular expression, you are going to want to investigate using the util.Regexp class so that the pattern is only compiled once.

This also highlights how you can use the profiler to examine the performance of different coding techniques.  Create a simple test and see how much relative time is spent in each function.