Ask Reuben

Working With Large Strings

What is the best way to create a large string?

In the last Ask Reuben article I looked at using STRING over using CHAR and one of the reasons is performance, particular with large strings, and being able to pass by reference rather than by value.  That is rather than popping and pushing 1000’s of bytes of CHAR data onto and off a stack, simply popping and pushing an address in memory of a STRING onto and off the stack.

In an earlier Ask Reuben article I also discussed the merits of using , (append) and || (concatenate) to build larger strings and suggested using SFMT.  My main focus was on code readability, that SFMT was easier to read, rather than getting bogged down trying to figure out if a comma (,) was inside a string or being used to join two strings tother, and wether quote (“) was inside a string or being used to delimit a string.  I didn’t mention the base.String.append method or the base.StringBuffer class and its base.StringBuffer.append method which can also be used to join two pieces of text together.

When manipulating large strings, for performance reasons the best weapon in your tool-box is using the base.StringBuffer class and its append() method.  The reason being that it makes the most of passing by reference, so instead of  passing 1000 ‘s bytes of data, only an address in memory is passed around.

Instead of taking my word, or the documentations word, how can you verify this performance gain?  The answer to that is to use the profiler.  The profiler measures what percentage of time a Genero program spends in each individual function.   The technique therefore is to write a little test program that does two or more tests of the various techniques and see which of these tests is performed quicker.

CONSTANT TEST_ITERATIONS = 1000
CONSTANT STRING_LENGTH = 1000

MAIN
    DEFINE i INTEGER
    FOR i = 1 TO TEST_ITERATIONS
        CALL do_tests()
    END FOR
END MAIN

FUNCTION do_tests()
    CALL test_char_comma()
    CALL test_char_pipe()
    CALL test_char_sfmt()
    CALL test_string_comma()
    CALL test_string_pipe()
    CALL test_string_sfmt()
    CALL test_string_append()
    CALL test_stringbuffer_append()
END FUNCTION

FUNCTION test_char_comma()
    DEFINE c CHAR(STRING_LENGTH)
    DEFINE i INTEGER

    LET c = "X"
    FOR i = 2 TO STRING_LENGTH
        LET c = c CLIPPED, "X"
    END FOR
    -- DISPLAY c
END FUNCTION

FUNCTION test_char_pipe()
    DEFINE c CHAR(STRING_LENGTH)
    DEFINE i INTEGER

    LET c = "X"
    FOR i = 2 TO STRING_LENGTH
        LET c = c CLIPPED || "X"
    END FOR
    -- DISPLAY c
END FUNCTION

FUNCTION test_char_sfmt()
    DEFINE c CHAR(STRING_LENGTH)
    DEFINE i INTEGER

    LET c = "X"
    FOR i = 2 TO STRING_LENGTH
        LET c = SFMT("%1%2", c CLIPPED, "X")
    END FOR
    -- DISPLAY c
END FUNCTION

FUNCTION test_string_comma()
    DEFINE s STRING
    DEFINE i INTEGER

    LET s = "X"
    FOR i = 2 TO STRING_LENGTH
        LET s = s, "X"
    END FOR
    -- DISPLAY s
END FUNCTION

FUNCTION test_string_pipe()
    DEFINE s STRING
    DEFINE i INTEGER

    LET s = "X"
    FOR i = 2 TO STRING_LENGTH
        LET s = s || "X"
    END FOR
    -- DISPLAY s
END FUNCTION

FUNCTION test_string_sfmt()
    DEFINE s STRING
    DEFINE i INTEGER

    LET s = "X"
    FOR i = 2 TO STRING_LENGTH
        LET s = SFMT("%1%2", s, "X")
    END FOR
    -- DISPLAY s
END FUNCTION

FUNCTION test_string_append()
    DEFINE s STRING
    DEFINE i INTEGER

    LET s = "X"
    FOR i = 2 TO STRING_LENGTH
        LET s = s.append("X")
    END FOR
    -- DISPLAY s
END FUNCTION

FUNCTION test_stringbuffer_append()
    DEFINE s STRING
    DEFINE sb base.StringBuffer
    DEFINE i INTEGER

    LET sb = base.StringBuffer.create()
    FOR i = 1 TO STRING_LENGTH
        CALL sb.append("X")
    END FOR
    LET s = sb.toString()
    -- DISPLAY s
END FUNCTION

In the code above, the function do_tests() runs all of the individual tests.  Each of the individual tests has a name such as test_char_comma() that indicates the technique being used to create a large string.  Each test is building a string of 1000 characters (size as indicated by the constant MAX_STRING_LENGTH) by adding another character at the end.

I run each test a 1000 times  (as indicated by the constant TEST_ITERATIONS) to give a good sample size.  That way you can be sure that your results are not impacted by some random server event.

One gotcha, particular when testing database operations is caching.  So be wary of that and if need be have an init type operation that will ensure the physical area of the database is read into memory before your test begins.

I then look in the profiler output for detail of the function do_tests(), that is in what functions called from do_tests() did the Genero program spend most or the least of its time.

100.00    0.00  100.00   1000/1000     <-- main
[16]	  100.00    0.00  100.00	1000     *** do_tests
	   41.73    5.78   35.95   1000/1000     --> test_char_comma
	   22.37    0.45   21.92   1000/1000     --> test_string_comma
	   10.24    6.74    3.51   1000/1000     --> test_char_sfmt
	    8.67    4.12    4.55   1000/1000     --> test_char_pipe
	    6.58    3.09    3.49   1000/1000     --> test_string_sfmt
	    5.04    2.88    2.17   1000/1000     --> test_string_pipe
	    4.81    2.87    1.95   1000/1000     --> test_string_append
	    0.55    0.36    0.19   1000/1000     --> test_stringbuffer_append

… this output in the left most column tells me what percentage of time was spent doing each individual test.  The smaller the number the better as this tells me less percentage time was spent in the function.

So we can see that for comparative methods, using string was an improvement over char as the percentage time spent in each was lower…

41.73 ... test_char_comma
22.37 ... test_string_comma

10.24 ... test_char_sfmt
6.58  ... test_string_sfmt

8.67 ... test_char_pipe
5.04 ... test_string_pipe

… but the big winner was to use base.StringBuffer.append to create the large string.  It was approximately between 9 (4.81/0.55) and 75 (41.73/0.55) times quicker than the other approaches

41.73 ...  test_char_comma (worst)
...
4.81 ... test_string_append (next best)
0.55 ...  test_stringbuffer_append (best)

If you ran the above test on your system, the exact percentages in the results may vary, but I would expect the overall ranking to be similar, that is string is faster than char, and that base.StringBuffer.append is the fastest way to put together a vary large string.

The profiler will typically be used to find a bottleneck in your program (what function is taking up most of the time), but you can also use it like this to measure two or more different algorithms and see which is quicker, in this case to verify that base.StringBuffer and its append method is probably the best when working with large strings.