A story about optimization, LLVM, and the SentinelInputRange

Developers want code that runs fast and users do too. However, it is easy to fall into the trap of early optimization; this story shows that even in non-trivial cases, modern compilers can do wonders. On the other hand, the myth of the “sufficiently smart compiler” is still very relevant. The best way is to look at what is actually generated for an example and that is what I’ll do in this article. Enter the SentinelInputRange.

A range?

D introduces the concept of a range. For a C++ programmer, a range can be seen as a pair of iterators. For a Java programmer, it can be seen as an InputStream. It is defined as an entity with a property front (which returns the current item in the range), a method popFront, which pops an item from the range, and a property empty, a boolean that becomes true when all elements of the range have been popped. It is obviously invalid to call popFront or access front on an empty range.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Some range :
struct MyRange {
    private T[] data;

    @property
    T front() {
        return data[0];
    }

    void popFront() {
        data = data[1 .. $];
    }

    @property
    bool empty() {
        return data.length == 0;
    }
}

In D, foreach can handle ranges properly, but to understand our problem let’s consume it manually to print its elements:

1
2
3
4
while(!r.empty) {
    writeln(r.front);
    r.popFront();
}

BTW, as of now, “r” is what I’ll use for a range in D code.

A SentinelInputRange?

A very common way to consume a range is to use a switch statement on the value of front, performing different actions depending on the value encountered. As the discussion on D newsgroup started with a lexer in mind, I’ll reuse that example.

A lexer is a piece of code that consume characters and create tokens. This is the first thing a compiler does once it has loaded your file. For instance, a lexer will recognize that « if » or « is » are keywords, and create the according token; conversely « igloo » is not a keyword and will be recognised as an identifier. Using a range, it is done as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
if(r.empty) {
    // Output some error.
}

switch(r.front) {
    case 'i' :
        r.popFront();
        if(r.empty) return lexIdentifier(r, "i");
        switch(r.front) {
            case 'f' :
                r.popFront();
                return lexKeyword(r, "if");

            case 's' :
                r.popFront();
                return lexKeyword(r, "is");

            default :
                return lexIdentifier(r, "i");
        }

    default :
         // Do something else.
}

Now imagine that the characters come from a C string. C string are consecutive characters in memory followed by a 0 that indicates the end of the string. We call that 0 a sentinel and it allow for something much better for our code above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
switch(*s) {
    case '\0' :
        // Output some error.

    case 'i' :
        s++;
        switch(*s) {
            case 'f' :
                s++;
                return lexKeyword(s, "if");

            case 's' :
                s++;
                return lexKeyword(s, "is");

            default :
                return lexIdentifier(s, "i");
        }

    default :
         // Do something else.
}

See how awesome it is? As the end of the input is indicated by a special value, we can handle the stream without checking for the end all the time. When we have a non 0 value, we can directly pop the pointer as we know that something comes after. This is impossible to do with ranges, as accessing front before checking for empty is invalid. We end up doing way more checks, hence the code is slower.

But wait a minute? Can’t I create a range that wraps a C string?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct CString {
    private char* s;

    @property
    char front() {
        return *s;
    }

    void popFront() {
         s++;
    }

    @property
    bool empty() {
        return *s == '\0';
    }
}

So far so good, but our lexer doesn’t know about CString and so can’t take advantage of its nice capabilities (without breaking encapsulation). It has to check empty before accessing front. To solve this a new type of Range has been proposed: the SentinelInputRange. This new range would define a sentinel property in addition to the other features of a range to indicate to the consumer that this value is the end of the range. The consumer can now take advantage of this, isn’t it great?

In fact it isn’t, as our lexer would either have to have 2 code-paths for SentinelInputRange and other ranges (extra complexity) or forget about one of the cases (limiting either its speed or its generality).

We will discover that in fact, we can all of it for free if we use LLVM.

Let’s summon LLVM to help us!

LLVM is a compiler infrastructure. It defines an intermediate language called LLVM IR, that allow the components of LLVM to talk to each other. Typical components are front-ends that transform programming language such as D or C++ into LLVM IR, the optimizer that consume LLVM IR and produce another LLVM IR that is equivalent in functionality but faster and finally back-ends that transform LLVM IR into assembly.

Let’s compile the example into LLVM IR, using the range lexer and the CString struct. Here is a plausible output (I used SDC but then edited it for more readability) :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
%CString = type { i8* }

@.cstr = private unnamed_addr constant [2 x i8] c"i\00"
@.cstr1 = private unnamed_addr constant [3 x i8] c"if\00"
@.cstr2 = private unnamed_addr constant [3 x i8] c"is\00"

define void @lex(%CString %arg.r) {
body:
  %r = alloca %CString
  store %CString %arg.r, %CString* %r
  %0 = call i1 @empty(%CString* %r)
  br i1 %0, label %then, label %else

then:                                             ; preds = %body
  call void @lexError()
  ret void

else:                                             ; preds = %body
  %1 = call i8 @front(%CString* %r)
  %cond = icmp eq i8 %1, 105
  br i1 %cond, label %.case, label %default

.case:                                            ; preds = %else
  call void @popFront(%CString* %r)
  %2 = call i1 @empty(%CString* %r)
  br i1 %2, label %then1, label %else2

then1:                                            ; preds = %.case
  %3 = load %CString* %r
  call void @lexIdentifier(%CString %3, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

else2:                                            ; preds = %.case
  %4 = call i8 @front(%CString* %r)
  switch i8 %4, label %default3 [
    i8 102, label %.case4
    i8 115, label %.case5
  ]

.case4:                                           ; preds = %else2
  call void @popFront(%CString* %r)
  %5 = load %CString* %r
  call void @lexKeyword(%CString %5, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr1, i32 0, i32 0) })
  ret void

.case5:                                           ; preds = %else2
  call void @popFront(%CString* %r)
  %6 = load %CString* %r
  call void @lexKeyword(%CString %6, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr2, i32 0, i32 0) })
  ret void

default3:                                         ; preds = %else2
  %7 = load %CString* %r
  call void @lexIdentifier(%CString %7, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

default:                                          ; preds = %else
  call void @lexSomething(%CString %arg.r)
  ret void
}

define i1 @empty(%CString* nocapture %this) nounwind readonly {
body:
  %0 = getelementptr inbounds %CString* %this, i32 0, i32 0
  %1 = load i8** %0
  %2 = load i8* %1
  %3 = icmp eq i8 %2, 0
  ret i1 %3
}

define i8 @front(%CString* nocapture %this) nounwind readonly {
body:
  %0 = getelementptr inbounds %CString* %this, i32 0, i32 0
  %1 = load i8** %0
  %2 = load i8* %1
  ret i8 %2
}

define void @popFront(%CString* nocapture %this) nounwind {
body:
  %0 = getelementptr inbounds %CString* %this, i32 0, i32 0
  %1 = load i8** %0
  %2 = getelementptr inbounds i8* %1, i64 1
  store i8* %2, i8** %0
  ret void
}

declare void @lexError()
declare void @lexIdentifier(%CString, { i64, i8* })
declare void @lexKeyword(%CString, { i64, i8* })
declare void @lexSomething(%CString)

The first interesting step we are concerned about is inlining. LLVM will put the calls to front, popFront and empty directly into lex as they are very simple. See what lex becomes after that operation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
define void @lex(%CString %arg.r) {
body:
  %r = alloca %CString
  store %CString %arg.r, %CString* %r
  %0 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %1 = load i8** %0
  %2 = load i8* %1
  %3 = icmp eq i8 %2, 0
  br i1 %3, label %then, label %else

then:                                             ; preds = %body
  call void @lexError()
  ret void

else:                                             ; preds = %body
  %4 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %5 = load i8** %4
  %6 = load i8* %5
  %cond = icmp eq i8 %6, 105
  br i1 %cond, label %.case, label %default

.case:                                            ; preds = %else
  %7 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %8 = load i8** %7
  %9 = getelementptr inbounds i8* %8, i64 1
  store i8* %9, i8** %7
  %10 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %11 = load i8** %10
  %12 = load i8* %11
  %13 = icmp eq i8 %12, 0
  br i1 %13, label %then1, label %else2

then1:                                            ; preds = %.case
  %14 = load %CString* %r
  call void @lexIdentifier(%CString %14, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

else2:                                            ; preds = %.case
  %15 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %16 = load i8** %15
  %17 = load i8* %16
  switch i8 %17, label %default3 [
    i8 102, label %.case4
    i8 115, label %.case5
  ]

.case4:                                           ; preds = %else2
  %18 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %19 = load i8** %18
  %20 = getelementptr inbounds i8* %19, i64 1
  store i8* %20, i8** %18
  %21 = load %CString* %r
  call void @lexKeyword(%CString %21, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr1, i32 0, i32 0) })
  ret void

.case5:                                           ; preds = %else2
  %22 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %23 = load i8** %22
  %24 = getelementptr inbounds i8* %23, i64 1
  store i8* %24, i8** %22
  %25 = load %CString* %r
  call void @lexKeyword(%CString %25, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr2, i32 0, i32 0) })
  ret void

default3:                                         ; preds = %else2
  %26 = load %CString* %r
  call void @lexIdentifier(%CString %26, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

default:                                          ; preds = %else
  call void @lexSomething(%CString %arg.r)
  ret void
}

This is fairly simple, calls have been removed and replaced with the function instructions themselves. The second thing that LLVM will do for us is to simplify the code by removing redundant loads, propagating constants, and so on:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
define void @lex(%CString %arg.r) {
body:
  %r = alloca %CString
  store %CString %arg.r, %CString* %r
  %0 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %1 = load i8** %0
  %2 = load i8* %1
  %3 = icmp eq i8 %2, 0
  br i1 %3, label %then, label %else

then:                                             ; preds = %body
  call void @lexError()
  ret void

else:                                             ; preds = %body
  %cond = icmp eq i8 %2, 105
  br i1 %cond, label %.case, label %default

.case:                                            ; preds = %else
  %4 = getelementptr inbounds i8* %1, i64 1
  store i8* %4, i8** %0
  %5 = load i8* %4
  %6 = icmp eq i8 %5, 0
  br i1 %6, label %then1, label %else2

then1:                                            ; preds = %.case
  %7 = load %CString* %r
  call void @lexIdentifier(%CString %7, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

else2:                                            ; preds = %.case
  switch i8 %5, label %default3 [
    i8 102, label %.case4
    i8 115, label %.case5
  ]

.case4:                                           ; preds = %else2
  %8 = getelementptr inbounds i8* %4, i64 1
  store i8* %8, i8** %0
  %9 = load %CString* %r
  call void @lexKeyword(%CString %9, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr1, i32 0, i32 0) })
  ret void

.case5:                                           ; preds = %else2
  %10 = getelementptr inbounds i8* %4, i64 1
  store i8* %10, i8** %0
  %11 = load %CString* %r
  call void @lexKeyword(%CString %11, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr2, i32 0, i32 0) })
  ret void

default3:                                         ; preds = %else2
  %12 = load %CString* %r
  call void @lexIdentifier(%CString %12, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

default:                                          ; preds = %else
  call void @lexSomething(%CString %arg.r)
  ret void
}

Now comes the interesting part: LLVM will analyse and simplify the control flow graph. This pass will effectively do the sentinel optimization for us. Remember, calls to CString methods have been inlined, so all the required information is in the lex function now. The compiler can see that we branch on empty and then switch on front, but now it does know that both are in fact branching based on the same value and merge everything!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
define void @lex(%CString %arg.r) {
body:
  %r = alloca %CString
  store %CString %arg.r, %CString* %r
  %0 = getelementptr inbounds %CString* %r, i32 0, i32 0
  %1 = load i8** %0
  %2 = load i8* %1
  switch i8 %2, label %default [
    i8 0, label %then
    i8 105, label %.case
  ]

then:                                             ; preds = %body
  call void @lexError()
  ret void

.case:                                            ; preds = %body
  %3 = getelementptr inbounds i8* %1, i64 1
  store i8* %3, i8** %0
  %4 = load i8* %3
  switch i8 %4, label %default3 [
    i8 0, label %then1
    i8 102, label %.case4
    i8 115, label %.case5
  ]

then1:                                            ; preds = %.case
  %5 = load %CString* %r
  call void @lexIdentifier(%CString %5, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

.case4:                                           ; preds = %.case
  %6 = getelementptr inbounds i8* %3, i64 1
  store i8* %6, i8** %0
  %7 = load %CString* %r
  call void @lexKeyword(%CString %7, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr1, i32 0, i32 0) })
  ret void

.case5:                                           ; preds = %.case
  %8 = getelementptr inbounds i8* %3, i64 1
  store i8* %8, i8** %0
  %9 = load %CString* %r
  call void @lexKeyword(%CString %9, { i64, i8* } { i64 2, i8* getelementptr inbounds ([3 x i8]* @.cstr2, i32 0, i32 0) })
  ret void

default3:                                         ; preds = %.case
  %10 = load %CString* %r
  call void @lexIdentifier(%CString %10, { i64, i8* } { i64 1, i8* getelementptr inbounds ([2 x i8]* @.cstr, i32 0, i32 0) })
  ret void

default:                                          ; preds = %body
  call void @lexSomething(%CString %arg.r)
  ret void
}

At this point, LLVM continues its optimizations, but I will stop dumping them; the point is made.

Conclusion

LLVM is able to provide the benefits of the SentinelInputRange to a function that is written for a generic range, rendering the whole concept of SentinelInputRange obsolete. While many compilers are unable to perform such optimizations, it is certainly not worth increasing the standard library complexity to work around their deficiencies.

Update: I want to thank John Colvin for his contribution to the article and the English language in general.

7 thoughts on “A story about optimization, LLVM, and the SentinelInputRange

  1. Very interesting, I wouldnt have assumed the optimization is that good.
    I wonder if standard gcc/g++ is that smart as well, do you have anything on that for comparison?

  2. As far as I know, GCC isn’t as good as LLVM in that specific case. It is able to optimize in some cases but not all of them. But it is an active project so I assume it will be able to do it soon. I think that if GCC isn’t as good, the right thing to do is to fix GCC, not clutter the standard lib.

  3. I wouldn’t call not relying on compiler-specific’s optimisation “early optimisation”.
    Because if you’re not using LLVM, this is still an optimisation!

    • I has to be noted that if GCC is not as good at this game, it can do similar stuff in many cases.

      The lesson here isn’t that the optimization is not worth it on some compilers, but that doing such optimization is doable with current technology. It means that we should clutter language design because some compiler are lacking that capability.

      The right solution is to use the sentinel input range and to implement similar optimization in dmd/work with other compiler community to get similar optimization in the pipe. Everybody would benefit from it in general.

      LLVM show that the right solution here lies in the compiler optimizer, and not in the library design.

  4. Even after LLVM optimization you have a comparison with ” first at line 22 (final snippet), and only then comparison with other values (s or f). i.e. you have 3 comparisons if the value is ‘s’ (115). With the C style code you have only 2 comparisons for value ‘s’ (115).
    To be honest I see this difference even without looking in LLVM, just by looking in D code.
    In other words I am trying to say that LLVM optimization does not help.
    p.s. I assume LLM switch is implemented as a sequence of if, else if, else if

      • In fact, if your switch is large, it is implemented as jump table like this in pseudo code :

        addr = jumpTable[front];
        goto addr;

        casexxx:
        caseyyy:
        casezzz:
        default:

        With jump table that contains addresses of labels. SO the front value is only tested once.

        With default set of passes, LLVM keep a different branch for the 0 case, even if it is clearly similar to the default one. This can be avoided by breaking in the default case and joining both branches in the code (that is actually a smart thing to do in all the cases and not related to the optimization we are talking about here – all range would benefit from that change, not only the sentinel ranges).

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>