anon_comenter9's comments

anon_comenter9 · on Aug 6, 2011

Actually health care costs are growing far larger and faster than the cost of social security.

anon_comenter9 · on July 2, 2011

Unless this is relying on the compiler optimizing this, the dispatch looks to be not good

if(vm->pProgram->instr[instr_idx] == MOV) arg0 = arg1; else if(vm->pProgram->instr[instr_idx] == PUSH) stack_push(vm->pStack, arg0); else if(vm->pProgram->instr[instr_idx] == POP) stack_pop(vm->pStack, arg0); else if(vm->pProgram->instr[instr_idx] == INC) ++(arg0); else if(vm->pProgram->instr[instr_idx] == DEC) --(arg0); else if(vm->pProgram->instr[instr_idx] == ADD) arg0 += arg1; else if(vm->pProgram->instr[instr_idx] == SUB) arg0 -= arg1; else if(vm->pProgram->instr[instr_idx] == MUL) arg0 = arg1; else if(vm->pProgram->instr[instr_idx] == DIV) arg0 /= arg1; else if(vm->pProgram->instr[instr_idx] == MOD) vm->pMemory->remainder = arg0 % arg1; else if(vm->pProgram->instr[instr_idx] == REM) arg0 = vm->pMemory->remainder; else if(vm->pProgram->instr[instr_idx] == NOT) arg0 = ~(arg0); else if(vm->pProgram->instr[instr_idx] == XOR) arg0 ^= arg1; else if(vm->pProgram->instr[instr_idx] == OR) arg0 |= arg1; else if(vm->pProgram->instr[instr_idx] == AND) arg0 &= arg1; else if(vm->pProgram->instr[instr_idx] == SHL) arg0 <<= arg1; else if(vm->pProgram->instr[instr_idx] == SHR) arg0 >>= arg1; else if(vm->pProgram->instr[instr_idx] == CMP) vm->pMemory->FLAGS = ((arg0 == arg1) | (arg0 > arg1) << 1); else if(vm->pProgram->instr[instr_idx] == JMP) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JE && (vm->pMemory->FLAGS & 0x1)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JNE && !(vm->pMemory->FLAGS & 0x1)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JG && (vm->pMemory->FLAGS & 0x2)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JGE && (vm->pMemory->FLAGS & 0x3)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JL && !(vm->pMemory->FLAGS & 0x3)) instr_idx = arg0 - 1; else if(vm->pProgram->instr[instr_idx] == JLE && !(vm->pMemory->FLAGS & 0x2)) instr_idx = *arg0 - 1;

jws · on July 3, 2011

Update: I changed it to a switch statement. The euler1 program went from 0.081s to 0.091s with gcc on a core i3.

The cascading if statements are indeed faster than a switch.

EOU - original comment follows…

Without the benefit of profiling, I can suggest it may not be as bad as it looks.

The relative frequency of the opcodes could make this faster than the switch. See MOV, PUSH, POP in the front? Theses likely have tiny, inline implementations and are also probably a large percentage of the opcodes implemented. They may all fit in the same cache line and give the pipelines on deep pipeline, branch predicting machines lots of good stuff to work on.

Likewise, the infrequently occurring opcode comparisons in the back part of the statement, by definition, are almost always false, which should tell the branch predictors which way to go to again keep the pipelines full.

A switch on the other hand is pretty much a guaranteed pipeline flush.

But… if you asked me to code this control structure without being able to profile and to get it fast the first time… I'd use a switch. (well, maybe with a couple if statements out front if I knew I had a heavily lopsided distribution.)

Then if I'd go reread that article that came by HN a couple weeks ago and turn the control structure inside out and see how much better it was.

palish · on July 3, 2011

You're very wise. Thank you for sharing.

It's satisfying that you were able to say "There's a good chance you're wrong, here's why" and then be confirmed by testing. You obviously have a great deal of experience with low-level programming; relatively rare, nowadays. Mine comes from graphics programming.

Also, which article are you referring to? Any keywords I can search for?

jws · on July 3, 2011

I think I am remembering this one: https://qht.co/item?id=2593095 The Common CPU Interpreter Loop Revisited

But after looking at tinyvm it may be a special animal. The native implementation of its virtual opcodes are only one or two instructions. Locality, cacheing, and pipelining are working at their finest here and it might be best to just let them do their thing.

iam · on July 3, 2011

I was quite curious as to this, so I looked into the code (using GCC 4.4.5 on a 32-bit VM). Switch statements can be optimized in 2 ways (doing a binary search - log n checks/jumps, or doing a jump table, no checks and 2 jumps). Usually the jump table is done for dense cases and binary search for sparse cases.

I will post the disassembly of the code as-is first. Then the same exact code but using a switch statement (change all the if/else into a 'case' and nothing else).

  // loop header (break out of loop if we hit END)
   1a:   8b 04 99                mov    eax,DWORD PTR [ecx+ebx*4]
  1d:   83 f8 ff                cmp    eax,0xffffffff
  20:   0f 84 62 01 00 00       je     188 <run_vm+0x188>
  26:   89 fa                   mov    edx,edi
  // inside of the loop begins here (theres a jmp 28 instruction at the bottom that i omitted here)
  28:   8b 52 0c                mov    edx,DWORD PTR [edx+0xc]
  // now edx == &vm->pProgram->args
  // FAIL: vm->pProgram->args is loop invariant and shouldn't be reloaded every iteration
  // this is either a failure to prove the load was redundant or a failure to allocate this its own register without spilling
  2b:   83 f8 01                cmp    eax,0x1
  // why is this here? instruction scheduling magic probably.
  2e:   8b 14 32                mov    edx,DWORD PTR [edx+esi*1]
  // esi is the instr_idx 
  // new edx is vm->pProgram->args[instr_idx]
  31:   8b 32                   mov    esi,DWORD PTR [edx]
  // esi is arg0     == vm->pProgram->args[instr_idx][0]
  33:   8b 52 04                mov    edx,DWORD PTR [edx+0x4]
  // edx is arg1     == vm->pProgram->args[instr_idx][1]
  36:   89 55 d4                mov    DWORD PTR [ebp-0x2c],edx
  // FAIL: spill vm->pProgram->args[instr_idx] to stack.
  // why is this a fail? it's already on the stack, so we can reconstruct it slightly slower without spilling
  39:   0f 84 01 01 00 00       je     140 <run_vm+0x140>
  // finally the je for the cmp

  // CMP/JE pairs for ELSE IFs follow until end
  3f:   83 f8 02                cmp    eax,0x2
  42:   0f 84 48 01 00 00       je     190 <run_vm+0x190>
  48:   83 f8 03                cmp    eax,0x3
  4b:   0f 84 5f 01 00 00       je     1b0 <run_vm+0x1b0>
  51:   83 f8 04                cmp    eax,0x4
  54:   0f 84 0e 01 00 00       je     168 <run_vm+0x168>
  5a:   83 f8 05                cmp    eax,0x5
  5d:   8d 76 00                lea    esi,[esi+0x0]
  60:   0f 84 12 01 00 00       je     178 <run_vm+0x178>
  66:   83 f8 06                cmp    eax,0x6
  69:   0f 84 61 01 00 00       je     1d0 <run_vm+0x1d0>
  6f:   83 f8 07                cmp    eax,0x7
  72:   0f 84 70 01 00 00       je     1e8 <run_vm+0x1e8>
  78:   83 f8 08                cmp    eax,0x8

eax is the bytecode for the current instruction (vm->pProgram->args[instr_idx]). Notice that the first time it inlines the code (for the MOV instruction). The rest of the time it's a cmp/jcc pair (I omitted the rest of the asm for clarity, but it's cmp/jcc all the way down).

So that alone makes it that a MOV bytecode would get executed almost right away. Then for everything else, PUSH, POP, etc you have to go through a linearly increasing sets of cmp/jcc pairs. So in this case if the probability distribution of your interpreted program meant that most opcodes were MOV, PUSH, POP, you could get pretty good performance. But once it goes through a few cmp/jcc pairs you have lost the performance benefit of using if/else, it will get slower than using a switch.

There are also a few potential fails here (look at my FAIL comments). The value for &vm->pProgram->args gets recalculated every loop iteration, this is probably a register allocation fail (not enough registers on x86). And vm->pProgram->args[instr_idx] gets spilled to stack. That's really annoying and probably unnecessary.

---------

Now let's take a look at the switch statement! I was curious to see if it would generate a binary search table or a jump table. In theory it should be the latter since the opcodes are 0-25 inclusive (with -1 for END, which is actually 0xFFFFFFFF and is not easily a candidate to become a jump table member). x

  //loop header
  1a:   8b 04 99                mov    eax,DWORD PTR [ecx+ebx*4]
  1d:   83 f8 ff                cmp    eax,0xffffffff
  20:   74 48                   je     6a <run_vm+0x6a>
  22:   8d b6 00 00 00 00       lea    esi,[esi+0x0]
  // loop beginning (there's an omitted jmp 28 lower on)
  28:   8b 7e 0c                mov    edi,DWORD PTR [esi+0xc]
  // now edi == &vm->pProgram->args
  // FAIL: vm->pProgram->args is loop invariant and shouldn't be reloaded every iteration
  // this is either a failure to prove the load was redundant or a failure to allocate this its own register without spilling
  2b:   83 f8 19                cmp    eax,0x19
  // compare eax to 25 (JLE)
  // compiler was unable to prove that the instruction is in the range [-1, 25]. this will always cause a >25 check every loop iteration (bad)
  // i wanted to try adding a __builtin_unreachable() as a default switch case to fix this, unfortunately i don't have gcc 4.5
  // also maybe doing a switch on eax & 0x20 so it knows its less than 32, 32 might be small enough to generate 7 blank jump table entries
  2e:   8b 14 17                mov    edx,DWORD PTR [edi+edx*1]
  // new edx is vm->pProgram->args[instr_idx]
  31:   8b 3a                   mov    edi,DWORD PTR [edx]
  // arg0
  33:   8b 52 04                mov    edx,DWORD PTR [edx+0x4]
  // arg1
  36:   89 55 d4                mov    DWORD PTR [ebp-0x2c],edx
  // FAIL: spill vm->pProgram->args[instr_idx] to stack.
  // why is this a fail? we can always get it from [esi+0xc] like at the beginning of the loop
  39:   77 1d                   ja     58 <run_vm+0x58>

  // aha, thats what the cmp eax, 0x19 was for previously.. refer to previous comment
  3b:   ff 24 85 00 00 00 00    jmp    DWORD PTR [eax*4+0x0]

  // jmp table badassery. but why did it take so long to get here?
  // this is a JMP r/m32 (jump near, absolute indirect)
  // the 0x0 gets set to a jump table offset by the linker later (i decompiled the .o file only)

  // the jmp table itself will look like JMP address-back-into-run_vm-as-a-constant

This is basically the same code as before except that: there's an additional cmp/ja instruction pair to check that the instruction code is in the [0,25] range (and of course the jmp to the jmp table instead of the if/else pairs). This makes it always slower for a branch that will never get taken, and ruins the branch prediction of the first few iterations.

Also the indirect jmp does make it a bit harder to do instruction prefetching, but this should be truly straightforward since the value of 'eax' is known for a good 8 instructions from the beginning of the loop (and the prefetching is trivial for the direct jumps). Another consideration is the L1 i-cache, which is 32K on a Nehalem. A quick google search for the micro-ops cache reveals it to be 12K. Since the loop is fairly tiny, both of these easily fit into those caches with probably at least 1 order of magnitude left over.

Supposing that the redundant cmp/ja is removed, we are now on a more even field. It should be faster for pretty much anything except the first case (instruction is a MOV).

  jmp [indirect abs]
  jmp direct abs

vs

  cmp eax, 0x0 // is EAX a mov?
  jg somewhere_else // skip the following section if its not a MOV

The second code will always be faster if we're always hitting MOVs, don't ask me to prove this, but it's a gut feeling that there's less micro ops involved in the second case. The equal point will probably be when the instrucitons are always PUSH (1) or POP (2). But after that it will always be faster. So I think overall unless all the instructions are MOV,PUSH,POP (haha unlikely) the jmp table solution should be faster with cmp/ja removed.

That being said, I think I've spent too much time on this as-is and will not be trying to remove the cmp/ja.

-----------

Summary: GCC doesn't generate the best code, and there's some things that are hard for even the compiler to optimize. Sometimes you have to massage the compiler into doing what you want, and it doesn't mean the theoretical approach to using switches over chained if/elses is flawed.

It's my gut feeling that if we can get the optimizer in this case to work slightly better, the switch statement should always be faster than the if/elses.

It also would've been interested to see the x86_64 disassembly, but of course then we have the downside of using 64-bit everywhere which leads to larger instruction encodings and the upside of less register spills.

teyc · on July 2, 2011

For the benefit of those who don't understand this, have a look at http://en.wikipedia.org/wiki/Threaded_code where it explains the different ways one can process byte codes.

The key issue here is that the efficiency of dispatch loop determines the performance of a VM.

anon_comenter9 · on July 3, 2011

Exactamundo!

haberman · on July 2, 2011

That is surprisingly bad. It reminds me of when I first heard that you could get fast sine/cosine approximations with table lookups -- not fully understanding this idea I wrote a program that generated the following (I was in high school, young and naive):

    double sin(double x) {
      if (x >= 0.00 && x < 0.01) return 0;
      if (x >= 0.01 && x < 0.02) return 0.01;
      // ...
    }

gcr · on July 2, 2011

At least you wrote the program to generate it rather than typing all that by hand. I think you got the gist of the idea. In any case, would all those branches be faster than the comparatively expensive floating-point arithmetic?

Retric · on July 3, 2011

No, at best you could do a binary search using if's to simulate a lookup table, but you are going to trash the pipeline on a modern CPU if you want a reasonably complex lookup table you are much better off with something like:

float lookup_Sine(x){ int I = 100 * (x mod 360); return sine_table(I); }

Anyway, for a modern CPU hitting main memory takes a lot of cycles. So you need something a lot more complex than a simple trig function to make it worth it.

atombender · on July 3, 2011

Yes, that's very odd. At least they could have used a case statement, which a C compiler will be able to optimize into an efficient jump table.

I remember reading about a VM (I remember it as being Google's V8, but that one compiles directly to machine code so probably I'm misremembering; maybe it was Strongtalk?) which basically ditched the classic bytecode loop (grab next instruction, check what it is, evaluate it, repeat) and instead implemented each bytecode type as a function and somehow munged the instruction pointer or modified the VM's machine code itself so that the CPU would move directly to the next instruction's function without jumping. Damn, I wish I could remember the idea. Anyone know what I'm talking about?

anon_comenter9 · on July 3, 2011

You are probably talking about threaded code which is described in "The Structure and Performance of Efficient Interpreters" http://www.jilp.org/vol5/v5paper12.pdf

With real code examples

jws · on July 3, 2011

The bodies of these opcodes are only single machine instruction or two themselves and are inlined. This changes the game.

atombender · on July 3, 2011

Thanks, that's an interesting paper. I have never heard of that paper, I'm thinking specifically of a VM that implemented the feature. Not sure if we are talking about the same algorithm, either, although it sounds similar.

anon_comenter9 · on July 3, 2011

Thinking about it, that sounds like inline threading

http://blog.mozilla.com/dmandelin/2008/08/27/inline-threadin...

spitfire · on July 2, 2011

That just makes the case for switch expressions even stronger. If you had those this would be a damn simple switch statement and it'd provide lots of hints to the compiler for speedups. (All but the modified jump instructions can go in a straight switch statement).

I'm very tempted to add type tagging into this and write a burroughs b5000 emulator. That'd be a neat side project.

palish · on July 2, 2011

Yeah, I noticed the same thing. But this is a good starting point.

From here, he could VirtualProtect() some memory and write the equivalent x86 instructions.

matthew-wegner · on July 2, 2011

Perhaps that's why there is a FastVM fork?

windsurfer · on July 2, 2011

Why can't the compiler optimize this?

to3m · on July 2, 2011

Theoretically, it could do, at least in this case, I think? Very risky to rely on that though! I doubt compiler writers have people who haven't heard of the switch statement in mind when they decide which optimizations to put in.

(The "fast" branch uses a switch statement - I guess the author is going for minimum number of lines in this one? I'm not sure saving 2-3 lines compared to a squeezed-together switch statement is a great tradeoff, though. You could get better results by just putting everything in one source file. Easier deployment, too. I bet it wouldn't affect readability significantly.)

ori_b · on July 2, 2011

It can. However, this is a very uncommon pattern, so it's not worth trying very hard to detect. On top of that, I'd expect the compiler to bail after the pattern gets to a certain size.