Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Personally, I find it painful that the compiler detects this kind of undefined behavior, and silently uses it for optimization, rather then stopping and emitting an error. In the printf example, the compiler could trivially emit an error saying "NULL check on p after dereference of p", and that would catch a large class of bugs. (Some static analysis tools check for exactly that.) Similarly, a loop access statically determined to fall outside the bounds of the array should produce an error.


Excuses are provided in http://blog.llvm.org/2011/05/what-every-c-programmer-should-... . But they are just that, excuses.

To summarize at least one of them, the compiler doesn't really see it as “detecting undefined behavior and optimizing accordingly”. It sees it as doing the right thing for all defined behaviors. The sort of imprecise analysis it does lead it to consider plenty of possible undefined behaviors, many of which cannot happen in real executions. It ignores these as a matter of fact, but reporting them would not tell the programmer anything it doesn't know, and would be perceived as noise.

On the example for (int i=1; i==0; i++) …, the compiler does not infer that i eventually overflows (undefined behavior). It infers that i is always positive, and thus that the condition is always false.


How about using statistics/machine learning and showing these spam warnings only when people want them? Yes, it is hard, but this is not an excuse!

Besides, fixing spammy warnings shouldn't be more difficult than fixing actual spam! I mean, common, with spam you have intelligent adversaries and compilers haven't reached that level. Not yet, anyway...


-Wi_want_undefined_behavior_spam_please_thanks


Overflows are not undefined. They are overflows. Maybe I want to overflow on purpose. Your for loop (int is signed) will complete assuming the body of the loop doesn't manipulate i, and given enough time.


pascal_cuoq is correct. Here's an excellent introduction to the subject:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...


unsigned overflows are well defined, but signed ones are not


This has to be one of the most irritatingly pedantic aspects of C, as the vast majority of systems use 2's complement and so would overflow in the same way, but the compiler writers think it's an "opportunity for optimisation" and I think this ends up causing more trouble than the optimisations are worth. The only sane interpretation of something like

    if(x + 1 < x)
     ...
is an overflow check, but silently "optimising away" that code because of the assumption that signed integers will never overflow is just horribly hostile and absolutely idiotic behaviour in my opinion. A sensible and pragmatic way to fix this would be to update the standard to define signed overflow, and maybe add a macro that is defined only on non-2s-complement platforms.


There is one solution that would keep the crazy semantics of C, but would still allow for 2's complement arithmetic to be well-defined when one wants to.

C99 defines int8_t, if it exists, to be a 2's complement signed integer of exactly 8 bits. Same for 16, 32, etc. The standard could very well define behavior on overflow for these (that is, turn them into actual types instead of typedefs), and leave int, long, etc alone. I think this would be a viable, realistic, solution. Integer conversions would probably still be a pain, though.


That makes sense (sort of). Better to use unsigned if you are trying to do modular arithmetic.

Signed integers have some weirdness attached. The number that's one followed by all zeroes in binary (INT_MIN in limits.h) is defined as negative, because the sign bit is set. But, the rules for 2's complement arithmetic predict that -INT_MIN == INT_MIN. So it's not a normal number.


Similarly amusing, INT_MIN / -1 will throw a "division by zero" on Intel CPUs, even though there isn't a zero anywhere in sight. INT_MIN * -1 is fine, of course (according to the CPU, even if not the language spec).


INT_MIN / -1 works fine for me on amd64 with gcc 4.9. It produces INT_MIN, just as you would expect. INT_MIN * -1 is also INT_MIN.

  #include <limits.h>
  #include <stdio.h>
  
  void main(void) {
          int x = INT_MIN;
          printf("INT_MIN = %d\n", x);
          printf("INT_MIN * -1 = %d\n", x * -1);
          printf("INT_MIN / -1 = %d\n", x / -1);
  }


    $ clang --version
    Debian clang version 3.5-1~exp1 (trunk) (based on LLVM 3.5)
    Target: x86_64-pc-linux-gnu
    Thread model: posix
    $ clang -o intmin_c -std=c99 intmin.c 
    $ ./intmin_c
         INT_MIN = -2147483648
    INT_MIN * -1 = -2147483648
    Floating point exception

    $ gcc --version
    gcc-4.7.real (Debian 4.7.2-5) 4.7.2
    (...)
    $ gcc -o intmin -std=c99 intmin.c 
    $ ./intmin
         INT_MIN = -2147483648
    INT_MIN * -1 = -2147483648
    INT_MIN / -1 = -2147483648

    $ tcc -v
    tcc version 0.9.25
    $ tcc -run intmin.c 
         INT_MIN = -2147483648
    INT_MIN * -1 = -2147483648
    Floating point exception

    $ cat intmin.c 
    #include <limits.h>
    #include <stdio.h>
      
    int main(void) { // Change from void to int for c99
      int x = INT_MIN;
      printf("     INT_MIN = %d\n", x);
      printf("INT_MIN * -1 = %d\n", x * -1);
      printf("INT_MIN / -1 = %d\n", x / -1);
      return 0; // return 0 - c99
    }
So, is this a gcc thing?


Your arithmetic is being optimized out by the compiler; https://ideone.com/KmTSUB crashes for example.


Looks like you're right (re my sibling comment above):

    $ gcc -std=c99 -S intmin.c -o intmin.s
    $ clang -std=c99 -S intmin.c -o intmin.sc
    $ grep -i div intmin.s*
    intmin.sc:      idivl   %esi
(No idivl in the gcc version)


This is a good reminder that "undefined behavior" includes "doing what I want it to do".

It's also a good example of how undefined behavior allows for optimizations. The compiler is able to evaluate your expression at compile time rather than emitting a division instruction, even though this changes how the program behaves.

C sure is fun.


Because it's undefined, the compiler is allowed to do anything: including giving the expected result.


In some programs I like to have something like this to catch potential bugs:

    int abs_s(int x) {
        assert(x != INT_MIN);
        return (x >= 0) ? x : -x;
    }


This assumes two's complement, though. With one's complement even `INT_MIN` would be fine to negate ;)

(On that note: Are there even machines left to write code for that don't use two's complement? Or 8 bits per byte?)


It has been mentioned on HN before that there are DSPs that use the same size for byte, short, and int, and that size is not 8 bits.


Pascal is the developer of Frama-C, a static analyser for C.


You can come up with specific cases and decide that they should be handled a different way. You're absolutely right that this NULL check elimination should generate a warning. But it's really really hard to come up with a general algorithm that correctly differentiates between "your code implies that this check can never be true, watch out!" and "your code implies that this check can never be true, so we correctly removed it during optimization".

For a trivial NULL-related example, the standard C free() function does a NULL check on its parameter. free(NULL) is legal and does nothing. A naive "does this check for NULL after dereferencing the pointer?" checker would therefore warn for this code:

    printf("the pointer's value is %d", *p);
    free(p);
To a human, this obviously shouldn't be warned about, while the other example should be. But how does the computer tell them apart? It's hard.


Super, super pedantic point: that probably wouldn't happen with your example because most of the time, free is defined in a shared library somewhere else, and the compiler wouldn't be able to inspect its code. Even if it's in the same source file, most compilers don't optimize across non-inlined function boundaries.

But! You made a good point, and it would apply to a function that did a null-check which was inlined. It's easy for us to imagine a function which 1) does a null-check, 2) gets inlined, and 3) is used in places in the code which dereference the pointer before calling the function.


While compilers won't optimise functions they don't yet know about, they do know the standard library and the respective guarantees and constraints. This includes removing calls to memset for things that are never read again (horrible for passwords or keys in memory, which is why there is a SecureZeroMemory or related function in operating systems) or other things.

As a somewhat trivial example:

  #include <stdio.h>
  #include <string.h>

  int main() {
  	printf("%lu\n", strlen("Test"));
  	return 0;
  }
will get compiled by MSVC (with /O2) to

  ; Line 5
	push	4
	push	OFFSET ??_C@_04OJNJKCBM@?$CFlu?6?$AA@
	call	_printf
	add	esp, 8
  ; Line 6
	xor	eax, eax
  ; Line 7
	ret	0
The compiler knows about strlen and notices that there is no need at runtime to calculate the length of a constant string and just puts in the result.


Great example, however this does not apply to free - or I would be very surprised if it did. It is relatively common for people to override malloc and free at runtime, so I would be very surprised if the compiler treated malloc or free as a compiler intrinsic, and inlined the code. I would not be surprised, however, if the compiler used the semantics of malloc or free to reason about the surrounding code. The original point, however, was about inlined code leading to generated code that no reasonable person would write (null check after use). So I still think that would not happen for free.


I just tested clang, and free(malloc(42)); gets completely optimized out, as does free(NULL);. free(argv) doesn't, so it's not quite that clever, at least.


The standard library thing is interesting. Modern compilers actually understand that stuff a lot. I don't think they'll reach into the implementation, but they understand the guaranteed semantics. For example, they know that this is undefined behavior:

    free(ptr);
    printf("after free, it contains %d\n", *ptr);
And the nasal demons shall flow freely. The compiler will certainly know that there's a NULL check in there. However, the smarts are different enough that it will also know not to warn you about it. So, yes, this is an example that won't happen in reality, although there's no reason it couldn't.


Sorry, I don't understand what's so hard about this problem? Why not just emit a warning when the compiler exploits undefined behavior to make some line of code unreachable. By "line of code" I mean code that's written by the user, not code after macroexpansions, inlinings or whatever. So the warning would mean that either you have a bug, or you can safely delete some code. Both of these are helpful.


Technically the compiler doesn't exploit the undefined behaviour. It exploits the assumption that it cannot happen and thus it's free to assume everywhere that only defined behaviour happens. Which means, the optimisations are for optimising the defined cases with no regard at all to the undefined behaviour.

You'll notice in a lot of cases that the exploitation of UB looks different for the same cases with different compilers or even compiler versions. This is because the compiler doesn't see »Oh, UB, I can optimise that« but rather »In this case I can do this which remains valid for all defined cases«.

Also, as others have pointed out, even if the compiler would emit a warning, it would be way too much noise because such things happen all the time.


> Also, as others have pointed out, even if the compiler would emit a warning, it would be way too much noise because such things happen all the time.

How so? For example, this code:

    printf("the pointer's value is %d", *p);
    free(p);
would not cause a warning under my proposal, even if free() contains a NULL check. The source code contains no unreachable lines, only the inlined/macroexpanded code does. On the other hand, most "gotcha" examples proposed so far do have unreachable source lines, and would lead to warnings.

Can you give an example of useful code that contains unreachable lines before macroexpansion and inlining? What's wrong with emitting a warning so the programmer can delete the useless line?

> You'll notice in a lot of cases that the exploitation of UB looks different for the same cases with different compilers or even compiler versions.

That's OK. The problem is with each individual compiler deleting code without warning. If compiler X deletes a line of my code, then it should warn me about it. If compiler Y doesn't delete that line, it doesn't have to warn me.


""NULL check on p after dereference of p","

Issuing an error or warning about this would flood stuff with warnings due to inlining/macros, you name it.

This happens all the time.

Basically, distinguishing between the things that are accidents, and things that are on purpose and expected to be optimized away, is very very very hard.


The warning could be "NULL check after unchecked dereference" with a pragma to disable the warning for macros within an annotated method


The unchecked doesn't add anything that makes it easier, and debugging functions and others rarely check things.

  /* Assumes you have a valid foo */
  int printfoo(struct foo *bar)
  {
  
    /* Print the main part of our foo */
    printf("First field: %d\n", bar->first);
    /* Get the substructure value */
    int foosub = get_foosub(bar);
    printf("Second field: %d\n", bar->second);
  }
   
  /* Doesn't assume you have a valid foo */
  int get_foobsub(struct foo *bar)
  {
     if (bar != NULL)
       return bar->second;
     assert();
  }

In any case, people have spent a long amount of time trying to make warnings like this work without massive false positive rates. It's just not easy.


> In the printf example, the compiler could trivially emit an error saying "NULL check on p after dereference of p"

If you know that p is not NULL, the code is just fine. The compiler has to compile it.


If you know that p is not NULL then why are you checking for it? Either way, something is wrong here. Either you made a mistake with where you check for NULL, or you are performing extraneous operations for no reason.


Usually this sort of thing comes up in code that's the result of several rounds of function inlining. Nobody would write that kind of code by hand, but it arises from the indirect results of several chains of function calls. In this regard, it's a very important optimization to be able to perform.

Kent Dybvig's classic response to "Who writes that kind of code?" is "Macros do." Inlining has much the same effect as macros.


I can see how macros can throw a wrench into this, but couldn't compilers tell the difference between when someone wrote code that might have a bug and when it inlines a function that is a bit paranoid for that situation?


Yes, by allowing the optimizer the freedom to exploit undefined behavior, but coupling that with static analysis (or a stricter language semantics) that can catch bugs before they get to the optimizer.


> If you know that p is not NULL then why are you checking for it?

These dummy tests can happen if you have a lot of macros for example. It still valid C code as long as p is not NULL. A compiler generating an error on this particular example wouldn't be a correct C compiler.

The compiler can reject this code if it can prove that p can be NULL, but in many case that's impossible at compile time.

I agree that a warning would be nice though.

edit: clarity.


LLVM does something similar if you pass -fsanitize=undefined: it tries to insert code that will crash the program when it invokes something that has undefined behaviour. It cannot emit a compiler error, because it's perfectly fine to have a function with undefined behaviour in your program, as long as you don't call it.


I don't understand why you say "it cannot emit a compiler error" -- just because something is allowed doesn't mean it's impossible to emit a compiler error for it (c.f. -Werror). Why can't there be a different option to emit a compiler error whenever -fsanitize=undefined would cause the compiler to add program-crashing code? Personally I would definitely use such an option as I can't imagine a purpose for having undefined behavior anywhere in my code. If I have a function that's never being called then I either forgot to call that function somewhere, or I have a useless function that I should remove from my code. Edit: or perhaps I'm implementing a library -- regardless, I can't imagine why I would want to compile successfully with undefined behavior in my code.


Do you want this to produce an error?

    int Successor(int x) { return x + 1; }
Because that is undefined behavior for x == INT_MAX, and if the compiler is detecting undefined behavior and emitting errors, this would be an obvious candidate.

How about this simple function?

    int Divide(int x, int y) { return x / y; }
This is undefined behavior for y == 0, or for x == -INT_MAX - 1 and y == -1. Shall it produce an error?

(Never mind why you're writing such simple functions. Imagine they do something more complex and just do the division or addition or whatever as part of their work.)

Here's a fun one:

    void StrClear(char *str) {
        memset(str, 0, strlen(str));
    }
This will invoke undefined behavior if passed a string constant as its parameter, or if passed an array that doesn't have a 0 in it. There is no portable (i.e. without invoking undefined behavior) way to check whether the parameter is a string constant or doesn't have a 0, so it is impossible to assert away the undefined behavior for this.

Many real, practical, production-worthy C functions will invoke undefined behavior with some inputs. Turning undefined behavior into a compile-time error will cause virtually all C code to not compile.


"There is no portable way to check whether the parameter is a string constant"

Well, the compiler should warn you if you try to pass a (char const *) to a function expecting a (char *). Use -Wwrite-strings to make string literals have type (char const[]) rather than (char[]).


> Why can't there be a different option to emit a compiler error whenever -fsanitize=undefined would cause the compiler to add program-crashing code?

The simple answer is "the halting problem".

If you can build a compiler that knows with certainty what runtime behavior would result from any program (including whether undefined behavior occurs), then you could solve the halting problem, but the halting problem is provably undecidable. So such a compiler cannot exist even in theory for the general case.

Yes, you can template-match a bunch of special cases, but the user can always write new code that doesn't match any of your "known to be defined behavior" patterns but still executes only defined behavior. Guaranteed!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: