Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
RFC: Enforcing Bounds Safety in C (-fbounds-safety) (llvm.org)
48 points by panic on May 25, 2023 | hide | past | favorite | 49 comments


Kind of wish people would stop inventing new notation. There's already Microsoft SAL. Just use the existing annotations (like _In_reads_(N) here) instead of forcing people to adopt two of them their codebases! https://learn.microsoft.com/en-us/cpp/code-quality/understan...


C99 also lets you do this:

  void foo(size_t elem_count, int elems[elem_count]);
Unfortunately this constraints the argument order. C23 fixes that with something like this:

  void foo(size_t elem_count; int elems[elem_count], elem_count);
(Might be slightly wrong, going from memory.)


C23 did not get around to introducing this kind of forward declaration (and, although GCC has had them for a long time, Clang has not implemented that extension either). I think I remember Gustedt (the chair for C11 and C17) openly advocating against the ages-old practice[1] of putting lengths after pointers, I suspect in part because of this problem, and Meneide (the current one) also puts lengths first in his proposed encoding APIs. In any case, the proposal[2] (ETA: by Martin Uecker, who’s also spoken up both in the linked thread and here) is still under consideration, it just didn’t get into C23.

[1] With main() being the exception.

[2] Latest version being https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3121.pdf.


WG14 can still vote it into C23.


Oh. With all the talk about national body comments I thought a freeze of some sort was already in effect. (Which did make the “nope“ paper on integer constant expressions[1] from Clang more than a bit surprising.) Thanks for the correction!

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3125.pdf


The freeze is in effect, but issues can still be addressed with NB comments. There was a NB related to this topic and WG14 asked me to create a new revision of this paper.


> Unfortunately this constraints the argument order

In both C and C++, it drives me absolutely bonkers that declaration order matters so much.

Point of personal preference, but they're the last languages I use with any regularity where you can't just reference a symbol that will be declared later in the file. In this case, the compiler should be able to figure it out without any of that hinting; the scope of potential valid bindings is in the parentheses. It's a very constrained scope!


Requiring things be defined in the right order is a bit annoying, as the computer can surely figure it out... but a lot of us are working with really large codebases that are very slow to compile and alleviating the compiler from having to know the entire file--or in some cases entire modules... madness!!--before being able to understand what any of it means really helps.


If compilation speed is an issue, you're already sunk in C and C++ because the #include evaluation rules leave very little room to optimize the massive redundancy of computing headers over and over (and every header includes all its headers, and etc).


In C headers are not a problem. They are simple and effective. C++ made the mistake of putting implementations into headers for generic programming.


C can have the same problem, it has inline functions.


Sure, you could also put a lot of complicated macros into headers. But in practice it is not a problem because this is not how you would do it in C. But in C++ you put the definition of a class into the header. Also templates basically have to exist in headers. So it is normal for C++ code to have a lot of code in headers.


You're basically saying pooling code between a large group might be worthwhile even if it requires you to take on a lot of book-keeping the computer could in principle help with. I don't think I've seen anyone else put it this way before.

Then again, we have so much compute going to LLMs lately, I wonder if we need to revisit our assumptions about how much hardware we have available for compilation. What might a language look like with the assumption we can spread LLM-training levels of hardware among everyone working on a large codebase? Or a compiler that can consult some sort of ML-based cache for reordering decisions? There's a new space of options opening up here.


Yes, and with pointer-to-array syntax the bounds checking already works today:

https://godbolt.org/z/oc6MTWjYd

For your syntax we will probably just also add it is an option.


One potential difference is variance: AFAIU, if

  void f(char (*pa)[7]);
  void g(int x, char (*pa)[x]);
  void h(int x, char p[static x]);
  char a[10];
then f(&a) is a constraint violation, and I believe g(7,&a) might actually be UB, whereas h(7,a) is fine. On the other hand, in the latter case

  void h(int x, char p[static x]) { p = (char *)&x; }
is legal, so extending that with bounds checks is also not without its problems.


g is UB which is why one can use it easily for checking. h is fine because a is larger than 7. If it were smaller the call could be diagnosed. If you overwrite the pointer inside the function, then the bounds will be lost. (which is different to pa = ... where the bounds still need to match).


This RFC recognizes that, and does not require additional annotation.

It's important to recognize that in plain C the above is simply an aid to someone looking at the header and is no different from

    void foo(size_t elem_count, int elems[]);
As such

    int j;
    foo(20, &j);
is fine. Under this RFC this is an error and the call site would trap at runtime (if not picked up a compile time).

The RFC also comments on (and handles)

    void foo(int elems[] __counted_by(elem_count), size_t elem_count);
Look for references to "delayed parsing". It also handles similar in structs.


While most C compilers ignore this the size in C can definitely be used for checking and this is intended, e.g. see the C2X charter: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2086.htm

And compilers do this already to some degree: https://godbolt.org/z/c7xdazKrd


The moment the size isn't trivially available the warnings cease. I was using that example just to keep the code brief.

This is a problem because it's trivially easy to have code along the lines of

    void foo(int size, int buffer[size]) {
        ...
        int new_base = some_math;
        int new_size = some_other_math;
        foo(new_size, buffer+new_base);
        ...
    }
That computes the size or base incorrectly, and the compiler will do nothing to stop you walking off the ends.

Obviously this RFC just uses the existing syntax if it's there and doesn't need explicit annotations, but the existing syntax requires the size parameter first which many APIs don't have. gcc has an extension to pre-declare a parameter name and type but you can't just use macros to make that work in other or older compilers. They also don't support sizes that aren't just a direct reference to a parameter (e.g. you can't make void some_matrix_func(int size, float matrix[size*size])).


That the size is information is used more for checking is generally ongoing work. Inside a function it can be propagated by the compiler.

The old GCC extension can also be hidden behind a macro, so this is backwards compatible:

void foo(HIDE(int x;) char buf[HIDE(x)], int x);


As far as I can tell SAL is a static analyzer aid, and only applies to specifically annotated values.

This RFC applies to all pointers in all cases, and bounds checking occurs on every pointer access at runtime. So while the annotations clearly overlap, the language semantics are vastly different.


There are no language semantics for either, they're just declarative annotations about run time behavior. Either compiler can add checks at compile or run time for the annotations. There's really no reason to have two sets of parallel annotations for the same constraint.


Oh, your issue is the different syntax not the feature itself?

The syntax this RFC uses is based on the standardized syntax for attributes in C.


In what sense is __counted_by "standardized"? And is it really worth deviating from what's already there?


Sorry that was really badly phrased.

GCC and Clang have a standard syntax for attributes, which is what this feature uses. Moreover the position semantics of the __attribute__ syntax matches the positional semantics of the standardized C++ attribute syntax, which is (fingers crossed) finally being picked up in C23.

So with that mea culpa let's address the rest.

__counted_by is obviously not standardized - this is an RFC for clang, let alone a spec proposal.

__counted_by is a macro that expands to:

    __attribute__((__counted_by__(T)))
Which is the standard syntax for all attributes in clang and gcc.

So the question is why the position is different from under SAL. SAL considers the bounds of a pointer to be a feature of the declaration, not the type. This RFC consider pointer bounds to be an attribute of the type itself. Because of that, the syntax of the attribute needs to be in the location for type attributes. You might reasonably respond with "a one off match of this other syntax would have been ok", but the SAL syntax also means that you can only specify one set of bounds per declaration. By having bounds be an attribute of the type you can specify the bounds at each level of indirection in a declaration, so you can have something like:

    void dump_string_list(int N, const char * __null_terminated * __counted_by(N) thing);
The other part is that this RFC also allows for opt in wide pointers so you can do

    void something_else(int N, int * __indexable * __counted_by(N) thing);
Which would presumably not be usable for an existing API, but for internal logic that isn't subject to ABI constraints would allow the adoption of bounds safety without significant source changes. It also highlights the impact on syntactic consistency of having this be an attribute of the type.

Now, given the necessary positional difference, reusing the same token as used for MS's SAL would make these incompatible, so code bases would need to use different macro names for each anyway.

Finally, if it was considered really valuable, then someone could implement the same implicit adoption that occurs when SAL attributes are present, as already occurs if you do something like

    void f(int N, int buffer[N])
while retaining the significantly more powerful and flexible attributes provided by this RFC.


You’d think maybe they could make the annotations less ugly


Who's they? Both annotations seem ugly. And ugliness is hardly a reason to litter everyone's code with a parallel set of names.


What is the ugliness you're talking about? They're clear and explicitly named so I'm not sure what you could change that would make things better


After Oracle v. Google, can a project consider a protocol designed by Microsoft safe to use even if they're doing their own implementation of the contract enforcement engine?


Given that SCOTUS essentially ruled that clean-room implementation is more or less automatically fair use, why not?


Making sure, firstly, that your implementation is clean room and, secondly, that you can prove that in court, is quite a hassle, though, especially if your implementers visited various conferences and standard meetings where they met people working on the feature being reimplemented.


Clang already supports a bunch of MS compiler features. I don't see why this is any different?


I recall reading that SAL was patent encumbered but I can't for the life of me find the actual patent. I remember it coming up in the llvm forums though..


Oh wow, thanks, I had no idea. Googling brings this up, though: https://patents.google.com/patent/US7584458B2/en

If this patent is genuinely enforceable, surely the names of the keywords aren't relevant?


Been there! https://www.doc.ic.ac.uk/~phjk/BoundsChecking.html (adding gcc -fbounds-checking, in 1995). We didn't need special annotations.


How did you make

    struct TerribleArray {
       int count;
       int *elements;
    }
correctly bounds check that elements was at least count elements long? and that indexing into elements is restricted to 0..<count?


It's bounded by the size of the object, eg. the struct. You know this because it was either allocated with malloc or on the stack. There's a paper at the link (one of the most referenced on this topic) which explains everything in detail.


So many years of effort that could have been avoided had C introduced wide pointers into the standard 24 years ago.

This will fail like all the other annotation attempts have failed.


To see how much WG14 cares about security, note that they even ignored the language authors proposal.


I do not think security was an issue at this time. I assume you mean this proposal: https://www.bell-labs.com/usr/dmr/www/vararray.pdf

I like it and it is on my list to implement as a prototype for GCC.

And WG14 takes security seriously, but I think you overestimate the power WG14 has to simply change things. WG14 is supposed to standardize existing practice, not reinvent the language. So you should complain to compiler vendors. Or contribute to the development of open-source compilers.


I will believe that, when C finally gets at very least vocabulary types for safer strings and arrays, instead of reboots of functions with pointer/count.

Plenty of compiler vendors have extensions for safer C. Microsoft introduced similar annotation mechanism like the one being discussed here when Windows XP SP 2 came to be, in 2001.

Apparently 50 years weren't enough to make it happen.

Contrast this with how WG21 looks into security and code safety, enough papers on the subject, specially after the cybersecurity bills started to come about.


Strictly speaking, this wouldn't be C any more. But it would be a bit more memory-safe.


Out-of-bounds accesses are undefined behaviour. A compiler is free, within the bounds of the standard, to do whatever it wants for them, including doing dynamic checks.


C compilers have been doing stuff like that for decades… trivial example: many compilers will insert an implicit return statement if you don’t do that.

Without this code execution would just continue to whatever function is in memory, if there’s one.

And yet… it’s still considered c language.


I think they meant more the new syntax.


It's still conforming C. It's not strictly conforming C, but no compiler extensions are. I'm not aware of any compiler that fully enforces strictly conforming C. It may not even be possible, since a strictly conforming program "shall not produce output dependent on any unspecified, undefined, or implementation-defined behavior"[1], which would require the compiler to reject any program that produces such output even though some cases can only be detected at runtime. And since strictly conforming programs can only use features of the language and standard library defined in the C standard, the compiler can't even insert runtime checks to exit the program if such behavior is encountered (since that would be a language extension).

[1] C17 standard 4.5


Implementation extensions have been around for decades. This seems as C as it can get! Also, the compiler has license to replace UB with anything, might as well be helpful.


Existing compilers that know nothing of the new annotations can still compile the code (without the checks obviously) using the macro definitions provided in the article.


If adopted by the C committees, would this then no longer "[not] be C any more"?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: