mbox series

[0/4] -ffreestanding/-fno-builtin-* patches

Message ID 20200817220212.338670-1-ndesaulniers@google.com (mailing list archive)
Headers show
Series -ffreestanding/-fno-builtin-* patches | expand

Message

Nick Desaulniers Aug. 17, 2020, 10:02 p.m. UTC
-ffreestanding typically inhibits "libcall optimizations" where calls to
certain library functions can be replaced by the compiler in certain
cases to calls to other library functions that may be more efficient.
This can be problematic for embedded targets that don't provide full
libc implementations.

-ffreestanding inhibits all such optimizations, which is the safe
choice, but generally we want the optimizations that are performed. The
Linux kernel does implement a fair amount of libc routines. Instead of
-ffreestanding (which makes more sense in smaller images like kexec's
purgatory image), prefer -fno-builtin-* flags to disable the compiler
from emitting calls to functions which may not be defined.

If you see a linkage failure due to a missing symbol that's typically
defined in a libc, and not explicitly called from the source code, then
the compiler may have done such a transform.  You can either implement
such a function (ie. in lib/string.c) or disable the transform outright
via -fno-builtin-* flag (where * is the name of the library routine, ie.
-fno-builtin-bcmp).

Patch 1 unbreaks the build with ToT clang, which has been red all
weekend, by adding -fno-builtin-stpcpy.
Patch 2 is a revert but adds -fno-builtin-bcmp.
Patch 3 does the same for x86 purgatory.
Patch 4 removes -ffreestanding from i386.

The first patch makes sense for Kbuild, the second maybe akpm@, the
third and forth for x86.  Not sure who should pick up the series (they
can be merged out of order, technically) but I really need the first
patch soon. The 3 latter patches are cleanups.

Nick Desaulniers (4):
  Makefile: add -fno-builtin-stpcpy
  Revert "lib/string.c: implement a basic bcmp"
  x86/boot: use -fno-builtin-bcmp
  x86: don't build CONFIG_X86_32 as -ffreestanding

 Makefile               |  7 +++++++
 arch/x86/Makefile      |  3 ---
 arch/x86/boot/Makefile |  1 +
 arch/x86/boot/string.c |  8 --------
 include/linux/string.h |  3 ---
 lib/string.c           | 20 --------------------
 6 files changed, 8 insertions(+), 34 deletions(-)

Comments

H. Peter Anvin Aug. 17, 2020, 10:44 p.m. UTC | #1
On 2020-08-17 15:02, Nick Desaulniers wrote:
> -ffreestanding typically inhibits "libcall optimizations" where calls to
> certain library functions can be replaced by the compiler in certain
> cases to calls to other library functions that may be more efficient.
> This can be problematic for embedded targets that don't provide full
> libc implementations.
> 
> -ffreestanding inhibits all such optimizations, which is the safe
> choice, but generally we want the optimizations that are performed. The
> Linux kernel does implement a fair amount of libc routines. Instead of
> -ffreestanding (which makes more sense in smaller images like kexec's
> purgatory image), prefer -fno-builtin-* flags to disable the compiler
> from emitting calls to functions which may not be defined.
> 
> If you see a linkage failure due to a missing symbol that's typically
> defined in a libc, and not explicitly called from the source code, then
> the compiler may have done such a transform.  You can either implement
> such a function (ie. in lib/string.c) or disable the transform outright
> via -fno-builtin-* flag (where * is the name of the library routine, ie.
> -fno-builtin-bcmp).
> 

This is arguably exactly the wrong way around.

The way this *should* be done is by opt-in, not opt-out, which by almost
definition ends up being a game of whack-a-mole, like in this case
stpcpy(). Furthermore, it is unlikely that people will remember what
options to flip when and if stpcpy() happens to be implemented in the
kernel.

The problem here is twofold:

1. The user would be expected to know what kind of the optimizations the
compiler can do on what function, which is private knowledge to the
compiler.

2. The only way to override -fno-builtin is by a header file with macros
overriding the function names with __builtin, but that doesn't tell the
compiler proper anything about the execution environment.

So the Right Thing is for the compiler authors to change the way
-ffreestanding works.  -ffreestanding means, by definition, that there
are no library calls (other than libgcc or whatever else is supplied
with the compiler) that the compiler can call. That is currently an
all-or-nothing choice, or at least one choice per C standard implemented.

Instead, a compile job with -ffreestanding should be able to provide a
list of standard C functions that the compiler may call, and thus the
compiler actually can do the right thing depending on which exact
functions it would consider calling. This list is probably most easily
supplied in the form of a header file with #pragma directives.

	-hpa
Nick Desaulniers Aug. 18, 2020, 5:56 p.m. UTC | #2
On Mon, Aug 17, 2020 at 3:44 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2020-08-17 15:02, Nick Desaulniers wrote:
> > -ffreestanding typically inhibits "libcall optimizations" where calls to
> > certain library functions can be replaced by the compiler in certain
> > cases to calls to other library functions that may be more efficient.
> > This can be problematic for embedded targets that don't provide full
> > libc implementations.
> >
> > -ffreestanding inhibits all such optimizations, which is the safe
> > choice, but generally we want the optimizations that are performed. The
> > Linux kernel does implement a fair amount of libc routines. Instead of
> > -ffreestanding (which makes more sense in smaller images like kexec's
> > purgatory image), prefer -fno-builtin-* flags to disable the compiler
> > from emitting calls to functions which may not be defined.
> >
> > If you see a linkage failure due to a missing symbol that's typically
> > defined in a libc, and not explicitly called from the source code, then
> > the compiler may have done such a transform.  You can either implement
> > such a function (ie. in lib/string.c) or disable the transform outright
> > via -fno-builtin-* flag (where * is the name of the library routine, ie.
> > -fno-builtin-bcmp).
> >
>
> This is arguably exactly the wrong way around.
>
> The way this *should* be done is by opt-in, not opt-out, which by almost
> definition ends up being a game of whack-a-mole, like in this case
> stpcpy(). Furthermore, it is unlikely that people will remember what
> options to flip when and if stpcpy() happens to be implemented in the
> kernel.
>
> The problem here is twofold:
>
> 1. The user would be expected to know what kind of the optimizations the
> compiler can do on what function, which is private knowledge to the
> compiler.
>
> 2. The only way to override -fno-builtin is by a header file with macros
> overriding the function names with __builtin, but that doesn't tell the
> compiler proper anything about the execution environment.
>
> So the Right Thing is for the compiler authors to change the way
> -ffreestanding works.

Sir, this is an Arby's

There are things all across the compilation landscape that make we
want to pontificate or even throw a tantrum in an Arby's.  Would I?
Well, no, I'm just trying to flip burgers or shovel the elephant
sh...or w/e they do at Arby's (I've never actually been; I detest
roast beef).

Would it be interesting to have a way of opting in, as you describe,
such that your compiler knew exactly what kind of embedded environment
it was targeting?  Maybe, but I'd argue that opting out is just the
other side of the same coin. Heads, I win; tails, you lose. That the
opt in or opt out list is shorter for a given project is not
particularly interesting.  Should we change the semantics of a fairly
commonly used compiler flag that multiple toolchains are in agreement
of, then fix all of the breakage in all of the code that relied on
those semantics?  I'm afraid that ship may have already
sailed...probably 20 or 30 years ago.

> -ffreestanding means, by definition, that there
> are no library calls (other than libgcc or whatever else is supplied
> with the compiler) that the compiler can call. That is currently an
> all-or-nothing choice, or at least one choice per C standard implemented.

Yes?

>
> Instead, a compile job with -ffreestanding should be able to provide a
> list of standard C functions that the compiler may call, and thus the
> compiler actually can do the right thing depending on which exact
> functions it would consider calling. This list is probably most easily
> supplied in the form of a header file with #pragma directives.
>
>         -hpa
>
>

Here we have a one line patch for keeping the build green.  If there's
some compiler feature you'd like implemented, let's sit down sometime
and work out the details.  I'll even buy you a beer.  But right now,
sir, the Arby's is on fire.  Please take your soapbox outside.

--
Thanks,
~Nick Desaulniers
H. Peter Anvin Aug. 18, 2020, 7:02 p.m. UTC | #3
On 2020-08-18 10:56, Nick Desaulniers wrote:
>>
>> The problem here is twofold:
>>
>> 1. The user would be expected to know what kind of the optimizations the
>> compiler can do on what function, which is private knowledge to the
>> compiler.
>>
>> 2. The only way to override -fno-builtin is by a header file with macros
>> overriding the function names with __builtin, but that doesn't tell the
>> compiler proper anything about the execution environment.
>>
>> So the Right Thing is for the compiler authors to change the way
>> -ffreestanding works.
> 
> Sir, this is an Arby's
> 
> There are things all across the compilation landscape that make we
> want to pontificate or even throw a tantrum in an Arby's.  Would I?
> Well, no, I'm just trying to flip burgers or shovel the elephant
> sh...or w/e they do at Arby's (I've never actually been; I detest
> roast beef).
> 
> Would it be interesting to have a way of opting in, as you describe,
> such that your compiler knew exactly what kind of embedded environment
> it was targeting?  Maybe, but I'd argue that opting out is just the
> other side of the same coin. Heads, I win; tails, you lose. That the
> opt in or opt out list is shorter for a given project is not
> particularly interesting.  Should we change the semantics of a fairly
> commonly used compiler flag that multiple toolchains are in agreement
> of, then fix all of the breakage in all of the code that relied on
> those semantics?  I'm afraid that ship may have already
> sailed...probably 20 or 30 years ago.
> 
>> -ffreestanding means, by definition, that there
>> are no library calls (other than libgcc or whatever else is supplied
>> with the compiler) that the compiler can call. That is currently an
>> all-or-nothing choice, or at least one choice per C standard implemented.
> 
> Yes?
> 

I'm not saying "change the semantics", nor am I saying that playing
whack-a-mole *for a limited time* is unreasonable. But I would like to go back
to the compiler authors and get them to implement such a #pragma: "this
freestanding implementation *does* support *this specific library function*,
and you are free to call it." The only way we can get what we really need from
the compilers is by speaking up and requesting it, and we have done so very
successfully recently; further back we tended to get a lot of
language-lawyering, but these days both the gcc and the clang teams have been
wonderfully responsive.

	-hpa
Linus Torvalds Aug. 18, 2020, 7:13 p.m. UTC | #4
On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> I'm not saying "change the semantics", nor am I saying that playing
> whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> to the compiler authors and get them to implement such a #pragma: "this
> freestanding implementation *does* support *this specific library function*,
> and you are free to call it."

I'd much rather just see the library functions as builtins that always
do the right thing (with the fallback being "just call the standard
function").

IOW, there's nothing wrong with -ffreestanding if you then also have
__builtin_memcpy() etc, and they do the sane compiler optimizations
for memcpy().

What we want to avoid is the compiler making *assumptions* based on
standard names, because we may implement some of those things
differently.

And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's
the year 2020, we don't use bcmp. It's that simple. Fix your damn
broken compiler and use memcmp. The argument that memcmp is more
expensive than bcmp is garbage legacy thinking from four decades ago.

It's likely the other way around, where people have actually spent
time on memcmp, but not on bcmp.

If somebody really *wants* to use bcmp, give them the "Get off my
lawn" flag, and leave them alone. But never ever should "use bcmp" be
any kind of default behavior. That's some batshit crazy stuff.

               Linus
Nick Desaulniers Aug. 18, 2020, 7:25 p.m. UTC | #5
On Tue, Aug 18, 2020 at 12:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > I'm not saying "change the semantics", nor am I saying that playing
> > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > to the compiler authors and get them to implement such a #pragma: "this
> > freestanding implementation *does* support *this specific library function*,
> > and you are free to call it."
>
> I'd much rather just see the library functions as builtins that always
> do the right thing (with the fallback being "just call the standard
> function").
>
> IOW, there's nothing wrong with -ffreestanding if you then also have
> __builtin_memcpy() etc, and they do the sane compiler optimizations
> for memcpy().
>
> What we want to avoid is the compiler making *assumptions* based on
> standard names, because we may implement some of those things
> differently.
>
> And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's
> the year 2020, we don't use bcmp. It's that simple. Fix your damn
> broken compiler and use memcmp. The argument that memcmp is more
> expensive than bcmp is garbage legacy thinking from four decades ago.
>
> It's likely the other way around, where people have actually spent
> time on memcmp, but not on bcmp.
>
> If somebody really *wants* to use bcmp, give them the "Get off my
> lawn" flag, and leave them alone. But never ever should "use bcmp" be
> any kind of default behavior. That's some batshit crazy stuff.
>
>                Linus

You'll have to ask Clement about that.  I'm not sure I ever saw the
"faster bcmp than memcmp" implementation, but I was told "it exists"
when I asked for a revert when all of our kernel builds went red.
Nick Desaulniers Aug. 18, 2020, 7:35 p.m. UTC | #6
On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> I'm not saying "change the semantics", nor am I saying that playing
> whack-a-mole *for a limited time* is unreasonable.

Ah, ok then.  Sorry I mischaracterized your position.

> But I would like to go back
> to the compiler authors and get them to implement such a #pragma: "this
> freestanding implementation *does* support *this specific library function*,
> and you are free to call it."

I think the first thing that would be helpful is some more detailed
write up of the problem statement, and analysis of why the current
provided tools are close but not enough.  Maybe filing a Clang bug
would be helpful to get more feedback from additional toolchain folks
than just me (https://bugs.llvm.org/enter_bug.cgi?product=clang "new
bug" component).  Alternatively, if you're planning on attending
plumbers next week, I plan to propose a "kernel toolchain" mailing
list for folks with whatever background to discuss future GNU C
extensions and how they might be used in kernel development.  That
might be more appropriate than a Clang bug, but it doesn't exist yet,
and feedback might be that it's a terrible idea for some reason.

> The only way we can get what we really need from
> the compilers is by speaking up and requesting it, and we have done so very
> successfully recently; further back we tended to get a lot of
> language-lawyering, but these days both the gcc and the clang teams have been
> wonderfully responsive.

Just trying to avoid the shoe, again.  I'd really like to get to the
point where we can untangle putting out fires from implementing
feature requests though.
Nick Desaulniers Aug. 18, 2020, 7:58 p.m. UTC | #7
On Tue, Aug 18, 2020 at 12:25 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Aug 18, 2020 at 12:19 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > >
> > > I'm not saying "change the semantics", nor am I saying that playing
> > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > to the compiler authors and get them to implement such a #pragma: "this
> > > freestanding implementation *does* support *this specific library function*,
> > > and you are free to call it."
> >
> > I'd much rather just see the library functions as builtins that always
> > do the right thing (with the fallback being "just call the standard
> > function").
> >
> > IOW, there's nothing wrong with -ffreestanding if you then also have
> > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > for memcpy().
> >
> > What we want to avoid is the compiler making *assumptions* based on
> > standard names, because we may implement some of those things
> > differently.

That's asking for trouble; please don't implement routines with
identifiers from libc but with differing function signatures, and then
proceed to *not* use -ffreestanding.  You can't have it both ways
(optimizations from *not* using -ffreestanding, then breaking all
kinds of assumptions based on conventions used across userspace), at
least not with the tools you currently have.

> >
> > And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's
> > the year 2020, we don't use bcmp. It's that simple. Fix your damn
> > broken compiler and use memcmp. The argument that memcmp is more
> > expensive than bcmp is garbage legacy thinking from four decades ago.
> >
> > It's likely the other way around, where people have actually spent
> > time on memcmp, but not on bcmp.
> >
> > If somebody really *wants* to use bcmp, give them the "Get off my
> > lawn" flag,

I wrote a paper in college on the philosophy and symbolism in "Gran
Torino."  Would recommend (the movie, not the paper).

> > and leave them alone. But never ever should "use bcmp" be
> > any kind of default behavior. That's some batshit crazy stuff.
> >
> >                Linus
>
> You'll have to ask Clement about that.  I'm not sure I ever saw the
> "faster bcmp than memcmp" implementation, but I was told "it exists"
> when I asked for a revert when all of our kernel builds went red.

Also, to Clement's credit, every patch I've ever seen from Clement is
backed up by data; typically fleetwide profiles at Google.  "we spend
a lot of time in memcmp, particularly comparing the result against
zero and no other value; hmm...how do we spend less time in
memcmp...oh, well there's another library function with slightly
different semantics we can call instead."  I don't think anyone would
consider the optimization batshit crazy given the number of cycles
saved across the fleet.  That an embedded project didn't provide an
implementation, is a footnote that can be fixed in the embedded
project, either by using -ffreestanding or -fno-builtin-bcmp, which is
what this series proposes to do.
Arvind Sankar Aug. 18, 2020, 8:24 p.m. UTC | #8
On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > I'm not saying "change the semantics", nor am I saying that playing
> > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > to the compiler authors and get them to implement such a #pragma: "this
> > freestanding implementation *does* support *this specific library function*,
> > and you are free to call it."
> 
> I'd much rather just see the library functions as builtins that always
> do the right thing (with the fallback being "just call the standard
> function").
> 
> IOW, there's nothing wrong with -ffreestanding if you then also have
> __builtin_memcpy() etc, and they do the sane compiler optimizations
> for memcpy().
> 
> What we want to avoid is the compiler making *assumptions* based on
> standard names, because we may implement some of those things
> differently.
> 

-ffreestanding as it stands today does have __builtin_memcpy and
friends. But you need to then use #define memcpy __builtin_memcpy etc,
which is messy and also doesn't fully express what you want. #pragma, or
even just allowing -fbuiltin-foo options would be useful.

The two compilers have some peculiarities, which means you really can't
have functions with the same name that do something else if you want to
use builtins at all, and can also lead to missed optimizations.

For eg, __builtin_strchr(s,'\0') can be optimized to strlen. gcc will
optimize it that way even if -ffreestanding is used (so strlen has to
mean strlen), while clang won't, so it misses a potential optimization.
This is admittedly a silly example, but you could imagine something like
strncpy being optimized to memcpy+memset if the source length was
previously computed.

PS: clang optimizes sprintf, but doesn't provide __builtin_sprintf?
Nick Desaulniers Aug. 18, 2020, 8:27 p.m. UTC | #9
On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > >
> > > I'm not saying "change the semantics", nor am I saying that playing
> > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > to the compiler authors and get them to implement such a #pragma: "this
> > > freestanding implementation *does* support *this specific library function*,
> > > and you are free to call it."
> >
> > I'd much rather just see the library functions as builtins that always
> > do the right thing (with the fallback being "just call the standard
> > function").
> >
> > IOW, there's nothing wrong with -ffreestanding if you then also have
> > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > for memcpy().
> >
> > What we want to avoid is the compiler making *assumptions* based on
> > standard names, because we may implement some of those things
> > differently.
> >
>
> -ffreestanding as it stands today does have __builtin_memcpy and
> friends. But you need to then use #define memcpy __builtin_memcpy etc,
> which is messy and also doesn't fully express what you want. #pragma, or
> even just allowing -fbuiltin-foo options would be useful.
>
> The two compilers have some peculiarities, which means you really can't
> have functions with the same name that do something else if you want to
> use builtins at all, and can also lead to missed optimizations.
>
> For eg, __builtin_strchr(s,'\0') can be optimized to strlen. gcc will
> optimize it that way even if -ffreestanding is used (so strlen has to
> mean strlen), while clang won't, so it misses a potential optimization.
> This is admittedly a silly example, but you could imagine something like
> strncpy being optimized to memcpy+memset if the source length was
> previously computed.
>
> PS: clang optimizes sprintf, but doesn't provide __builtin_sprintf?

https://bugs.llvm.org/show_bug.cgi?id=47224
Nick Desaulniers Aug. 18, 2020, 8:58 p.m. UTC | #10
On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> > > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > >
> > > > I'm not saying "change the semantics", nor am I saying that playing
> > > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > > to the compiler authors and get them to implement such a #pragma: "this
> > > > freestanding implementation *does* support *this specific library function*,
> > > > and you are free to call it."
> > >
> > > I'd much rather just see the library functions as builtins that always
> > > do the right thing (with the fallback being "just call the standard
> > > function").
> > >
> > > IOW, there's nothing wrong with -ffreestanding if you then also have
> > > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > > for memcpy().
> > >
> > > What we want to avoid is the compiler making *assumptions* based on
> > > standard names, because we may implement some of those things
> > > differently.
> > >
> >
> > -ffreestanding as it stands today does have __builtin_memcpy and
> > friends. But you need to then use #define memcpy __builtin_memcpy etc,
> > which is messy and also doesn't fully express what you want. #pragma, or
> > even just allowing -fbuiltin-foo options would be useful.

I do really like the idea of -fbuiltin-foo.  For example, you'd specify:

-ffreestanding -fbuiltin-bcmp

as an example. `-ffreestanding` would opt you out of ALL libcall
optimizations, `-fbuiltin-bcmp` would then opt you back in to
transforms that produce bcmp.  That way you're informing the compiler
more precisely about the environment you'd be targeting.  It feels
symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
easy when there is such symmetry).  And it's already convention that
if you specify multiple conflicting compiler flags, then the latter
one specified "wins."  In that sense, turning back on specific
libcalls after disabling the rest looks more ergonomic to me.

Maybe Eli or David have thoughts on why that may or may not be as
ergonomic or possible to implement as I imagine?

> >
> > The two compilers have some peculiarities, which means you really can't
> > have functions with the same name that do something else if you want to
> > use builtins at all, and can also lead to missed optimizations.
> >
> > For eg, __builtin_strchr(s,'\0') can be optimized to strlen. gcc will
> > optimize it that way even if -ffreestanding is used (so strlen has to
> > mean strlen), while clang won't, so it misses a potential optimization.
> > This is admittedly a silly example, but you could imagine something like
> > strncpy being optimized to memcpy+memset if the source length was
> > previously computed.
> >
> > PS: clang optimizes sprintf, but doesn't provide __builtin_sprintf?
>
> https://bugs.llvm.org/show_bug.cgi?id=47224
> --
> Thanks,
> ~Nick Desaulniers
Arvind Sankar Aug. 18, 2020, 9:41 p.m. UTC | #11
On Tue, Aug 18, 2020 at 01:58:51PM -0700, Nick Desaulniers wrote:
> On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > >
> > > On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> > > > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > >
> > > > > I'm not saying "change the semantics", nor am I saying that playing
> > > > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > > > to the compiler authors and get them to implement such a #pragma: "this
> > > > > freestanding implementation *does* support *this specific library function*,
> > > > > and you are free to call it."
> > > >
> > > > I'd much rather just see the library functions as builtins that always
> > > > do the right thing (with the fallback being "just call the standard
> > > > function").
> > > >
> > > > IOW, there's nothing wrong with -ffreestanding if you then also have
> > > > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > > > for memcpy().
> > > >
> > > > What we want to avoid is the compiler making *assumptions* based on
> > > > standard names, because we may implement some of those things
> > > > differently.
> > > >
> > >
> > > -ffreestanding as it stands today does have __builtin_memcpy and
> > > friends. But you need to then use #define memcpy __builtin_memcpy etc,
> > > which is messy and also doesn't fully express what you want. #pragma, or
> > > even just allowing -fbuiltin-foo options would be useful.
> 
> I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
> 
> -ffreestanding -fbuiltin-bcmp
> 
> as an example. `-ffreestanding` would opt you out of ALL libcall
> optimizations, `-fbuiltin-bcmp` would then opt you back in to
> transforms that produce bcmp.  That way you're informing the compiler
> more precisely about the environment you'd be targeting.  It feels
> symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
> easy when there is such symmetry).  And it's already convention that
> if you specify multiple conflicting compiler flags, then the latter
> one specified "wins."  In that sense, turning back on specific
> libcalls after disabling the rest looks more ergonomic to me.
> 
> Maybe Eli or David have thoughts on why that may or may not be as
> ergonomic or possible to implement as I imagine?
> 

Note that -fno-builtin-foo seems to mean slightly different things in
clang and gcc. From experimentation, clang will neither optimize a call
to foo, nor perform an optimization that introduces a call to foo. gcc
will avoid optimizing calls to foo, but it can still generate new calls
to foo while optimizing something else. Which means that
-fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
just that gcc doesn't seem to have implemented those optimizations.
Dávid Bolvanský Aug. 18, 2020, 9:51 p.m. UTC | #12
So -fno-builtin-stpcpy is not ideal solution then.. gcc may implement this opt too and here we go again, red builds.

We should either provide stpcpy implementation or fix few places and do not use sprintf there.

> Dňa 18. 8. 2020 o 23:41 užívateľ Arvind Sankar <nivedita@alum.mit.edu> napísal:
> 
> On Tue, Aug 18, 2020 at 01:58:51PM -0700, Nick Desaulniers wrote:
>>> On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
>>> <ndesaulniers@google.com> wrote:
>>> 
>>> On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>>>> 
>>>> On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
>>>>> On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>>>>> 
>>>>>> I'm not saying "change the semantics", nor am I saying that playing
>>>>>> whack-a-mole *for a limited time* is unreasonable. But I would like to go back
>>>>>> to the compiler authors and get them to implement such a #pragma: "this
>>>>>> freestanding implementation *does* support *this specific library function*,
>>>>>> and you are free to call it."
>>>>> 
>>>>> I'd much rather just see the library functions as builtins that always
>>>>> do the right thing (with the fallback being "just call the standard
>>>>> function").
>>>>> 
>>>>> IOW, there's nothing wrong with -ffreestanding if you then also have
>>>>> __builtin_memcpy() etc, and they do the sane compiler optimizations
>>>>> for memcpy().
>>>>> 
>>>>> What we want to avoid is the compiler making *assumptions* based on
>>>>> standard names, because we may implement some of those things
>>>>> differently.
>>>>> 
>>>> 
>>>> -ffreestanding as it stands today does have __builtin_memcpy and
>>>> friends. But you need to then use #define memcpy __builtin_memcpy etc,
>>>> which is messy and also doesn't fully express what you want. #pragma, or
>>>> even just allowing -fbuiltin-foo options would be useful.
>> 
>> I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
>> 
>> -ffreestanding -fbuiltin-bcmp
>> 
>> as an example. `-ffreestanding` would opt you out of ALL libcall
>> optimizations, `-fbuiltin-bcmp` would then opt you back in to
>> transforms that produce bcmp.  That way you're informing the compiler
>> more precisely about the environment you'd be targeting.  It feels
>> symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
>> easy when there is such symmetry).  And it's already convention that
>> if you specify multiple conflicting compiler flags, then the latter
>> one specified "wins."  In that sense, turning back on specific
>> libcalls after disabling the rest looks more ergonomic to me.
>> 
>> Maybe Eli or David have thoughts on why that may or may not be as
>> ergonomic or possible to implement as I imagine?
>> 
> 
> Note that -fno-builtin-foo seems to mean slightly different things in
> clang and gcc. From experimentation, clang will neither optimize a call
> to foo, nor perform an optimization that introduces a call to foo. gcc
> will avoid optimizing calls to foo, but it can still generate new calls
> to foo while optimizing something else. Which means that
> -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> just that gcc doesn't seem to have implemented those optimizations.
David Laight Aug. 18, 2020, 9:53 p.m. UTC | #13
From: Nick Desaulniers
> Sent: 18 August 2020 21:59
> 
> On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
...
> > > -ffreestanding as it stands today does have __builtin_memcpy and
> > > friends. But you need to then use #define memcpy __builtin_memcpy etc,
> > > which is messy and also doesn't fully express what you want. #pragma, or
> > > even just allowing -fbuiltin-foo options would be useful.
> 
> I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
> 
> -ffreestanding -fbuiltin-bcmp
> 
> as an example. `-ffreestanding` would opt you out of ALL libcall
> optimizations, `-fbuiltin-bcmp` would then opt you back in to
> transforms that produce bcmp.  That way you're informing the compiler
> more precisely about the environment you'd be targeting.  It feels
> symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
> easy when there is such symmetry).  And it's already convention that
> if you specify multiple conflicting compiler flags, then the latter
> one specified "wins."  In that sense, turning back on specific
> libcalls after disabling the rest looks more ergonomic to me.
> 
> Maybe Eli or David have thoughts on why that may or may not be as
> ergonomic or possible to implement as I imagine?

You might want -fbuiltin-bcmp=my_bcmp_function so that you can specify
the actual elf symbol name to be used.

I was recently trying to compile an application so that it would
run on a system with an old libc.
Avoiding explicit calls to new functions wasn't a problem, but I
couldn't do anything about the memcpy() calls generated by gcc
for structure copies.
Due to the silly glibc fiasco with memcpy going backwards I'd
either need to force out a reference to the old version of memcpy
or a reference to memove - but neither is possible.

I then tried a C++ program and gave up because 'char traits'
was all different. I then realised I need to remove all references
to std::string to get any kind of efficient object code :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Nick Desaulniers Aug. 18, 2020, 9:59 p.m. UTC | #14
On Tue, Aug 18, 2020 at 2:41 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Tue, Aug 18, 2020 at 01:58:51PM -0700, Nick Desaulniers wrote:
> > On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > > >
> > > > On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> > > > > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > > >
> > > > > > I'm not saying "change the semantics", nor am I saying that playing
> > > > > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > > > > to the compiler authors and get them to implement such a #pragma: "this
> > > > > > freestanding implementation *does* support *this specific library function*,
> > > > > > and you are free to call it."
> > > > >
> > > > > I'd much rather just see the library functions as builtins that always
> > > > > do the right thing (with the fallback being "just call the standard
> > > > > function").
> > > > >
> > > > > IOW, there's nothing wrong with -ffreestanding if you then also have
> > > > > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > > > > for memcpy().
> > > > >
> > > > > What we want to avoid is the compiler making *assumptions* based on
> > > > > standard names, because we may implement some of those things
> > > > > differently.
> > > > >
> > > >
> > > > -ffreestanding as it stands today does have __builtin_memcpy and
> > > > friends. But you need to then use #define memcpy __builtin_memcpy etc,
> > > > which is messy and also doesn't fully express what you want. #pragma, or
> > > > even just allowing -fbuiltin-foo options would be useful.
> >
> > I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
> >
> > -ffreestanding -fbuiltin-bcmp
> >
> > as an example. `-ffreestanding` would opt you out of ALL libcall
> > optimizations, `-fbuiltin-bcmp` would then opt you back in to
> > transforms that produce bcmp.  That way you're informing the compiler
> > more precisely about the environment you'd be targeting.  It feels
> > symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
> > easy when there is such symmetry).  And it's already convention that
> > if you specify multiple conflicting compiler flags, then the latter
> > one specified "wins."  In that sense, turning back on specific
> > libcalls after disabling the rest looks more ergonomic to me.
> >
> > Maybe Eli or David have thoughts on why that may or may not be as
> > ergonomic or possible to implement as I imagine?
> >
>
> Note that -fno-builtin-foo seems to mean slightly different things in
> clang and gcc. From experimentation, clang will neither optimize a call
> to foo, nor perform an optimization that introduces a call to foo. gcc
> will avoid optimizing calls to foo, but it can still generate new calls
> to foo while optimizing something else. Which means that
> -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> just that gcc doesn't seem to have implemented those optimizations.

Can you please share some godbolt links that demonstrate these observations?
Dávid Bolvanský Aug. 18, 2020, 10:05 p.m. UTC | #15
Here:
https://godbolt.org/z/qjo5P6

st 19. 8. 2020 o 0:00 Nick Desaulniers <ndesaulniers@google.com> napísal(a):
>
> On Tue, Aug 18, 2020 at 2:41 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > On Tue, Aug 18, 2020 at 01:58:51PM -0700, Nick Desaulniers wrote:
> > > On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
> > > <ndesaulniers@google.com> wrote:
> > > >
> > > > On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > > > >
> > > > > On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> > > > > > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > > > >
> > > > > > > I'm not saying "change the semantics", nor am I saying that playing
> > > > > > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > > > > > to the compiler authors and get them to implement such a #pragma: "this
> > > > > > > freestanding implementation *does* support *this specific library function*,
> > > > > > > and you are free to call it."
> > > > > >
> > > > > > I'd much rather just see the library functions as builtins that always
> > > > > > do the right thing (with the fallback being "just call the standard
> > > > > > function").
> > > > > >
> > > > > > IOW, there's nothing wrong with -ffreestanding if you then also have
> > > > > > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > > > > > for memcpy().
> > > > > >
> > > > > > What we want to avoid is the compiler making *assumptions* based on
> > > > > > standard names, because we may implement some of those things
> > > > > > differently.
> > > > > >
> > > > >
> > > > > -ffreestanding as it stands today does have __builtin_memcpy and
> > > > > friends. But you need to then use #define memcpy __builtin_memcpy etc,
> > > > > which is messy and also doesn't fully express what you want. #pragma, or
> > > > > even just allowing -fbuiltin-foo options would be useful.
> > >
> > > I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
> > >
> > > -ffreestanding -fbuiltin-bcmp
> > >
> > > as an example. `-ffreestanding` would opt you out of ALL libcall
> > > optimizations, `-fbuiltin-bcmp` would then opt you back in to
> > > transforms that produce bcmp.  That way you're informing the compiler
> > > more precisely about the environment you'd be targeting.  It feels
> > > symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
> > > easy when there is such symmetry).  And it's already convention that
> > > if you specify multiple conflicting compiler flags, then the latter
> > > one specified "wins."  In that sense, turning back on specific
> > > libcalls after disabling the rest looks more ergonomic to me.
> > >
> > > Maybe Eli or David have thoughts on why that may or may not be as
> > > ergonomic or possible to implement as I imagine?
> > >
> >
> > Note that -fno-builtin-foo seems to mean slightly different things in
> > clang and gcc. From experimentation, clang will neither optimize a call
> > to foo, nor perform an optimization that introduces a call to foo. gcc
> > will avoid optimizing calls to foo, but it can still generate new calls
> > to foo while optimizing something else. Which means that
> > -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> > just that gcc doesn't seem to have implemented those optimizations.
>
> Can you please share some godbolt links that demonstrate these observations?
> --
> Thanks,
> ~Nick Desaulniers
Arvind Sankar Aug. 18, 2020, 10:25 p.m. UTC | #16
On Mon, Aug 17, 2020 at 03:02:08PM -0700, Nick Desaulniers wrote:
> -ffreestanding typically inhibits "libcall optimizations" where calls to
> certain library functions can be replaced by the compiler in certain
> cases to calls to other library functions that may be more efficient.
> This can be problematic for embedded targets that don't provide full
> libc implementations.
> 
> -ffreestanding inhibits all such optimizations, which is the safe
> choice, but generally we want the optimizations that are performed. The
> Linux kernel does implement a fair amount of libc routines. Instead of
> -ffreestanding (which makes more sense in smaller images like kexec's
> purgatory image), prefer -fno-builtin-* flags to disable the compiler
> from emitting calls to functions which may not be defined.
> 
> If you see a linkage failure due to a missing symbol that's typically
> defined in a libc, and not explicitly called from the source code, then
> the compiler may have done such a transform.  You can either implement
> such a function (ie. in lib/string.c) or disable the transform outright
> via -fno-builtin-* flag (where * is the name of the library routine, ie.
> -fno-builtin-bcmp).
> 
> Patch 1 unbreaks the build with ToT clang, which has been red all
> weekend, by adding -fno-builtin-stpcpy.
> Patch 2 is a revert but adds -fno-builtin-bcmp.
> Patch 3 does the same for x86 purgatory.
> Patch 4 removes -ffreestanding from i386.
> 
> The first patch makes sense for Kbuild, the second maybe akpm@, the
> third and forth for x86.  Not sure who should pick up the series (they
> can be merged out of order, technically) but I really need the first
> patch soon. The 3 latter patches are cleanups.
> 
> Nick Desaulniers (4):
>   Makefile: add -fno-builtin-stpcpy
>   Revert "lib/string.c: implement a basic bcmp"
>   x86/boot: use -fno-builtin-bcmp
>   x86: don't build CONFIG_X86_32 as -ffreestanding
> 
>  Makefile               |  7 +++++++
>  arch/x86/Makefile      |  3 ---
>  arch/x86/boot/Makefile |  1 +
>  arch/x86/boot/string.c |  8 --------
>  include/linux/string.h |  3 ---
>  lib/string.c           | 20 --------------------
>  6 files changed, 8 insertions(+), 34 deletions(-)
> 
> -- 
> 2.28.0.220.ged08abb693-goog
> 

Another thing that needs to be fixed is that at least lib/string.c needs
to be compiled with -ffreestanding.

gcc-10 optimizes the generic memset implementation in there into a call
to memset. Now that's on x86 which doesn't use the generic
implementation, but this is just waiting to bite us.

https://godbolt.org/z/6EhG15
Nick Desaulniers Aug. 18, 2020, 10:59 p.m. UTC | #17
On Tue, Aug 18, 2020 at 3:25 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> Another thing that needs to be fixed is that at least lib/string.c needs
> to be compiled with -ffreestanding.
>
> gcc-10 optimizes the generic memset implementation in there into a call
> to memset. Now that's on x86 which doesn't use the generic
> implementation, but this is just waiting to bite us.
>
> https://godbolt.org/z/6EhG15

I'll let you send the patch for that this time.  (It's too bad godbolt
doesn't have newer versions of GCC for cross compilation...cant test
aarch64 gcc-10, for example.)  It would be interesting for sure to see
resulting differences in disassembly observed in lib/string.o with
-ffreestanding.

But, oof, that's not good.  Certainly impressive and powerful loop
idiom recognition, but wouldn't you consider it a bug that this
optimization should probably first check that it's not replacing part
of a loop with a potentially recursive call to itself?

Admittedly, we've had the same shenanigans with memcpy implemented in
terms of calls to __builtin_memcpy being lowered to infinitely
recursive calls...which feels like the same kind of bug.  ("You wanted
infinite recursion in the kexec purgatory image, right?" "No,
compiler, I did not.")  example: https://godbolt.org/z/MzrTaM
(probably should fix this in both implementations; at the least I feel
like Clang's -Winfinite-recursion should try to help us out here).

Feels almost like it may be difficult to provide an implementation of
memset without stepping on a landmine.  One thing I'd be curious about
is whether all of lib/string.c would need -ffreestanding, or if you
could move just memset to its own TU then use -ffreestanding on that.
A free standing environment must always provide a core set of
functions like memset, memcpy, memcmp, memmove, according to
https://gcc.gnu.org/onlinedocs/gcc/Standards.html.  Maybe those four
should be in a separate TU compiled as -ffreestanding, so that they
can never be lowered to calls to themselves (potentially infinitely
recursive)?
Nick Desaulniers Aug. 18, 2020, 11:22 p.m. UTC | #18
On Tue, Aug 18, 2020 at 3:05 PM Dávid Bolvanský
<david.bolvansky@gmail.com> wrote:
>
> st 19. 8. 2020 o 0:00 Nick Desaulniers <ndesaulniers@google.com> napísal(a):
> >
> > On Tue, Aug 18, 2020 at 2:41 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > >
> > > Note that -fno-builtin-foo seems to mean slightly different things in
> > > clang and gcc. From experimentation, clang will neither optimize a call
> > > to foo, nor perform an optimization that introduces a call to foo. gcc
> > > will avoid optimizing calls to foo, but it can still generate new calls
> > > to foo while optimizing something else. Which means that
> > > -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> > > just that gcc doesn't seem to have implemented those optimizations.
> >
> > Can you please share some godbolt links that demonstrate these observations?
> Here:
> https://godbolt.org/z/qjo5P6

Ok, when I implemented this version that used -fno-builtin-stpcpy, I
initially+locally had it added to CLANG_FLAGS rather than
KBUILD_CFLAGS, but changed it to KBUILD_CFLAGS because I believed that
BOTH compilers would not lower calls to foo given -fno-builtin-foo.
Since we have evidence that's not the case, maybe that's the final
solution and my final proposal (v3). A summary:

1. v1 "implement stpcpy"
https://lore.kernel.org/lkml/20200815002417.1512973-1-ndesaulniers@google.com/T/#u
"Please don't provide more unsafe string functions to the kernel"
2. v2 "KBUILD_CFLAGS += -fno-builtin-stpcpy"
https://lore.kernel.org/lkml/20200817220212.338670-1-ndesaulniers@google.com/T/#t
"-fno-builtin-* doesn't work like that on GCC"
3. v3 "CLANG_FLAGS += -fno-builtin-stpcpy" TODO

I'll argue that providing an implementation of stpcpy while hiding the
declaration from include/lib/string.h "for the possibility that GCC
may one day perform the same libcall optimization" as YAGNI, that we
may cross that bridge by resurrecting v1 (with the removal of the hunk
against include/lib/string.h).  This also defers adding more unsafe
string functions in the kernel.

Thoughts before I send the patch and write that up?
Arvind Sankar Aug. 18, 2020, 11:51 p.m. UTC | #19
On Tue, Aug 18, 2020 at 03:59:45PM -0700, Nick Desaulniers wrote:
> On Tue, Aug 18, 2020 at 3:25 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > Another thing that needs to be fixed is that at least lib/string.c needs
> > to be compiled with -ffreestanding.
> >
> > gcc-10 optimizes the generic memset implementation in there into a call
> > to memset. Now that's on x86 which doesn't use the generic
> > implementation, but this is just waiting to bite us.
> >
> > https://godbolt.org/z/6EhG15
> 
> I'll let you send the patch for that this time.  (It's too bad godbolt
> doesn't have newer versions of GCC for cross compilation...cant test
> aarch64 gcc-10, for example.)  It would be interesting for sure to see
> resulting differences in disassembly observed in lib/string.o with
> -ffreestanding.

https://lore.kernel.org/lkml/20200818234307.3382306-1-nivedita@alum.mit.edu/

> 
> But, oof, that's not good.  Certainly impressive and powerful loop
> idiom recognition, but wouldn't you consider it a bug that this
> optimization should probably first check that it's not replacing part
> of a loop with a potentially recursive call to itself?

Difficult to check that in general, but I would have thought they'd at
least add a check for memset directly calling memset. It looks like they
considered that but then decided to go with -ffreestanding disabling the
optimization. Even gcc 4.x does it :)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888

> 
> Admittedly, we've had the same shenanigans with memcpy implemented in
> terms of calls to __builtin_memcpy being lowered to infinitely
> recursive calls...which feels like the same kind of bug.  ("You wanted
> infinite recursion in the kexec purgatory image, right?" "No,
> compiler, I did not.")  example: https://godbolt.org/z/MzrTaM
> (probably should fix this in both implementations; at the least I feel
> like Clang's -Winfinite-recursion should try to help us out here).
> 
> Feels almost like it may be difficult to provide an implementation of
> memset without stepping on a landmine.  One thing I'd be curious about
> is whether all of lib/string.c would need -ffreestanding, or if you
> could move just memset to its own TU then use -ffreestanding on that.
> A free standing environment must always provide a core set of
> functions like memset, memcpy, memcmp, memmove, according to
> https://gcc.gnu.org/onlinedocs/gcc/Standards.html.  Maybe those four
> should be in a separate TU compiled as -ffreestanding, so that they
> can never be lowered to calls to themselves (potentially infinitely
> recursive)?
> -- 
> Thanks,
> ~Nick Desaulniers

I think all of it should be freestanding. Since eg the compiler could
recognize strcpy and turn it into a call to strcpy.

It turns out that at least memcpy can also be recognized, but gcc only
does that if the arguments have the restrict qualifier, so the version
in the kernel doesn't get broken.
Arvind Sankar Aug. 19, 2020, 12:20 a.m. UTC | #20
On Tue, Aug 18, 2020 at 03:59:45PM -0700, Nick Desaulniers wrote:
> On Tue, Aug 18, 2020 at 3:25 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > Another thing that needs to be fixed is that at least lib/string.c needs
> > to be compiled with -ffreestanding.
> >
> > gcc-10 optimizes the generic memset implementation in there into a call
> > to memset. Now that's on x86 which doesn't use the generic
> > implementation, but this is just waiting to bite us.
> >
> > https://godbolt.org/z/6EhG15
> 
> Admittedly, we've had the same shenanigans with memcpy implemented in
> terms of calls to __builtin_memcpy being lowered to infinitely
> recursive calls...which feels like the same kind of bug.  ("You wanted
> infinite recursion in the kexec purgatory image, right?" "No,
> compiler, I did not.")  example: https://godbolt.org/z/MzrTaM
> (probably should fix this in both implementations; at the least I feel
> like Clang's -Winfinite-recursion should try to help us out here).
> 

What's the other implementation we need to worry about? purgatory (at
least on x86) has freestanding already.
David Laight Aug. 19, 2020, 8:26 a.m. UTC | #21
From: Arvind Sankar
> Sent: 18 August 2020 23:26
...
> gcc-10 optimizes the generic memset implementation in there into a call
> to memset. Now that's on x86 which doesn't use the generic
> implementation, but this is just waiting to bite us.
> 
> https://godbolt.org/z/6EhG15

Gah, if I want a call to memset() I'll write a call to memset().
If I'm trying to zero a small number of bytes I'll write a loop
because it'll be faster.

I wish compilers wouldn't even consider these sorts of 'optimisations'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Clement Courbet Aug. 19, 2020, 12:19 p.m. UTC | #22
On Tue, Aug 18, 2020 at 9:58 PM Nick Desaulniers <ndesaulniers@google.com> wrote:
On Tue, Aug 18, 2020 at 12:25 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Aug 18, 2020 at 12:19 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And honestly, a compiler that uses 'bcmp' is just broken. WTH? It's
> > the year 2020, we don't use bcmp. It's that simple. Fix your damn
> > broken compiler and use memcmp. The argument that memcmp is more
> > expensive than bcmp is garbage legacy thinking from four decades ago.
> >
> > It's likely the other way around, where people have actually spent
> > time on memcmp, but not on bcmp.
> >
> >                Linus
>
> You'll have to ask Clement about that.  I'm not sure I ever saw the
> "faster bcmp than memcmp" implementation, but I was told "it exists"
> when I asked for a revert when all of our kernel builds went red.

If **is** possible to make bcmp much faster then memcmp. We have one
such implementation internally (it's scheduled to be released as part of
llvm-libc some time this year), but most libc implementations just alias to
memcmp.

Below is a graph showing the impact of releasing this compiler optimization
with our optimized bcmp on the google fleet (the cumulative memcmp+bcmp usage
of all programs running on google datacenters, including the kernel). Scale and
dates have been redacted for obvious reasons, but note that the graph starts at
y=0, so you can compare the values relative to each other. Note how as memcmp
is progressively being replaced by bcmp (more and more programs being
recompiled with the compiler patch), the cumulative usage of memory
comparison drops significantly.
 
https://drive.google.com/file/d/1p8z1ilw2xaAJEnx_5eu-vflp3tEOv0qY/view?usp=sharing

The reasons why bcmp can be faster are:
 - typical libc implementations use the hardware to its full capacity, e.g. for
bcmp we can use vector loads and compares, which can process up to 64 bytes
(avx512) in one instruction. It's harder to implement memcmp with these for
little-endian architectures as there is no vector bswap. Because the kernel
only uses GPRs I can see how that might not perfectly fit the kernel use case.
But the kernel really is a special case, the compiler is written for most
programs, not specifically for the kernel, and most programs should benefit from
this optimization.
 - bcmp() does not have to look at the bytes in order, e.g. it can look at the
first and last . This is useful when comparing buffers that have common
prefixes (as happens in mostly sorted containers, and we have data that shows
that this is a quite common instance).
 

> Also, to Clement's credit, every patch I've ever seen from Clement is
> backed up by data; typically fleetwide profiles at Google.  "we spend
> a lot of time in memcmp, particularly comparing the result against
> zero and no other value; hmm...how do we spend less time in
> memcmp...oh, well there's another library function with slightly
> different semantics we can call instead."  I don't think anyone would
> consider the optimization batshit crazy given the number of cycles
> saved across the fleet.  That an embedded project didn't provide an
> implementation, is a footnote that can be fixed in the embedded
> project, either by using -ffreestanding or -fno-builtin-bcmp, which is
> what this series proposes to do.
Rasmus Villemoes Aug. 20, 2020, 2:56 p.m. UTC | #23
On 18/08/2020 23.41, Arvind Sankar wrote:
> On Tue, Aug 18, 2020 at 01:58:51PM -0700, Nick Desaulniers wrote:
>> On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
>> <ndesaulniers@google.com> wrote:
>>>
>>> On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>>>>
>>>> On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
>>>>> On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>>>>>
>>>>>> I'm not saying "change the semantics", nor am I saying that playing
>>>>>> whack-a-mole *for a limited time* is unreasonable. But I would like to go back
>>>>>> to the compiler authors and get them to implement such a #pragma: "this
>>>>>> freestanding implementation *does* support *this specific library function*,
>>>>>> and you are free to call it."
>>>>>
>>>>> I'd much rather just see the library functions as builtins that always
>>>>> do the right thing (with the fallback being "just call the standard
>>>>> function").
>>>>>
>>>>> IOW, there's nothing wrong with -ffreestanding if you then also have
>>>>> __builtin_memcpy() etc, and they do the sane compiler optimizations
>>>>> for memcpy().
>>>>>
>>>>> What we want to avoid is the compiler making *assumptions* based on
>>>>> standard names, because we may implement some of those things
>>>>> differently.
>>>>>
>>>>
>>>> -ffreestanding as it stands today does have __builtin_memcpy and
>>>> friends. But you need to then use #define memcpy __builtin_memcpy etc,
>>>> which is messy and also doesn't fully express what you want. #pragma, or
>>>> even just allowing -fbuiltin-foo options would be useful.
>>
>> I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
>>
>> -ffreestanding -fbuiltin-bcmp
>>
>> as an example. `-ffreestanding` would opt you out of ALL libcall
>> optimizations, `-fbuiltin-bcmp` would then opt you back in to
>> transforms that produce bcmp.  That way you're informing the compiler
>> more precisely about the environment you'd be targeting.  It feels
>> symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
>> easy when there is such symmetry).  And it's already convention that
>> if you specify multiple conflicting compiler flags, then the latter
>> one specified "wins."  In that sense, turning back on specific
>> libcalls after disabling the rest looks more ergonomic to me.
>>
>> Maybe Eli or David have thoughts on why that may or may not be as
>> ergonomic or possible to implement as I imagine?
>>
> 
> Note that -fno-builtin-foo seems to mean slightly different things in
> clang and gcc. From experimentation, clang will neither optimize a call
> to foo, nor perform an optimization that introduces a call to foo. gcc
> will avoid optimizing calls to foo, but it can still generate new calls
> to foo while optimizing something else. Which means that
> -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> just that gcc doesn't seem to have implemented those optimizations.
> 

I think it's more than that. I've always read gcc's documentation

'-fno-builtin'
'-fno-builtin-FUNCTION'
     Don't recognize built-in functions that do not begin with
     '__builtin_' as prefix. ...

     GCC normally generates special code to handle certain built-in
     functions more efficiently; for instance, calls to 'alloca' may
     become single instructions which adjust the stack directly, and
     calls to 'memcpy' may become inline copy loops.
     ...

to mean exactly that observed above and nothing more, i.e. that
-fno-builtin-foo merely means that gcc stops treating a call of a
function named foo to mean a call to a function implementing the
standard function by that name (and hence allows it to e.g. replace a
memcpy(d, s, 1) by byte load+store). It does not mean to prevent
emitting calls to foo, and I don't think it ever will - it's a bit sad
that clang has chosen to interpret these options differently.

Thinking out load, it would be useful if both compilers grew

  -fassume-provided-std-foo

and

  -fno-assume-provided-std-foo

options to tell the compiler that a function named foo with standard
semantics can be assumed (or not) to be provided by the execution
environment; i.e. one half of what -f(no-)builtin-foo apparently does
for clang currently.

And yes, the positive -fbuiltin-foo would also be quite useful in order
to get the compiler to recognize a few important functions (memcpy,
memcmp) while using -ffreestanding (or just plain -fno-builtin) to tell
it to avoid assuming anything about most std functions - I've worked on
a VxWorks target where snprintf() didn't have the correct "return what
would be written" semantics but rather behaved like the kernel's
non-standard scnprintf(), and who knows what other odd quirks that libc had.

Rasmus
Arvind Sankar Aug. 20, 2020, 5:56 p.m. UTC | #24
On Thu, Aug 20, 2020 at 04:56:02PM +0200, Rasmus Villemoes wrote:
> On 18/08/2020 23.41, Arvind Sankar wrote:
> > 
> > Note that -fno-builtin-foo seems to mean slightly different things in
> > clang and gcc. From experimentation, clang will neither optimize a call
> > to foo, nor perform an optimization that introduces a call to foo. gcc
> > will avoid optimizing calls to foo, but it can still generate new calls
> > to foo while optimizing something else. Which means that
> > -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> > just that gcc doesn't seem to have implemented those optimizations.
> > 
> 
> I think it's more than that. I've always read gcc's documentation
> 
> '-fno-builtin'
> '-fno-builtin-FUNCTION'
>      Don't recognize built-in functions that do not begin with
>      '__builtin_' as prefix. ...
> 
>      GCC normally generates special code to handle certain built-in
>      functions more efficiently; for instance, calls to 'alloca' may
>      become single instructions which adjust the stack directly, and
>      calls to 'memcpy' may become inline copy loops.
>      ...
> 
> to mean exactly that observed above and nothing more, i.e. that
> -fno-builtin-foo merely means that gcc stops treating a call of a
> function named foo to mean a call to a function implementing the
> standard function by that name (and hence allows it to e.g. replace a
> memcpy(d, s, 1) by byte load+store). It does not mean to prevent
> emitting calls to foo, and I don't think it ever will - it's a bit sad
> that clang has chosen to interpret these options differently.

That documentation is misleading, as it also goes on to say:
"...nor can you change the behavior of the functions by linking with a
different library"
which implies that you _can_ change the behavior if you use the option,
and which is what your "i.e." is saying as well.

My point is that this is not completely true: in gcc, foo by default is
defined to be __builtin_foo, and -fno-builtin-foo simply removes this
definition. So the effect is just that calls to foo in the original
source will be left alone.

But in order for an optimization that introduces a new call to foo to be
valid, foo _must_ have standard semantics: strchr(s,'\0') is not s +
strlen(s) unless strlen has standard semantics. This is an oversight in
gcc's optimizations: it converts to s + __builtin_strlen(s), which then
(normally) becomes s + strlen(s).

Check out this horror: https://godbolt.org/z/a1r9fK

Clang will disable this optimization if -fno-builtin-strlen is
specified.

Clang's interpretation is more useful for embedded, since you can use
-fno-builtin-foo and avoid calling __builtin_foo directly, and be
guaranteed that there will be no calls to foo that you didn't write
explicitly (outside of memcpy/memset/memcmp). In this case you are free
to implement foo with non-standard semantics, or avoid implementing it
altogether, and be reasonably confident that it will all work.

> 
> Thinking out load, it would be useful if both compilers grew
> 
>   -fassume-provided-std-foo
> 
> and
> 
>   -fno-assume-provided-std-foo
> 
> options to tell the compiler that a function named foo with standard
> semantics can be assumed (or not) to be provided by the execution
> environment; i.e. one half of what -f(no-)builtin-foo apparently does
> for clang currently.

Not following: -fno-assume-provided-std-foo sounds like it would have
exactly the same semantics as Clang's -fno-builtin-foo, except maybe in
addition it should cause the compiler to error on seeing __builtin_foo
if it can't implement that without calling foo.

> 
> And yes, the positive -fbuiltin-foo would also be quite useful in order
> to get the compiler to recognize a few important functions (memcpy,
> memcmp) while using -ffreestanding (or just plain -fno-builtin) to tell
> it to avoid assuming anything about most std functions - I've worked on
> a VxWorks target where snprintf() didn't have the correct "return what
> would be written" semantics but rather behaved like the kernel's
> non-standard scnprintf(), and who knows what other odd quirks that libc had.
> 
> Rasmus
Dávid Bolvanský Aug. 20, 2020, 6:05 p.m. UTC | #25
Yeah, gcc is doing weird things here : (

It is kinda sad that same flag does different things with gcc and clang.



> Dňa 20. 8. 2020 o 19:56 užívateľ Arvind Sankar <nivedita@alum.mit.edu> napísal:
> 
> On Thu, Aug 20, 2020 at 04:56:02PM +0200, Rasmus Villemoes wrote:
>>> On 18/08/2020 23.41, Arvind Sankar wrote:
>>> 
>>> Note that -fno-builtin-foo seems to mean slightly different things in
>>> clang and gcc. From experimentation, clang will neither optimize a call
>>> to foo, nor perform an optimization that introduces a call to foo. gcc
>>> will avoid optimizing calls to foo, but it can still generate new calls
>>> to foo while optimizing something else. Which means that
>>> -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
>>> just that gcc doesn't seem to have implemented those optimizations.
>>> 
>> 
>> I think it's more than that. I've always read gcc's documentation
>> 
>> '-fno-builtin'
>> '-fno-builtin-FUNCTION'
>>     Don't recognize built-in functions that do not begin with
>>     '__builtin_' as prefix. ...
>> 
>>     GCC normally generates special code to handle certain built-in
>>     functions more efficiently; for instance, calls to 'alloca' may
>>     become single instructions which adjust the stack directly, and
>>     calls to 'memcpy' may become inline copy loops.
>>     ...
>> 
>> to mean exactly that observed above and nothing more, i.e. that
>> -fno-builtin-foo merely means that gcc stops treating a call of a
>> function named foo to mean a call to a function implementing the
>> standard function by that name (and hence allows it to e.g. replace a
>> memcpy(d, s, 1) by byte load+store). It does not mean to prevent
>> emitting calls to foo, and I don't think it ever will - it's a bit sad
>> that clang has chosen to interpret these options differently.
> 
> That documentation is misleading, as it also goes on to say:
> "...nor can you change the behavior of the functions by linking with a
> different library"
> which implies that you _can_ change the behavior if you use the option,
> and which is what your "i.e." is saying as well.
> 
> My point is that this is not completely true: in gcc, foo by default is
> defined to be __builtin_foo, and -fno-builtin-foo simply removes this
> definition. So the effect is just that calls to foo in the original
> source will be left alone.
> 
> But in order for an optimization that introduces a new call to foo to be
> valid, foo _must_ have standard semantics: strchr(s,'\0') is not s +
> strlen(s) unless strlen has standard semantics. This is an oversight in
> gcc's optimizations: it converts to s + __builtin_strlen(s), which then
> (normally) becomes s + strlen(s).
> 
> Check out this horror: https://godbolt.org/z/a1r9fK
> 
> Clang will disable this optimization if -fno-builtin-strlen is
> specified.
> 
> Clang's interpretation is more useful for embedded, since you can use
> -fno-builtin-foo and avoid calling __builtin_foo directly, and be
> guaranteed that there will be no calls to foo that you didn't write
> explicitly (outside of memcpy/memset/memcmp). In this case you are free
> to implement foo with non-standard semantics, or avoid implementing it
> altogether, and be reasonably confident that it will all work.
> 
>> 
>> Thinking out load, it would be useful if both compilers grew
>> 
>>  -fassume-provided-std-foo
>> 
>> and
>> 
>>  -fno-assume-provided-std-foo
>> 
>> options to tell the compiler that a function named foo with standard
>> semantics can be assumed (or not) to be provided by the execution
>> environment; i.e. one half of what -f(no-)builtin-foo apparently does
>> for clang currently.
> 
> Not following: -fno-assume-provided-std-foo sounds like it would have
> exactly the same semantics as Clang's -fno-builtin-foo, except maybe in
> addition it should cause the compiler to error on seeing __builtin_foo
> if it can't implement that without calling foo.
> 
>> 
>> And yes, the positive -fbuiltin-foo would also be quite useful in order
>> to get the compiler to recognize a few important functions (memcpy,
>> memcmp) while using -ffreestanding (or just plain -fno-builtin) to tell
>> it to avoid assuming anything about most std functions - I've worked on
>> a VxWorks target where snprintf() didn't have the correct "return what
>> would be written" semantics but rather behaved like the kernel's
>> non-standard scnprintf(), and who knows what other odd quirks that libc had.
>> 
>> Rasmus
H. Peter Anvin Aug. 20, 2020, 10:41 p.m. UTC | #26
On 2020-08-18 13:58, Nick Desaulniers wrote:
> On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
>>
>> On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>>>
>>> On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
>>>> On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>>>>
>>>>> I'm not saying "change the semantics", nor am I saying that playing
>>>>> whack-a-mole *for a limited time* is unreasonable. But I would like to go back
>>>>> to the compiler authors and get them to implement such a #pragma: "this
>>>>> freestanding implementation *does* support *this specific library function*,
>>>>> and you are free to call it."
>>>>
>>>> I'd much rather just see the library functions as builtins that always
>>>> do the right thing (with the fallback being "just call the standard
>>>> function").
>>>>
>>>> IOW, there's nothing wrong with -ffreestanding if you then also have
>>>> __builtin_memcpy() etc, and they do the sane compiler optimizations
>>>> for memcpy().
>>>>
>>>> What we want to avoid is the compiler making *assumptions* based on
>>>> standard names, because we may implement some of those things
>>>> differently.
>>>>
>>>
>>> -ffreestanding as it stands today does have __builtin_memcpy and
>>> friends. But you need to then use #define memcpy __builtin_memcpy etc,
>>> which is messy and also doesn't fully express what you want. #pragma, or
>>> even just allowing -fbuiltin-foo options would be useful.
> 
> I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
> 
> -ffreestanding -fbuiltin-bcmp
> 
> as an example. `-ffreestanding` would opt you out of ALL libcall
> optimizations, `-fbuiltin-bcmp` would then opt you back in to
> transforms that produce bcmp.  That way you're informing the compiler
> more precisely about the environment you'd be targeting.  It feels
> symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
> easy when there is such symmetry).  And it's already convention that
> if you specify multiple conflicting compiler flags, then the latter
> one specified "wins."  In that sense, turning back on specific
> libcalls after disabling the rest looks more ergonomic to me.
> 
> Maybe Eli or David have thoughts on why that may or may not be as
> ergonomic or possible to implement as I imagine?
> 

I would prefer this to be a #pragma for a header file, rather than
having a very long command line for everything...

	-hpa
Arvind Sankar Aug. 20, 2020, 11:17 p.m. UTC | #27
On Thu, Aug 20, 2020 at 03:41:33PM -0700, H. Peter Anvin wrote:
> 
> I would prefer this to be a #pragma for a header file, rather than
> having a very long command line for everything...
> 
> 	-hpa
> 

There is @option_file, though.
Linus Torvalds Aug. 20, 2020, 11:33 p.m. UTC | #28
On Thu, Aug 20, 2020 at 10:56 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> Clang's interpretation is more useful for embedded, since you can use
> -fno-builtin-foo and avoid calling __builtin_foo directly, and be
> guaranteed that there will be no calls to foo that you didn't write
> explicitly (outside of memcpy/memset/memcmp). In this case you are free
> to implement foo with non-standard semantics, or avoid implementing it
> altogether, and be reasonably confident that it will all work.

Honestly, I think concentrating on whether __builtin_foo() works or
not misses the point entirely.

That has never _ever_ been a problem for us, and I doubt it has been a
problem for anybody else either.

If you use __builtin_memcpy() in your source tree, then why would you
possibly ever want to disable it? And if you _don't_ use it, then
again - why would you ever want to disable it?

No, the real (and only) problem has always been about the compilers
magically and silently "recognizing" certain source patterns, and
turning them into function calls behind the users back.

And that has nothing to do with __builtin_foo(). At least it
_shouldn't_ have anything to do with it.

So this is things like the compiler silently seeing "oh, you called
your function 'free()', so we know that the stores you did to it are
dead and we'll remove them".

Or this is the compiler doing "Oh, you did four stores of zero in a
row, and and you asked for size optimizations, so we'll turn those
into a 'bzero()' call".

Or the compiler doing completely broken things, and turning a
"!memcmp()" expression into a "!bcmp()" because the compilier
incorrectly assumes it's faster.

Notice? Not a single one of those had any __builtin_xyz() pattern in
them. Quite the reverse. The compiler took something completely
different, and assumed builtin semantics without us having told it to.

So I think "-fno-builtin-xyz" is barking *completely* up the wrong
tree. It's missing the problem. The problem is not "I have some
builtin patterns, here, you can use them".

It's the same as all the vector intrinsics. Those don't hurt anybody -
as long as they only get used when people use the intrinsics. If the
compiler starts to suddenly use vector intrinsics without being told
to, *THAT* can be a problem. But there is never any reson to turn off
any particular intrinsic otherwise.

If you don't want it used, don't use it. And if you do use it, the
compiler generates the vector code sequence. It's that simple.

So to me, a compiler flag like "-fno-builtin-memcpy" is completely
insane. The flag adds absolutely no value.

The real value would be "-fno-magic-bcmp" which turns off stupid
optimizations that magically turn non-bcmp things into bcmp. But it
should not turn off *actual* __builtin_bcmp() if such a thing exists
and people want to explicitly use it.

             Linus
Rasmus Villemoes Aug. 21, 2020, 6:45 a.m. UTC | #29
On 20/08/2020 19.56, Arvind Sankar wrote:
> On Thu, Aug 20, 2020 at 04:56:02PM +0200, Rasmus Villemoes wrote:
>> On 18/08/2020 23.41, Arvind Sankar wrote:
>>>
>>> Note that -fno-builtin-foo seems to mean slightly different things in
>>> clang and gcc. From experimentation, clang will neither optimize a call
>>> to foo, nor perform an optimization that introduces a call to foo. gcc
>>> will avoid optimizing calls to foo, but it can still generate new calls
>>> to foo while optimizing something else. Which means that
>>> -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
>>> just that gcc doesn't seem to have implemented those optimizations.
>>>
>>
>> I think it's more than that. I've always read gcc's documentation
>>
>> '-fno-builtin'
>> '-fno-builtin-FUNCTION'
>>      Don't recognize built-in functions that do not begin with
>>      '__builtin_' as prefix. ...
>>
>>      GCC normally generates special code to handle certain built-in
>>      functions more efficiently; for instance, calls to 'alloca' may
>>      become single instructions which adjust the stack directly, and
>>      calls to 'memcpy' may become inline copy loops.
>>      ...
>>
>> to mean exactly that observed above and nothing more, i.e. that
>> -fno-builtin-foo merely means that gcc stops treating a call of a
>> function named foo to mean a call to a function implementing the
>> standard function by that name (and hence allows it to e.g. replace a
>> memcpy(d, s, 1) by byte load+store). It does not mean to prevent
>> emitting calls to foo, and I don't think it ever will - it's a bit sad
>> that clang has chosen to interpret these options differently.
> 
> That documentation is misleading, as it also goes on to say:
> "...nor can you change the behavior of the functions by linking with a
> different library"
> which implies that you _can_ change the behavior if you use the option,
> and which is what your "i.e." is saying as well.
> 
> My point is that this is not completely true: in gcc, foo by default is
> defined to be __builtin_foo, and -fno-builtin-foo simply removes this
> definition. So the effect is just that calls to foo in the original
> source will be left alone.

Yes, this is a much better way of putting it. And with -fbuiltin-foo in
effect, the compiler just needs to transform the code in some way as-if
the standard function by that name was called, which it can of course
decide to implement by emitting such a call, but it can also open-code
it - or synthesize it using other std functions.

> But in order for an optimization that introduces a new call to foo to be
> valid, foo _must_ have standard semantics: strchr(s,'\0') is not s +
> strlen(s) unless strlen has standard semantics.

Correct. So I agree that -fno-builtin-strlen should prevent the compiler
from generating calls to strlen() that don't appear in the code.

This is an oversight in
> gcc's optimizations: it converts to s + __builtin_strlen(s), which then
> (normally) becomes s + strlen(s).
> 
> Check out this horror: https://godbolt.org/z/a1r9fK
> 
> Clang will disable this optimization if -fno-builtin-strlen is
> specified.
>
> Clang's interpretation is more useful for embedded, since you can use
> -fno-builtin-foo and avoid calling __builtin_foo directly, and be
> guaranteed that there will be no calls to foo that you didn't write
> explicitly (outside of memcpy/memset/memcmp). In this case you are free
> to implement foo with non-standard semantics, or avoid implementing it
> altogether, and be reasonably confident that it will all work.

Yeah, except that the list of -fno-builtin-foo one would have to pass is
enourmous, so for targets with a somewhat wonky libc, I'd much rather be
able to do a blanket -fno-builtin, and then manually check their memcpy,
memset and memcmp implementations and opt back in with
-fbuiltin-mem{cpy,set,cmp} so that small constant-size memcpys do get
properly open-coded.

The advice in gcc's documentation of just #definining memcpy() to
__builtin_memcpy() doesn't work in the real world (for example it breaks
C++ code that uses std::memcpy(...)).

>> Thinking out load, it would be useful if both compilers grew
>>
>>   -fassume-provided-std-foo
>>
>> and
>>
>>   -fno-assume-provided-std-foo
>>
>> options to tell the compiler that a function named foo with standard
>> semantics can be assumed (or not) to be provided by the execution
>> environment; i.e. one half of what -f(no-)builtin-foo apparently does
>> for clang currently.
> 
> Not following: -fno-assume-provided-std-foo sounds like it would have
> exactly the same semantics as Clang's -fno-builtin-foo, except maybe in
> addition it should cause the compiler to error on seeing __builtin_foo
> if it can't implement that without calling foo.

Yeah, I think you've convinced me there's no use for a separate option
to prevent inventing calls to foo() - I was mostly thinking of it as a
way to avoid having to provide each and every libc function that may
have been half-way standardized at some point. But if one doesn't
provide, say, bcmp, the code base certainly doesn't use bcmp itself, so
one doesn't lose anything by just using -fno-builtin-bcmp; there are no
explicit bcmp() uses that fail to get optimized.

Rasmus
Arvind Sankar Aug. 21, 2020, 5:29 p.m. UTC | #30
On Thu, Aug 20, 2020 at 04:33:03PM -0700, Linus Torvalds wrote:
> On Thu, Aug 20, 2020 at 10:56 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > Clang's interpretation is more useful for embedded, since you can use
> > -fno-builtin-foo and avoid calling __builtin_foo directly, and be
> > guaranteed that there will be no calls to foo that you didn't write
> > explicitly (outside of memcpy/memset/memcmp). In this case you are free
> > to implement foo with non-standard semantics, or avoid implementing it
> > altogether, and be reasonably confident that it will all work.
> 
> Honestly, I think concentrating on whether __builtin_foo() works or
> not misses the point entirely.
> 
> That has never _ever_ been a problem for us, and I doubt it has been a
> problem for anybody else either.
> 
> If you use __builtin_memcpy() in your source tree, then why would you
> possibly ever want to disable it? And if you _don't_ use it, then
> again - why would you ever want to disable it?
> 
> No, the real (and only) problem has always been about the compilers
> magically and silently "recognizing" certain source patterns, and
> turning them into function calls behind the users back.
> 
> And that has nothing to do with __builtin_foo(). At least it
> _shouldn't_ have anything to do with it.
> 

There seems to be some confusion here. The recognition and
__builtin_foo() go hand-in-hand: memcpy() is special _because_ the
compiler defines it to be __builtin_memcpy(); and the compiler turns the
patterns into __builtin_foo() calls, which just end up as a call to
foo() if they can't be inlined. The no-builtin- options _don't_ disable
__builtin_ functions. They remove the default definition of foo() as
__builtin_foo().

Take the problem that instigated this thread. __builtin_stpcpy() doesn't
work in the kernel because the fallback, stpcpy(), isn't implemented.
The optimization is doing:
  sprintf(buf,"%s",s)
  -> __builtin_sprintf(buf,"%s",s)
  -> __builtin_stpcpy(buf,s)-buf
  -> stpcpy(buf,s)-buf

Now, further below, you basically say this is an example of the compiler
taking something non-stpcpy() and turning it into stpcpy(), and you ask
for a no-magic-stpcpy that would stop this optimization. That's what
clang's no-builtin-stpcpy already does. The only extra thing it does is
that the compiler will also not touch an explicit call to stpcpy(), but
you can still call __builtin_stpcpy() if you really want it.

This is what was going on in that LZ4 memcpy() issue: the compiler was
faithfully compiling code like memcpy(d,s,8) into a call to memcpy()
because we told it not to define memcpy() as __builtin_memcpy(), by
compiling for a freestanding environment.

This is why I'm saying clang's no-builtin-foo option is useful for
embedded: it doesn't prevent the programmer using __builtin_foo(), it
prevents the _compiler_ using __builtin_foo() on its own.

> So this is things like the compiler silently seeing "oh, you called
> your function 'free()', so we know that the stores you did to it are
> dead and we'll remove them".
> 
> Or this is the compiler doing "Oh, you did four stores of zero in a
> row, and and you asked for size optimizations, so we'll turn those
> into a 'bzero()' call".

This one is slightly different from the previous one. The first case is
really a call to __builtin_free().

This one is turning something that wasn't a function call into
__builtin_bzero(), and I would hope that no-builtin-bzero would stop it
as well. OTOH, the compiler is free to turn it into memset(), just like
it could for structure/array initializers.

The memcpy/memset/memcmp family is a bit of an edge case: the compiler
requires them to be defined even for freestanding environments, so you
can't in general stop the compiler from turning something into memset().
(That -ffreestanding stops gcc from turning loops into memset() is a
pragmatic recognition that some people are going to try to implement
memset() in C.)

> 
> Or the compiler doing completely broken things, and turning a
> "!memcmp()" expression into a "!bcmp()" because the compilier
> incorrectly assumes it's faster.

Stop it with the bcmp-shaming already. bcmp _can_ be implemented faster
than memcmp, and it's no great loss if it isn't, since then it'll just
be an alias to memcmp in any sensible libc.

> 
> Notice? Not a single one of those had any __builtin_xyz() pattern in
> them. Quite the reverse. The compiler took something completely
> different, and assumed builtin semantics without us having told it to.
> 
> So I think "-fno-builtin-xyz" is barking *completely* up the wrong
> tree. It's missing the problem. The problem is not "I have some
> builtin patterns, here, you can use them".

Nope: in a hosted environment, xyz() _is_ __builtin_xyz(), and that is
almost always a good thing for 99% of the code out there: you tell it to
use builtin semantics by choosing to compile for a hosted environment.

If you want something in between freestanding and hosted, you absolutely
need some way to tell the compiler exactly which xyz()'s can be treated
as __builtin_xyz() and which ones shouldn't. The no-builtin- flags allow
you to start from a hosted environment and turn off the specialness of
the functions that you don't want to be special. The proposed builtin-
flags would allow you to start from freestanding and turn on the
specialness of the functions that you do want to be special.

> 
> It's the same as all the vector intrinsics. Those don't hurt anybody -
> as long as they only get used when people use the intrinsics. If the
> compiler starts to suddenly use vector intrinsics without being told
> to, *THAT* can be a problem. But there is never any reson to turn off
> any particular intrinsic otherwise.
> 
> If you don't want it used, don't use it. And if you do use it, the
> compiler generates the vector code sequence. It's that simple.
> 
> So to me, a compiler flag like "-fno-builtin-memcpy" is completely
> insane. The flag adds absolutely no value.
> 
> The real value would be "-fno-magic-bcmp" which turns off stupid
> optimizations that magically turn non-bcmp things into bcmp. But it
> should not turn off *actual* __builtin_bcmp() if such a thing exists
> and people want to explicitly use it.
> 
>              Linus
Linus Torvalds Aug. 21, 2020, 5:54 p.m. UTC | #31
On Fri, Aug 21, 2020 at 10:29 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> The no-builtin- options _don't_ disable
> __builtin_ functions. They remove the default definition of foo() as
> __builtin_foo().

Oh, ok, then it's fine.

> Take the problem that instigated this thread. __builtin_stpcpy() doesn't
> work in the kernel because the fallback, stpcpy(), isn't implemented.

Right.

The only problem is the - bogus - recognition of (or rewriting of)
code patterns that the compiler recohnmizes.

__builtin_stpcpy() itself is fine if somebody wants to use it - and
has the fallback.

But implicitly recognizing some code sequence and turning it into
something else is the problem.

> This is why I'm saying clang's no-builtin-foo option is useful for
> embedded: it doesn't prevent the programmer using __builtin_foo(), it
> prevents the _compiler_ using __builtin_foo() on its own.

And that's fine. But it's apparently not what gcc does.

> > So this is things like the compiler silently seeing "oh, you called
> > your function 'free()', so we know that the stores you did to it are
> > dead and we'll remove them".
> >
> > Or this is the compiler doing "Oh, you did four stores of zero in a
> > row, and and you asked for size optimizations, so we'll turn those
> > into a 'bzero()' call".
>
> This one is slightly different from the previous one. The first case is
> really a call to __builtin_free().

No, the first case is a disgrace and a compiler bug.

We've had a situation where gcc complained about a static function
called "free()", without any header file inclusion, and then
complained about it not matching its idea of what "free()" is.

Which is pure and utter garbage.

It's like you have a local variable "int free", and the compiler says
"hey, this doesn't match the prototype that I know this name should
have". It's BS. You just saw the user not just *use* that name, but
*define* it, and do it in a scope where the complaint is irrelevant
and stupid, and when we hadn't even included the header that would
have resulted in conflicts.

IOW, it's an example of a compiler that thinks "it knows better".

It's a broken compiler. And it's an example of the kind of breakage
that compilers absolutely shouldn't do.

The second example is from clang doesn't something horribly horribly stupid.

> This one is turning something that wasn't a function call into
> __builtin_bzero(), and I would hope that no-builtin-bzero would stop it
> as well. OTOH, the compiler is free to turn it into memset(), just like
> it could for structure/array initializers.

The whole "the compiler is free to do X" argument is pure BS. Stop
repeating that bogus argument.

Of COURSE a compiler can do whatever the hell it wants.

That doesn't change the fact that certain things are broken beyond
words and utterly stupid, and a compiler that does them is a *BAD*
compiler.

Turning four stores into a memset() is garbage. Just admit it, instead
of trying to say that it's allowed.

Technically a compiler can decode to simply not compile the kernel at
all, because we do things that are outside a lot of standards.

So the "technically allowed" is not an argument.

             Linus
Linus Torvalds Aug. 21, 2020, 6:02 p.m. UTC | #32
On Fri, Aug 21, 2020 at 10:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Aug 21, 2020 at 10:29 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > This is why I'm saying clang's no-builtin-foo option is useful for
> > embedded: it doesn't prevent the programmer using __builtin_foo(), it
> > prevents the _compiler_ using __builtin_foo() on its own.
>
> And that's fine. But it's apparently not what gcc does.

Oh, testing it seems to say that that is exactly what gcc does too. I
must have misunderstood some comment in this thread to mean otherwise.

                Linus
Arvind Sankar Aug. 21, 2020, 7:14 p.m. UTC | #33
On Fri, Aug 21, 2020 at 11:02:08AM -0700, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 10:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, Aug 21, 2020 at 10:29 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > >
> > > This is why I'm saying clang's no-builtin-foo option is useful for
> > > embedded: it doesn't prevent the programmer using __builtin_foo(), it
> > > prevents the _compiler_ using __builtin_foo() on its own.
> >
> > And that's fine. But it's apparently not what gcc does.
> 
> Oh, testing it seems to say that that is exactly what gcc does too. I
> must have misunderstood some comment in this thread to mean otherwise.
> 
>                 Linus

How are you testing it?

https://godbolt.org/z/eahdGn
Linus Torvalds Aug. 21, 2020, 7:23 p.m. UTC | #34
On Fri, Aug 21, 2020 at 12:15 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> How are you testing it?
>
> https://godbolt.org/z/eahdGn

Ugh. I tested the reverse thing - that the builtin is still available
for manual use despite the -fno-builtin.

Because I - stupidly - assumed that fno-builtin would do *something*.

But if it's available for manual use _and_ it's used implicitly by the
compiler, then that was clearly very naive of me.

                     Linus
Arvind Sankar Aug. 21, 2020, 7:57 p.m. UTC | #35
On Fri, Aug 21, 2020 at 10:54:57AM -0700, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 10:29 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > This one is slightly different from the previous one. The first case is
> > really a call to __builtin_free().
> 
> No, the first case is a disgrace and a compiler bug.
> 
> We've had a situation where gcc complained about a static function
> called "free()", without any header file inclusion, and then
> complained about it not matching its idea of what "free()" is.
> 
> Which is pure and utter garbage.
> 
> It's like you have a local variable "int free", and the compiler says
> "hey, this doesn't match the prototype that I know this name should
> have". It's BS. You just saw the user not just *use* that name, but
> *define* it, and do it in a scope where the complaint is irrelevant
> and stupid, and when we hadn't even included the header that would
> have resulted in conflicts.
> 
> IOW, it's an example of a compiler that thinks "it knows better".
> 
> It's a broken compiler. And it's an example of the kind of breakage
> that compilers absolutely shouldn't do.

That's -Wbuiltin-declaration-mismatch, and can be turned off, and it
won't warn if you have -fno-builtin-free. I don't completely agree with
you, though warning for static functions is a bit overzealous. For an
external function, especially something more obscure like stpcpy(), I
appreciate the warning.

> 
> The second example is from clang doesn't something horribly horribly stupid.

Calm down man :)

> 
> > This one is turning something that wasn't a function call into
> > __builtin_bzero(), and I would hope that no-builtin-bzero would stop it
> > as well. OTOH, the compiler is free to turn it into memset(), just like
> > it could for structure/array initializers.
> 
> The whole "the compiler is free to do X" argument is pure BS. Stop
> repeating that bogus argument.
> 
> Of COURSE a compiler can do whatever the hell it wants.
> 
> That doesn't change the fact that certain things are broken beyond
> words and utterly stupid, and a compiler that does them is a *BAD*
> compiler.
> 
> Turning four stores into a memset() is garbage. Just admit it, instead
> of trying to say that it's allowed.
> 

Look, four stores into memset(), yeah that's a bit weird. I didn't think
you meant "four" literally. But in any case, that has nothing to do with
the topic at hand. It would be just as bad if it was a 16-byte structure
being initialized with an out-of-line memset() call.

But coming back to the actual topic: it is fine if the compiler turns
four stores into __builtin_memset(). A size-16 or -32 __builtin_memset()
will get inlined anyway. There's a lot of garbage here if you look
closely: check out what gcc does to initialize a 7-character array with
zeros at -Os.
Peter Zijlstra Aug. 21, 2020, 8:03 p.m. UTC | #36
On Fri, Aug 21, 2020 at 03:57:12PM -0400, Arvind Sankar wrote:
> closely: check out what gcc does to initialize a 7-character array with
> zeros at -Os.

You do know that 's' stands for 'stupid', right?

-Os is infamous for generating atrocious crap.
Linus Torvalds Aug. 21, 2020, 9:39 p.m. UTC | #37
On Fri, Aug 21, 2020 at 12:57 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> Look, four stores into memset(), yeah that's a bit weird. I didn't think
> you meant "four" literally. But in any case, that has nothing to do with
> the topic at hand. It would be just as bad if it was a 16-byte structure
> being initialized with an out-of-line memset() call.

Actually, I mis-remembered. It wasn't four stores.

It was two.

We have this lovely "sas_ss_reset()" function that initializes three
fields in a structure (two to zero, one to '2').

And we used it in a critical place that didn't allow function calls
(because we have magic rules with the SMAP instructions).

And clang turned the initalization into a memset().  Which then
triggered our "you can't do that here" check on the generated code.

This is the kind of special rules we sometimes can have for code
generation, where the compiler really doesn't understand that no, you
can't just replace this code sequence with a function call, because
there are things going on around it that really mean that the code
should be generated the way we wrote it.

> But coming back to the actual topic: it is fine if the compiler turns
> four stores into __builtin_memset(). A size-16 or -32 __builtin_memset()
> will get inlined anyway. There's a lot of garbage here if you look
> closely: check out what gcc does to initialize a 7-character array with
> zeros at -Os.

Yeah. The reason we had to make -Os be a non-preferred thing is
because gcc has some really sad things it does when optimizing for
size.

I happen to believe that I$ matters, but -Os made _such_ a mess of
things that it was untenable to use ;(

              Linus
Nick Desaulniers Aug. 22, 2020, 12:12 a.m. UTC | #38
On Fri, Aug 21, 2020 at 2:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Aug 21, 2020 at 12:57 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > Look, four stores into memset(), yeah that's a bit weird. I didn't think
> > you meant "four" literally. But in any case, that has nothing to do with
> > the topic at hand. It would be just as bad if it was a 16-byte structure
> > being initialized with an out-of-line memset() call.
>
> Actually, I mis-remembered. It wasn't four stores.
>
> It was two.
>
> We have this lovely "sas_ss_reset()" function that initializes three
> fields in a structure (two to zero, one to '2').
>
> And we used it in a critical place that didn't allow function calls
> (because we have magic rules with the SMAP instructions).
>
> And clang turned the initalization into a memset().  Which then
> triggered our "you can't do that here" check on the generated code.
>
> This is the kind of special rules we sometimes can have for code
> generation, where the compiler really doesn't understand that no, you
> can't just replace this code sequence with a function call, because
> there are things going on around it that really mean that the code
> should be generated the way we wrote it.

For more context for folks at home eating popcorn and enjoying the
show: https://github.com/ClangBuiltLinux/linux/issues/876#issuecomment-613049480.
And that was specifically with KASAN enabled and doesn't appear to be
common behavior in clang otherwise (higher threshold). Why the
heuristics change for when it seems to be more profitable to roll
assignment of contiguous members of the same struct to the same value
into a memset, and 2 longs seems to be the threshold for KASAN, I
don't know.  But I agree that should be fixed on the compiler side,
which is why I haven't been pushing the kernel workaround.

Everyone's just too busy right now; folks working on kernel sanitizers
have their hands full with KCSAN or MTE (on armv8.5) or default
initialization, and I'm trying to keep the build green (ie. this
series, well the v2 below, and
https://lore.kernel.org/lkml/20200812004308.1448603-1-nivedita@alum.mit.edu/
would be much appreciated), and get ready for plumbers, and wrap up my
intern's project, and do yearly performance reviews at my day job, and
million other things.

I've filed https://bugs.llvm.org/show_bug.cgi?id=47280 to discuss more
the ability to opt into builtins from a freestanding environment.

Now that Arvind has provided an excellent analysis of how the builtin
functionality works (bookmarked:
https://lore.kernel.org/lkml/20200821172935.GA1411923@rani.riverdale.lan/),
were there still objections to add the -fno-builtin-stpcpy flags for
clang to the Makefile?  I would like to get
https://lore.kernel.org/lkml/20200819191654.1130563-1-ndesaulniers@google.com/T/#m76c445f9645f62bc6ffc48ca26949725235688a0
landed so the build is not red for another week.
David Laight Aug. 22, 2020, 12:20 p.m. UTC | #39
> For more context for folks at home eating popcorn and enjoying the
> show: https://github.com/ClangBuiltLinux/linux/issues/876#issuecomment-613049480.
> And that was specifically with KASAN enabled and doesn't appear to be
> common behavior in clang otherwise (higher threshold). Why the
> heuristics change for when it seems to be more profitable to roll
> assignment of contiguous members of the same struct to the same value
> into a memset, and 2 longs seems to be the threshold for KASAN, I
> don't know.  But I agree that should be fixed on the compiler side,
> which is why I haven't been pushing the kernel workaround.

Given x86 has is a simple 3-instruction loop for memset
that will do 1 write/clock (the max on current cpu) I
doubt it is ever worth not inlining memset().
The only real special case is lengths < 8.

For KASAN I wonder if something is stopping it inlining memset()?
So what usually happens is the two stores get converted to memset()
and then the memset() gets inlined back to two stores?

OTOH all this faffing for memset and memcpy is probably a
waste of time.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Masahiro Yamada Aug. 24, 2020, 3:57 p.m. UTC | #40
On Wed, Aug 19, 2020 at 6:41 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Tue, Aug 18, 2020 at 01:58:51PM -0700, Nick Desaulniers wrote:
> > On Tue, Aug 18, 2020 at 1:27 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > On Tue, Aug 18, 2020 at 1:24 PM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> > > >
> > > > On Tue, Aug 18, 2020 at 12:13:22PM -0700, Linus Torvalds wrote:
> > > > > On Tue, Aug 18, 2020 at 12:03 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > > > > >
> > > > > > I'm not saying "change the semantics", nor am I saying that playing
> > > > > > whack-a-mole *for a limited time* is unreasonable. But I would like to go back
> > > > > > to the compiler authors and get them to implement such a #pragma: "this
> > > > > > freestanding implementation *does* support *this specific library function*,
> > > > > > and you are free to call it."
> > > > >
> > > > > I'd much rather just see the library functions as builtins that always
> > > > > do the right thing (with the fallback being "just call the standard
> > > > > function").
> > > > >
> > > > > IOW, there's nothing wrong with -ffreestanding if you then also have
> > > > > __builtin_memcpy() etc, and they do the sane compiler optimizations
> > > > > for memcpy().
> > > > >
> > > > > What we want to avoid is the compiler making *assumptions* based on
> > > > > standard names, because we may implement some of those things
> > > > > differently.
> > > > >
> > > >
> > > > -ffreestanding as it stands today does have __builtin_memcpy and
> > > > friends. But you need to then use #define memcpy __builtin_memcpy etc,
> > > > which is messy and also doesn't fully express what you want. #pragma, or
> > > > even just allowing -fbuiltin-foo options would be useful.
> >
> > I do really like the idea of -fbuiltin-foo.  For example, you'd specify:
> >
> > -ffreestanding -fbuiltin-bcmp
> >
> > as an example. `-ffreestanding` would opt you out of ALL libcall
> > optimizations, `-fbuiltin-bcmp` would then opt you back in to
> > transforms that produce bcmp.  That way you're informing the compiler
> > more precisely about the environment you'd be targeting.  It feels
> > symmetric to existing `-fno-` flags (clang makes -f vs -fno- pretty
> > easy when there is such symmetry).  And it's already convention that
> > if you specify multiple conflicting compiler flags, then the latter
> > one specified "wins."  In that sense, turning back on specific
> > libcalls after disabling the rest looks more ergonomic to me.
> >
> > Maybe Eli or David have thoughts on why that may or may not be as
> > ergonomic or possible to implement as I imagine?
> >
>
> Note that -fno-builtin-foo seems to mean slightly different things in
> clang and gcc. From experimentation, clang will neither optimize a call
> to foo, nor perform an optimization that introduces a call to foo. gcc
> will avoid optimizing calls to foo, but it can still generate new calls
> to foo while optimizing something else. Which means that
> -fno-builtin-{bcmp,stpcpy} only solves things for clang, not gcc. It's
> just that gcc doesn't seem to have implemented those optimizations.


To prevent transformation from foo() into bar(),
there are two ways in Clang to do that;
-fno-builtin-foo, and -fno-builtin-bar.
There is only one in GCC; -fno-buitin-foo.

Is this correct?



I just played the optimization
from printf("helloworld\n") to puts("helloworld").

https://godbolt.org/z/5s4ded


-fno-builtin-puts cannot prevent clang
from emitting puts.
Is it because clang does not support
-fno-builtin-puts?

It is not clear to find out
which -fno-builtin-* is supported
because compilation succeeds anyway...


--
Best Regards
Masahiro Yamada
Arvind Sankar Aug. 24, 2020, 5:34 p.m. UTC | #41
On Tue, Aug 25, 2020 at 12:57:22AM +0900, Masahiro Yamada wrote:
> 
> 
> To prevent transformation from foo() into bar(),
> there are two ways in Clang to do that;
> -fno-builtin-foo, and -fno-builtin-bar.
> There is only one in GCC; -fno-buitin-foo.
> 
> Is this correct?
> 

It looked that way from previous experimentation, but...

> 
> 
> I just played the optimization
> from printf("helloworld\n") to puts("helloworld").
> 
> https://godbolt.org/z/5s4ded
> 
> 
> -fno-builtin-puts cannot prevent clang
> from emitting puts.
> Is it because clang does not support
> -fno-builtin-puts?

Ugh. clang doesn't have __builtin_puts() but it optimizes printf() into
puts(). It doesn't have __builtin_putchar() but will optimize
printf("c") into putchar('c').

And it has .. half of __builtin_fwrite()? fwrite() of a single byte gets
optimized into fputc(), fprintf() of a string literal gets optimized
into fwrite(), -fno-builtin-fwrite prevents both optimizations, but
__builtin_fwrite() gives a compile error.

I give up.
Nick Desaulniers Aug. 25, 2020, 7:10 a.m. UTC | #42
On Mon, Aug 24, 2020 at 10:34 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
>
> On Tue, Aug 25, 2020 at 12:57:22AM +0900, Masahiro Yamada wrote:
> >
> >
> > To prevent transformation from foo() into bar(),
> > there are two ways in Clang to do that;
> > -fno-builtin-foo, and -fno-builtin-bar.
> > There is only one in GCC; -fno-buitin-foo.
> >
> > Is this correct?
> >
>
> It looked that way from previous experimentation, but...
>
> >
> >
> > I just played the optimization
> > from printf("helloworld\n") to puts("helloworld").
> >
> > https://godbolt.org/z/5s4ded
> >
> >
> > -fno-builtin-puts cannot prevent clang
> > from emitting puts.
> > Is it because clang does not support
> > -fno-builtin-puts?
>
> Ugh. clang doesn't have __builtin_puts() but it optimizes printf() into
> puts(). It doesn't have __builtin_putchar() but will optimize
> printf("c") into putchar('c').

Bah, merely a <strikethrough>flesh
wound</strikethrough><strikethrough>compiler bug</strikethrough>rather
long TODO in the compiler.
https://github.com/llvm/llvm-project/blob/be2bc7d4cef2edd66c7fb74b70adf62fc68754db/clang/include/clang/Basic/Builtins.def#L943

Anyways, give me a week and I'll hack through the rest of them
https://reviews.llvm.org/D86508.  Certainly made HPA's point hit home,
that's a lot of functionality to implement or disable in an
environment.

Masahiro, are you implying that we shouldn't take the
-fno-builtin-stpcpy patch, because Clang is inconsistent? (That can be
fixed.) Even though -fno-builtin-stpcpy works here as intended?
https://lore.kernel.org/lkml/20200817220212.338670-2-ndesaulniers@google.com/

Otherwise we need to provide an implementation of this symbol in the kernel.
https://lore.kernel.org/lkml/20200815020946.1538085-1-ndesaulniers@google.com/

Please, pick your poison.
Nick Desaulniers Aug. 25, 2020, 7:31 a.m. UTC | #43
On Tue, Aug 25, 2020 at 12:10 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Mon, Aug 24, 2020 at 10:34 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > On Tue, Aug 25, 2020 at 12:57:22AM +0900, Masahiro Yamada wrote:
> > >
> > >
> > > To prevent transformation from foo() into bar(),
> > > there are two ways in Clang to do that;
> > > -fno-builtin-foo, and -fno-builtin-bar.
> > > There is only one in GCC; -fno-buitin-foo.
> > >
> > > Is this correct?
> > >
> >
> > It looked that way from previous experimentation, but...
> >
> > >
> > >
> > > I just played the optimization
> > > from printf("helloworld\n") to puts("helloworld").
> > >
> > > https://godbolt.org/z/5s4ded
> > >
> > >
> > > -fno-builtin-puts cannot prevent clang
> > > from emitting puts.
> > > Is it because clang does not support
> > > -fno-builtin-puts?
> >
> > Ugh. clang doesn't have __builtin_puts() but it optimizes printf() into
> > puts(). It doesn't have __builtin_putchar() but will optimize
> > printf("c") into putchar('c').
>
> Bah, merely a <strikethrough>flesh
> wound</strikethrough><strikethrough>compiler bug</strikethrough>rather
> long TODO in the compiler.
> https://github.com/llvm/llvm-project/blob/be2bc7d4cef2edd66c7fb74b70adf62fc68754db/clang/include/clang/Basic/Builtins.def#L943
>
> Anyways, give me a week and I'll hack through the rest of them
> https://reviews.llvm.org/D86508.  Certainly made HPA's point hit home,
> that's a lot of functionality to implement or disable in an
> environment.
>
> Masahiro, are you implying that we shouldn't take the
> -fno-builtin-stpcpy patch, because Clang is inconsistent? (That can be
> fixed.) Even though -fno-builtin-stpcpy works here as intended?
> https://lore.kernel.org/lkml/20200817220212.338670-2-ndesaulniers@google.com/

Sorry, the above link ^ should be this hunk (beyond tired, getting up
in 4.5hrs for plumbers):

diff --git a/Makefile b/Makefile
index c4470a4e131f..6a08cdfa58ae 100644
--- a/Makefile
+++ b/Makefile
@@ -577,6 +577,7 @@ ifneq ($(LLVM_IAS),1)
 CLANG_FLAGS    += -no-integrated-as
 endif
 CLANG_FLAGS    += -Werror=unknown-warning-option
+CLANG_FLAGS    += -fno-builtin-stpcpy
 KBUILD_CFLAGS  += $(CLANG_FLAGS)
 KBUILD_AFLAGS  += $(CLANG_FLAGS)
 export CLANG_FLAGS

>
> Otherwise we need to provide an implementation of this symbol in the kernel.
> https://lore.kernel.org/lkml/20200815020946.1538085-1-ndesaulniers@google.com/
>
> Please, pick your poison.
> --
> Thanks,
> ~Nick Desaulniers
Masahiro Yamada Aug. 25, 2020, 12:28 p.m. UTC | #44
On Tue, Aug 25, 2020 at 4:10 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Mon, Aug 24, 2020 at 10:34 AM Arvind Sankar <nivedita@alum.mit.edu> wrote:
> >
> > On Tue, Aug 25, 2020 at 12:57:22AM +0900, Masahiro Yamada wrote:
> > >
> > >
> > > To prevent transformation from foo() into bar(),
> > > there are two ways in Clang to do that;
> > > -fno-builtin-foo, and -fno-builtin-bar.
> > > There is only one in GCC; -fno-buitin-foo.
> > >
> > > Is this correct?
> > >
> >
> > It looked that way from previous experimentation, but...
> >
> > >
> > >
> > > I just played the optimization
> > > from printf("helloworld\n") to puts("helloworld").
> > >
> > > https://godbolt.org/z/5s4ded
> > >
> > >
> > > -fno-builtin-puts cannot prevent clang
> > > from emitting puts.
> > > Is it because clang does not support
> > > -fno-builtin-puts?
> >
> > Ugh. clang doesn't have __builtin_puts() but it optimizes printf() into
> > puts(). It doesn't have __builtin_putchar() but will optimize
> > printf("c") into putchar('c').
>
> Bah, merely a <strikethrough>flesh
> wound</strikethrough><strikethrough>compiler bug</strikethrough>rather
> long TODO in the compiler.
> https://github.com/llvm/llvm-project/blob/be2bc7d4cef2edd66c7fb74b70adf62fc68754db/clang/include/clang/Basic/Builtins.def#L943
>
> Anyways, give me a week and I'll hack through the rest of them
> https://reviews.llvm.org/D86508.  Certainly made HPA's point hit home,
> that's a lot of functionality to implement or disable in an
> environment.
>
> Masahiro, are you implying that we shouldn't take the
> -fno-builtin-stpcpy patch, because Clang is inconsistent? (That can be
> fixed.) Even though -fno-builtin-stpcpy works here as intended?
> https://lore.kernel.org/lkml/20200817220212.338670-2-ndesaulniers@google.com/
>
> Otherwise we need to provide an implementation of this symbol in the kernel.
> https://lore.kernel.org/lkml/20200815020946.1538085-1-ndesaulniers@google.com/
>
> Please, pick your poison.



I am not a compiler expert.

Nor am I sure if I am the right person who makes this decision.
But, if so, I would choose the latter.
(implement stpcpy() in the kernel)

I was addressed last night, so
I should write up my thoughts.

I do not think -fno-builtin-stpcpy is a
general solution.

-fno-builtin-stpcpy will work for now
because only Clang implements the transformation
from 'sprintf(dest, "%s", str)' into
'stpcpy(dest, str) - dest'.

If GCC implements it some day,
we would run into a problem because
in GCC, it is not -fno-builtin-stpcpy, but
-fno-builtin-sprintf that disables that optimization.

In this regard, 'KBUILD_CFLAGS += -fno-builtin-sprintf'
would be more future-proof, but it is potentially
an overkill.
We want to disable optimization from sprintf() to stpcpy(),
but we may still benefit from the optimization from
sprintf() into something else.


Linus is uncomfortable with this kind of compiler magic.
If we take compiler's freedom away,
-ffreestanding is a big hammer to solve this problem.

If we welcome the compiler's optimization,
we should implement stpcpy(), bcmp(), and whatever
until we solve all link errors.



--
Best Regards
Masahiro Yamada
Nick Desaulniers Aug. 25, 2020, 2:02 p.m. UTC | #45
On Tue, Aug 25, 2020 at 5:30 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
>
> On Tue, Aug 25, 2020 at 4:10 PM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > Masahiro, are you implying that we shouldn't take the
> > -fno-builtin-stpcpy patch, because Clang is inconsistent? (That can be
> > fixed.) Even though -fno-builtin-stpcpy works here as intended?
> > https://lore.kernel.org/lkml/20200817220212.338670-2-ndesaulniers@google.com/
> >
> > Otherwise we need to provide an implementation of this symbol in the kernel.
> > https://lore.kernel.org/lkml/20200815020946.1538085-1-ndesaulniers@google.com/
> >
> > Please, pick your poison.
>
>
>
> I am not a compiler expert.
>
> Nor am I sure if I am the right person who makes this decision.
> But, if so, I would choose the latter.
> (implement stpcpy() in the kernel)
>
> I was addressed last night, so
> I should write up my thoughts.
>
> I do not think -fno-builtin-stpcpy is a
> general solution.
>
> -fno-builtin-stpcpy will work for now
> because only Clang implements the transformation
> from 'sprintf(dest, "%s", str)' into
> 'stpcpy(dest, str) - dest'.
>
> If GCC implements it some day,
> we would run into a problem because
> in GCC, it is not -fno-builtin-stpcpy, but
> -fno-builtin-sprintf that disables that optimization.
>
> In this regard, 'KBUILD_CFLAGS += -fno-builtin-sprintf'
> would be more future-proof, but it is potentially
> an overkill.
> We want to disable optimization from sprintf() to stpcpy(),
> but we may still benefit from the optimization from
> sprintf() into something else.
>
>
> Linus is uncomfortable with this kind of compiler magic.
> If we take compiler's freedom away,
> -ffreestanding is a big hammer to solve this problem.
>
> If we welcome the compiler's optimization,
> we should implement stpcpy(), bcmp(), and whatever
> until we solve all link errors.

Speculating that -ffreestanding is untenable, submitted v3:
https://lore.kernel.org/lkml/20200825135838.2938771-1-ndesaulniers@google.com/T/#u
Masahiro Yamada Aug. 26, 2020, 1:28 p.m. UTC | #46
On Tue, Aug 25, 2020 at 11:02 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> On Tue, Aug 25, 2020 at 5:30 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
> >
> > On Tue, Aug 25, 2020 at 4:10 PM Nick Desaulniers
> > <ndesaulniers@google.com> wrote:
> > >
> > > Masahiro, are you implying that we shouldn't take the
> > > -fno-builtin-stpcpy patch, because Clang is inconsistent? (That can be
> > > fixed.) Even though -fno-builtin-stpcpy works here as intended?
> > > https://lore.kernel.org/lkml/20200817220212.338670-2-ndesaulniers@google.com/
> > >
> > > Otherwise we need to provide an implementation of this symbol in the kernel.
> > > https://lore.kernel.org/lkml/20200815020946.1538085-1-ndesaulniers@google.com/
> > >
> > > Please, pick your poison.
> >
> >
> >
> > I am not a compiler expert.
> >
> > Nor am I sure if I am the right person who makes this decision.
> > But, if so, I would choose the latter.
> > (implement stpcpy() in the kernel)
> >
> > I was addressed last night, so
> > I should write up my thoughts.
> >
> > I do not think -fno-builtin-stpcpy is a
> > general solution.
> >
> > -fno-builtin-stpcpy will work for now
> > because only Clang implements the transformation
> > from 'sprintf(dest, "%s", str)' into
> > 'stpcpy(dest, str) - dest'.
> >
> > If GCC implements it some day,
> > we would run into a problem because
> > in GCC, it is not -fno-builtin-stpcpy, but
> > -fno-builtin-sprintf that disables that optimization.
> >
> > In this regard, 'KBUILD_CFLAGS += -fno-builtin-sprintf'
> > would be more future-proof, but it is potentially
> > an overkill.
> > We want to disable optimization from sprintf() to stpcpy(),
> > but we may still benefit from the optimization from
> > sprintf() into something else.
> >
> >
> > Linus is uncomfortable with this kind of compiler magic.
> > If we take compiler's freedom away,
> > -ffreestanding is a big hammer to solve this problem.
> >
> > If we welcome the compiler's optimization,
> > we should implement stpcpy(), bcmp(), and whatever
> > until we solve all link errors.
>
> Speculating that -ffreestanding is untenable, submitted v3:
> https://lore.kernel.org/lkml/20200825135838.2938771-1-ndesaulniers@google.com/T/#u
> --
> Thanks,
> ~Nick Desaulniers

FYI:
Andrew Morton picked up this patch.