diff mbox

[v2] Avoid reusing string buffer when doing string expansion

Message ID 20150204020059.GA7069@macpro.local (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Luc Van Oostenryck Feb. 4, 2015, 2:01 a.m. UTC
In get_string_constant(), the code tried to reuse the storage for the string
but only if the expansion of the string was not bigger than its unexpanded form.
But this string can be shared with other expressions and reusing the buffer will
result in later corruption

A minimal exemple would be something like:
const char a[] = BACKSLASH;
const char b[] = BACKSLASH;

The expansion for 'a' will correctly produce the two-char string consisting
of a backslash char followed by a null char.
But then the expansion of 'b' will expand this once more,
producing the expansion of "\0": the two-char string: { '\0', '\0' }.

The fix is to not reuse the storage for the string if any king of expansion
have been done.

Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
---
 char.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Christopher Li Feb. 4, 2015, 5:30 a.m. UTC | #1
On Tue, Feb 3, 2015 at 6:01 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> In get_string_constant(), the code tried to reuse the storage for the string
> but only if the expansion of the string was not bigger than its unexpanded form.
> But this string can be shared with other expressions and reusing the buffer will
> result in later corruption
>
> A minimal exemple would be something like:
> const char a[] = BACKSLASH;
> const char b[] = BACKSLASH;
>
> The expansion for 'a' will correctly produce the two-char string consisting
> of a backslash char followed by a null char.
> But then the expansion of 'b' will expand this once more,
> producing the expansion of "\0": the two-char string: { '\0', '\0' }.

Are you sure about this behavior? You mean you see "b" has the string
size as 2. I haven't understand how this can happen.

> The fix is to not reuse the storage for the string if any king of expansion
> have been done.

That is a bit over kill. We only need to avoid reuse storage if the
destination part of the string is come from a preprocessor macro.
It is pretty common string contain escape sequence. We don't
want to allocate extra memory copy if it is not part of a macro
expansion.

>
> Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> ---
>  char.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/char.c b/char.c
> index 08ca2230..2e21bb77 100644
> --- a/char.c
> +++ b/char.c
> @@ -123,11 +123,15 @@ struct token *get_string_constant(struct token *token, struct expression *expr)
>                 len = MAX_STRING;
>         }
>
> -       if (len >= string->length)      /* can't cannibalize */
> +       /* The input string can be shared with other expression and so
> +        * its storage can't be reused if any kind of expansion have been done on it.
> +        */
> +       if ((len != string->length) || memcmp(buffer, string->data, len)) {

I don' think this check take into account the preprocessor macro has
been used or not. In other words, any general "hello world\n" which
contain the escape character will produce a different buffer, there for,
a new copy of the string. Which is not necessary. That is a pretty
common case.

I am working on patch to address it in the preprocessor macro.
The idea is that just mark the string as immutable if it is part of the
macro expansion. I will see how it goes.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luc Van Oostenryck Feb. 4, 2015, 6:22 a.m. UTC | #2
On Tue, Feb 03, 2015 at 09:30:15PM -0800, Christopher Li wrote:
> On Tue, Feb 3, 2015 at 6:01 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
> >
> > In get_string_constant(), the code tried to reuse the storage for the string
> > but only if the expansion of the string was not bigger than its unexpanded form.
> > But this string can be shared with other expressions and reusing the buffer will
> > result in later corruption
> >
> > A minimal exemple would be something like:
> > const char a[] = BACKSLASH;
> > const char b[] = BACKSLASH;
> >
> > The expansion for 'a' will correctly produce the two-char string consisting
> > of a backslash char followed by a null char.
> > But then the expansion of 'b' will expand this once more,
> > producing the expansion of "\0": the two-char string: { '\0', '\0' }.
> 
> Are you sure about this behavior? You mean you see "b" has the string
> size as 2. I haven't understand how this can happen.

Using the show_data() / sparse -vdata on:
===
#define BACKSLASH "\\"
const char a[] = BACKSLASH;
===

gives, correctly:
===
symbol a:
	char const [addressable] [toplevel] b[0]
	bit_size = 16
	val = "\\"
=== 

But if the macro is used several times:
===
#define BACKSLASH "\\"
const char a[] = BACKSLASH;
const char b[] = BACKSLASH;
const char c[] = "<" BACKSLASH ">";
===

the, we get:
===
symbol a:
	char const [addressable] [toplevel] a[0]
	bit_size = 16
	val = "\0"
symbol b:
	char const [addressable] [toplevel] b[0]
	bit_size = 16
	val = "\0"
symbol c:
	char const [addressable] [toplevel] c[0]
	bit_size = 32
	val = "<\0>"
===

And even worse:
===
#define BACKSLASH "(\\)"
const char m[] = BACKSLASH;
const char n[] = BACKSLASH;
const char k[] = "<" BACKSLASH ">";
===

gives:
===
symbol m:
	char const [addressable] [toplevel] m[0]
	bit_size = 24
	val = "()"
symbol n:
	char const [addressable] [toplevel] n[0]
	bit_size = 24
	val = "()"
symbol k:
	char const [addressable] [toplevel] k[0]
	bit_size = 40
	val = "<()>"
===

> > The fix is to not reuse the storage for the string if any king of expansion
> > have been done.
> 
> That is a bit over kill. We only need to avoid reuse storage if the
> destination part of the string is come from a preprocessor macro.
> It is pretty common string contain escape sequence. We don't
> want to allocate extra memory copy if it is not part of a macro
> expansion.

Well yes ...
Is it only with macros that the string structure is so shared?
And have we a way to test if the string is coming from a macro?

 
> >
> > Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> > Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> > ---
> >  char.c | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/char.c b/char.c
> > index 08ca2230..2e21bb77 100644
> > --- a/char.c
> > +++ b/char.c
> > @@ -123,11 +123,15 @@ struct token *get_string_constant(struct token *token, struct expression *expr)
> >                 len = MAX_STRING;
> >         }
> >
> > -       if (len >= string->length)      /* can't cannibalize */
> > +       /* The input string can be shared with other expression and so
> > +        * its storage can't be reused if any kind of expansion have been done on it.
> > +        */
> > +       if ((len != string->length) || memcmp(buffer, string->data, len)) {
> 
> I don' think this check take into account the preprocessor macro has
> been used or not. In other words, any general "hello world\n" which
> contain the escape character will produce a different buffer, there for,
> a new copy of the string. Which is not necessary. That is a pretty
> common case.

No, indeed, it does not.
It just allocate a new buffer every time there is any modification/expansion
so that the original one is not touched (in case it is used elsewhere).

> 
> I am working on patch to address it in the preprocessor macro.
> The idea is that just mark the string as immutable if it is part of the
> macro expansion. I will see how it goes.
> 
> Chris
> --

A simpler and safer way would be to directly do the string expansion just after
a string token is recognized, or even better in the lexer itself.
So the string buffer, macro or not, will always directly contain the right values.
But maybe there was good reasons to not do it this way.

Luc
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Li Feb. 4, 2015, 8:01 a.m. UTC | #3
On Tue, Feb 3, 2015 at 10:22 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>> Are you sure about this behavior? You mean you see "b" has the string
>> size as 2. I haven't understand how this can happen.
>
>
> But if the macro is used several times:
> ===
> #define BACKSLASH "\\"
> const char a[] = BACKSLASH;
> const char b[] = BACKSLASH;
> const char c[] = "<" BACKSLASH ">";
> ===
>
> the, we get:
> ===
> symbol a:
>         char const [addressable] [toplevel] a[0]
>         bit_size = 16
>         val = "\0"
> symbol b:
>         char const [addressable] [toplevel] b[0]
>         bit_size = 16

The value buffer is corrupted. But the bit_size is still 16, which
is correct. I just think that in your example it shouldn't corrupt
the size. Your test case seems confirm that.

> Is it only with macros that the string structure is so shared?

That is right. I haven't see it can happen any other way.
The tokenizer always construct new token and string structure
from the C source file.

It is the preprocessor using macro expand which copy and duplicate
the token list. The token has a pointer point to the string which
is shared across different invocation of macro.

> And have we a way to test if the string is coming from a macro?

Not right now. But we can add it.

>
> A simpler and safer way would be to directly do the string expansion just after
> a string token is recognized, or even better in the lexer itself.
> So the string buffer, macro or not, will always directly contain the right values.
> But maybe there was good reasons to not do it this way.

I have an counter example that will not work. Let say

#define b(a, d) a##d
wchar_t s[] = b(L, "\xabcdabc");

When the lexer process the escape char, you did not know the string
is wide char or not. That can be changed after the macro expansion.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luc Van Oostenryck Feb. 4, 2015, 11:38 p.m. UTC | #4
On Wed, Feb 04, 2015 at 12:01:39AM -0800, Christopher Li wrote:
> On Tue, Feb 3, 2015 at 10:22 PM, Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
> >> Are you sure about this behavior? You mean you see "b" has the string
> >> size as 2. I haven't understand how this can happen.
> >
> >
> > But if the macro is used several times:
> > ===
> > #define BACKSLASH "\\"
> > const char a[] = BACKSLASH;
> > const char b[] = BACKSLASH;
> > const char c[] = "<" BACKSLASH ">";
> > ===
> >
> > the, we get:
> > ===
> > symbol a:
> >         char const [addressable] [toplevel] a[0]
> >         bit_size = 16
> >         val = "\0"
> > symbol b:
> >         char const [addressable] [toplevel] b[0]
> >         bit_size = 16
> 
> The value buffer is corrupted. But the bit_size is still 16, which
> is correct. I just think that in your example it shouldn't corrupt
> the size. Your test case seems confirm that.
> 
> > Is it only with macros that the string structure is so shared?
> 
> That is right. I haven't see it can happen any other way.
> The tokenizer always construct new token and string structure
> from the C source file.
> 
> It is the preprocessor using macro expand which copy and duplicate
> the token list. The token has a pointer point to the string which
> is shared across different invocation of macro.

Fine.
I was affraid that there was other possibilities, like, for exemple,
if the identical string litterals are put in an hash table, like it is done
for identifiers.

> > And have we a way to test if the string is coming from a macro?
> 
> Not right now. But we can add it.
> 
> >
> > A simpler and safer way would be to directly do the string expansion just after
> > a string token is recognized, or even better in the lexer itself.
> > So the string buffer, macro or not, will always directly contain the right values.
> > But maybe there was good reasons to not do it this way.
> 
> I have an counter example that will not work. Let say
> 
> #define b(a, d) a##d
> wchar_t s[] = b(L, "\xabcdabc");
> 
> When the lexer process the escape char, you did not know the string
> is wide char or not. That can be changed after the macro expansion.
> 
> Chris

Yes, I see.

BTW, I've checked and there is a lot of problems with wide strings.
I'll send some test case later.


Regards,
Luc
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/char.c b/char.c
index 08ca2230..2e21bb77 100644
--- a/char.c
+++ b/char.c
@@ -123,11 +123,15 @@  struct token *get_string_constant(struct token *token, struct expression *expr)
 		len = MAX_STRING;
 	}
 
-	if (len >= string->length)	/* can't cannibalize */
+	/* The input string can be shared with other expression and so
+	 * its storage can't be reused if any kind of expansion have been done on it.
+	 */
+	if ((len != string->length) || memcmp(buffer, string->data, len)) {
 		string = __alloc_string(len+1);
-	string->length = len+1;
-	memcpy(string->data, buffer, len);
-	string->data[len] = '\0';
+		string->length = len+1;
+		memcpy(string->data, buffer, len);
+		string->data[len] = '\0';
+	}
 	expr->string = string;
 	expr->wide = is_wide;
 	return token;