diff mbox

Avoid reuse of string buffer when concatening adjacent string litterals

Message ID 20150131012339.GA3460@macpro.local (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Luc Van Oostenryck Jan. 31, 2015, 1:23 a.m. UTC
In get_string_constant(), the code tried to reuse the storage for the string
but only if the expansion of the string was not bigger than its unexpanded form.
But this fail when the string constant is a sequence of adjacent string litterals
(each being possibly shared, used elsewhere, isolated or in another order).
The minimal exemple would be something like this:

#define P "\001"
const char a[] = P "a";
const char b[] = P "b";

The expansion for 'a' will produce a string which is smaller than
the unexpanded "\001" (2 instead of 4).
By trying to reuse the storage, all further occurrence of "\001"
(probably only from the same 'origin', here the macro P) will then be replaced by "\001a".

The fix is thus to not try to reuse the storage for the string if it consit of
several adjacent litterals.

Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
---
 char.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Rasmus Villemoes Feb. 3, 2015, 10:38 p.m. UTC | #1
On Sat, Jan 31 2015, Luc Van Oostenryck <luc.vanoostenryck@gmail.com> wrote:

> In get_string_constant(), the code tried to reuse the storage for the string
> but only if the expansion of the string was not bigger than its unexpanded form.
> But this fail when the string constant is a sequence of adjacent string litterals
> (each being possibly shared, used elsewhere, isolated or in another order).
> The minimal exemple would be something like this:
>
> #define P "\001"
> const char a[] = P "a";
> const char b[] = P "b";
>
> The expansion for 'a' will produce a string which is smaller than
> the unexpanded "\001" (2 instead of 4).
> By trying to reuse the storage, all further occurrence of "\001"
> (probably only from the same 'origin', here the macro P) will then be replaced by "\001a".
>
> The fix is thus to not try to reuse the storage for the string if it consit of
> several adjacent litterals.
>

Thanks, but there's still something wrong. Using your show-data feature
on this:

===
#define BACKSLASH "\\"
#define LETTER_t "t"

static const char s1[] = BACKSLASH;
/* static const char s2[] = BACKSLASH; */
static const char s3[] = BACKSLASH LETTER_t;
static const char s4[] = "a" BACKSLASH LETTER_t "b";
===

I get

symbol s1:
        char static const [toplevel] s1[0]
        bit_size = 16
        val = "\\"
symbol s3:
        char static const [toplevel] s3[0]
        bit_size = 24
        val = "\0t"
symbol s4:
        char static const [toplevel] s4[0]
        bit_size = 40
        val = "a\0tb"

Now if I do the same with s2 not commented out, I get


symbol s1:
        char static const [toplevel] s1[0]
        bit_size = 16
        val = "\0"
symbol s2:
        char static const [toplevel] s2[0]
        bit_size = 16
        val = "\0"
symbol s3:
        char static const [toplevel] s3[0]
        bit_size = 24
        val = "\0t"
symbol s4:
        char static const [toplevel] s4[0]
        bit_size = 40
        val = "a\0tb"

So the expansion of BACKSLASH changes depending on how often it is
expanded...

The LETTER_t thing above is because I thought I had somehow provoked a
double expansion, making BACKSLASH LETTER_t (or some variant) expand to
a single-character string containing just a tab. But I can't seem to
reproduce that particular behaviour, so maybe I'm imagining
stuff. Anyway, the above is certainly real.

Thanks,
Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Luc Van Oostenryck Feb. 4, 2015, 12:32 a.m. UTC | #2
On Tue, Feb 03, 2015 at 11:38:02PM +0100, Rasmus Villemoes wrote:
> On Sat, Jan 31 2015, Luc Van Oostenryck <luc.vanoostenryck@gmail.com> wrote:
> 
> > In get_string_constant(), the code tried to reuse the storage for the string
> > but only if the expansion of the string was not bigger than its unexpanded form.
> > But this fail when the string constant is a sequence of adjacent string litterals
> > (each being possibly shared, used elsewhere, isolated or in another order).
> > The minimal exemple would be something like this:
> >
> > #define P "\001"
> > const char a[] = P "a";
> > const char b[] = P "b";
> >
> > The expansion for 'a' will produce a string which is smaller than
> > the unexpanded "\001" (2 instead of 4).
> > By trying to reuse the storage, all further occurrence of "\001"
> > (probably only from the same 'origin', here the macro P) will then be replaced by "\001a".
> >
> > The fix is thus to not try to reuse the storage for the string if it consit of
> > several adjacent litterals.
> >
> 
> Thanks, but there's still something wrong. Using your show-data feature
> on this:
> 
> ===
> #define BACKSLASH "\\"
> #define LETTER_t "t"
> 
> static const char s1[] = BACKSLASH;
> /* static const char s2[] = BACKSLASH; */
> static const char s3[] = BACKSLASH LETTER_t;
> static const char s4[] = "a" BACKSLASH LETTER_t "b";
> ===
> 
> I get
> 
> symbol s1:
>         char static const [toplevel] s1[0]
>         bit_size = 16
>         val = "\\"
> symbol s3:
>         char static const [toplevel] s3[0]
>         bit_size = 24
>         val = "\0t"
> symbol s4:
>         char static const [toplevel] s4[0]
>         bit_size = 40
>         val = "a\0tb"
> 
> Now if I do the same with s2 not commented out, I get
> 
> 
> symbol s1:
>         char static const [toplevel] s1[0]
>         bit_size = 16
>         val = "\0"
> symbol s2:
>         char static const [toplevel] s2[0]
>         bit_size = 16
>         val = "\0"
> symbol s3:
>         char static const [toplevel] s3[0]
>         bit_size = 24
>         val = "\0t"
> symbol s4:
>         char static const [toplevel] s4[0]
>         bit_size = 40
>         val = "a\0tb"
> 
> So the expansion of BACKSLASH changes depending on how often it is
> expanded...
> 
> The LETTER_t thing above is because I thought I had somehow provoked a
> double expansion, making BACKSLASH LETTER_t (or some variant) expand to
> a single-character string containing just a tab. But I can't seem to
> reproduce that particular behaviour, so maybe I'm imagining
> stuff. Anyway, the above is certainly real.
> 
> Thanks,
> Rasmus
> --
Yes, I see.

Now thinking about it, it's obvious that the string buffer can't be reused at all
if there is any kind of expansion done on it, the adjacent strings concatenation
make just the thing worse but are not the cause of it.

I'll post an updated patch later.


Regards,
Luc
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Li Feb. 4, 2015, 3:26 a.m. UTC | #3
On Tue, Feb 3, 2015 at 4:32 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> Now thinking about it, it's obvious that the string buffer can't be reused at all
> if there is any kind of expansion done on it, the adjacent strings concatenation
> make just the thing worse but are not the cause of it.
>
Right. That is what I think after reading your patch too.
String concatenation is a not a good indicator on macro expand.
There should be a fix base on the macro expand.

Even though I haven't construct an test case like Rasmus did.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rasmus Villemoes Feb. 4, 2015, 8:39 a.m. UTC | #4
On Wed, Feb 04 2015, Luc Van Oostenryck <luc.vanoostenryck@gmail.com> wrote:

> On Tue, Feb 03, 2015 at 11:38:02PM +0100, Rasmus Villemoes wrote:
>> 
>> Thanks, but there's still something wrong. Using your show-data feature
>> on this:
>> 
>> ===
>> #define BACKSLASH "\\"
>> #define LETTER_t "t"
>> 
>> static const char s1[] = BACKSLASH;
>> /* static const char s2[] = BACKSLASH; */
>> static const char s3[] = BACKSLASH LETTER_t;
>> static const char s4[] = "a" BACKSLASH LETTER_t "b";
>> ===
>> 
>> I get
>> 
>> symbol s1:
>>         char static const [toplevel] s1[0]
>>         bit_size = 16
>>         val = "\\"
>> symbol s3:
>>         char static const [toplevel] s3[0]
>>         bit_size = 24
>>         val = "\0t"
>> symbol s4:
>>         char static const [toplevel] s4[0]
>>         bit_size = 40
>>         val = "a\0tb"
>> 
>> Now if I do the same with s2 not commented out, I get
>> 
>> 
>> symbol s1:
>>         char static const [toplevel] s1[0]
>>         bit_size = 16
>>         val = "\0"
>> symbol s2:
>>         char static const [toplevel] s2[0]
>>         bit_size = 16
>>         val = "\0"
>> symbol s3:
>>         char static const [toplevel] s3[0]
>>         bit_size = 24
>>         val = "\0t"
>> symbol s4:
>>         char static const [toplevel] s4[0]
>>         bit_size = 40
>>         val = "a\0tb"
>> 
>> So the expansion of BACKSLASH changes depending on how often it is
>> expanded...
>> 
>> The LETTER_t thing above is because I thought I had somehow provoked a
>> double expansion, making BACKSLASH LETTER_t (or some variant) expand to
>> a single-character string containing just a tab. But I can't seem to
>> reproduce that particular behaviour, so maybe I'm imagining
>> stuff. Anyway, the above is certainly real.
>> 
>> Thanks,
>> Rasmus
>> --
> Yes, I see.
>
> Now thinking about it, it's obvious that the string buffer can't be reused at all
> if there is any kind of expansion done on it, the adjacent strings concatenation
> make just the thing worse but are not the cause of it.
>

That was also my conclusion from looking at the code, but I was unable
to do anything about it. And I wasn't hallucinating, I was just
overcomplicating things:

#define NOT_TAB "\\t"

static const char s1[] = NOT_TAB;
static const char s2[] = NOT_TAB;

indeed fails.

Thanks,
Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rasmus Villemoes Feb. 4, 2015, 8:58 a.m. UTC | #5
On Wed, Feb 04 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:

> And I wasn't hallucinating, I was just overcomplicating things:
>
> #define NOT_TAB "\\t"
>
> static const char s1[] = NOT_TAB;
> static const char s2[] = NOT_TAB;
>
> indeed fails.

While we're collecting examples, let me also mention that __FILE__
doesn't work for files with backslash in their name. Sane people of
course don't put backslashes in file names, but they are a rather normal
occurence in path names on a certain operating system.

Rasmus

--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Li Feb. 4, 2015, 4:20 p.m. UTC | #6
On Wed, Feb 4, 2015 at 12:58 AM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> On Wed, Feb 04 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
>
>> And I wasn't hallucinating, I was just overcomplicating things:
>>
>> #define NOT_TAB "\\t"
>>
>> static const char s1[] = NOT_TAB;
>> static const char s2[] = NOT_TAB;
>>
>> indeed fails.
>
> While we're collecting examples, let me also mention that __FILE__
> doesn't work for files with backslash in their name. Sane people of
> course don't put backslashes in file names, but they are a rather normal
> occurence in path names on a certain operating system.

Can you submit a patch for adding the test case you found?
I will include those into the the test suit.

Thanks

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rasmus Villemoes Feb. 6, 2015, 9:52 p.m. UTC | #7
On Wed, Feb 04 2015, Christopher Li <sparse@chrisli.org> wrote:

> On Wed, Feb 4, 2015 at 12:58 AM, Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>> On Wed, Feb 04 2015, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
>>
>>> And I wasn't hallucinating, I was just overcomplicating things:
>>>
>>> #define NOT_TAB "\\t"
>>>
>>> static const char s1[] = NOT_TAB;
>>> static const char s2[] = NOT_TAB;
>>>
>>> indeed fails.
>>
>> While we're collecting examples, let me also mention that __FILE__
>> doesn't work for files with backslash in their name. Sane people of
>> course don't put backslashes in file names, but they are a rather normal
>> occurence in path names on a certain operating system.
>
> Can you submit a patch for adding the test case you found?
> I will include those into the the test suit.

I'd like to, but I'm not sure how to write the test in terms of sparse's
test frame work. How do I check that the string is as expected? 

Rasmus
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Li Feb. 7, 2015, 1:30 a.m. UTC | #8
On Fri, Feb 6, 2015 at 1:52 PM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
>> Can you submit a patch for adding the test case you found?
>> I will include those into the the test suit.
>
> I'd like to, but I'm not sure how to write the test in terms of sparse's
> test frame work. How do I check that the string is as expected?

You can use ./test-parsing to evaluate the test file.
$ cat /tmp/v.c
#define BACKSLASH "\\"
static const char a[] = BACKSLASH "a";
static const char b[] = BACKSLASH "b";

# on current master branch it shows:
$ ./test-parsing /tmp/v.c
.align 1
char static const [toplevel] a[0]
 =
    movi.64        v2,&"\7b"
    ld.24        v3,[v2]
,
.align 1
char static const [toplevel] b[0]
 =
    movi.64        v5,&"\7b"
    ld.24        v6,[v5]


# on the review-immutable-string branch it shows:
$ ./test-parsing /tmp/v.c

.align 1
char static const [toplevel] a[0]
 =
    movi.64        v2,&"\\a"
    ld.24        v3,[v2]
,
.align 1
char static const [toplevel] b[0]
 =
    movi.64        v5,&"\\b"
    ld.24        v6,[v5]


Notice that "\\a" and "\\b" was not there before the
bug was fixed.

The test suit is just some C file in the "validations" directory.
Each file has a test name and optional how to invoke the C file
and what to expect from the running result.
You should able to find some example in the test suit.
Or, you can just submit a patch to include those C file
in the "validations" directory. Let some one help you complete
the test suit.

You run the test suit as:

make check


Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-sparse" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Damien Lespiau Feb. 9, 2015, 9:48 p.m. UTC | #9
On 7 February 2015 at 01:30, Christopher Li <sparse@chrisli.org> wrote:
> On Fri, Feb 6, 2015 at 1:52 PM, Rasmus Villemoes
> <linux@rasmusvillemoes.dk> wrote:
>>> Can you submit a patch for adding the test case you found?
>>> I will include those into the the test suit.
>>
>> I'd like to, but I'm not sure how to write the test in terms of sparse's
>> test frame work. How do I check that the string is as expected?
>
> You can use ./test-parsing to evaluate the test file.

There's even a bit of documentation!

http://git.kernel.org/cgit/devel/sparse/sparse.git/tree/Documentation/test-suite
diff mbox

Patch

diff --git a/char.c b/char.c
index 08ca2230..ce1a0700 100644
--- a/char.c
+++ b/char.c
@@ -93,6 +93,7 @@  struct token *get_string_constant(struct token *token, struct expression *expr)
 	static char buffer[MAX_STRING];
 	int len = 0;
 	int bits;
+	int parts = 0;
 
 	while (!done) {
 		switch (token_type(next)) {
@@ -117,13 +118,14 @@  struct token *get_string_constant(struct token *token, struct expression *expr)
 			len++;
 		}
 		token = token->next;
+		parts++;
 	}
 	if (len > MAX_STRING) {
 		warning(token->pos, "trying to concatenate %d-character string (%d bytes max)", len, MAX_STRING);
 		len = MAX_STRING;
 	}
 
-	if (len >= string->length)	/* can't cannibalize */
+	if (len >= string->length || parts > 1)	/* safe to reuse the string buffer */
 		string = __alloc_string(len+1);
 	string->length = len+1;
 	memcpy(string->data, buffer, len);