[linux-cifs-client] cifs: Fix insufficient memory allocation for nativeFileSystem field
diff mbox

Message ID 49DB5202.4030601@suse.de
State New, archived
Headers show

Commit Message

Suresh Jayaraman April 7, 2009, 1:15 p.m. UTC
Jeff Layton wrote:
> On Mon, 06 Apr 2009 22:33:09 +0530
> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> 
>> Steve French wrote:
>>> I don't think that we should be using these size assumptions
>>> (multiples of UCS stringlen).    A new UCS helper function should be
>>> created that calculates how much memory would be needed for a
>>> converted string - and we need to use this before we do the malloc and
>>> string conversion.  In effect a strlen and strnlen function that takes
>>> a target code page argument.  For strings that will never be more than
>>> a hundred bytes this may not be needed, and we can use the length
>>> assumption, but since mallocs in kernel can be so expensive I would
>>> rather calculate the actual string length needed for the target.
>> Ah, ok. I thought of writing a little function based on
>> cifs_strncpy_to_host() and adding a comment like below:
>>
>> /* UniStrnlen() returns length in 16 bit Unicode  characters
>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8
>>  * character can be up to 8 bytes maximum, so we need to
>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
>>  * UTF-8 string */
>>
> 
> I think you'll have to basically do the conversion twice. Walk the
> string once and convert each character determine its length and then
> discard it. Get the total and allocate that many bytes (plus the null

Thanks for explaining. It seems adding a new UCS helper that computes
length in bytes like the below would be good enough and make use of it
to compute length for memory allocation.

> termination), and do the conversion again into the buffer.

Do we still need this conversion again?



> 
> I'm not truly convinced this is really necessary though. You have to
> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
> bytes, you get 32 anyway. You'll probably end up using roughly the same
> amount of memory that you would have had you just estimated the size.
> 

Yes, it seems for most cases we would end up with roughly the same
amount of memory.

Both the approaches sounds plausible to me. If we have a concensus on
one of these two, I can convert the code in audited places and resend
the patch.

Thanks,

Comments

Steve French April 7, 2009, 2:59 p.m. UTC | #1
On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote:
> Jeff Layton wrote:
>> On Mon, 06 Apr 2009 22:33:09 +0530
>> Suresh Jayaraman <sjayaraman@suse.de> wrote:
>>
>>> Steve French wrote:
>>>> I don't think that we should be using these size assumptions
>>>> (multiples of UCS stringlen).    A new UCS helper function should be
>>>> created that calculates how much memory would be needed for a
>>>> converted string - and we need to use this before we do the malloc and
>>>> string conversion.  In effect a strlen and strnlen function that takes
>>>> a target code page argument.  For strings that will never be more than
>>>> a hundred bytes this may not be needed, and we can use the length
>>>> assumption, but since mallocs in kernel can be so expensive I would
>>>> rather calculate the actual string length needed for the target.
>>> Ah, ok. I thought of writing a little function based on
>>> cifs_strncpy_to_host() and adding a comment like below:
>>>
>>> /* UniStrnlen() returns length in 16 bit Unicode  characters
>>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8
>>>  * character can be up to 8 bytes maximum, so we need to
>>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
>>>  * UTF-8 string */
>>>
>>
>> I think you'll have to basically do the conversion twice. Walk the
>> string once and convert each character determine its length and then
>> discard it. Get the total and allocate that many bytes (plus the null
>
> Thanks for explaining. It seems adding a new UCS helper that computes
> length in bytes like the below would be good enough and make use of it
> to compute length for memory allocation.
>
>> termination), and do the conversion again into the buffer.
>
> Do we still need this conversion again?
>
>
> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> index 14eb9a2..0396bdc 100644
> --- a/fs/cifs/cifs_unicode.h
> +++ b/fs/cifs/cifs_unicode.h
> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>  }
>
>  /*
> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> + */
> +static inline size_t
> +UniStrnlenBytes(const unsigned char *str, int maxlen)
> +{
> +       size_t nbytes = 0;
> +       wchar_t *uni;
> +
> +       while (*str++) {
> +               /* convert each char, find its length and add to nbytes */
> +               if (char2uni(str, maxlen, uni) > 0)
> +                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> +       }
> +       return nbytes;
> +}
> +
> +/*
>
> We would still be needing the version (UniStrnlen) that returns length
> in characters also.
>
>>
>> I'm not truly convinced this is really necessary though. You have to
>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
>> bytes, you get 32 anyway. You'll probably end up using roughly the same
>> amount of memory that you would have had you just estimated the size.

Shaggy made the comment that the string length calculation probably
won't matter (exact size vs. estimate) for most cases in cifs since
small allocations off the slab are fairly fast and it doesn't hurt to
overallocate by this amount.    Although for the typical cases a
Unicode string usually will shrink when converted to UTF-8 obviously
we have to allow for the maximum size conversion.


Except for long lived strings, for temporary Unicode strings
conversions that start with a Unicode string length of 256 wchars long
or shorter, probably is no point in calculating the string length
since the slab allocation for the worst case target is fast enough.
Obviously for path lengths though it can make a huge difference
(\\server\share\directory1\directory2\directory3\ etc.) and we ought
to calculate the exact length.
Jeff Layton April 7, 2009, 3:22 p.m. UTC | #2
On Tue, 7 Apr 2009 09:59:46 -0500
Steve French <smfrench@gmail.com> wrote:

> On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote:
> > Jeff Layton wrote:
> >> On Mon, 06 Apr 2009 22:33:09 +0530
> >> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> >>
> >>> Steve French wrote:
> >>>> I don't think that we should be using these size assumptions
> >>>> (multiples of UCS stringlen).    A new UCS helper function should be
> >>>> created that calculates how much memory would be needed for a
> >>>> converted string - and we need to use this before we do the malloc and
> >>>> string conversion.  In effect a strlen and strnlen function that takes
> >>>> a target code page argument.  For strings that will never be more than
> >>>> a hundred bytes this may not be needed, and we can use the length
> >>>> assumption, but since mallocs in kernel can be so expensive I would
> >>>> rather calculate the actual string length needed for the target.
> >>> Ah, ok. I thought of writing a little function based on
> >>> cifs_strncpy_to_host() and adding a comment like below:
> >>>
> >>> /* UniStrnlen() returns length in 16 bit Unicode  characters
> >>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8
> >>>  * character can be up to 8 bytes maximum, so we need to
> >>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
> >>>  * UTF-8 string */
> >>>
> >>
> >> I think you'll have to basically do the conversion twice. Walk the
> >> string once and convert each character determine its length and then
> >> discard it. Get the total and allocate that many bytes (plus the null
> >
> > Thanks for explaining. It seems adding a new UCS helper that computes
> > length in bytes like the below would be good enough and make use of it
> > to compute length for memory allocation.
> >
> >> termination), and do the conversion again into the buffer.
> >
> > Do we still need this conversion again?
> >
> >
> > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> > index 14eb9a2..0396bdc 100644
> > --- a/fs/cifs/cifs_unicode.h
> > +++ b/fs/cifs/cifs_unicode.h
> > @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
> >  }
> >
> >  /*
> > + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> > + */
> > +static inline size_t
> > +UniStrnlenBytes(const unsigned char *str, int maxlen)
> > +{
> > +       size_t nbytes = 0;
> > +       wchar_t *uni;
> > +
> > +       while (*str++) {
> > +               /* convert each char, find its length and add to nbytes */
> > +               if (char2uni(str, maxlen, uni) > 0)
> > +                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> > +       }
> > +       return nbytes;
> > +}
> > +
> > +/*
> >
> > We would still be needing the version (UniStrnlen) that returns length
> > in characters also.
> >
> >>
> >> I'm not truly convinced this is really necessary though. You have to
> >> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
> >> bytes, you get 32 anyway. You'll probably end up using roughly the same
> >> amount of memory that you would have had you just estimated the size.
> 
> Shaggy made the comment that the string length calculation probably
> won't matter (exact size vs. estimate) for most cases in cifs since
> small allocations off the slab are fairly fast and it doesn't hurt to
> overallocate by this amount.    Although for the typical cases a
> Unicode string usually will shrink when converted to UTF-8 obviously
> we have to allow for the maximum size conversion.
> 
> 
> Except for long lived strings, for temporary Unicode strings
> conversions that start with a Unicode string length of 256 wchars long
> or shorter, probably is no point in calculating the string length
> since the slab allocation for the worst case target is fast enough.
> Obviously for path lengths though it can make a huge difference
> (\\server\share\directory1\directory2\directory3\ etc.) and we ought
> to calculate the exact length.
> 

The only reason not to calculate the entire string length before the
allocation is to avoid the overhead of walking the string twice. The
only place this is likely to make a significant difference is the
readdir codepath (since that involves a lot of string conversion).
That's also the place where we have to store significant numbers of
pathnames. Therefore, I submit that there's little benefit in skipping
the length calculation anywhere. The places where we can do it are too
small and infrequent to matter.

For simplicity's sake I suggest that we just have everything just
calculate the exact string length and do the allocation that way.
Trying to get clever is more likely to lead us back into problems where
the buffer is too short.
Steve French April 7, 2009, 3:33 p.m. UTC | #3
On Tue, Apr 7, 2009 at 10:22 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Tue, 7 Apr 2009 09:59:46 -0500
> Steve French <smfrench@gmail.com> wrote:
>
>> On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote:
>> > Jeff Layton wrote:
>> >> On Mon, 06 Apr 2009 22:33:09 +0530
>> >> Suresh Jayaraman <sjayaraman@suse.de> wrote:
>> >>
>> >>> Steve French wrote:
>> >>>> I don't think that we should be using these size assumptions
>> >>>> (multiples of UCS stringlen).    A new UCS helper function should be
>> >>>> created that calculates how much memory would be needed for a
>> >>>> converted string - and we need to use this before we do the malloc and
>> >>>> string conversion.  In effect a strlen and strnlen function that takes
>> >>>> a target code page argument.  For strings that will never be more than
>> >>>> a hundred bytes this may not be needed, and we can use the length
>> >>>> assumption, but since mallocs in kernel can be so expensive I would
>> >>>> rather calculate the actual string length needed for the target.
>> >>> Ah, ok. I thought of writing a little function based on
>> >>> cifs_strncpy_to_host() and adding a comment like below:
>> >>>
>> >>> /* UniStrnlen() returns length in 16 bit Unicode  characters
>> >>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8
>> >>>  * character can be up to 8 bytes maximum, so we need to
>> >>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
>> >>>  * UTF-8 string */
>> >>>
>> >>
>> >> I think you'll have to basically do the conversion twice. Walk the
>> >> string once and convert each character determine its length and then
>> >> discard it. Get the total and allocate that many bytes (plus the null
>> >
>> > Thanks for explaining. It seems adding a new UCS helper that computes
>> > length in bytes like the below would be good enough and make use of it
>> > to compute length for memory allocation.
>> >
>> >> termination), and do the conversion again into the buffer.
>> >
>> > Do we still need this conversion again?
>> >
>> >
>> > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
>> > index 14eb9a2..0396bdc 100644
>> > --- a/fs/cifs/cifs_unicode.h
>> > +++ b/fs/cifs/cifs_unicode.h
>> > @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>> >  }
>> >
>> >  /*
>> > + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
>> > + */
>> > +static inline size_t
>> > +UniStrnlenBytes(const unsigned char *str, int maxlen)
>> > +{
>> > +       size_t nbytes = 0;
>> > +       wchar_t *uni;
>> > +
>> > +       while (*str++) {
>> > +               /* convert each char, find its length and add to nbytes */
>> > +               if (char2uni(str, maxlen, uni) > 0)
>> > +                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
>> > +       }
>> > +       return nbytes;
>> > +}
>> > +
>> > +/*
>> >
>> > We would still be needing the version (UniStrnlen) that returns length
>> > in characters also.
>> >
>> >>
>> >> I'm not truly convinced this is really necessary though. You have to
>> >> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
>> >> bytes, you get 32 anyway. You'll probably end up using roughly the same
>> >> amount of memory that you would have had you just estimated the size.
>>
>> Shaggy made the comment that the string length calculation probably
>> won't matter (exact size vs. estimate) for most cases in cifs since
>> small allocations off the slab are fairly fast and it doesn't hurt to
>> overallocate by this amount.    Although for the typical cases a
>> Unicode string usually will shrink when converted to UTF-8 obviously
>> we have to allow for the maximum size conversion.
>>
>>
>> Except for long lived strings, for temporary Unicode strings
>> conversions that start with a Unicode string length of 256 wchars long
>> or shorter, probably is no point in calculating the string length
>> since the slab allocation for the worst case target is fast enough.
>> Obviously for path lengths though it can make a huge difference
>> (\\server\share\directory1\directory2\directory3\ etc.) and we ought
>> to calculate the exact length.
>>
>
> The only reason not to calculate the entire string length before the
> allocation is to avoid the overhead of walking the string twice. The
> only place this is likely to make a significant difference is the
> readdir codepath (since that involves a lot of string conversion).
> That's also the place where we have to store significant numbers of
> pathnames. Therefore, I submit that there's little benefit in skipping
> the length calculation anywhere. The places where we can do it are too
> small and infrequent to matter.
>
> For simplicity's sake I suggest that we just have everything just
> calculate the exact string length and do the allocation that way.
> Trying to get clever is more likely to lead us back into problems where
> the buffer is too short.

You may be right that it also turns out to be simpler that way (not just
that it conserves memory).  We could make it simpler still by
wrapping the common sequence of length calculation, kmalloc and
string conversion in the same worker function.
Jeff Layton April 9, 2009, 11:55 a.m. UTC | #4
On Tue, 07 Apr 2009 18:45:46 +0530
Suresh Jayaraman <sjayaraman@suse.de> wrote:

> Do we still need this conversion again?
> 

I know this isn't a "real" patch submission yet, but some comments
below...

> 
> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> index 14eb9a2..0396bdc 100644
> --- a/fs/cifs/cifs_unicode.h
> +++ b/fs/cifs/cifs_unicode.h
> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>  }
> 
>  /*
> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> + */
> +static inline size_t
> +UniStrnlenBytes(const unsigned char *str, int maxlen)
> +{
> +       size_t nbytes = 0;
> +       wchar_t *uni;
		^^^^^
I think you need to allocate actual storage for the character here.

> +
> +       while (*str++) {
> +               /* convert each char, find its length and add to nbytes */
> +               if (char2uni(str, maxlen, uni) > 0)
> +                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);

"uni" is a ptr to a wchar_t, but you're treating it as a string.
There's no guarantee that it'll be null-terminated. I might be
mistaken, but doesn't char2uni return the length of the converted
character in bytes? Tallying up the return from those is probably
the thing to do.

> +       }
> +       return nbytes;
> +}
> +
> +/*
Suresh Jayaraman April 9, 2009, 12:16 p.m. UTC | #5
Jeff Layton wrote:
> On Tue, 07 Apr 2009 18:45:46 +0530
> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> 
>> Do we still need this conversion again?
>>
> 
> I know this isn't a "real" patch submission yet, but some comments
> below...

Indeed, it's quick, hackish and intended just to know where we are headed..
I appreciate the review much, nevertheless.

>> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
>> index 14eb9a2..0396bdc 100644
>> --- a/fs/cifs/cifs_unicode.h
>> +++ b/fs/cifs/cifs_unicode.h
>> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>>  }
>>
>>  /*
>> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
>> + */
>> +static inline size_t
>> +UniStrnlenBytes(const unsigned char *str, int maxlen)
>> +{
>> +       size_t nbytes = 0;
>> +       wchar_t *uni;
> 		^^^^^
> I think you need to allocate actual storage for the character here.

yes.

>> +
>> +       while (*str++) {
>> +               /* convert each char, find its length and add to nbytes */
>> +               if (char2uni(str, maxlen, uni) > 0)
>> +                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> 
> "uni" is a ptr to a wchar_t, but you're treating it as a string.
> There's no guarantee that it'll be null-terminated. I might be
> mistaken, but doesn't char2uni return the length of the converted
> character in bytes? Tallying up the return from those is probably
> the thing to do.

sorry stupid bugs, it's not a patch that qualifies for submission/review..
char2uni returns length in bytes. I did look at others some fs/* code
which sort of does this, didn't get it right.

I have made a different patch based on the recent comments and it's on
its way..

Thanks for the review and comments.
Suresh Jayaraman April 9, 2009, 2:19 p.m. UTC | #6
Jeff Layton wrote:
> On Tue, 07 Apr 2009 18:45:46 +0530
> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> 
>> Do we still need this conversion again?
>>
> 
> I know this isn't a "real" patch submission yet, but some comments
> below...
> 
>> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
>> index 14eb9a2..0396bdc 100644
>> --- a/fs/cifs/cifs_unicode.h
>> +++ b/fs/cifs/cifs_unicode.h
>> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>>  }
>>
>>  /*
>> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
>> + */
>> +static inline size_t
>> +UniStrnlenBytes(const unsigned char *str, int maxlen)
>> +{
>> +       size_t nbytes = 0;
>> +       wchar_t *uni;
> 		^^^^^
> I think you need to allocate actual storage for the character here.
> 

Oh, wait no storage is required I think. We reuse the pointer to wchar_t
and it seems char2uni() returns 1 or -EINVAL
and I was thinking strlen() on a wide char returns length in bytes?

fs/nls/nls_base.c

static int char2uni(const unsigned char *rawstring, int boundlen,
wchar_t *uni)
{
        *uni = charset2uni[*rawstring];
        if (*uni == 0x0000)
                return -EINVAL;
        return 1;
}

but, strnlen() may not work, though..

or am i looking at the wrong place..


>> +
>> +       while (*str++) {
>> +               /* convert each char, find its length and add to nbytes */
>> +               if (char2uni(str, maxlen, uni) > 0)
>> +                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> 
> "uni" is a ptr to a wchar_t, but you're treating it as a string.
> There's no guarantee that it'll be null-terminated. I might be
> mistaken, but doesn't char2uni return the length of the converted
> character in bytes? Tallying up the return from those is probably
> the thing to do.
> 
>> +       }
>> +       return nbytes;
>> +}
>> +
>> +/*
> 
>
Suresh Jayaraman April 9, 2009, 2:29 p.m. UTC | #7
Steve French wrote:
> On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote:
>> Jeff Layton wrote:
>>> On Mon, 06 Apr 2009 22:33:09 +0530
>>> Suresh Jayaraman <sjayaraman@suse.de> wrote:
>>>
>>>> Steve French wrote:
>>>>> I don't think that we should be using these size assumptions
>>>>> (multiples of UCS stringlen). � �A new UCS helper function should be
>>>>> created that calculates how much memory would be needed for a
>>>>> converted string - and we need to use this before we do the malloc and
>>>>> string conversion. �In effect a strlen and strnlen function that takes
>>>>> a target code page argument. �For strings that will never be more than
>>>>> a hundred bytes this may not be needed, and we can use the length
>>>>> assumption, but since mallocs in kernel can be so expensive I would
>>>>> rather calculate the actual string length needed for the target.
>>>> Ah, ok. I thought of writing a little function based on
>>>> cifs_strncpy_to_host() and adding a comment like below:
>>>>
>>>> /* UniStrnlen() returns length in 16 bit Unicode �characters
>>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8
>>>> �* character can be up to 8 bytes maximum, so we need to
>>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
>>>> �* UTF-8 string */
>>>>
>>> I think you'll have to basically do the conversion twice. Walk the
>>> string once and convert each character determine its length and then
>>> discard it. Get the total and allocate that many bytes (plus the null
>> Thanks for explaining. It seems adding a new UCS helper that computes
>> length in bytes like the below would be good enough and make use of it
>> to compute length for memory allocation.
>>
>>> termination), and do the conversion again into the buffer.
>> Do we still need this conversion again?
>>
>>
>> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
>> index 14eb9a2..0396bdc 100644
>> --- a/fs/cifs/cifs_unicode.h
>> +++ b/fs/cifs/cifs_unicode.h
>> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>> �}
>>
>> �/*
>> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
>> + */
>> +static inline size_t
>> +UniStrnlenBytes(const unsigned char *str, int maxlen)
>> +{
>> + � � � size_t nbytes = 0;
>> + � � � wchar_t *uni;
>> +
>> + � � � while (*str++) {
>> + � � � � � � � /* convert each char, find its length and add to nbytes */
>> + � � � � � � � if (char2uni(str, maxlen, uni) > 0)
>> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
>> + � � � }
>> + � � � return nbytes;
>> +}
>> +
>> +/*
>>
>> We would still be needing the version (UniStrnlen) that returns length
>> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the
Unicode standard specifies no characters above 0x10ffff, so Unicode
characters can only be up to four bytes long in UTF-8.
>>
>>> I'm not truly convinced this is really necessary though. You have to
>>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
>>> bytes, you get 32 anyway. You'll probably end up using roughly the same
>>> amount of memory that you would have had you just estimated the size.
> 
> Shaggy made the comment that the string length calculation probably
> won't matter (exact size vs. estimate) for most cases in cifs since
> small allocations off the slab are fairly fast and it doesn't hurt to
> overallocate by this amount.    Although for the typical cases a
> Unicode string usually will shrink when converted to UTF-8 obviously
> we have to allow for the maximum size conversion.
> 
> 

OTOH, felix-suse@fefe.de pointed me to utf-8 man page:

* UTF-8 encoded UCS characters may be up to six bytes long, however the
Unicode standard specifies no characters above 0x10ffff, so Unicode
characters can only be up to four bytes long in UTF-8.

Going by this, length * 2 (original code) might still be sufficient?
Jeff Layton April 9, 2009, 2:40 p.m. UTC | #8
On Thu, 09 Apr 2009 19:59:13 +0530
Suresh Jayaraman <sjayaraman@suse.de> wrote:

> Steve French wrote:
> > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote:
> >> Jeff Layton wrote:
> >>> On Mon, 06 Apr 2009 22:33:09 +0530
> >>> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> >>>
> >>>> Steve French wrote:
> >>>>> I don't think that we should be using these size assumptions
> >>>>> (multiples of UCS stringlen). � �A new UCS helper function should be
> >>>>> created that calculates how much memory would be needed for a
> >>>>> converted string - and we need to use this before we do the malloc and
> >>>>> string conversion. �In effect a strlen and strnlen function that takes
> >>>>> a target code page argument. �For strings that will never be more than
> >>>>> a hundred bytes this may not be needed, and we can use the length
> >>>>> assumption, but since mallocs in kernel can be so expensive I would
> >>>>> rather calculate the actual string length needed for the target.
> >>>> Ah, ok. I thought of writing a little function based on
> >>>> cifs_strncpy_to_host() and adding a comment like below:
> >>>>
> >>>> /* UniStrnlen() returns length in 16 bit Unicode �characters
> >>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8
> >>>> �* character can be up to 8 bytes maximum, so we need to
> >>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
> >>>> �* UTF-8 string */
> >>>>
> >>> I think you'll have to basically do the conversion twice. Walk the
> >>> string once and convert each character determine its length and then
> >>> discard it. Get the total and allocate that many bytes (plus the null
> >> Thanks for explaining. It seems adding a new UCS helper that computes
> >> length in bytes like the below would be good enough and make use of it
> >> to compute length for memory allocation.
> >>
> >>> termination), and do the conversion again into the buffer.
> >> Do we still need this conversion again?
> >>
> >>
> >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> >> index 14eb9a2..0396bdc 100644
> >> --- a/fs/cifs/cifs_unicode.h
> >> +++ b/fs/cifs/cifs_unicode.h
> >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
> >> �}
> >>
> >> �/*
> >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> >> + */
> >> +static inline size_t
> >> +UniStrnlenBytes(const unsigned char *str, int maxlen)
> >> +{
> >> + � � � size_t nbytes = 0;
> >> + � � � wchar_t *uni;
> >> +
> >> + � � � while (*str++) {
> >> + � � � � � � � /* convert each char, find its length and add to nbytes */
> >> + � � � � � � � if (char2uni(str, maxlen, uni) > 0)
> >> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> >> + � � � }
> >> + � � � return nbytes;
> >> +}
> >> +
> >> +/*
> >>
> >> We would still be needing the version (UniStrnlen) that returns length
> >> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the
> Unicode standard specifies no characters above 0x10ffff, so Unicode
> characters can only be up to four bytes long in UTF-8.
> >>
> >>> I'm not truly convinced this is really necessary though. You have to
> >>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
> >>> bytes, you get 32 anyway. You'll probably end up using roughly the same
> >>> amount of memory that you would have had you just estimated the size.
> > 
> > Shaggy made the comment that the string length calculation probably
> > won't matter (exact size vs. estimate) for most cases in cifs since
> > small allocations off the slab are fairly fast and it doesn't hurt to
> > overallocate by this amount.    Although for the typical cases a
> > Unicode string usually will shrink when converted to UTF-8 obviously
> > we have to allow for the maximum size conversion.
> > 
> > 
> 
> OTOH, felix-suse@fefe.de pointed me to utf-8 man page:
> 
> * UTF-8 encoded UCS characters may be up to six bytes long, however the
> Unicode standard specifies no characters above 0x10ffff, so Unicode
> characters can only be up to four bytes long in UTF-8.

Don't they mean 3 bytes there?

> 
> Going by this, length * 2 (original code) might still be sufficient?
> 

I think the safest thing is still to just calculate the exact lengths of
the buffer before allocating. It's hard to imagine that it'll have
significant performance impact.

If we later find that it does then we can look at optimizing those cases
for speed instead of size, but at least at that point we're working
with code that has the buffers sufficiently sized.
Jeff Layton April 9, 2009, 2:46 p.m. UTC | #9
On Thu, 09 Apr 2009 19:49:53 +0530
Suresh Jayaraman <sjayaraman@suse.de> wrote:

> Jeff Layton wrote:
> > On Tue, 07 Apr 2009 18:45:46 +0530
> > Suresh Jayaraman <sjayaraman@suse.de> wrote:
> > 
> >> Do we still need this conversion again?
> >>
> > 
> > I know this isn't a "real" patch submission yet, but some comments
> > below...
> > 
> >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> >> index 14eb9a2..0396bdc 100644
> >> --- a/fs/cifs/cifs_unicode.h
> >> +++ b/fs/cifs/cifs_unicode.h
> >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
> >>  }
> >>
> >>  /*
> >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> >> + */
> >> +static inline size_t
> >> +UniStrnlenBytes(const unsigned char *str, int maxlen)
> >> +{
> >> +       size_t nbytes = 0;
> >> +       wchar_t *uni;
> > 		^^^^^
> > I think you need to allocate actual storage for the character here.
> > 
> 
> Oh, wait no storage is required I think. We reuse the pointer to wchar_t
> and it seems char2uni() returns 1 or -EINVAL
> and I was thinking strlen() on a wide char returns length in bytes?
> 
> fs/nls/nls_base.c
> 
> static int char2uni(const unsigned char *rawstring, int boundlen,
> wchar_t *uni)
> {
>         *uni = charset2uni[*rawstring];

^^^ No, you do storage. You're passing a pointer to a wchar_t to
char2uni(). char2uni() then assigns the target of that pointer a value.
With your current code, that pointer is just junk from the stack when
you pass it to char2uni. It could panic or cause random memory
corruption in its current form.

What you really want to do is change that declaration to:

	wchar_t	uni;

...and then do:

	char2uni(str, maxlen, &uni)
simo April 9, 2009, 3:09 p.m. UTC | #10
On Thu, 2009-04-09 at 10:40 -0400, Jeff Layton wrote:
> On Thu, 09 Apr 2009 19:59:13 +0530
> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> 
> > Steve French wrote:
> > > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote:
> > >> Jeff Layton wrote:
> > >>> On Mon, 06 Apr 2009 22:33:09 +0530
> > >>> Suresh Jayaraman <sjayaraman@suse.de> wrote:
> > >>>
> > >>>> Steve French wrote:
> > >>>>> I don't think that we should be using these size assumptions
> > >>>>> (multiples of UCS stringlen). � �A new UCS helper function should be
> > >>>>> created that calculates how much memory would be needed for a
> > >>>>> converted string - and we need to use this before we do the malloc and
> > >>>>> string conversion. �In effect a strlen and strnlen function that takes
> > >>>>> a target code page argument. �For strings that will never be more than
> > >>>>> a hundred bytes this may not be needed, and we can use the length
> > >>>>> assumption, but since mallocs in kernel can be so expensive I would
> > >>>>> rather calculate the actual string length needed for the target.
> > >>>> Ah, ok. I thought of writing a little function based on
> > >>>> cifs_strncpy_to_host() and adding a comment like below:
> > >>>>
> > >>>> /* UniStrnlen() returns length in 16 bit Unicode �characters
> > >>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8
> > >>>> �* character can be up to 8 bytes maximum, so we need to
> > >>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the
> > >>>> �* UTF-8 string */
> > >>>>
> > >>> I think you'll have to basically do the conversion twice. Walk the
> > >>> string once and convert each character determine its length and then
> > >>> discard it. Get the total and allocate that many bytes (plus the null
> > >> Thanks for explaining. It seems adding a new UCS helper that computes
> > >> length in bytes like the below would be good enough and make use of it
> > >> to compute length for memory allocation.
> > >>
> > >>> termination), and do the conversion again into the buffer.
> > >> Do we still need this conversion again?
> > >>
> > >>
> > >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> > >> index 14eb9a2..0396bdc 100644
> > >> --- a/fs/cifs/cifs_unicode.h
> > >> +++ b/fs/cifs/cifs_unicode.h
> > >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
> > >> �}
> > >>
> > >> �/*
> > >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
> > >> + */
> > >> +static inline size_t
> > >> +UniStrnlenBytes(const unsigned char *str, int maxlen)
> > >> +{
> > >> + � � � size_t nbytes = 0;
> > >> + � � � wchar_t *uni;
> > >> +
> > >> + � � � while (*str++) {
> > >> + � � � � � � � /* convert each char, find its length and add to nbytes */
> > >> + � � � � � � � if (char2uni(str, maxlen, uni) > 0)
> > >> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
> > >> + � � � }
> > >> + � � � return nbytes;
> > >> +}
> > >> +
> > >> +/*
> > >>
> > >> We would still be needing the version (UniStrnlen) that returns length
> > >> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the
> > Unicode standard specifies no characters above 0x10ffff, so Unicode
> > characters can only be up to four bytes long in UTF-8.
> > >>
> > >>> I'm not truly convinced this is really necessary though. You have to
> > >>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17
> > >>> bytes, you get 32 anyway. You'll probably end up using roughly the same
> > >>> amount of memory that you would have had you just estimated the size.
> > > 
> > > Shaggy made the comment that the string length calculation probably
> > > won't matter (exact size vs. estimate) for most cases in cifs since
> > > small allocations off the slab are fairly fast and it doesn't hurt to
> > > overallocate by this amount.    Although for the typical cases a
> > > Unicode string usually will shrink when converted to UTF-8 obviously
> > > we have to allow for the maximum size conversion.
> > > 
> > > 
> > 
> > OTOH, felix-suse@fefe.de pointed me to utf-8 man page:
> > 
> > * UTF-8 encoded UCS characters may be up to six bytes long, however the
> > Unicode standard specifies no characters above 0x10ffff, so Unicode
> > characters can only be up to four bytes long in UTF-8.
> 
> Don't they mean 3 bytes there?


Nope, Suresh is right, RFC 3629 restricted the Unicode range so that
effectively the valid characters are all represented within 4 bytes.

Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


So in theory 4 bytes are enough, I guess you then have to make sure the
conversion routines used respect this limit and you have to think how to
address cases where invalid sequences come in.

Aside:
I think Windows servers will return you "invalid" (for example, in their
NFS server, MS uses some of the space to map characters that are invalid
for windows in filenames, but valid over NFS, if your cifs client is
used to access shares that are also exported via NFS on Windows, I
believe the server may send you these characters.) as  UTF-16 sequences,
IIRC at least in some versions I think we have also seen that their file
system layer did not validate UTF-16 sequences (because initially it was
UCS2 where all values are "valid").

So in talking with windows servers you must think how to deal with these
deviations from the standards (yeah, fun). Of course one way to deal
with it is to just throw an error, and deem the sequence invalid.


> > Going by this, length * 2 (original code) might still be sufficient?
> > 
> 
> I think the safest thing is still to just calculate the exact lengths of
> the buffer before allocating. It's hard to imagine that it'll have
> significant performance impact.

UTF-16 -> UTF-8 can be expensive in some cases, it really depends how
critical is the code path that needs the conversion.
It may make sense to allow the conversion routines to do the memory
allocation on their own.

> If we later find that it does then we can look at optimizing those cases
> for speed instead of size, but at least at that point we're working
> with code that has the buffers sufficiently sized.

Make sense.

Simo.

Patch
diff mbox

diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
index 14eb9a2..0396bdc 100644
--- a/fs/cifs/cifs_unicode.h
+++ b/fs/cifs/cifs_unicode.h
@@ -159,6 +159,23 @@  UniStrnlen(const wchar_t *ucs1, int maxlen)
 }

 /*
+ * UniStrnlenBytes: Return the length in bytes of a UTF-8 string
+ */
+static inline size_t
+UniStrnlenBytes(const unsigned char *str, int maxlen)
+{
+       size_t nbytes = 0;
+       wchar_t *uni;
+
+       while (*str++) {
+               /* convert each char, find its length and add to nbytes */
+               if (char2uni(str, maxlen, uni) > 0)
+                       nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE);
+       }
+       return nbytes;
+}
+
+/*

We would still be needing the version (UniStrnlen) that returns length
in characters also.