Message ID | 49DB5202.4030601@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote: > Jeff Layton wrote: >> On Mon, 06 Apr 2009 22:33:09 +0530 >> Suresh Jayaraman <sjayaraman@suse.de> wrote: >> >>> Steve French wrote: >>>> I don't think that we should be using these size assumptions >>>> (multiples of UCS stringlen).   A new UCS helper function should be >>>> created that calculates how much memory would be needed for a >>>> converted string - and we need to use this before we do the malloc and >>>> string conversion.  In effect a strlen and strnlen function that takes >>>> a target code page argument.  For strings that will never be more than >>>> a hundred bytes this may not be needed, and we can use the length >>>> assumption, but since mallocs in kernel can be so expensive I would >>>> rather calculate the actual string length needed for the target. >>> Ah, ok. I thought of writing a little function based on >>> cifs_strncpy_to_host() and adding a comment like below: >>> >>> /* UniStrnlen() returns length in 16 bit Unicode  characters >>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8 >>>  * character can be up to 8 bytes maximum, so we need to >>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the >>>  * UTF-8 string */ >>> >> >> I think you'll have to basically do the conversion twice. Walk the >> string once and convert each character determine its length and then >> discard it. Get the total and allocate that many bytes (plus the null > > Thanks for explaining. It seems adding a new UCS helper that computes > length in bytes like the below would be good enough and make use of it > to compute length for memory allocation. > >> termination), and do the conversion again into the buffer. > > Do we still need this conversion again? > > > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > index 14eb9a2..0396bdc 100644 > --- a/fs/cifs/cifs_unicode.h > +++ b/fs/cifs/cifs_unicode.h > @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) >  } > >  /* > + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string > + */ > +static inline size_t > +UniStrnlenBytes(const unsigned char *str, int maxlen) > +{ > +    size_t nbytes = 0; > +    wchar_t *uni; > + > +    while (*str++) { > +        /* convert each char, find its length and add to nbytes */ > +        if (char2uni(str, maxlen, uni) > 0) > +            nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); > +    } > +    return nbytes; > +} > + > +/* > > We would still be needing the version (UniStrnlen) that returns length > in characters also. > >> >> I'm not truly convinced this is really necessary though. You have to >> figure that kmalloc is a power-of-two allocator. If you kmalloc 17 >> bytes, you get 32 anyway. You'll probably end up using roughly the same >> amount of memory that you would have had you just estimated the size. Shaggy made the comment that the string length calculation probably won't matter (exact size vs. estimate) for most cases in cifs since small allocations off the slab are fairly fast and it doesn't hurt to overallocate by this amount. Although for the typical cases a Unicode string usually will shrink when converted to UTF-8 obviously we have to allow for the maximum size conversion. Except for long lived strings, for temporary Unicode strings conversions that start with a Unicode string length of 256 wchars long or shorter, probably is no point in calculating the string length since the slab allocation for the worst case target is fast enough. Obviously for path lengths though it can make a huge difference (\\server\share\directory1\directory2\directory3\ etc.) and we ought to calculate the exact length.
On Tue, 7 Apr 2009 09:59:46 -0500 Steve French <smfrench@gmail.com> wrote: > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote: > > Jeff Layton wrote: > >> On Mon, 06 Apr 2009 22:33:09 +0530 > >> Suresh Jayaraman <sjayaraman@suse.de> wrote: > >> > >>> Steve French wrote: > >>>> I don't think that we should be using these size assumptions > >>>> (multiples of UCS stringlen).   A new UCS helper function should be > >>>> created that calculates how much memory would be needed for a > >>>> converted string - and we need to use this before we do the malloc and > >>>> string conversion.  In effect a strlen and strnlen function that takes > >>>> a target code page argument.  For strings that will never be more than > >>>> a hundred bytes this may not be needed, and we can use the length > >>>> assumption, but since mallocs in kernel can be so expensive I would > >>>> rather calculate the actual string length needed for the target. > >>> Ah, ok. I thought of writing a little function based on > >>> cifs_strncpy_to_host() and adding a comment like below: > >>> > >>> /* UniStrnlen() returns length in 16 bit Unicode  characters > >>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8 > >>>  * character can be up to 8 bytes maximum, so we need to > >>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the > >>>  * UTF-8 string */ > >>> > >> > >> I think you'll have to basically do the conversion twice. Walk the > >> string once and convert each character determine its length and then > >> discard it. Get the total and allocate that many bytes (plus the null > > > > Thanks for explaining. It seems adding a new UCS helper that computes > > length in bytes like the below would be good enough and make use of it > > to compute length for memory allocation. > > > >> termination), and do the conversion again into the buffer. > > > > Do we still need this conversion again? > > > > > > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > > index 14eb9a2..0396bdc 100644 > > --- a/fs/cifs/cifs_unicode.h > > +++ b/fs/cifs/cifs_unicode.h > > @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) > >  } > > > >  /* > > + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string > > + */ > > +static inline size_t > > +UniStrnlenBytes(const unsigned char *str, int maxlen) > > +{ > > +    size_t nbytes = 0; > > +    wchar_t *uni; > > + > > +    while (*str++) { > > +        /* convert each char, find its length and add to nbytes */ > > +        if (char2uni(str, maxlen, uni) > 0) > > +            nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); > > +    } > > +    return nbytes; > > +} > > + > > +/* > > > > We would still be needing the version (UniStrnlen) that returns length > > in characters also. > > > >> > >> I'm not truly convinced this is really necessary though. You have to > >> figure that kmalloc is a power-of-two allocator. If you kmalloc 17 > >> bytes, you get 32 anyway. You'll probably end up using roughly the same > >> amount of memory that you would have had you just estimated the size. > > Shaggy made the comment that the string length calculation probably > won't matter (exact size vs. estimate) for most cases in cifs since > small allocations off the slab are fairly fast and it doesn't hurt to > overallocate by this amount. Although for the typical cases a > Unicode string usually will shrink when converted to UTF-8 obviously > we have to allow for the maximum size conversion. > > > Except for long lived strings, for temporary Unicode strings > conversions that start with a Unicode string length of 256 wchars long > or shorter, probably is no point in calculating the string length > since the slab allocation for the worst case target is fast enough. > Obviously for path lengths though it can make a huge difference > (\\server\share\directory1\directory2\directory3\ etc.) and we ought > to calculate the exact length. > The only reason not to calculate the entire string length before the allocation is to avoid the overhead of walking the string twice. The only place this is likely to make a significant difference is the readdir codepath (since that involves a lot of string conversion). That's also the place where we have to store significant numbers of pathnames. Therefore, I submit that there's little benefit in skipping the length calculation anywhere. The places where we can do it are too small and infrequent to matter. For simplicity's sake I suggest that we just have everything just calculate the exact string length and do the allocation that way. Trying to get clever is more likely to lead us back into problems where the buffer is too short.
On Tue, Apr 7, 2009 at 10:22 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Tue, 7 Apr 2009 09:59:46 -0500 > Steve French <smfrench@gmail.com> wrote: > >> On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote: >> > Jeff Layton wrote: >> >> On Mon, 06 Apr 2009 22:33:09 +0530 >> >> Suresh Jayaraman <sjayaraman@suse.de> wrote: >> >> >> >>> Steve French wrote: >> >>>> I don't think that we should be using these size assumptions >> >>>> (multiples of UCS stringlen).   A new UCS helper function should be >> >>>> created that calculates how much memory would be needed for a >> >>>> converted string - and we need to use this before we do the malloc and >> >>>> string conversion.  In effect a strlen and strnlen function that takes >> >>>> a target code page argument.  For strings that will never be more than >> >>>> a hundred bytes this may not be needed, and we can use the length >> >>>> assumption, but since mallocs in kernel can be so expensive I would >> >>>> rather calculate the actual string length needed for the target. >> >>> Ah, ok. I thought of writing a little function based on >> >>> cifs_strncpy_to_host() and adding a comment like below: >> >>> >> >>> /* UniStrnlen() returns length in 16 bit Unicode  characters >> >>>  * (UCS-2) with base length of 2 bytes per character. An UTF-8 >> >>>  * character can be up to 8 bytes maximum, so we need to >> >>>  * allocate (len/2) * 4 bytes (or) (4 * len) bytes for the >> >>>  * UTF-8 string */ >> >>> >> >> >> >> I think you'll have to basically do the conversion twice. Walk the >> >> string once and convert each character determine its length and then >> >> discard it. Get the total and allocate that many bytes (plus the null >> > >> > Thanks for explaining. It seems adding a new UCS helper that computes >> > length in bytes like the below would be good enough and make use of it >> > to compute length for memory allocation. >> > >> >> termination), and do the conversion again into the buffer. >> > >> > Do we still need this conversion again? >> > >> > >> > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h >> > index 14eb9a2..0396bdc 100644 >> > --- a/fs/cifs/cifs_unicode.h >> > +++ b/fs/cifs/cifs_unicode.h >> > @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) >> >  } >> > >> >  /* >> > + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string >> > + */ >> > +static inline size_t >> > +UniStrnlenBytes(const unsigned char *str, int maxlen) >> > +{ >> > +    size_t nbytes = 0; >> > +    wchar_t *uni; >> > + >> > +    while (*str++) { >> > +        /* convert each char, find its length and add to nbytes */ >> > +        if (char2uni(str, maxlen, uni) > 0) >> > +            nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); >> > +    } >> > +    return nbytes; >> > +} >> > + >> > +/* >> > >> > We would still be needing the version (UniStrnlen) that returns length >> > in characters also. >> > >> >> >> >> I'm not truly convinced this is really necessary though. You have to >> >> figure that kmalloc is a power-of-two allocator. If you kmalloc 17 >> >> bytes, you get 32 anyway. You'll probably end up using roughly the same >> >> amount of memory that you would have had you just estimated the size. >> >> Shaggy made the comment that the string length calculation probably >> won't matter (exact size vs. estimate) for most cases in cifs since >> small allocations off the slab are fairly fast and it doesn't hurt to >> overallocate by this amount.   Although for the typical cases a >> Unicode string usually will shrink when converted to UTF-8 obviously >> we have to allow for the maximum size conversion. >> >> >> Except for long lived strings, for temporary Unicode strings >> conversions that start with a Unicode string length of 256 wchars long >> or shorter, probably is no point in calculating the string length >> since the slab allocation for the worst case target is fast enough. >> Obviously for path lengths though it can make a huge difference >> (\\server\share\directory1\directory2\directory3\ etc.) and we ought >> to calculate the exact length. >> > > The only reason not to calculate the entire string length before the > allocation is to avoid the overhead of walking the string twice. The > only place this is likely to make a significant difference is the > readdir codepath (since that involves a lot of string conversion). > That's also the place where we have to store significant numbers of > pathnames. Therefore, I submit that there's little benefit in skipping > the length calculation anywhere. The places where we can do it are too > small and infrequent to matter. > > For simplicity's sake I suggest that we just have everything just > calculate the exact string length and do the allocation that way. > Trying to get clever is more likely to lead us back into problems where > the buffer is too short. You may be right that it also turns out to be simpler that way (not just that it conserves memory). We could make it simpler still by wrapping the common sequence of length calculation, kmalloc and string conversion in the same worker function.
On Tue, 07 Apr 2009 18:45:46 +0530 Suresh Jayaraman <sjayaraman@suse.de> wrote: > Do we still need this conversion again? > I know this isn't a "real" patch submission yet, but some comments below... > > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > index 14eb9a2..0396bdc 100644 > --- a/fs/cifs/cifs_unicode.h > +++ b/fs/cifs/cifs_unicode.h > @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) > } > > /* > + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string > + */ > +static inline size_t > +UniStrnlenBytes(const unsigned char *str, int maxlen) > +{ > + size_t nbytes = 0; > + wchar_t *uni; ^^^^^ I think you need to allocate actual storage for the character here. > + > + while (*str++) { > + /* convert each char, find its length and add to nbytes */ > + if (char2uni(str, maxlen, uni) > 0) > + nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); "uni" is a ptr to a wchar_t, but you're treating it as a string. There's no guarantee that it'll be null-terminated. I might be mistaken, but doesn't char2uni return the length of the converted character in bytes? Tallying up the return from those is probably the thing to do. > + } > + return nbytes; > +} > + > +/*
Jeff Layton wrote: > On Tue, 07 Apr 2009 18:45:46 +0530 > Suresh Jayaraman <sjayaraman@suse.de> wrote: > >> Do we still need this conversion again? >> > > I know this isn't a "real" patch submission yet, but some comments > below... Indeed, it's quick, hackish and intended just to know where we are headed.. I appreciate the review much, nevertheless. >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h >> index 14eb9a2..0396bdc 100644 >> --- a/fs/cifs/cifs_unicode.h >> +++ b/fs/cifs/cifs_unicode.h >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) >> } >> >> /* >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string >> + */ >> +static inline size_t >> +UniStrnlenBytes(const unsigned char *str, int maxlen) >> +{ >> + size_t nbytes = 0; >> + wchar_t *uni; > ^^^^^ > I think you need to allocate actual storage for the character here. yes. >> + >> + while (*str++) { >> + /* convert each char, find its length and add to nbytes */ >> + if (char2uni(str, maxlen, uni) > 0) >> + nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); > > "uni" is a ptr to a wchar_t, but you're treating it as a string. > There's no guarantee that it'll be null-terminated. I might be > mistaken, but doesn't char2uni return the length of the converted > character in bytes? Tallying up the return from those is probably > the thing to do. sorry stupid bugs, it's not a patch that qualifies for submission/review.. char2uni returns length in bytes. I did look at others some fs/* code which sort of does this, didn't get it right. I have made a different patch based on the recent comments and it's on its way.. Thanks for the review and comments.
Jeff Layton wrote: > On Tue, 07 Apr 2009 18:45:46 +0530 > Suresh Jayaraman <sjayaraman@suse.de> wrote: > >> Do we still need this conversion again? >> > > I know this isn't a "real" patch submission yet, but some comments > below... > >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h >> index 14eb9a2..0396bdc 100644 >> --- a/fs/cifs/cifs_unicode.h >> +++ b/fs/cifs/cifs_unicode.h >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) >> } >> >> /* >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string >> + */ >> +static inline size_t >> +UniStrnlenBytes(const unsigned char *str, int maxlen) >> +{ >> + size_t nbytes = 0; >> + wchar_t *uni; > ^^^^^ > I think you need to allocate actual storage for the character here. > Oh, wait no storage is required I think. We reuse the pointer to wchar_t and it seems char2uni() returns 1 or -EINVAL and I was thinking strlen() on a wide char returns length in bytes? fs/nls/nls_base.c static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) { *uni = charset2uni[*rawstring]; if (*uni == 0x0000) return -EINVAL; return 1; } but, strnlen() may not work, though.. or am i looking at the wrong place.. >> + >> + while (*str++) { >> + /* convert each char, find its length and add to nbytes */ >> + if (char2uni(str, maxlen, uni) > 0) >> + nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); > > "uni" is a ptr to a wchar_t, but you're treating it as a string. > There's no guarantee that it'll be null-terminated. I might be > mistaken, but doesn't char2uni return the length of the converted > character in bytes? Tallying up the return from those is probably > the thing to do. > >> + } >> + return nbytes; >> +} >> + >> +/* > >
Steve French wrote: > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote: >> Jeff Layton wrote: >>> On Mon, 06 Apr 2009 22:33:09 +0530 >>> Suresh Jayaraman <sjayaraman@suse.de> wrote: >>> >>>> Steve French wrote: >>>>> I don't think that we should be using these size assumptions >>>>> (multiples of UCS stringlen). � �A new UCS helper function should be >>>>> created that calculates how much memory would be needed for a >>>>> converted string - and we need to use this before we do the malloc and >>>>> string conversion. �In effect a strlen and strnlen function that takes >>>>> a target code page argument. �For strings that will never be more than >>>>> a hundred bytes this may not be needed, and we can use the length >>>>> assumption, but since mallocs in kernel can be so expensive I would >>>>> rather calculate the actual string length needed for the target. >>>> Ah, ok. I thought of writing a little function based on >>>> cifs_strncpy_to_host() and adding a comment like below: >>>> >>>> /* UniStrnlen() returns length in 16 bit Unicode �characters >>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8 >>>> �* character can be up to 8 bytes maximum, so we need to >>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the >>>> �* UTF-8 string */ >>>> >>> I think you'll have to basically do the conversion twice. Walk the >>> string once and convert each character determine its length and then >>> discard it. Get the total and allocate that many bytes (plus the null >> Thanks for explaining. It seems adding a new UCS helper that computes >> length in bytes like the below would be good enough and make use of it >> to compute length for memory allocation. >> >>> termination), and do the conversion again into the buffer. >> Do we still need this conversion again? >> >> >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h >> index 14eb9a2..0396bdc 100644 >> --- a/fs/cifs/cifs_unicode.h >> +++ b/fs/cifs/cifs_unicode.h >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) >> �} >> >> �/* >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string >> + */ >> +static inline size_t >> +UniStrnlenBytes(const unsigned char *str, int maxlen) >> +{ >> + � � � size_t nbytes = 0; >> + � � � wchar_t *uni; >> + >> + � � � while (*str++) { >> + � � � � � � � /* convert each char, find its length and add to nbytes */ >> + � � � � � � � if (char2uni(str, maxlen, uni) > 0) >> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); >> + � � � } >> + � � � return nbytes; >> +} >> + >> +/* >> >> We would still be needing the version (UniStrnlen) that returns length >> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8. >> >>> I'm not truly convinced this is really necessary though. You have to >>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17 >>> bytes, you get 32 anyway. You'll probably end up using roughly the same >>> amount of memory that you would have had you just estimated the size. > > Shaggy made the comment that the string length calculation probably > won't matter (exact size vs. estimate) for most cases in cifs since > small allocations off the slab are fairly fast and it doesn't hurt to > overallocate by this amount. Although for the typical cases a > Unicode string usually will shrink when converted to UTF-8 obviously > we have to allow for the maximum size conversion. > > OTOH, felix-suse@fefe.de pointed me to utf-8 man page: * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10ffff, so Unicode characters can only be up to four bytes long in UTF-8. Going by this, length * 2 (original code) might still be sufficient?
On Thu, 09 Apr 2009 19:59:13 +0530 Suresh Jayaraman <sjayaraman@suse.de> wrote: > Steve French wrote: > > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote: > >> Jeff Layton wrote: > >>> On Mon, 06 Apr 2009 22:33:09 +0530 > >>> Suresh Jayaraman <sjayaraman@suse.de> wrote: > >>> > >>>> Steve French wrote: > >>>>> I don't think that we should be using these size assumptions > >>>>> (multiples of UCS stringlen). � �A new UCS helper function should be > >>>>> created that calculates how much memory would be needed for a > >>>>> converted string - and we need to use this before we do the malloc and > >>>>> string conversion. �In effect a strlen and strnlen function that takes > >>>>> a target code page argument. �For strings that will never be more than > >>>>> a hundred bytes this may not be needed, and we can use the length > >>>>> assumption, but since mallocs in kernel can be so expensive I would > >>>>> rather calculate the actual string length needed for the target. > >>>> Ah, ok. I thought of writing a little function based on > >>>> cifs_strncpy_to_host() and adding a comment like below: > >>>> > >>>> /* UniStrnlen() returns length in 16 bit Unicode �characters > >>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8 > >>>> �* character can be up to 8 bytes maximum, so we need to > >>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the > >>>> �* UTF-8 string */ > >>>> > >>> I think you'll have to basically do the conversion twice. Walk the > >>> string once and convert each character determine its length and then > >>> discard it. Get the total and allocate that many bytes (plus the null > >> Thanks for explaining. It seems adding a new UCS helper that computes > >> length in bytes like the below would be good enough and make use of it > >> to compute length for memory allocation. > >> > >>> termination), and do the conversion again into the buffer. > >> Do we still need this conversion again? > >> > >> > >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > >> index 14eb9a2..0396bdc 100644 > >> --- a/fs/cifs/cifs_unicode.h > >> +++ b/fs/cifs/cifs_unicode.h > >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) > >> �} > >> > >> �/* > >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string > >> + */ > >> +static inline size_t > >> +UniStrnlenBytes(const unsigned char *str, int maxlen) > >> +{ > >> + � � � size_t nbytes = 0; > >> + � � � wchar_t *uni; > >> + > >> + � � � while (*str++) { > >> + � � � � � � � /* convert each char, find its length and add to nbytes */ > >> + � � � � � � � if (char2uni(str, maxlen, uni) > 0) > >> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); > >> + � � � } > >> + � � � return nbytes; > >> +} > >> + > >> +/* > >> > >> We would still be needing the version (UniStrnlen) that returns length > >> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the > Unicode standard specifies no characters above 0x10ffff, so Unicode > characters can only be up to four bytes long in UTF-8. > >> > >>> I'm not truly convinced this is really necessary though. You have to > >>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17 > >>> bytes, you get 32 anyway. You'll probably end up using roughly the same > >>> amount of memory that you would have had you just estimated the size. > > > > Shaggy made the comment that the string length calculation probably > > won't matter (exact size vs. estimate) for most cases in cifs since > > small allocations off the slab are fairly fast and it doesn't hurt to > > overallocate by this amount. Although for the typical cases a > > Unicode string usually will shrink when converted to UTF-8 obviously > > we have to allow for the maximum size conversion. > > > > > > OTOH, felix-suse@fefe.de pointed me to utf-8 man page: > > * UTF-8 encoded UCS characters may be up to six bytes long, however the > Unicode standard specifies no characters above 0x10ffff, so Unicode > characters can only be up to four bytes long in UTF-8. Don't they mean 3 bytes there? > > Going by this, length * 2 (original code) might still be sufficient? > I think the safest thing is still to just calculate the exact lengths of the buffer before allocating. It's hard to imagine that it'll have significant performance impact. If we later find that it does then we can look at optimizing those cases for speed instead of size, but at least at that point we're working with code that has the buffers sufficiently sized.
On Thu, 09 Apr 2009 19:49:53 +0530 Suresh Jayaraman <sjayaraman@suse.de> wrote: > Jeff Layton wrote: > > On Tue, 07 Apr 2009 18:45:46 +0530 > > Suresh Jayaraman <sjayaraman@suse.de> wrote: > > > >> Do we still need this conversion again? > >> > > > > I know this isn't a "real" patch submission yet, but some comments > > below... > > > >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > >> index 14eb9a2..0396bdc 100644 > >> --- a/fs/cifs/cifs_unicode.h > >> +++ b/fs/cifs/cifs_unicode.h > >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) > >> } > >> > >> /* > >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string > >> + */ > >> +static inline size_t > >> +UniStrnlenBytes(const unsigned char *str, int maxlen) > >> +{ > >> + size_t nbytes = 0; > >> + wchar_t *uni; > > ^^^^^ > > I think you need to allocate actual storage for the character here. > > > > Oh, wait no storage is required I think. We reuse the pointer to wchar_t > and it seems char2uni() returns 1 or -EINVAL > and I was thinking strlen() on a wide char returns length in bytes? > > fs/nls/nls_base.c > > static int char2uni(const unsigned char *rawstring, int boundlen, > wchar_t *uni) > { > *uni = charset2uni[*rawstring]; ^^^ No, you do storage. You're passing a pointer to a wchar_t to char2uni(). char2uni() then assigns the target of that pointer a value. With your current code, that pointer is just junk from the stack when you pass it to char2uni. It could panic or cause random memory corruption in its current form. What you really want to do is change that declaration to: wchar_t uni; ...and then do: char2uni(str, maxlen, &uni)
On Thu, 2009-04-09 at 10:40 -0400, Jeff Layton wrote: > On Thu, 09 Apr 2009 19:59:13 +0530 > Suresh Jayaraman <sjayaraman@suse.de> wrote: > > > Steve French wrote: > > > On Tue, Apr 7, 2009 at 8:15 AM, Suresh Jayaraman <sjayaraman@suse.de> wrote: > > >> Jeff Layton wrote: > > >>> On Mon, 06 Apr 2009 22:33:09 +0530 > > >>> Suresh Jayaraman <sjayaraman@suse.de> wrote: > > >>> > > >>>> Steve French wrote: > > >>>>> I don't think that we should be using these size assumptions > > >>>>> (multiples of UCS stringlen). � �A new UCS helper function should be > > >>>>> created that calculates how much memory would be needed for a > > >>>>> converted string - and we need to use this before we do the malloc and > > >>>>> string conversion. �In effect a strlen and strnlen function that takes > > >>>>> a target code page argument. �For strings that will never be more than > > >>>>> a hundred bytes this may not be needed, and we can use the length > > >>>>> assumption, but since mallocs in kernel can be so expensive I would > > >>>>> rather calculate the actual string length needed for the target. > > >>>> Ah, ok. I thought of writing a little function based on > > >>>> cifs_strncpy_to_host() and adding a comment like below: > > >>>> > > >>>> /* UniStrnlen() returns length in 16 bit Unicode �characters > > >>>> �* (UCS-2) with base length of 2 bytes per character. An UTF-8 > > >>>> �* character can be up to 8 bytes maximum, so we need to > > >>>> �* allocate (len/2) * 4 bytes (or) (4 * len) bytes for the > > >>>> �* UTF-8 string */ > > >>>> > > >>> I think you'll have to basically do the conversion twice. Walk the > > >>> string once and convert each character determine its length and then > > >>> discard it. Get the total and allocate that many bytes (plus the null > > >> Thanks for explaining. It seems adding a new UCS helper that computes > > >> length in bytes like the below would be good enough and make use of it > > >> to compute length for memory allocation. > > >> > > >>> termination), and do the conversion again into the buffer. > > >> Do we still need this conversion again? > > >> > > >> > > >> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > > >> index 14eb9a2..0396bdc 100644 > > >> --- a/fs/cifs/cifs_unicode.h > > >> +++ b/fs/cifs/cifs_unicode.h > > >> @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) > > >> �} > > >> > > >> �/* > > >> + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string > > >> + */ > > >> +static inline size_t > > >> +UniStrnlenBytes(const unsigned char *str, int maxlen) > > >> +{ > > >> + � � � size_t nbytes = 0; > > >> + � � � wchar_t *uni; > > >> + > > >> + � � � while (*str++) { > > >> + � � � � � � � /* convert each char, find its length and add to nbytes */ > > >> + � � � � � � � if (char2uni(str, maxlen, uni) > 0) > > >> + � � � � � � � � � � � nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); > > >> + � � � } > > >> + � � � return nbytes; > > >> +} > > >> + > > >> +/* > > >> > > >> We would still be needing the version (UniStrnlen) that returns length > > >> in characters also.* UTF-8 encoded UCS characters may be up to six bytes long, however the > > Unicode standard specifies no characters above 0x10ffff, so Unicode > > characters can only be up to four bytes long in UTF-8. > > >> > > >>> I'm not truly convinced this is really necessary though. You have to > > >>> figure that kmalloc is a power-of-two allocator. If you kmalloc 17 > > >>> bytes, you get 32 anyway. You'll probably end up using roughly the same > > >>> amount of memory that you would have had you just estimated the size. > > > > > > Shaggy made the comment that the string length calculation probably > > > won't matter (exact size vs. estimate) for most cases in cifs since > > > small allocations off the slab are fairly fast and it doesn't hurt to > > > overallocate by this amount. Although for the typical cases a > > > Unicode string usually will shrink when converted to UTF-8 obviously > > > we have to allow for the maximum size conversion. > > > > > > > > > > OTOH, felix-suse@fefe.de pointed me to utf-8 man page: > > > > * UTF-8 encoded UCS characters may be up to six bytes long, however the > > Unicode standard specifies no characters above 0x10ffff, so Unicode > > characters can only be up to four bytes long in UTF-8. > > Don't they mean 3 bytes there? Nope, Suresh is right, RFC 3629 restricted the Unicode range so that effectively the valid characters are all represented within 4 bytes. Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx So in theory 4 bytes are enough, I guess you then have to make sure the conversion routines used respect this limit and you have to think how to address cases where invalid sequences come in. Aside: I think Windows servers will return you "invalid" (for example, in their NFS server, MS uses some of the space to map characters that are invalid for windows in filenames, but valid over NFS, if your cifs client is used to access shares that are also exported via NFS on Windows, I believe the server may send you these characters.) as UTF-16 sequences, IIRC at least in some versions I think we have also seen that their file system layer did not validate UTF-16 sequences (because initially it was UCS2 where all values are "valid"). So in talking with windows servers you must think how to deal with these deviations from the standards (yeah, fun). Of course one way to deal with it is to just throw an error, and deem the sequence invalid. > > Going by this, length * 2 (original code) might still be sufficient? > > > > I think the safest thing is still to just calculate the exact lengths of > the buffer before allocating. It's hard to imagine that it'll have > significant performance impact. UTF-16 -> UTF-8 can be expensive in some cases, it really depends how critical is the code path that needs the conversion. It may make sense to allow the conversion routines to do the memory allocation on their own. > If we later find that it does then we can look at optimizing those cases > for speed instead of size, but at least at that point we're working > with code that has the buffers sufficiently sized. Make sense. Simo.
diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h index 14eb9a2..0396bdc 100644 --- a/fs/cifs/cifs_unicode.h +++ b/fs/cifs/cifs_unicode.h @@ -159,6 +159,23 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) } /* + * UniStrnlenBytes: Return the length in bytes of a UTF-8 string + */ +static inline size_t +UniStrnlenBytes(const unsigned char *str, int maxlen) +{ + size_t nbytes = 0; + wchar_t *uni; + + while (*str++) { + /* convert each char, find its length and add to nbytes */ + if (char2uni(str, maxlen, uni) > 0) + nbytes += strnlen(uni, NLS_MAX_CHARSET_SIZE); + } + return nbytes; +} + +/* We would still be needing the version (UniStrnlen) that returns length in characters also.