Message ID | 49EDBB06.3060409@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am Dienstag, 21. April 2009 schrieb Suresh Jayaraman: > Though the consensus is that we need a generalised helper to handle > unicode string buffers so that other filesystems could consume, we would > need a cifs helper like this in the interim, given the number of > discussions/reviews and bug reports. cifs could easily replace this with > generic helpers once such helper is in place. > > > Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> > --- > fs/cifs/cifs_unicode.h | 27 +++++++++++++++++++++++++++ > 1 files changed, 27 insertions(+), 0 deletions(-) > > diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h > index 14eb9a2..8ab332b 100644 > --- a/fs/cifs/cifs_unicode.h > +++ b/fs/cifs/cifs_unicode.h > @@ -159,6 +159,33 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) > } > > /* > + * UniStrnlenBytes: Return the length of a NLS string in bytes. Also, populates > + * 'nchars' with the length of string in 16 bit Unicode chars. > + */ > +static inline size_t > +UniStrnlenBytes(const wchar_t *str, int maxlen, int *nchars, > + const struct nls_table *codepage) > +{ > + int nc; > + size_t nbytes = 0; > + char buf[NLS_MAX_CHARSET_SIZE]; /* enough for one char at a time */ > + > + *nchars = 0; > + while (*str++ && maxlen) { > + nc = codepage->uni2char(*str, buf, NLS_MAX_CHARSET_SIZE); > + if (nc > 0) > + nbytes += nc; > + else > + nbytes += 1; /* for '?' */ > + (*nchars)++; > + if (*nchars >= maxlen) > + break; > + } > + > + return nbytes; > +} > + > +/* > * UniStrncat: Concatenate length limited string > */ > static inline wchar_t * during cleanup of the unicode related stuff we should also have a look at functions currently located in misc.c - cifs_convertUCSpath() - cifsConvertToUCS() At the moment cifs_convertUCSpath() contains the following check: ... /* make sure we do not overrun callers allocated temp buffer */ if (j >= (2 * NAME_MAX)) break; ... Probably both functions should be moved away from misc.c. Cheers, Günter
On Wed, 22 Apr 2009 02:25:05 +0200 Günter Kukkukk <linux@kukkukk.com> wrote: > > during cleanup of the unicode related stuff we should also have a look > at functions currently located in misc.c > - cifs_convertUCSpath() > - cifsConvertToUCS() > > At the moment cifs_convertUCSpath() contains the following check: > ... > /* make sure we do not overrun callers allocated temp buffer */ > if (j >= (2 * NAME_MAX)) > break; > ... > Probably both functions should be moved away from misc.c. Agreed on all counts. For now, we're focused on the cases where we're converting the data from the server to the local charset. After that's resolved we need to fix the other direction as well. It might make sense to consolidate all of this in a separate file someplace as well.
Am Mittwoch, 22. April 2009 schrieb Jeff Layton: > On Wed, 22 Apr 2009 02:25:05 +0200 > Günter Kukkukk <linux@kukkukk.com> wrote: > > > > during cleanup of the unicode related stuff we should also have a look > > at functions currently located in misc.c > > - cifs_convertUCSpath() > > - cifsConvertToUCS() > > > > At the moment cifs_convertUCSpath() contains the following check: > > ... > > /* make sure we do not overrun callers allocated temp buffer */ > > if (j >= (2 * NAME_MAX)) > > break; > > ... > > Probably both functions should be moved away from misc.c. > > Agreed on all counts. For now, we're focused on the cases where we're > converting the data from the server to the local charset. After that's > resolved we need to fix the other direction as well. > > It might make sense to consolidate all of this in a separate file > someplace as well. > Hi Jeff, the function cifs_convertUCSpath() _is_ related to "...we're focused on the cases where we're converting the data from the server to the local charset. ....". And - it's heavily used. Cheers, Günter
Am Donnerstag, 23. April 2009 schrieb Günter Kukkukk: > Am Mittwoch, 22. April 2009 schrieb Jeff Layton: > > On Wed, 22 Apr 2009 02:25:05 +0200 > > Günter Kukkukk <linux@kukkukk.com> wrote: > > > > > > during cleanup of the unicode related stuff we should also have a look > > > at functions currently located in misc.c > > > - cifs_convertUCSpath() > > > - cifsConvertToUCS() > > > > > > At the moment cifs_convertUCSpath() contains the following check: > > > ... > > > /* make sure we do not overrun callers allocated temp buffer */ > > > if (j >= (2 * NAME_MAX)) > > > break; > > > ... > > > Probably both functions should be moved away from misc.c. > > > > Agreed on all counts. For now, we're focused on the cases where we're > > converting the data from the server to the local charset. After that's > > resolved we need to fix the other direction as well. > > > > It might make sense to consolidate all of this in a separate file > > someplace as well. > > > > Hi Jeff, > > the function cifs_convertUCSpath() _is_ related to > "...we're focused on the cases where we're converting the data from > the server to the local charset. ....". > > And - it's heavily used. > Cheers, Günter just some further notes. With "it's heavily used" i didn't mean the number of callers using this function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() is called in daily usage.... (readdir results) The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies to cifs_convertUCSpath()! See the following code snippet: readdir.c --> static int cifs_get_name_from_search_buf() .... if (unicode) { /* BB fixme - test with long names */ /* Note converted filename can be longer than in unicode */ if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) pqst->len = cifs_convertUCSpath((char *)pqst->name, (__le16 *)filename, len/2, nlt); else pqst->len = cifs_strfromUCS_le((char *)pqst->name, (__le16 *)filename, len/2, nlt); .... Cheers, Günter
On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 23 Apr 2009 02:49:21 +0200 > Günter Kukkukk <linux@kukkukk.com> wrote: > >> just some further notes. >> With "it's heavily used" i didn't mean the number of callers using this >> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() >> is called in daily usage.... (readdir results) >> >> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies >> to cifs_convertUCSpath()! >> >> See the following code snippet: >> >> readdir.c --> static int cifs_get_name_from_search_buf() >> .... >> >>    if (unicode) { >>        /* BB fixme - test with long names */ >>        /* Note converted filename can be longer than in unicode */ >>        if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) >>            pqst->len = cifs_convertUCSpath((char *)pqst->name, >>                    (__le16 *)filename, len/2, nlt); >>        else >>            pqst->len = cifs_strfromUCS_le((char *)pqst->name, >>                    (__le16 *)filename, len/2, nlt); >> >> .... > > I see what you mean. Good catch. That function also has broken buffer > length checking logic too. > > This patch is only compile-tested, but it should fix those problems. In > the long run, we probably need to make all of these functions take an > argument with the length of the destination buffer. > > Let's plan that overhaul after Suresh's latest set goes in though. > > -- > Jeff Layton <jlayton@redhat.com> > > _______________________________________________ > linux-cifs-client mailing list > linux-cifs-client@lists.samba.org > https://lists.samba.org/mailman/listinfo/linux-cifs-client > > A general question, the functions such as cifs_strtoUCS call uni2char which assumes UTF-8 translation format. If one of the characaters being encoded happens to be 6 bytes long, will a SMB/CIFS server be able to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two byte encoded value, (how) would it handle 6 byte encoded value!
On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote: >> On Thu, 23 Apr 2009 02:49:21 +0200 >> Günter Kukkukk <linux@kukkukk.com> wrote: >> >>> just some further notes. >>> With "it's heavily used" i didn't mean the number of callers using this >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() >>> is called in daily usage.... (readdir results) >>> >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies >>> to cifs_convertUCSpath()! >>> >>> See the following code snippet: >>> >>> readdir.c --> static int cifs_get_name_from_search_buf() >>> .... >>> >>>    if (unicode) { >>>        /* BB fixme - test with long names */ >>>        /* Note converted filename can be longer than in unicode */ >>>        if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) >>>            pqst->len = cifs_convertUCSpath((char *)pqst->name, >>>                    (__le16 *)filename, len/2, nlt); >>>        else >>>            pqst->len = cifs_strfromUCS_le((char *)pqst->name, >>>                    (__le16 *)filename, len/2, nlt); >>> >>> .... >> >> I see what you mean. Good catch. That function also has broken buffer >> length checking logic too. >> >> This patch is only compile-tested, but it should fix those problems. In >> the long run, we probably need to make all of these functions take an >> argument with the length of the destination buffer. >> >> Let's plan that overhaul after Suresh's latest set goes in though. >> >> -- >> Jeff Layton <jlayton@redhat.com> >> >> _______________________________________________ >> linux-cifs-client mailing list >> linux-cifs-client@lists.samba.org >> https://lists.samba.org/mailman/listinfo/linux-cifs-client >> >> > > A general question, the functions such as cifs_strtoUCS call uni2char > which assumes UTF-8 translation format. > If one of the characaters being encoded happens to be 6 bytes long, > will a SMB/CIFS server be able > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two > byte encoded value, (how) would it handle > 6 byte encoded value! > Sorry, I meant to say 'char2uni which assumes UTF-8 translation format' and not 'uni2char which assumes UTF-8 translation format'
On Fri, 24 Apr 2009 11:59:54 -0500 Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar > <shirishpargaonkar@gmail.com> wrote: > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote: > >> On Thu, 23 Apr 2009 02:49:21 +0200 > >> Günter Kukkukk <linux@kukkukk.com> wrote: > >> > >>> just some further notes. > >>> With "it's heavily used" i didn't mean the number of callers using this > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() > >>> is called in daily usage.... (readdir results) > >>> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies > >>> to cifs_convertUCSpath()! > >>> > >>> See the following code snippet: > >>> > >>> readdir.c --> static int cifs_get_name_from_search_buf() > >>> .... > >>> > >>>    if (unicode) { > >>>        /* BB fixme - test with long names */ > >>>        /* Note converted filename can be longer than in unicode */ > >>>        if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) > >>>            pqst->len = cifs_convertUCSpath((char *)pqst->name, > >>>                    (__le16 *)filename, len/2, nlt); > >>>        else > >>>            pqst->len = cifs_strfromUCS_le((char *)pqst->name, > >>>                    (__le16 *)filename, len/2, nlt); > >>> > >>> .... > >> > >> I see what you mean. Good catch. That function also has broken buffer > >> length checking logic too. > >> > >> This patch is only compile-tested, but it should fix those problems. In > >> the long run, we probably need to make all of these functions take an > >> argument with the length of the destination buffer. > >> > >> Let's plan that overhaul after Suresh's latest set goes in though. > >> > >> -- > >> Jeff Layton <jlayton@redhat.com> > >> > >> _______________________________________________ > >> linux-cifs-client mailing list > >> linux-cifs-client@lists.samba.org > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client > >> > >> > > > > A general question, the functions such as cifs_strtoUCS call uni2char > > which assumes UTF-8 translation format. > > If one of the characaters being encoded happens to be 6 bytes long, > > will a SMB/CIFS server be able > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two > > byte encoded value, (how) would it handle > > 6 byte encoded value! > > > > Sorry, I meant to say > 'char2uni which assumes UTF-8 translation format' > and not > 'uni2char which assumes UTF-8 translation format' My understanding is that the unicode spec allows for a character to translate to a wide char of up to 6 bytes. According to Suresh's earlier email though, the unicode standard specifies no characters above 0x10ffff. So Unicode characters can only be up to four bytes long in UTF-8 (and maybe even only 3 bytes unless I'm missing something). The question of course is, what if the client is using some other non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6 bytes in that case?
Am Freitag, 24. April 2009 schrieb Jeff Layton: > On Fri, 24 Apr 2009 11:59:54 -0500 > Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: > > > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar > > <shirishpargaonkar@gmail.com> wrote: > > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote: > > >> On Thu, 23 Apr 2009 02:49:21 +0200 > > >> Günter Kukkukk <linux@kukkukk.com> wrote: > > >> > > >>> just some further notes. > > >>> With "it's heavily used" i didn't mean the number of callers using this > > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() > > >>> is called in daily usage.... (readdir results) > > >>> > > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies > > >>> to cifs_convertUCSpath()! > > >>> > > >>> See the following code snippet: > > >>> > > >>> readdir.c --> static int cifs_get_name_from_search_buf() > > >>> .... > > >>> > > >>>    if (unicode) { > > >>>        /* BB fixme - test with long names */ > > >>>        /* Note converted filename can be longer than in unicode */ > > >>>        if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) > > >>>            pqst->len = cifs_convertUCSpath((char *)pqst->name, > > >>>                    (__le16 *)filename, len/2, nlt); > > >>>        else > > >>>            pqst->len = cifs_strfromUCS_le((char *)pqst->name, > > >>>                    (__le16 *)filename, len/2, nlt); > > >>> > > >>> .... > > >> > > >> I see what you mean. Good catch. That function also has broken buffer > > >> length checking logic too. > > >> > > >> This patch is only compile-tested, but it should fix those problems. In > > >> the long run, we probably need to make all of these functions take an > > >> argument with the length of the destination buffer. > > >> > > >> Let's plan that overhaul after Suresh's latest set goes in though. > > >> > > >> -- > > >> Jeff Layton <jlayton@redhat.com> > > >> > > >> _______________________________________________ > > >> linux-cifs-client mailing list > > >> linux-cifs-client@lists.samba.org > > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client > > >> > > >> > > > > > > A general question, the functions such as cifs_strtoUCS call uni2char > > > which assumes UTF-8 translation format. > > > If one of the characaters being encoded happens to be 6 bytes long, > > > will a SMB/CIFS server be able > > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two > > > byte encoded value, (how) would it handle > > > 6 byte encoded value! > > > > > > > Sorry, I meant to say > > 'char2uni which assumes UTF-8 translation format' > > and not > > 'uni2char which assumes UTF-8 translation format' > > My understanding is that the unicode spec allows for a character to > translate to a wide char of up to 6 bytes. According to Suresh's > earlier email though, the unicode standard specifies no characters > above 0x10ffff. So Unicode characters can only be up to four bytes long > in UTF-8 (and maybe even only 3 bytes unless I'm missing something). > > The question of course is, what if the client is using some other > non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6 > bytes in that case? > i've spent days now on the "unicode" (and ISO-10646) stuff - will write a conclusion later this week. The current unicode upper limit 0x10ffff results to 4 bytes utf-8. "It is important to note that both the Unicode consortium and ISO pledge to never extend the encoding-space past this range." (0x10ffff) Jeff: "...what if the client is using some other non-UTF8 multibyte charset?" Some unixes use UTF-32 and UCS4 to represent one character - but even those would only consume 4 bytes per char - always wasting 11 bit in the 32 bit range. The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS filesystem would only result in 3 bytes UTF-8 (more later this week). Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2" using proper nls 'char2uni'. More later ... Cheers, Günter
2009/4/24 Günter Kukkukk <linux@kukkukk.com>: > Am Freitag, 24. April 2009 schrieb Jeff Layton: >> On Fri, 24 Apr 2009 11:59:54 -0500 >> Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: >> >> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar >> > <shirishpargaonkar@gmail.com> wrote: >> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote: >> > >> On Thu, 23 Apr 2009 02:49:21 +0200 >> > >> Günter Kukkukk <linux@kukkukk.com> wrote: >> > >> >> > >>> just some further notes. >> > >>> With "it's heavily used" i didn't mean the number of callers using this >> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() >> > >>> is called in daily usage.... (readdir results) >> > >>> >> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies >> > >>> to cifs_convertUCSpath()! >> > >>> >> > >>> See the following code snippet: >> > >>> >> > >>> readdir.c --> static int cifs_get_name_from_search_buf() >> > >>> .... >> > >>> >> > >>> if (unicode) { >> > >>> /* BB fixme - test with long names */ >> > >>> /* Note converted filename can be longer than in unicode */ >> > >>> if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) >> > >>> pqst->len = cifs_convertUCSpath((char *)pqst->name, >> > >>> (__le16 *)filename, len/2, nlt); >> > >>> else >> > >>> pqst->len = cifs_strfromUCS_le((char *)pqst->name, >> > >>> (__le16 *)filename, len/2, nlt); >> > >>> >> > >>> .... >> > >> >> > >> I see what you mean. Good catch. That function also has broken buffer >> > >> length checking logic too. >> > >> >> > >> This patch is only compile-tested, but it should fix those problems. In >> > >> the long run, we probably need to make all of these functions take an >> > >> argument with the length of the destination buffer. >> > >> >> > >> Let's plan that overhaul after Suresh's latest set goes in though. >> > >> >> > >> -- >> > >> Jeff Layton <jlayton@redhat.com> >> > >> >> > >> _______________________________________________ >> > >> linux-cifs-client mailing list >> > >> linux-cifs-client@lists.samba.org >> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client >> > >> >> > >> >> > > >> > > A general question, the functions such as cifs_strtoUCS call uni2char >> > > which assumes UTF-8 translation format. >> > > If one of the characaters being encoded happens to be 6 bytes long, >> > > will a SMB/CIFS server be able >> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two >> > > byte encoded value, (how) would it handle >> > > 6 byte encoded value! >> > > >> > >> > Sorry, I meant to say >> > 'char2uni which assumes UTF-8 translation format' >> > and not >> > 'uni2char which assumes UTF-8 translation format' >> >> My understanding is that the unicode spec allows for a character to >> translate to a wide char of up to 6 bytes. According to Suresh's >> earlier email though, the unicode standard specifies no characters >> above 0x10ffff. So Unicode characters can only be up to four bytes long >> in UTF-8 (and maybe even only 3 bytes unless I'm missing something). >> >> The question of course is, what if the client is using some other >> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6 >> bytes in that case? >> > > i've spent days now on the "unicode" (and ISO-10646) stuff - will write > a conclusion later this week. > The current unicode upper limit 0x10ffff results to 4 bytes utf-8. > > "It is important to note that both the Unicode consortium and ISO pledge > to never extend the encoding-space past this range." (0x10ffff) > Gunter, The range 0 - 0x10ffff is the range of Unicode/UCS character set. But when any of these Unicode/UCS characters is encoded using UTF-8 the encoded value can span upto 6 bytes. Would that be correct? > Jeff: "...what if the client is using some other non-UTF8 multibyte charset?" > > Some unixes use UTF-32 and UCS4 to represent one character - but even > those would only consume 4 bytes per char - always wasting 11 bit in > the 32 bit range. > > The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS > filesystem would only result in 3 bytes UTF-8 (more later this week). > > Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2" > using proper nls 'char2uni'. > > More later ... > Cheers, Günter >
On Fri, Apr 24, 2009 at 10:28 PM, Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: > 2009/4/24 Günter Kukkukk <linux@kukkukk.com>: >> Am Freitag, 24. April 2009 schrieb Jeff Layton: >>> On Fri, 24 Apr 2009 11:59:54 -0500 >>> Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote: >>> >>> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar >>> > <shirishpargaonkar@gmail.com> wrote: >>> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote: >>> > >> On Thu, 23 Apr 2009 02:49:21 +0200 >>> > >> Günter Kukkukk <linux@kukkukk.com> wrote: >>> > >> >>> > >>> just some further notes. >>> > >>> With "it's heavily used" i didn't mean the number of callers using this >>> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath() >>> > >>> is called in daily usage.... (readdir results) >>> > >>> >>> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies >>> > >>> to cifs_convertUCSpath()! >>> > >>> >>> > >>> See the following code snippet: >>> > >>> >>> > >>> readdir.c --> static int cifs_get_name_from_search_buf() >>> > >>> .... >>> > >>> >>> > >>> if (unicode) { >>> > >>> /* BB fixme - test with long names */ >>> > >>> /* Note converted filename can be longer than in unicode */ >>> > >>> if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR) >>> > >>> pqst->len = cifs_convertUCSpath((char *)pqst->name, >>> > >>> (__le16 *)filename, len/2, nlt); >>> > >>> else >>> > >>> pqst->len = cifs_strfromUCS_le((char *)pqst->name, >>> > >>> (__le16 *)filename, len/2, nlt); >>> > >>> >>> > >>> .... >>> > >> >>> > >> I see what you mean. Good catch. That function also has broken buffer >>> > >> length checking logic too. >>> > >> >>> > >> This patch is only compile-tested, but it should fix those problems. In >>> > >> the long run, we probably need to make all of these functions take an >>> > >> argument with the length of the destination buffer. >>> > >> >>> > >> Let's plan that overhaul after Suresh's latest set goes in though. >>> > >> >>> > >> -- >>> > >> Jeff Layton <jlayton@redhat.com> >>> > >> >>> > >> _______________________________________________ >>> > >> linux-cifs-client mailing list >>> > >> linux-cifs-client@lists.samba.org >>> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client >>> > >> >>> > >> >>> > > >>> > > A general question, the functions such as cifs_strtoUCS call uni2char >>> > > which assumes UTF-8 translation format. >>> > > If one of the characaters being encoded happens to be 6 bytes long, >>> > > will a SMB/CIFS server be able >>> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two >>> > > byte encoded value, (how) would it handle >>> > > 6 byte encoded value! >>> > > >>> > >>> > Sorry, I meant to say >>> > 'char2uni which assumes UTF-8 translation format' >>> > and not >>> > 'uni2char which assumes UTF-8 translation format' >>> >>> My understanding is that the unicode spec allows for a character to >>> translate to a wide char of up to 6 bytes. According to Suresh's >>> earlier email though, the unicode standard specifies no characters >>> above 0x10ffff. So Unicode characters can only be up to four bytes long >>> in UTF-8 (and maybe even only 3 bytes unless I'm missing something). >>> >>> The question of course is, what if the client is using some other >>> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6 >>> bytes in that case? >>> >> >> i've spent days now on the "unicode" (and ISO-10646) stuff - will write >> a conclusion later this week. >> The current unicode upper limit 0x10ffff results to 4 bytes utf-8. >> >> "It is important to note that both the Unicode consortium and ISO pledge >> to never extend the encoding-space past this range." (0x10ffff) >> > > Gunter, > > > The range 0 - 0x10ffff is the range of Unicode/UCS character set. > But when any of these Unicode/UCS characters is encoded using UTF-8 > the encoded value can span upto 6 bytes. Would that be correct? > I should not say 'any of these' but should say 'some of these'. So a Unicode/UCS character itself would not take more than 4 bytes but some of their encoded value may take upto six bytes and endoded value is what sent over the wire to the server. >> Jeff: "...what if the client is using some other non-UTF8 multibyte charset?" >> >> Some unixes use UTF-32 and UCS4 to represent one character - but even >> those would only consume 4 bytes per char - always wasting 11 bit in >> the 32 bit range. >> >> The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS >> filesystem would only result in 3 bytes UTF-8 (more later this week). >> >> Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2" >> using proper nls 'char2uni'. >> >> More later ... >> Cheers, Günter >> >
diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h index 14eb9a2..8ab332b 100644 --- a/fs/cifs/cifs_unicode.h +++ b/fs/cifs/cifs_unicode.h @@ -159,6 +159,33 @@ UniStrnlen(const wchar_t *ucs1, int maxlen) } /* + * UniStrnlenBytes: Return the length of a NLS string in bytes. Also, populates + * 'nchars' with the length of string in 16 bit Unicode chars. + */ +static inline size_t +UniStrnlenBytes(const wchar_t *str, int maxlen, int *nchars, + const struct nls_table *codepage) +{ + int nc; + size_t nbytes = 0; + char buf[NLS_MAX_CHARSET_SIZE]; /* enough for one char at a time */ + + *nchars = 0; + while (*str++ && maxlen) { + nc = codepage->uni2char(*str, buf, NLS_MAX_CHARSET_SIZE); + if (nc > 0) + nbytes += nc; + else + nbytes += 1; /* for '?' */ + (*nchars)++; + if (*nchars >= maxlen) + break; + } + + return nbytes; +} + +/* * UniStrncat: Concatenate length limited string */ static inline wchar_t *
Though the consensus is that we need a generalised helper to handle unicode string buffers so that other filesystems could consume, we would need a cifs helper like this in the interim, given the number of discussions/reviews and bug reports. cifs could easily replace this with generic helpers once such helper is in place. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> --- fs/cifs/cifs_unicode.h | 27 +++++++++++++++++++++++++++ 1 files changed, 27 insertions(+), 0 deletions(-)