diff mbox

[linux-cifs-client,1/3] cifs: Introduce helper to compute length of nls string in bytes

Message ID 49EDBB06.3060409@suse.de (mailing list archive)
State New, archived
Headers show

Commit Message

Suresh Jayaraman April 21, 2009, 12:24 p.m. UTC
Though the consensus is that we need a generalised helper to handle
unicode string buffers so that other filesystems could consume, we would
need a cifs helper like this in the interim, given the number of
discussions/reviews and bug reports. cifs could easily replace this with
generic helpers once such helper is in place.


Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
---
 fs/cifs/cifs_unicode.h |   27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)

Comments

Günter Kukkukk April 22, 2009, 12:25 a.m. UTC | #1
Am Dienstag, 21. April 2009 schrieb Suresh Jayaraman:
> Though the consensus is that we need a generalised helper to handle
> unicode string buffers so that other filesystems could consume, we would
> need a cifs helper like this in the interim, given the number of
> discussions/reviews and bug reports. cifs could easily replace this with
> generic helpers once such helper is in place.
> 
> 
> Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
> ---
>  fs/cifs/cifs_unicode.h |   27 +++++++++++++++++++++++++++
>  1 files changed, 27 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
> index 14eb9a2..8ab332b 100644
> --- a/fs/cifs/cifs_unicode.h
> +++ b/fs/cifs/cifs_unicode.h
> @@ -159,6 +159,33 @@ UniStrnlen(const wchar_t *ucs1, int maxlen)
>  }
>  
>  /*
> + * UniStrnlenBytes: Return the length of a NLS string in bytes. Also, populates
> + * 'nchars' with the length of string in 16 bit Unicode chars.
> + */
> +static inline size_t
> +UniStrnlenBytes(const wchar_t *str, int maxlen, int *nchars,
> +		const struct nls_table *codepage)
> +{
> +	int nc;
> +	size_t nbytes = 0;
> +	char buf[NLS_MAX_CHARSET_SIZE]; /* enough for one char at a time */
> +
> +	*nchars = 0;
> +	while (*str++ && maxlen) {
> +		nc = codepage->uni2char(*str, buf, NLS_MAX_CHARSET_SIZE);
> +		if (nc > 0)
> +			nbytes += nc;
> +		else
> +			nbytes += 1; /* for '?' */
> +		(*nchars)++;
> +		if (*nchars >= maxlen)
> +			break;
> +	}
> +
> +	return nbytes;
> +}
> +
> +/*
>   * UniStrncat:  Concatenate length limited string
>   */
>  static inline wchar_t *

during cleanup of the unicode related stuff we should also have a look
at functions currently located in misc.c
  - cifs_convertUCSpath()
  - cifsConvertToUCS()

At the moment cifs_convertUCSpath() contains the following check:
...
		/* make sure we do not overrun callers allocated temp buffer */
		if (j >= (2 * NAME_MAX))
			break;
...
Probably both functions should be moved away from misc.c.
Cheers, Günter
Jeff Layton April 22, 2009, 6:17 a.m. UTC | #2
On Wed, 22 Apr 2009 02:25:05 +0200
Günter Kukkukk <linux@kukkukk.com> wrote:
> 
> during cleanup of the unicode related stuff we should also have a look
> at functions currently located in misc.c
>   - cifs_convertUCSpath()
>   - cifsConvertToUCS()
> 
> At the moment cifs_convertUCSpath() contains the following check:
> ...
> 		/* make sure we do not overrun callers allocated temp buffer */
> 		if (j >= (2 * NAME_MAX))
> 			break;
> ...
> Probably both functions should be moved away from misc.c.

Agreed on all counts. For now, we're focused on the cases where we're
converting the data from the server to the local charset. After that's
resolved we need to fix the other direction as well.

It might make sense to consolidate all of this in a separate file
someplace as well.
Günter Kukkukk April 23, 2009, 12:06 a.m. UTC | #3
Am Mittwoch, 22. April 2009 schrieb Jeff Layton:
> On Wed, 22 Apr 2009 02:25:05 +0200
> Günter Kukkukk <linux@kukkukk.com> wrote:
> > 
> > during cleanup of the unicode related stuff we should also have a look
> > at functions currently located in misc.c
> >   - cifs_convertUCSpath()
> >   - cifsConvertToUCS()
> > 
> > At the moment cifs_convertUCSpath() contains the following check:
> > ...
> > 		/* make sure we do not overrun callers allocated temp buffer */
> > 		if (j >= (2 * NAME_MAX))
> > 			break;
> > ...
> > Probably both functions should be moved away from misc.c.
> 
> Agreed on all counts. For now, we're focused on the cases where we're
> converting the data from the server to the local charset. After that's
> resolved we need to fix the other direction as well.
> 
> It might make sense to consolidate all of this in a separate file
> someplace as well.
> 

Hi Jeff,

the function cifs_convertUCSpath() _is_ related to 
"...we're focused on the cases where we're converting the data from
the server to the local charset. ....".

And - it's heavily used.
Cheers, Günter
Günter Kukkukk April 23, 2009, 12:49 a.m. UTC | #4
Am Donnerstag, 23. April 2009 schrieb Günter Kukkukk:
> Am Mittwoch, 22. April 2009 schrieb Jeff Layton:
> > On Wed, 22 Apr 2009 02:25:05 +0200
> > Günter Kukkukk <linux@kukkukk.com> wrote:
> > > 
> > > during cleanup of the unicode related stuff we should also have a look
> > > at functions currently located in misc.c
> > >   - cifs_convertUCSpath()
> > >   - cifsConvertToUCS()
> > > 
> > > At the moment cifs_convertUCSpath() contains the following check:
> > > ...
> > > 		/* make sure we do not overrun callers allocated temp buffer */
> > > 		if (j >= (2 * NAME_MAX))
> > > 			break;
> > > ...
> > > Probably both functions should be moved away from misc.c.
> > 
> > Agreed on all counts. For now, we're focused on the cases where we're
> > converting the data from the server to the local charset. After that's
> > resolved we need to fix the other direction as well.
> > 
> > It might make sense to consolidate all of this in a separate file
> > someplace as well.
> > 
> 
> Hi Jeff,
> 
> the function cifs_convertUCSpath() _is_ related to 
> "...we're focused on the cases where we're converting the data from
> the server to the local charset. ....".
> 
> And - it's heavily used.
> Cheers, Günter

just some further notes. 
With "it's heavily used" i didn't mean the number of callers using this
function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
is called in daily usage.... (readdir results)

The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
to cifs_convertUCSpath()!

See the following code snippet: 

readdir.c --> static int cifs_get_name_from_search_buf()
....

	if (unicode) {
		/* BB fixme - test with long names */
		/* Note converted filename can be longer than in unicode */
		if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
			pqst->len = cifs_convertUCSpath((char *)pqst->name,
					(__le16 *)filename, len/2, nlt);
		else
			pqst->len = cifs_strfromUCS_le((char *)pqst->name,
					(__le16 *)filename, len/2, nlt);

....

Cheers, Günter
Shirish Pargaonkar April 24, 2009, 4:57 p.m. UTC | #5
On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 23 Apr 2009 02:49:21 +0200
> Günter Kukkukk <linux@kukkukk.com> wrote:
>
>> just some further notes.
>> With "it's heavily used" i didn't mean the number of callers using this
>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
>> is called in daily usage.... (readdir results)
>>
>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
>> to cifs_convertUCSpath()!
>>
>> See the following code snippet:
>>
>> readdir.c --> static int cifs_get_name_from_search_buf()
>> ....
>>
>>       if (unicode) {
>>               /* BB fixme - test with long names */
>>               /* Note converted filename can be longer than in unicode */
>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
>>                                       (__le16 *)filename, len/2, nlt);
>>               else
>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
>>                                       (__le16 *)filename, len/2, nlt);
>>
>> ....
>
> I see what you mean. Good catch. That function also has broken buffer
> length checking logic too.
>
> This patch is only compile-tested, but it should fix those problems. In
> the long run, we probably need to make all of these functions take an
> argument with the length of the destination buffer.
>
> Let's plan that overhaul after Suresh's latest set goes in though.
>
> --
> Jeff Layton <jlayton@redhat.com>
>
> _______________________________________________
> linux-cifs-client mailing list
> linux-cifs-client@lists.samba.org
> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>
>

A general question, the functions such as cifs_strtoUCS call uni2char
which assumes UTF-8 translation format.
If one of the characaters being encoded happens to be 6 bytes long,
will a SMB/CIFS server be able
to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
byte encoded value, (how) would it handle
6 byte encoded value!
Shirish Pargaonkar April 24, 2009, 4:59 p.m. UTC | #6
On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
<shirishpargaonkar@gmail.com> wrote:
> On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote:
>> On Thu, 23 Apr 2009 02:49:21 +0200
>> Günter Kukkukk <linux@kukkukk.com> wrote:
>>
>>> just some further notes.
>>> With "it's heavily used" i didn't mean the number of callers using this
>>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
>>> is called in daily usage.... (readdir results)
>>>
>>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
>>> to cifs_convertUCSpath()!
>>>
>>> See the following code snippet:
>>>
>>> readdir.c --> static int cifs_get_name_from_search_buf()
>>> ....
>>>
>>>       if (unicode) {
>>>               /* BB fixme - test with long names */
>>>               /* Note converted filename can be longer than in unicode */
>>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
>>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
>>>                                       (__le16 *)filename, len/2, nlt);
>>>               else
>>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
>>>                                       (__le16 *)filename, len/2, nlt);
>>>
>>> ....
>>
>> I see what you mean. Good catch. That function also has broken buffer
>> length checking logic too.
>>
>> This patch is only compile-tested, but it should fix those problems. In
>> the long run, we probably need to make all of these functions take an
>> argument with the length of the destination buffer.
>>
>> Let's plan that overhaul after Suresh's latest set goes in though.
>>
>> --
>> Jeff Layton <jlayton@redhat.com>
>>
>> _______________________________________________
>> linux-cifs-client mailing list
>> linux-cifs-client@lists.samba.org
>> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>>
>>
>
> A general question, the functions such as cifs_strtoUCS call uni2char
> which assumes UTF-8 translation format.
> If one of the characaters being encoded happens to be 6 bytes long,
> will a SMB/CIFS server be able
> to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
> byte encoded value, (how) would it handle
> 6 byte encoded value!
>

Sorry, I meant to say
 'char2uni which assumes UTF-8 translation format'
and not
 'uni2char which assumes UTF-8 translation format'
Jeff Layton April 24, 2009, 9:27 p.m. UTC | #7
On Fri, 24 Apr 2009 11:59:54 -0500
Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote:

> On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
> <shirishpargaonkar@gmail.com> wrote:
> > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote:
> >> On Thu, 23 Apr 2009 02:49:21 +0200
> >> Günter Kukkukk <linux@kukkukk.com> wrote:
> >>
> >>> just some further notes.
> >>> With "it's heavily used" i didn't mean the number of callers using this
> >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
> >>> is called in daily usage.... (readdir results)
> >>>
> >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
> >>> to cifs_convertUCSpath()!
> >>>
> >>> See the following code snippet:
> >>>
> >>> readdir.c --> static int cifs_get_name_from_search_buf()
> >>> ....
> >>>
> >>>       if (unicode) {
> >>>               /* BB fixme - test with long names */
> >>>               /* Note converted filename can be longer than in unicode */
> >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
> >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
> >>>                                       (__le16 *)filename, len/2, nlt);
> >>>               else
> >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
> >>>                                       (__le16 *)filename, len/2, nlt);
> >>>
> >>> ....
> >>
> >> I see what you mean. Good catch. That function also has broken buffer
> >> length checking logic too.
> >>
> >> This patch is only compile-tested, but it should fix those problems. In
> >> the long run, we probably need to make all of these functions take an
> >> argument with the length of the destination buffer.
> >>
> >> Let's plan that overhaul after Suresh's latest set goes in though.
> >>
> >> --
> >> Jeff Layton <jlayton@redhat.com>
> >>
> >> _______________________________________________
> >> linux-cifs-client mailing list
> >> linux-cifs-client@lists.samba.org
> >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
> >>
> >>
> >
> > A general question, the functions such as cifs_strtoUCS call uni2char
> > which assumes UTF-8 translation format.
> > If one of the characaters being encoded happens to be 6 bytes long,
> > will a SMB/CIFS server be able
> > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
> > byte encoded value, (how) would it handle
> > 6 byte encoded value!
> >
> 
> Sorry, I meant to say
>  'char2uni which assumes UTF-8 translation format'
> and not
>  'uni2char which assumes UTF-8 translation format'

My understanding is that the unicode spec allows for a character to
translate to a wide char of up to 6 bytes. According to Suresh's
earlier email though, the unicode standard specifies no characters
above 0x10ffff. So Unicode characters can only be up to four bytes long
in UTF-8 (and maybe even only 3 bytes unless I'm missing something).

The question of course is, what if the client is using some other
non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
bytes in that case?
Günter Kukkukk April 25, 2009, 3:12 a.m. UTC | #8
Am Freitag, 24. April 2009 schrieb Jeff Layton:
> On Fri, 24 Apr 2009 11:59:54 -0500
> Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote:
> 
> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
> > <shirishpargaonkar@gmail.com> wrote:
> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > >> On Thu, 23 Apr 2009 02:49:21 +0200
> > >> Günter Kukkukk <linux@kukkukk.com> wrote:
> > >>
> > >>> just some further notes.
> > >>> With "it's heavily used" i didn't mean the number of callers using this
> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
> > >>> is called in daily usage.... (readdir results)
> > >>>
> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
> > >>> to cifs_convertUCSpath()!
> > >>>
> > >>> See the following code snippet:
> > >>>
> > >>> readdir.c --> static int cifs_get_name_from_search_buf()
> > >>> ....
> > >>>
> > >>>       if (unicode) {
> > >>>               /* BB fixme - test with long names */
> > >>>               /* Note converted filename can be longer than in unicode */
> > >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
> > >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
> > >>>                                       (__le16 *)filename, len/2, nlt);
> > >>>               else
> > >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
> > >>>                                       (__le16 *)filename, len/2, nlt);
> > >>>
> > >>> ....
> > >>
> > >> I see what you mean. Good catch. That function also has broken buffer
> > >> length checking logic too.
> > >>
> > >> This patch is only compile-tested, but it should fix those problems. In
> > >> the long run, we probably need to make all of these functions take an
> > >> argument with the length of the destination buffer.
> > >>
> > >> Let's plan that overhaul after Suresh's latest set goes in though.
> > >>
> > >> --
> > >> Jeff Layton <jlayton@redhat.com>
> > >>
> > >> _______________________________________________
> > >> linux-cifs-client mailing list
> > >> linux-cifs-client@lists.samba.org
> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
> > >>
> > >>
> > >
> > > A general question, the functions such as cifs_strtoUCS call uni2char
> > > which assumes UTF-8 translation format.
> > > If one of the characaters being encoded happens to be 6 bytes long,
> > > will a SMB/CIFS server be able
> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
> > > byte encoded value, (how) would it handle
> > > 6 byte encoded value!
> > >
> > 
> > Sorry, I meant to say
> >  'char2uni which assumes UTF-8 translation format'
> > and not
> >  'uni2char which assumes UTF-8 translation format'
> 
> My understanding is that the unicode spec allows for a character to
> translate to a wide char of up to 6 bytes. According to Suresh's
> earlier email though, the unicode standard specifies no characters
> above 0x10ffff. So Unicode characters can only be up to four bytes long
> in UTF-8 (and maybe even only 3 bytes unless I'm missing something).
> 
> The question of course is, what if the client is using some other
> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
> bytes in that case?
> 

i've spent days now on the "unicode" (and ISO-10646) stuff - will write
a conclusion later this week.
The current unicode upper limit 0x10ffff results to 4 bytes utf-8.

   "It is important to note that both the Unicode consortium and ISO pledge
    to never extend the encoding-space past this range." (0x10ffff)

Jeff: "...what if the client is using some other non-UTF8 multibyte charset?"

Some unixes use UTF-32 and UCS4 to represent one character - but even
those would only consume 4 bytes per char - always wasting 11 bit in
the 32 bit range.

The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS
filesystem would only result in 3 bytes UTF-8 (more later this week).

Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2"
using proper nls 'char2uni'.

More later ...
Cheers, Günter
Shirish Pargaonkar April 25, 2009, 3:28 a.m. UTC | #9
2009/4/24 Günter Kukkukk <linux@kukkukk.com>:
> Am Freitag, 24. April 2009 schrieb Jeff Layton:
>> On Fri, 24 Apr 2009 11:59:54 -0500
>> Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote:
>>
>> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
>> > <shirishpargaonkar@gmail.com> wrote:
>> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote:
>> > >> On Thu, 23 Apr 2009 02:49:21 +0200
>> > >> Günter Kukkukk <linux@kukkukk.com> wrote:
>> > >>
>> > >>> just some further notes.
>> > >>> With "it's heavily used" i didn't mean the number of callers using this
>> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
>> > >>> is called in daily usage.... (readdir results)
>> > >>>
>> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
>> > >>> to cifs_convertUCSpath()!
>> > >>>
>> > >>> See the following code snippet:
>> > >>>
>> > >>> readdir.c --> static int cifs_get_name_from_search_buf()
>> > >>> ....
>> > >>>
>> > >>>       if (unicode) {
>> > >>>               /* BB fixme - test with long names */
>> > >>>               /* Note converted filename can be longer than in unicode */
>> > >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
>> > >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
>> > >>>                                       (__le16 *)filename, len/2, nlt);
>> > >>>               else
>> > >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
>> > >>>                                       (__le16 *)filename, len/2, nlt);
>> > >>>
>> > >>> ....
>> > >>
>> > >> I see what you mean. Good catch. That function also has broken buffer
>> > >> length checking logic too.
>> > >>
>> > >> This patch is only compile-tested, but it should fix those problems. In
>> > >> the long run, we probably need to make all of these functions take an
>> > >> argument with the length of the destination buffer.
>> > >>
>> > >> Let's plan that overhaul after Suresh's latest set goes in though.
>> > >>
>> > >> --
>> > >> Jeff Layton <jlayton@redhat.com>
>> > >>
>> > >> _______________________________________________
>> > >> linux-cifs-client mailing list
>> > >> linux-cifs-client@lists.samba.org
>> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>> > >>
>> > >>
>> > >
>> > > A general question, the functions such as cifs_strtoUCS call uni2char
>> > > which assumes UTF-8 translation format.
>> > > If one of the characaters being encoded happens to be 6 bytes long,
>> > > will a SMB/CIFS server be able
>> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
>> > > byte encoded value, (how) would it handle
>> > > 6 byte encoded value!
>> > >
>> >
>> > Sorry, I meant to say
>> >  'char2uni which assumes UTF-8 translation format'
>> > and not
>> >  'uni2char which assumes UTF-8 translation format'
>>
>> My understanding is that the unicode spec allows for a character to
>> translate to a wide char of up to 6 bytes. According to Suresh's
>> earlier email though, the unicode standard specifies no characters
>> above 0x10ffff. So Unicode characters can only be up to four bytes long
>> in UTF-8 (and maybe even only 3 bytes unless I'm missing something).
>>
>> The question of course is, what if the client is using some other
>> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
>> bytes in that case?
>>
>
> i've spent days now on the "unicode" (and ISO-10646) stuff - will write
> a conclusion later this week.
> The current unicode upper limit 0x10ffff results to 4 bytes utf-8.
>
>   "It is important to note that both the Unicode consortium and ISO pledge
>    to never extend the encoding-space past this range." (0x10ffff)
>

Gunter,


The range 0 - 0x10ffff is the range of Unicode/UCS character set.
But when any of these Unicode/UCS characters is encoded using UTF-8
the encoded value can span upto 6 bytes. Would that be correct?

> Jeff: "...what if the client is using some other non-UTF8 multibyte charset?"
>
> Some unixes use UTF-32 and UCS4 to represent one character - but even
> those would only consume 4 bytes per char - always wasting 11 bit in
> the 32 bit range.
>
> The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS
> filesystem would only result in 3 bytes UTF-8 (more later this week).
>
> Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2"
> using proper nls 'char2uni'.
>
> More later ...
> Cheers, Günter
>
Shirish Pargaonkar April 25, 2009, 3:46 a.m. UTC | #10
On Fri, Apr 24, 2009 at 10:28 PM, Shirish Pargaonkar
<shirishpargaonkar@gmail.com> wrote:
> 2009/4/24 Günter Kukkukk <linux@kukkukk.com>:
>> Am Freitag, 24. April 2009 schrieb Jeff Layton:
>>> On Fri, 24 Apr 2009 11:59:54 -0500
>>> Shirish Pargaonkar <shirishpargaonkar@gmail.com> wrote:
>>>
>>> > On Fri, Apr 24, 2009 at 11:57 AM, Shirish Pargaonkar
>>> > <shirishpargaonkar@gmail.com> wrote:
>>> > > On Thu, Apr 23, 2009 at 12:56 AM, Jeff Layton <jlayton@redhat.com> wrote:
>>> > >> On Thu, 23 Apr 2009 02:49:21 +0200
>>> > >> Günter Kukkukk <linux@kukkukk.com> wrote:
>>> > >>
>>> > >>> just some further notes.
>>> > >>> With "it's heavily used" i didn't mean the number of callers using this
>>> > >>> function (only 1 in readdir.c) - i meant "the number of times" cifs_convertUCSpath()
>>> > >>> is called in daily usage.... (readdir results)
>>> > >>>
>>> > >>> The current focus was mostly on cifs_strfromUCS_le() - but the _same_ applies
>>> > >>> to cifs_convertUCSpath()!
>>> > >>>
>>> > >>> See the following code snippet:
>>> > >>>
>>> > >>> readdir.c --> static int cifs_get_name_from_search_buf()
>>> > >>> ....
>>> > >>>
>>> > >>>       if (unicode) {
>>> > >>>               /* BB fixme - test with long names */
>>> > >>>               /* Note converted filename can be longer than in unicode */
>>> > >>>               if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR)
>>> > >>>                       pqst->len = cifs_convertUCSpath((char *)pqst->name,
>>> > >>>                                       (__le16 *)filename, len/2, nlt);
>>> > >>>               else
>>> > >>>                       pqst->len = cifs_strfromUCS_le((char *)pqst->name,
>>> > >>>                                       (__le16 *)filename, len/2, nlt);
>>> > >>>
>>> > >>> ....
>>> > >>
>>> > >> I see what you mean. Good catch. That function also has broken buffer
>>> > >> length checking logic too.
>>> > >>
>>> > >> This patch is only compile-tested, but it should fix those problems. In
>>> > >> the long run, we probably need to make all of these functions take an
>>> > >> argument with the length of the destination buffer.
>>> > >>
>>> > >> Let's plan that overhaul after Suresh's latest set goes in though.
>>> > >>
>>> > >> --
>>> > >> Jeff Layton <jlayton@redhat.com>
>>> > >>
>>> > >> _______________________________________________
>>> > >> linux-cifs-client mailing list
>>> > >> linux-cifs-client@lists.samba.org
>>> > >> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>>> > >>
>>> > >>
>>> > >
>>> > > A general question, the functions such as cifs_strtoUCS call uni2char
>>> > > which assumes UTF-8 translation format.
>>> > > If one of the characaters being encoded happens to be 6 bytes long,
>>> > > will a SMB/CIFS server be able
>>> > > to handle that i.e. if it is expecting a UCS-2LE encoding, thus a two
>>> > > byte encoded value, (how) would it handle
>>> > > 6 byte encoded value!
>>> > >
>>> >
>>> > Sorry, I meant to say
>>> >  'char2uni which assumes UTF-8 translation format'
>>> > and not
>>> >  'uni2char which assumes UTF-8 translation format'
>>>
>>> My understanding is that the unicode spec allows for a character to
>>> translate to a wide char of up to 6 bytes. According to Suresh's
>>> earlier email though, the unicode standard specifies no characters
>>> above 0x10ffff. So Unicode characters can only be up to four bytes long
>>> in UTF-8 (and maybe even only 3 bytes unless I'm missing something).
>>>
>>> The question of course is, what if the client is using some other
>>> non-UTF8 multibyte charset? Could we end up with chars that are 5 or 6
>>> bytes in that case?
>>>
>>
>> i've spent days now on the "unicode" (and ISO-10646) stuff - will write
>> a conclusion later this week.
>> The current unicode upper limit 0x10ffff results to 4 bytes utf-8.
>>
>>   "It is important to note that both the Unicode consortium and ISO pledge
>>    to never extend the encoding-space past this range." (0x10ffff)
>>
>
> Gunter,
>
>
> The range 0 - 0x10ffff is the range of Unicode/UCS character set.
> But when any of these Unicode/UCS characters is encoded using UTF-8
> the encoded value can span upto 6 bytes. Would that be correct?
>

I should not say 'any of these' but should say 'some of these'.
So a Unicode/UCS character itself would not take more than 4 bytes
but some of their encoded value may take upto six bytes and endoded value is
what sent over the wire to the server.

>> Jeff: "...what if the client is using some other non-UTF8 multibyte charset?"
>>
>> Some unixes use UTF-32 and UCS4 to represent one character - but even
>> those would only consume 4 bytes per char - always wasting 11 bit in
>> the 32 bit range.
>>
>> The initial UCS-2 (2 byte) encoding used by Microsoft for their NTFS
>> filesystem would only result in 3 bytes UTF-8 (more later this week).
>>
>> Any valid 3 byte UTF-8 byte sequence should be easily converted back to "UCS-2"
>> using proper nls 'char2uni'.
>>
>> More later ...
>> Cheers, Günter
>>
>
diff mbox

Patch

diff --git a/fs/cifs/cifs_unicode.h b/fs/cifs/cifs_unicode.h
index 14eb9a2..8ab332b 100644
--- a/fs/cifs/cifs_unicode.h
+++ b/fs/cifs/cifs_unicode.h
@@ -159,6 +159,33 @@  UniStrnlen(const wchar_t *ucs1, int maxlen)
 }
 
 /*
+ * UniStrnlenBytes: Return the length of a NLS string in bytes. Also, populates
+ * 'nchars' with the length of string in 16 bit Unicode chars.
+ */
+static inline size_t
+UniStrnlenBytes(const wchar_t *str, int maxlen, int *nchars,
+		const struct nls_table *codepage)
+{
+	int nc;
+	size_t nbytes = 0;
+	char buf[NLS_MAX_CHARSET_SIZE]; /* enough for one char at a time */
+
+	*nchars = 0;
+	while (*str++ && maxlen) {
+		nc = codepage->uni2char(*str, buf, NLS_MAX_CHARSET_SIZE);
+		if (nc > 0)
+			nbytes += nc;
+		else
+			nbytes += 1; /* for '?' */
+		(*nchars)++;
+		if (*nchars >= maxlen)
+			break;
+	}
+
+	return nbytes;
+}
+
+/*
  * UniStrncat:  Concatenate length limited string
  */
 static inline wchar_t *