diff mbox

ALPHA: advice for a patch to CIFS

Message ID 7CE799CC0E4DE04B88D5FDF226E18AC2CDF8E31455@LONPMAILBOX01.citrite.net (mailing list archive)
State New, archived
Headers show

Commit Message

Frediano Ziglio July 27, 2012, 7:13 p.m. UTC
Hi,
  I'm currently trying to support utf-16 with characters not in plane 0.

I'm currently end up with this patch. Currently is not against latest
kernel but the problem still reside in last git kernel.

wchar_t is currently 16bit so converting a utf8 encoded characters not
in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a
-EINVAL return. This patch detect utf8 in cifs_strtoUCS and add special
code calling directly utf8_to_utf32.

Does it sound a good patch or just a bad hack. Perhaps would be better
to change char2uni converting to unicode_t (32bit) instead of wchar_t
but probably many code have to be checked in order to make sure it does
not lead to wrong conversions, overflows or other bad stuff.

Is it worth working in this hacking way? I'd like to upstream this
patch.



Signed-off-by: "Frediano Ziglio" <frediano.ziglio@citrix.com>

Regards,
  Frediano

Comments

Scott Lovenberg July 27, 2012, 7:50 p.m. UTC | #1
On Fri, Jul 27, 2012 at 3:42 PM, Scott Lovenberg
<scott.lovenberg@gmail.com> wrote:
>
>
> On Fri, Jul 27, 2012 at 3:13 PM, Frediano Ziglio
> <frediano.ziglio@citrix.com> wrote:
>>
>> Hi,
>>   I'm currently trying to support utf-16 with characters not in plane 0.
>>
>> I'm currently end up with this patch. Currently is not against latest
>> kernel but the problem still reside in last git kernel.
>>
>> wchar_t is currently 16bit so converting a utf8 encoded characters not
>> in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a
>> -EINVAL return. This patch detect utf8 in cifs_strtoUCS and add special
>> code calling directly utf8_to_utf32.
>>
>> Does it sound a good patch or just a bad hack. Perhaps would be better
>> to change char2uni converting to unicode_t (32bit) instead of wchar_t
>> but probably many code have to be checked in order to make sure it does
>> not lead to wrong conversions, overflows or other bad stuff.
>>
>> Is it worth working in this hacking way? I'd like to upstream this
>> patch.
>>
>>
>> diff -r c2325d754e8d fs/cifs/cifs_unicode.c
>> --- a/fs/cifs/cifs_unicode.c    Fri Jul 27 15:12:23 2012 +0100
>> +++ b/fs/cifs/cifs_unicode.c    Fri Jul 27 19:09:04 2012 +0100
>> @@ -192,22 +192,40 @@ cifs_strtoUCS(__le16 *to, const char *fr
>>  {
>>         int charlen;
>>         int i;
>> -       wchar_t *wchar_to = (wchar_t *)to; /* needed to quiet sparse */
>> +       int is_utf8 = !strcmp(codepage->charset, "utf8");
>> +       wchar_t wchar_to; /* needed to quiet sparse */
>> +       unicode_t uni;
>>
>>         for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
>>
>>                 /* works for 2.4.0 kernel or later */
>> -               charlen = codepage->char2uni(from, len, &wchar_to[i]);
>> +               if (is_utf8) {
>> +                       charlen = utf8_to_utf32(from, len, &uni);
>> +               } else {
>> +                       charlen = codepage->char2uni(from, len,
>> &wchar_to);
>> +                       uni = wchar_to;
>> +               }
>> +
>>                 if (charlen < 1) {
>>                         cERROR(1,
>>                                ("strtoUCS: char2uni of %d returned %d",
>>                                 (int)*from, charlen));
>>                         /* A question mark */
>> -                       to[i] = cpu_to_le16(0x003f);
>> +                       wchar_to = 0x003f;
>>                         charlen = 1;
>> -               } else
>> -                       to[i] = cpu_to_le16(wchar_to[i]);
>> -
>> +               } else if (uni < 0x10000) {
>> +                       wchar_to = uni;
>> +               } else if (uni < 0x110000) {
>> +                       uni -= 0x10000;
>> +                       to[i++] = cpu_to_le16(0xD800 | (uni >> 10));
>> +                       wchar_to = 0xDC00 | (uni & 0x3FF);
>> +               } else {
>> +                       cERROR(1,
>> +                              ("strtoUCS: char2uni of %d returned %d",
>> +                               (int)*from, charlen));
>> +                       wchar_to = 0x003f;
>> +               }
>> +               to[i] = cpu_to_le16(wchar_to);
>>         }
>>
>>         to[i] = 0;
>>
>> Signed-off-by: "Frediano Ziglio" <frediano.ziglio@citrix.com>
>>
>> Regards,
>>   Frediano
>>

[sorry for the noise, resending in plain text for the mail daemon - it
rejected the last mail because of HTML formatting]

 Just my $0.02, but there are a lot of magic numbers in this patch.  Maybe I
 don't have the domain knowledge needed to understand the patch, but I do
 have trouble following the conditionals and bitwise operations.

 --
 Peace and Blessings,
 -Scott.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve French July 27, 2012, 8:11 p.m. UTC | #2
On Fri, Jul 27, 2012 at 2:50 PM, Scott Lovenberg
<scott.lovenberg@gmail.com> wrote:
> On Fri, Jul 27, 2012 at 3:42 PM, Scott Lovenberg
> <scott.lovenberg@gmail.com> wrote:
>>
>>
>> On Fri, Jul 27, 2012 at 3:13 PM, Frediano Ziglio
>> <frediano.ziglio@citrix.com> wrote:
>>>
>>> Hi,
>>>   I'm currently trying to support utf-16 with characters not in plane 0.
>>>
>>> I'm currently end up with this patch. Currently is not against latest
>>> kernel but the problem still reside in last git kernel.
>>>
>>> wchar_t is currently 16bit so converting a utf8 encoded characters not
>>> in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a
>>> -EINVAL return. This patch detect utf8 in cifs_strtoUCS and add special
>>> code calling directly utf8_to_utf32.
>>>
>>> Does it sound a good patch or just a bad hack. Perhaps would be better
>>> to change char2uni converting to unicode_t (32bit) instead of wchar_t
>>> but probably many code have to be checked in order to make sure it does
>>> not lead to wrong conversions, overflows or other bad stuff.
>>>
>>> Is it worth working in this hacking way? I'd like to upstream this
>>> patch.


Terminology is confusing.   Refreshing my memory by looking at

http://en.wikipedia.org/wiki/Universal_Character_Set

are we talking about UTF-16 vs. UCS-2 (ie cases where a pair of
16 bit unicode characters are interpreted as one)?

IIRC there are a few languages where this helps, at least
since Windows XP when apparently it became more common.
and we have to support this in kernel.


> Just my $0.02, but there are a lot of magic numbers in this patch

Agreed.  The check vs. 3F and against the maximum unicode
value should be against #defined values which are easier to read.
Jeff Layton July 28, 2012, 12:06 p.m. UTC | #3
On Fri, 27 Jul 2012 20:13:42 +0100
Frediano Ziglio <frediano.ziglio@citrix.com> wrote:

> Hi,
>   I'm currently trying to support utf-16 with characters not in plane 0.
> 
> I'm currently end up with this patch. Currently is not against latest
> kernel but the problem still reside in last git kernel.
> 
> wchar_t is currently 16bit so converting a utf8 encoded characters not
> in plane 0 (>= 0x10000) to wchar_t (that is calling char2uni) lead to a
> -EINVAL return. This patch detect utf8 in cifs_strtoUCS and add special
> code calling directly utf8_to_utf32.
> 
> Does it sound a good patch or just a bad hack. Perhaps would be better
> to change char2uni converting to unicode_t (32bit) instead of wchar_t
> but probably many code have to be checked in order to make sure it does
> not lead to wrong conversions, overflows or other bad stuff.
> 
> Is it worth working in this hacking way? I'd like to upstream this
> patch.
> 
> 
> diff -r c2325d754e8d fs/cifs/cifs_unicode.c
> --- a/fs/cifs/cifs_unicode.c	Fri Jul 27 15:12:23 2012 +0100
> +++ b/fs/cifs/cifs_unicode.c	Fri Jul 27 19:09:04 2012 +0100
> @@ -192,22 +192,40 @@ cifs_strtoUCS(__le16 *to, const char *fr

That function doesn't exist anymore. You should base this on a more
recent upstream tree.

>  {
>  	int charlen;
>  	int i;
> -	wchar_t *wchar_to = (wchar_t *)to; /* needed to quiet sparse */
> +	int is_utf8 = !strcmp(codepage->charset, "utf8");

Gross...there must be a better way to do that?


> +	wchar_t wchar_to; /* needed to quiet sparse */
> +	unicode_t uni;
>  
>  	for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
>  
>  		/* works for 2.4.0 kernel or later */
> -		charlen = codepage->char2uni(from, len, &wchar_to[i]);
> +		if (is_utf8) {
> +			charlen = utf8_to_utf32(from, len, &uni);
> +		} else {
> +			charlen = codepage->char2uni(from, len, &wchar_to);
> +			uni = wchar_to;
> +		}
> +
>  		if (charlen < 1) {
>  			cERROR(1,
>  			       ("strtoUCS: char2uni of %d returned %d",
>  				(int)*from, charlen));
>  			/* A question mark */
> -			to[i] = cpu_to_le16(0x003f);
> +			wchar_to = 0x003f;
>  			charlen = 1;
> -		} else
> -			to[i] = cpu_to_le16(wchar_to[i]);
> -
> +		} else if (uni < 0x10000) {

	"uni" will be unintialized here if is_utf8 is false.

> +			wchar_to = uni;
> +		} else if (uni < 0x110000) {
> +			uni -= 0x10000;
> +			to[i++] = cpu_to_le16(0xD800 | (uni >> 10));
> +			wchar_to = 0xDC00 | (uni & 0x3FF);
> +		} else {
> +			cERROR(1,
> +			       ("strtoUCS: char2uni of %d returned %d",
> +				(int)*from, charlen));
> +			wchar_to = 0x003f;
> +		}
> +		to[i] = cpu_to_le16(wchar_to);
>  	}
>  
>  	to[i] = 0;
> 
> Signed-off-by: "Frediano Ziglio" <frediano.ziglio@citrix.com>
> 
> Regards,
>   Frediano
> 

The basic idea looks ok, but I agree that this could use some more
commenting and/or some #define'd constants. Does the conversion of on
the wire characters to the local charset need similar work?
diff mbox

Patch

diff -r c2325d754e8d fs/cifs/cifs_unicode.c
--- a/fs/cifs/cifs_unicode.c	Fri Jul 27 15:12:23 2012 +0100
+++ b/fs/cifs/cifs_unicode.c	Fri Jul 27 19:09:04 2012 +0100
@@ -192,22 +192,40 @@  cifs_strtoUCS(__le16 *to, const char *fr
 {
 	int charlen;
 	int i;
-	wchar_t *wchar_to = (wchar_t *)to; /* needed to quiet sparse */
+	int is_utf8 = !strcmp(codepage->charset, "utf8");
+	wchar_t wchar_to; /* needed to quiet sparse */
+	unicode_t uni;
 
 	for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
 
 		/* works for 2.4.0 kernel or later */
-		charlen = codepage->char2uni(from, len, &wchar_to[i]);
+		if (is_utf8) {
+			charlen = utf8_to_utf32(from, len, &uni);
+		} else {
+			charlen = codepage->char2uni(from, len, &wchar_to);
+			uni = wchar_to;
+		}
+
 		if (charlen < 1) {
 			cERROR(1,
 			       ("strtoUCS: char2uni of %d returned %d",
 				(int)*from, charlen));
 			/* A question mark */
-			to[i] = cpu_to_le16(0x003f);
+			wchar_to = 0x003f;
 			charlen = 1;
-		} else
-			to[i] = cpu_to_le16(wchar_to[i]);
-
+		} else if (uni < 0x10000) {
+			wchar_to = uni;
+		} else if (uni < 0x110000) {
+			uni -= 0x10000;
+			to[i++] = cpu_to_le16(0xD800 | (uni >> 10));
+			wchar_to = 0xDC00 | (uni & 0x3FF);
+		} else {
+			cERROR(1,
+			       ("strtoUCS: char2uni of %d returned %d",
+				(int)*from, charlen));
+			wchar_to = 0x003f;
+		}
+		to[i] = cpu_to_le16(wchar_to);
 	}
 
 	to[i] = 0;