Dealing with unicodes in strings


Graham Cox
 

I’m converting some NSData to a NSString using UTF8 encoding, which is what I believe it should be.

But the strings are sometimes ending up with embedded codes that are not converting, like @“this is a string\u002D with codes in it”

What is the proper way to deal with this? I tried the various ‘canonical’ and ‘compatibility’ mapping methods but they do nothing with this, but since I don’t really know what they do, it’s no surprise. Is there a method that will just deal with this?

—Graham


Glenn L. Austin
 

Is it possible that the code "\u002D" is in the string as the six characters? It is just the minus sign, but could the source have encoded certain characters so they wouldn't be accidentally interpreted?

-- 
Glenn L. Austin, Computer Wizard and Race Car Driver         <><
<http://www.austinsoft.com>

On May 31, 2019, at 6:09 AM, Graham Cox <graham@...> wrote:

I’m converting some NSData to a NSString using UTF8 encoding, which is what I believe it should be.

But the strings are sometimes ending up with embedded codes that are not converting, like @“this is a string\u002D with codes in it”

What is the proper way to deal with this? I tried the various ‘canonical’ and ‘compatibility’ mapping methods but they do nothing with this, but since I don’t really know what they do, it’s no surprise. Is there a method that will just deal with this?

—Graham







Quincey Morris
 

On May 31, 2019, at 07:51 , Glenn L. Austin <glenn@...> wrote:

Is it possible that the code "\u002D" is in the string as the six characters? It is just the minus sign, but could the source have encoded certain characters so they wouldn't be accidentally interpreted?

That’s what I was thinking too.

This would easily be resolved if we could see the bytes of the NSData in hex.


Graham Cox
 

Yes, I think it’s putting these 6 characters into the string.

The original data is an HTML page, and these strings come from some embedded javascript on the page - I’m scraping the page to extract specific bits of information, and it generally works OK, except for this minor formatting issue. Though the page declares it is using UTF8 encoding, I’m wondering if that applies even to embedded javascript strings - perhaps they need to be treated as C strings?

I can write some code to deal with it, but it just seems like something NSSString can already do.

—Graham



On 1 Jun 2019, at 12:57 am, Quincey Morris <quinceymorris@...> wrote:

On May 31, 2019, at 07:51 , Glenn L. Austin <glenn@...> wrote:

Is it possible that the code "\u002D" is in the string as the six characters? It is just the minus sign, but could the source have encoded certain characters so they wouldn't be accidentally interpreted?

That’s what I was thinking too.

This would easily be resolved if we could see the bytes of the NSData in hex.



Roland King
 

Don’t really see how NSString would be able to deal with that, it has no way of knowing there are embedded escape sequences in what is otherwise UTF-8. This looks like a fairly standard web encoding of ’special characters’ which javascript is especially fond of. I would just detect \u and treat the next two bytes at a UTF-16 character. You might find an NSString extension or another class which deals with web encodings does it seamlessly but frankly if that’s all there is, I’d just deal with it by hand. 

On 1 Jun 2019, at 08:53, Graham Cox <graham@...> wrote:

Yes, I think it’s putting these 6 characters into the string.

The original data is an HTML page, and these strings come from some embedded javascript on the page - I’m scraping the page to extract specific bits of information, and it generally works OK, except for this minor formatting issue. Though the page declares it is using UTF8 encoding, I’m wondering if that applies even to embedded javascript strings - perhaps they need to be treated as C strings?

I can write some code to deal with it, but it just seems like something NSSString can already do.

—Graham



On 1 Jun 2019, at 12:57 am, Quincey Morris <quinceymorris@...> wrote:

On May 31, 2019, at 07:51 , Glenn L. Austin <glenn@...> wrote:

Is it possible that the code "\u002D" is in the string as the six characters? It is just the minus sign, but could the source have encoded certain characters so they wouldn't be accidentally interpreted?

That’s what I was thinking too.

This would easily be resolved if we could see the bytes of the NSData in hex.




 

Those are JavaScript escape sequences. If you’re reading raw JS string literals out of the page, you need to decode all the escapes, which are like the C ones plus \uxxxx.

This isn’t anything to do with NSString; NSJSONSerialization could probably decode it since JSON string syntax is based on JS.

—Jens


Graham Cox
 

OK, makes sense. A simple NSString category using NSScanner internally makes it easy enough.

Thanks for the help,

—Graham

On 1 Jun 2019, at 1:18 pm, Jens Alfke <jens@mooseyard.com> wrote:

Those are JavaScript escape sequences. If you’re reading raw JS string literals out of the page, you need to decode all the escapes, which are like the C ones plus \uxxxx.

This isn’t anything to do with NSString; NSJSONSerialization could probably decode it since JSON string syntax is based on JS.

—Jens