Post by Lilian PigallioBut anoher questions: how to detect UTF-8 strings ?
IsTextUnicode seems to works only with UTF-16 strings (it always return
false with UTF-8 strings)
Lilian.
Given an arbitrary array of bytes, there's no real way to tell if it
contains useful information encoded as UTF-8. You may notice that the
description of IsTextUnicode() says that it is not ***guaranteed*** to give
you the right answer. It can only make a good guess based on statistical
information, or can tell you that all 16-bit characters look like UNICODE
versions of ANSI characters.
With UTF-8, it's even more difficult. The only thing you know for sure is
that a UTF-8 string is terminated by a byte with a value of 0. (Also, I
think the value of the byte before the 0 must be < 128, as the sign bit in a
UTF-8 character string indicates the start or continuation of a multibyte
character, and a single multibyte character can't end with a 0 byte.)
Do you know anything else about the string? Do you know its length, at
least?
A simple example to show the problem: A byte array with values 0x41, 0x00,
0x42, 0x00, 0x00, 0x00 could be interpreted as a UTF-8 string "a"
(terminating with the first 0 byte), or as a UNICODE (UTF-16) string "ab"
(terminating with the first double-zero). Which do you pick?
Carl