如何轻松检测string中的utf8编码？

我有从其他程序的数据填充的string，这个数据可以用UTF8编码或不。所以，如果不是我可以编码为UTF8，但什么是最好的方式来检测C ++中的UTF8？我看到这个变种https：//stackoverflow.com/questions / …但有评论说，这种解决scheme不100％检测。所以，如果我编码到UTF8string已经包含UTF8数据，然后我写错误的文本到数据库。

那么我可以使用这个UTF8检测：

bool is_utf8(const char * string) { if(!string) return 0; const unsigned char * bytes = (const unsigned char *)string; while(*bytes) { if( (// ASCII // use bytes[0] <= 0x7F to allow ASCII control characters bytes[0] == 0x09 || bytes[0] == 0x0A || bytes[0] == 0x0D || (0x20 <= bytes[0] && bytes[0] <= 0x7E) ) ) { bytes += 1; continue; } if( (// non-overlong 2-byte (0xC2 <= bytes[0] && bytes[0] <= 0xDF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF) ) ) { bytes += 2; continue; } if( (// excluding overlongs bytes[0] == 0xE0 && (0xA0 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) ) || (// straight 3-byte ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) || bytes[0] == 0xEE || bytes[0] == 0xEF) && (0x80 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) ) || (// excluding surrogates bytes[0] == 0xED && (0x80 <= bytes[1] && bytes[1] <= 0x9F) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) ) ) { bytes += 3; continue; } if( (// planes 1-3 bytes[0] == 0xF0 && (0x90 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) && (0x80 <= bytes[3] && bytes[3] <= 0xBF) ) || (// planes 4-15 (0xF1 <= bytes[0] && bytes[0] <= 0xF3) && (0x80 <= bytes[1] && bytes[1] <= 0xBF) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) && (0x80 <= bytes[3] && bytes[3] <= 0xBF) ) || (// plane 16 bytes[0] == 0xF4 && (0x80 <= bytes[1] && bytes[1] <= 0x8F) && (0x80 <= bytes[2] && bytes[2] <= 0xBF) && (0x80 <= bytes[3] && bytes[3] <= 0xBF) ) ) { bytes += 4; continue; } return 0; } return 1; }

如果检测结果不正确，则此代码用于编码为UTF8：

  string text; if(!is_utf8(EscReason.c_str())) { int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(), text.length(), 0, 0); std::wstring utf16_str(size, '\0'); MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(), text.length(), &utf16_str[0], size); int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(), utf16_str.length(), 0, 0, 0, 0); std::string utf8_str(utf8_size, '\0'); WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(), utf16_str.length(), &utf8_str[0], utf8_size, 0, 0); text = utf8_str; }

或上面的代码不正确？我也是在Windows 7上做的。那Ubuntu怎么样呢？这个变体在那里工作吗？

你可能不了解UTF-8和其他的选择。一个字节只有256个可能的值。考虑到字符的数量，这不是很多。因此，许多字节序列都是有效的UTF-8字符串，而在其他编码中则是有效的字符串。

实际上，每个ASCII字符串都是故意有效的UTF-8字符串，其含义基本相同。您的代码将返回ìs_utf8("Hello") 。

甚至许多其他非UTF8，非ASCII字符串与有效的UTF-8字符串共享一个字节序列。而且没有办法将非UTF-8字符串转换为UTF-8，而不知道它是什么类型的非UTF-8编码。即使拉丁一和拉丁二已经完全不同了。 CP_ACP甚至比Latin-1差， CP_ACP到处都不一样。

您的文本必须以UTF-8格式进入数据库。因此，如果它还不是UTF-8，则必须进行转换，并且必须知道确切的源编码。没有魔法逃脱。

在Linux上， iconv是在两种编码之间转换的常用方法。

比较整个字节值不是检测UTF-8的正确方法。你必须分析每个字节的实际位模式。 UTF-8使用一个非常独特的位模式，没有其他编码使用。尝试更像这样的东西，而不是：

 bool is_utf8(const char * string) { if (!string) return true; const unsigned char * bytes = (const unsigned char *)string; int num; while (*bytes != 0x00) { if ((*bytes & 0x80) == 0x00) { // U+0000 to U+007F num = 1; } else if ((*bytes & 0xE0) == 0xC0) { // U+0080 to U+07FF num = 2; } else if ((*bytes & 0xF0) == 0xE0) { // U+0800 to U+FFFF num = 3; } else if ((*bytes & 0xF8) == 0xF0) { // U+10000 to U+10FFFF num = 4; } else return false; bytes += 1; for (int i = 1; i < num; ++i) { if ((*bytes & 0xC0) != 0x80) return false; bytes += 1; } } return true; }

现在，这并没有考虑到非法的UTF-8序列，比如超长的编码，UTF-16代理和U + 10FFFF以上的编码点。如果你想确保UTF-8是有效和正确的，你需要更多的东西：

 bool is_valid_utf8(const char * string) { if (!string) return true; const unsigned char * bytes = (const unsigned char *)string; unsigned int cp; int num; while (*bytes != 0x00) { if ((*bytes & 0x80) == 0x00) { // U+0000 to U+007F cp = (*bytes & 0x7F); num = 1; } else if ((*bytes & 0xE0) == 0xC0) { // U+0080 to U+07FF cp = (*bytes & 0x1F); num = 2; } else if ((*bytes & 0xF0) == 0xE0) { // U+0800 to U+FFFF cp = (*bytes & 0x0F); num = 3; } else if ((*bytes & 0xF8) == 0xF0) { // U+10000 to U+10FFFF cp = (*bytes & 0x07); num = 4; } else return false; bytes += 1; for (int i = 1; i < num; ++i) { if ((*bytes & 0xC0) != 0x80) return false; cp = (cp << 6) | (*bytes & 0x3F); bytes += 1; } if ((cp > 0x10FFFF) || ((cp >= 0xD800) && (cp <= 0xDFFF)) || ((cp <= 0x007F) && (num != 1)) || ((cp >= 0x0080) && (cp <= 0x07FF) && (num != 2)) || ((cp >= 0x0800) && (cp <= 0xFFFF) && (num != 3)) || ((cp >= 0x10000) && (cp <= 0x1FFFFF) && (num != 4))) return false; } return true; }