如何在Linux中使用POSIX方法从文件中读取Unicode-16string？

我有一个包含UNICODE-16string的文件，我想读入Linux程序。这些string是从Windows的内部WCHAR格式生成的。（Windows是否总是使用UTF-16？例如日文版本）

我相信我可以使用原始读取和使用wcstombs_l进行转换来读取它们。但是，我无法确定要使用的区域设置。在我的最新的Ubuntu和Mac OS X机器上运行“locale -a”会得到零名称的UTF-16语言环境。

有没有更好的办法？

更新：正确的答案和其他下面的帮助指向我使用libiconv。这是我用来完成转换的function。我现在有一个类，它使转换成一行代码。

// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8) // It will allocate the space needed for dest. The caller is // responsible for freeing the memory. static int iwcstombs_alloc(char **dest, const wchar_t *src) { iconv_t cd; const char from[] = "UTF-16LE"; const char to[] = "UTF-8"; cd = iconv_open(to, from); if (cd == (iconv_t)-1) { printf("iconv_open(\"%s\", \"%s\") failed: %s\n", to, from, strerror(errno)); return(-1); } // How much space do we need? // Guess that we need the same amount of space as used by src. // TODO: There should be a while loop around this whole process // that detects insufficient memory space and reallocates // more space. int len = sizeof(wchar_t) * (wcslen(src) + 1); //printf("len = %d\n", len); // Allocate space int destLen = len * sizeof(char); *dest = (char *)malloc(destLen); if (*dest == NULL) { iconv_close(cd); return -1; } // Convert size_t inBufBytesLeft = len; char *inBuf = (char *)src; size_t outBufBytesLeft = destLen; char *outBuf = (char *)*dest; int rc = iconv(cd, &inBuf, &inBufBytesLeft, &outBuf, &outBufBytesLeft); if (rc == -1) { printf("iconv() failed: %s\n", strerror(errno)); iconv_close(cd); free(*dest); *dest = NULL; return -1; } iconv_close(cd); return 0; } // iwcstombs_alloc()

（Windows是否总是使用UTF-16？例如日文版本）

是的，NT的WCHAR总是UTF-16LE。

（对于日文安装的'系统代码页'确实是cp932 / Shift-JIS，为了许多非Unicode本地，FAT32路径等许多应用程序的好处，仍然存在于NT中）。

但是，wchar_t不能保证是16位，而在Linux上不会使用UTF-32（UCS-4）。所以wcstombs_l不太可能快乐。

正确的做法是使用像iconv这样的库来读取它在内部使用的任何格式 – 据推测wchar_t。您可以尝试通过戳入字节来破解它，但是您可能会得到像代理商一样的错误。

在我的最新的Ubuntu和Mac OS X机器上运行“locale -a”会产生零个名称为utf-16的语言环境。

事实上，Linux无法使用UTF-16作为区域默认编码，这要归功于所有的\ 0。

最简单的方法是将文件从utf16转换为utf8原生的UNIX编码，然后读取它，

 iconv -f utf16 -t utf8 file_in.txt -o file_out.txt

您也可以使用iconv（3）（参见man 3 iconv）来使用C转换字符串。其他大多数语言也都绑定到iconv。

比你可以使用任何UTF-8语言环境，比如en_US.UTF-8，通常是大多数Linux发行版的默认语言环境。

你可以阅读二进制，然后做自己的快速转换： http : //unicode.org/faq/utf_bom.html#utf16-3但它可能更安全的使用库（如libiconv），正确处理无效序列。

我强烈建议使用Unicode编码作为程序的内部表示。使用UTF-16或UTF-8。如果你在内部使用UTF-16，那么显然不需要翻译。如果使用UTF-8，则可以使用其中包含.UTF-8的语言环境，例如en_US.UTF-8 。