Articles of UTF 8

如何在C中读写UTF8文本文件?

我想从文本文件中读取UTF8文本,然后将其中的一部分打印到另一个文件中。 我正在使用Linux和gcc编译器。 这是我使用的代码: #include <stdio.h> #include <stdlib.h> int main(){ FILE *fin; FILE *fout; int character; fin=fopen("in.txt", "r"); fout=fopen("out.txt","w"); while((character=fgetc(fin))!=EOF){ putchar(character); // It displays the right character (UTF8) in the terminal fprintf(fout,"%c ",character); // It displays weird characters in the file } fclose(fin); fclose(fout); printf("\nFile has been created…\n"); return 0; } 它现在适用于英文字符。

为什么UTF-8文本在OS X和Linux之间按不同顺序sorting?

我有一个UTF-8编码文本行的文本文件: mac-os-x$ cat unsorted.txt ウ foo チ 'foo' 津 如果它有助于重现问题,则下面是校验和和文件中确切字节的转储,以及如何自己生成文件(在Linux上,使用base64 -d而不是-D ): mac-os-x$ shasum unsorted.txt a6d0b708d3e0cafb0c6e1af7450e9243da8cb078 unsorted.txt mac-os-x$ perl -ne 'print join(" ", map { sprintf "%02x", ord } split //), "\n"' unsorted.txt e3 82 a6 0a 66 6f 6f 0a e3 83 81 0a 27 66 6f 6f 27 0a e6 b4 a5 0a […]

如何检测terminal中的Unicodestring宽度?

我正在开发一个基于terminal的程序,它支持Unicode。 在某些情况下,我需要确定一个string在打印之前会消耗多lessterminal列。 不幸的是,有些字符是2列(中文等),但我发现这个答案 ,表明检测全angular字符的一个好方法是通过调用ICU库中的u_getIntPropertyValue()。 现在我试图parsing我的UTF8string的字符,并将它们传递给此函数。 我现在遇到的问题是,u_getIntPropertyValue()需要一个UTF-32代码点。 什么是从utf8string获得这个最好的方法? 我目前正在尝试使用boost :: locale(在我的程序中的其他地方使用),但是我无法获得干净的转换。 来自boost :: locale的我的UTF32string被预先填充了一个零宽度字符来表示字节顺序。 显然,我可以跳过string的前四个字节,但有没有更干净的方法来做到这一点? 这是我目前丑陋的解决scheme: inline size_t utf8PrintableSize(const std::string &str, std::locale loc) { namespace ba = boost::locale::boundary; ba::ssegment_index map(ba::character, str.begin(), str.end(), loc); size_t widthCount = 0; for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it) { ++widthCount; std::string utf32Char = boost::locale::conv::from_utf(it->str(), std::string("utf-32")); UChar32 utf32Codepoint = […]

奇怪的PHP UTF-8行为

我有以下testingPHP代码: header('Content-type: text/html; charset=utf-8'); $text = 'Développeur Web'; var_dump($text); $text = preg_replace('#[^\\pL\d]+#u', '-', $text); var_dump($text); $text = trim($text, '-'); var_dump($text); $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); var_dump($text); $text = strtolower($text); var_dump($text); $text = preg_replace('#[^-\w]+#', '', $text); var_dump($text); 在我的本地机器上按预期工作: string(16) "Développeur Web" string(16) "Développeur-Web" string(16) "Développeur-Web" string(16) "D'eveloppeur-Web" string(16) "d'eveloppeur-web" string(15) "developpeur-web" 但在我的现场服务器上performance得很奇怪: string 'Développeur Web' (length=16) […]

为什么iconv无法从utf-8转换为iso-8859-1

我的系统是SUSE Linux Enterprise Server 11。 我试图将数据从utf-8格式转换为ISO使用“iconv” $>file test.utf8 test.utf8: UTF-8 Unicode text, with very long lines $> $>file -i test.utf8 test.utf8: text/plain charset=utf-8 $> $>iconv -f UTF-8 -t ISO-8859-1 test.utf8 > test.iso iconv: test.utf8:20:105: cannot convert 你能帮我一下这个吗? 谢谢。

UTF-8文件名返回在Linuxterminal找不到

我在Linux(Ubuntu)terminal有一些文件的问题,在名称中有重音。 例如: $ ls dir/ criação.png 所以,terminal返回该文件,所以它存在。 现在让我们看看文件是否存在,用这个简单的命令: $ [ -f criação.png ] && echo "File Exist" || echo "Not Exist" Not Exist 正如你所看到的,“不存在”。 现在,我在OSX上有相同的文件夹和文件,然后我运行相同的命令,它返回这个: $ [ -f criação.png ] && echo "File Exist" || echo "Not Exist" File Exist 我对地区有一些了解: $ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE=en_US.UTF-8 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" […]

如何检测一个文件是否在Bash中有一个UTF-8 BOM?

我正在尝试编写一个能够自动从文件中删除UTF-8 BOM的脚本。 我无法检测文件是否有一个在第一个地方。 这是我的代码: function has-bom { # Test if the file starts with 0xEF, 0xBB, and 0xBF head -c 3 "$1" | grep -P '\xef\xbb\xbf' return $? } 出于某种原因, head似乎忽略了文件前面的BOM。 举个例子,运行这个 printf '\xef\xbb\xbf' > file head -c 3 file 将不会打印任何东西。 我试图寻找一个选项head –help ,让我解决这个问题,但没有运气。 有什么我可以做的这个工作?

如何在文本文件中检测到无效的utf8 unicode / binary

我需要检测有无效(非ASCII)UTF-8,Unicode或二进制字符的损坏的文本文件。 �>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}�� 我曾试过: iconv -f utf-8 -t utf-8 -c file.csv 这将文件从utf-8编码转换为utf-8编码, -c用于跳过无效的utf-8字符。 但是最后这些非法人物还是被打印出来的。 在Linux或其他语言的bash中是否还有其他解决scheme?

什么是en_US.UTF-8语言环境的Windows等价物?

如果我想在Windows上进行以下工作,那么正确的语言环境是什么,以及如何检测它是否实际存在: 此代码是否工作在全局,还是仅仅是我的系统?

Java Windows UTF-8(unicode)打印

我有这样的问题,当你在Windows中,你尝试打印通过JAVA只能使用AUTOSENSE属性。 但是我想要打印的string是希腊语=> UTF-8。 当我把AUTOSENSE到TEXT_PLAIN_UTF8我得到一个:sun.print.PrintJobFlavorException:无效的风味exception…. 有什么build议么? 或者其他的Unicode打印方式? 谢谢! String datastr = "UNICODE STRING"; byte[] databa = null; try { databa = datastr.getBytes("UTF8"); } catch (UnsupportedEncodingException e1) { e1.printStackTrace(); } DocFlavor docFlavor = DocFlavor.BYTE_ARRAY.TEXT_PLAIN_UTF_16; PrintRequestAttributeSet aset = new HashPrintRequestAttributeSet(); PrintService service = PrintServiceLookup.lookupDefaultPrintService(); if (databa != null) { DocPrintJob pjob = service.createPrintJob(); Doc doc = new SimpleDoc(databa, […]