Intereting Posts

部署Django，Gunicorn，Nginx，Virtualenv数字海洋给了我502坏门户和Gunicorn不能读密钥 win7 pyspark sql utils IllegalArgumentException Mercurial Hg没有发现变化 – 不能推出汞 Linux C串口读/写 BIRT和Excel与<email@madeup.com>做什么？ CopyFile是否在源文件上放置任何锁？ debuggingCutyCapt + Flash 向sdout和stderr回声删除目录中的所有冗余文件浏览器强制下载php页面，当我想打开页面使用JAVA中的PIDvalidation进程是否正在运行 Symfony 3 TokenAuthenticator结束会话我可以通过.Net获得与安装的打印机相关的图标吗？ wcslen是ISO / IEC 14882：2003 C ++标准库的一部分吗？跨平台UI间距/填充

从python中的MS word文件中提取文本

为了在python中使用MS word文件，有python的win32扩展，可以在windows中使用。我如何在Linux中做同样的事情？有没有图书馆？

你可以做一个子进程调用antiword 。 Antiword是一个用于从文档doc中转储文本的Linux命令行工具。对于简单的文档工作得很好（显然它失去了格式化）。它可以通过apt，也可以作为RPM，或者你可以自己编译。

使用本机Python docx模块 。以下是如何从文档中提取所有文本的方法：

 document = docx.Document(filename) docText = '\n\n'.join([ paragraph.text.encode('utf-8') for paragraph in document.paragraphs ]) print docText

请参阅Python DocX站点

还检查出了拉出表等的Textract

用正则表达式解析XML会调用cthulu。不要这样做！

本杰明的答案是相当不错的。我刚刚巩固…

 import zipfile, re docx = zipfile.ZipFile('/path/to/file/mydocument.docx') content = docx.read('word/document.xml') cleaned = re.sub('<(.|\n)*?>','',content) print cleaned

OpenOffice.org可以用Python编写脚本：参见这里。

由于OOo可以完美加载大多数MS Word文件，所以我认为这是最好的选择。

我知道这是一个老问题，但我最近试图找到一种方法来从MS word文件中提取文本，到目前为止，我发现最好的解决方案是使用wvLib：

http://wvware.sourceforge.net/

在安装库之后，在Python中使用它非常简单：

 import commands exe = 'wvText ' + word_file + ' ' + output_txt_file out = commands.getoutput(exe) exe = 'cat ' + output_txt_file out = commands.getoutput(exe)

就是这样。我们正在做的是使用commands.getouput函数来运行一些shell脚本，即wvText（从Word文档中提取文本，cat用于读取文件输出）。之后，Word文档中的整个文本将被放在out变量中，随时可以使用。

希望这将有助于任何人在将来有类似的问题。

看看doc格式是如何工作的，并在linux中使用PHP创建word文档。前者特别有用。 Abiword是我推荐的工具。虽然有一些限制：

但是，如果文档具有复杂的表格，文本框，嵌入的电子表格等，则可能无法按预期工作。开发好的MS Word过滤器是一个非常困难的过程，所以在我们努力让Word文档正确打开时，请耐心等待。如果您的Word文档无法加载，请打开一个Bug并包含文档，以便我们改进导入程序。

（注意：我也在这个问题上发布了这个，但是在这里看起来很有意义，所以请转告。）

现在，这是相当丑陋和漂亮hacky，但它似乎为我的基本文本提取工作。很显然，要在Qt程序中使用它，你必须为它创建一个进程，但是我一起入侵的命令行是：

 unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

所以那是：

unzip -p file.docx：-p ==“解压缩到标准输出”

grep'<w：t' ：只抓取包含'<w：t'（<w：t>是Word 2007的XML文档元素，据我所知）

sed's / <[^ <] > // g'*：删除标签内的所有内容

grep -v'^ [[：space：]] $'*：删除空行

有可能是一个更有效的方法来做到这一点，但似乎对我已经测试了几个文档的工作。

据我所知，unzip，grep和sed都有Windows和任何Unix的端口，所以它应该是合理的跨平台的。祝你有一个丑陋的黑客;）

如果你打算纯粹使用python模块而不调用子进程，你可以使用zipfile python modude。

 content = "" # Load DocX into zipfile docx = zipfile.ZipFile('/home/whateverdocument.docx') # Unpack zipfile unpacked = docx.infolist() # Find the /word/document.xml file in the package and assign it to variable for item in unpacked: if item.orig_filename == 'word/document.xml': content = docx.read(item.orig_filename) else: pass

您的内容字符串需要清理，这样做的一个方法是：

 # Clean the content string from xml tags for better search fullyclean = [] halfclean = content.split('<') for item in halfclean: if '>' in item: bad_good = item.split('>') if bad_good[-1] != '': fullyclean.append(bad_good[-1]) else: pass else: pass # Assemble a new string with all pure content content = " ".join(fullyclean)

但是，清理字符串肯定有更好的方法，可能使用re模块。希望这可以帮助。

Unoconv也可能是一个很好的选择： http ://linux.die.net/man/1/unoconv

我不确定如果不使用COM，你会有很多运气。 .doc格式非常复杂，通常在保存时被称为Word的“内存转储”。

在Swati，这是在HTML中，这是很好，很好，但大多数文件文件不是很好！

要阅读Word 2007及更高版本的文件（包括.docx文件），可以使用python-docx软件包：

 from docx import Document document = Document('existing-document-file.docx') document.save('new-file-name.docx')

要从Word 2003及更早版本读取.doc文件，请进行对反字的子处理调用。您需要先安装antiword：

 sudo apt-get install antiword

然后从你的python脚本中调用它：

 import os input_word_file = "input_file.doc" output_text_file = "output_file.txt" os.system('antiword %s > %s' % (input_word_file, output_text_file))

只是阅读“doc”文件而不使用COM： miette的选项。应该在任何平台上工作。

如果你安装了LibreOffice，你可以直接从命令行调用它来将文件转换为文本，然后将文本加载到Python中。

这是一个老问题吗？我相信这样的事情是不存在的。只有回答和未回答的问题。这一个是相当没有答案，或者如果你愿意一半答案。那么，读取* .docx（MS Word 2007及更高版本）文档而不使用COM互操作的方法都将被覆盖。但是，仅使用Python从* .doc（MS Word 97-2000）中提取文本的方法缺少。这是复杂的吗？要做：不是真的，要明白：那是另一回事。

当我没有找到任何完成的代码时，我阅读了一些格式规范，并挖掘了一些其他语言提出的算法。

MS Word（* .doc）文件是一个OLE2复合文件。不要用很多不必要的细节来打扰你，把它想象成存储在文件中的文件系统。它实际上使用FAT结构，所以定义成立。（嗯，也许你可以循环挂载在Linux中???）这样，你可以在一个文件中存储更多的文件，如图片等。相同的是在* .docx通过使用ZIP存档。 PyPI上有可用于读取OLE文件的软件包。像（olefile，compoundfiles，…）我使用了compoundfiles包来打开* .doc文件。但是，在MS Word 97-2000中，内部子文件不是XML或HTML，而是二进制文件。由于这还不够，每个都包含一个关于其他信息，所以你必须至少读两个，并相应地解开存储的信息。要充分理解，请阅读我从中学习算法的PDF文档。

下面的代码是非常匆忙编写和测试少量的文件。据我所知，它按预期工作。有时候会出现一些乱码，而且几乎总是在文本的末尾。中间也可能有一些奇怪的字符。

那些只想搜索文字的人会很高兴。不过，我还是敦促任何能够帮助改进这些代码的人来这样做。

 doc2text module: """ This is Python implementation of C# algorithm proposed in: http://b2xtranslator.sourceforge.net/howtos/How_to_retrieve_text_from_a_binary_doc_file.pdf Python implementation author is Dalen Bernaca. Code needs refining and probably bug fixing! As I am not a C# expert I would like some code rechecks by one. Parts of which I am uncertain are: * Did the author of original algorithm used uint32 and int32 when unpacking correctly? I copied each occurence as in original algo. * Is the FIB length for MS Word 97 1472 bytes as in MS Word 2000, and would it make any difference if it is not? * Did I interpret each C# command correctly? I think I did! """ from compoundfiles import CompoundFileReader, CompoundFileError from struct import unpack __all__ = ["doc2text"] def doc2text (path): text = u"" cr = CompoundFileReader(path) # Load WordDocument stream: try: f = cr.open("WordDocument") doc = f.read() f.close() except: cr.close(); raise CompoundFileError, "The file is corrupted or it is not a Word document at all." # Extract file information block and piece table stream informations from it: fib = doc[:1472] fcClx = unpack("L", fib[0x01a2l:0x01a6l])[0] lcbClx = unpack("L", fib[0x01a6l:0x01a6+4l])[0] tableFlag = unpack("L", fib[0x000al:0x000al+4l])[0] & 0x0200l == 0x0200l tableName = ("0Table", "1Table")[tableFlag] # Load piece table stream: try: f = cr.open(tableName) table = f.read() f.close() except: cr.close(); raise CompoundFileError, "The file is corrupt. '%s' piece table stream is missing." % tableName cr.close() # Find piece table inside a table stream: clx = table[fcClx:fcClx+lcbClx] pos = 0 pieceTable = "" lcbPieceTable = 0 while True: if clx[pos]=="\x02": # This is piece table, we store it: lcbPieceTable = unpack("l", clx[pos+1:pos+5])[0] pieceTable = clx[pos+5:pos+5+lcbPieceTable] break elif clx[pos]=="\x01": # This is beggining of some other substructure, we skip it: pos = pos+1+1+ord(clx[pos+1]) else: break if not pieceTable: raise CompoundFileError, "The file is corrupt. Cannot locate a piece table." # Read info from pieceTable, about each piece and extract it from WordDocument stream: pieceCount = (lcbPieceTable-4)/12 for x in xrange(pieceCount): cpStart = unpack("l", pieceTable[x*4:x*4+4])[0] cpEnd = unpack("l", pieceTable[(x+1)*4:(x+1)*4+4])[0] ofsetDescriptor = ((pieceCount+1)*4)+(x*8) pieceDescriptor = pieceTable[ofsetDescriptor:ofsetDescriptor+8] fcValue = unpack("L", pieceDescriptor[2:6])[0] isANSII = (fcValue & 0x40000000) == 0x40000000 fc = fcValue & 0xbfffffff cb = cpEnd-cpStart enc = ("utf-16", "cp1252")[isANSII] cb = (cb*2, cb)[isANSII] text += doc[fc:fc+cb].decode(enc, "ignore") return "\n".join(text.splitlines())