如何在 Linux 上使用 Python 读取 word 文件信息

如题所述

第一步:获取doc文件的xml组成文件

import zipfiledef get_word_xml(docx_filename):
with open(docx_filename) as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content

第二步:解析xml为树形数据结构
from lxml import etreedef get_xml_tree(xml_string):
return etree.fromstring(xml_string)

第三步:读取word内容:
def _itertext(self, my_etree):
"""Iterator to go through xml tree's text nodes"""
for node in my_etree.iter(tag=etree.Element):
if self._check_element_is(node, 't'):
yield (node, node.text)def _check_element_is(self, element, type_char):
word_schema = '99999'
return element.tag == '{%s}%s' % (word_schema,type_char)
温馨提示:答案为网友推荐,仅供参考
相似回答