Python把数据从Word(.docx)中读出来写入到Excel(.xlsx)中
左侧Word的每一行是一段,是一些非结构化数据,目标是把它结构化表示成右侧的excel格式。
需要导入的包:
import docx from docx import Document from openpyxl import Workbook from tools import *
新建用于写xlsx的对象
workbook = Workbook() booksheet = workbook.active
读docx文档存入到xlsx里:
dir = '/Users/b/' file = '南京亲近母语2017年书目.docx' f = docx.Document(dir+file) level = '' #遍历文档里的段落 for para in f.paragraphs: bookname = '' auther = '' publiser = '' resource = '南京亲近母语2017年书目' text = para.text if len(text) == 0: continue text = key_filter(text) #用于过滤数据 textlist=text.split(' ') if len(textlist) == 1: level = textlist[0] print('level1',level) continue print('level2',level) while ' ' in textlist: textlist.remove('') list = [] if is_bookname(textlist[0].strip()): bookname = re_filter(textlist[0].strip(),'[1-9]\d*.') print(bookname) else: continue list.append(bookname.strip()) list.append(textlist[1].strip()) list.append(publiser.strip()) list.append(resource.strip()) list.append(level.strip()) booksheet.append(list) workbook.save(file.split('.')[0]+'.xlsx')
上面是完整的,下面分开解释解释
读Word文档:
f = docx.Document(dir+file) for para in f.paragraphs: text = para.text print(text)
新建excel文件并写入数据,以list的形式写入表中
from openpyxl import Workbook workbook = Workbook() booksheet = workbook.active list = ['《大卫上学去》','[美]大卫·香农','','南京亲近母语2017年书目','一年级课程书目(图画书书目'] booksheet.append(list) workbook.save(file.split('.')[0]+'.xlsx')