admin管理员组

文章数量:1344240

I’m working on a Python script to extract content from a Word document (.docx) and insert it into a SQL Server database. The challenge is that I need to preserve text styles like bold and italic, as well as handle line breaks and footnotes from the Word document.

Currently, I'm using the python-docx library to process the document. Line breaks have been successfully transferred using <br\>, but text styles (bold/italic) and footnotes are not being included in the output.

Here's what I’ve attempted so far:

1. For text styles:

I tried looping through paragraph.runs to detect run.bold and run.italic. However, the styled text doesn’t appear in my database output.

2. For footnotes:

I tried extracting footnotes using a custom function with doc.footnotes or checking for the style Footnote Text. While the function doesn’t raise errors, footnotes don’t appear in the final output.

Here’s the snippet of my code for processing styles and footnotes:

text_with_style = []
if paragraph.runs:
    for run in paragraph.runs:
        styled_text = run.text.strip()
        if run.bold:
            styled_text = f"<b>{styled_text}</b>"
        if run.italic:
            styled_text = f"<i>{styled_text}</i>"
        text_with_style.append(styled_text)

formatted_text = " ".join(text_with_style).replace("\n", "<br>")

For footnotes:


def extract_footnotes(doc):

    footnotes_text = []
    
    if hasattr(doc, 'footnotes'):
    
        for footnote in doc.footnotes:
    
            footnotes_text.append(footnote.text.strip())
    
    return footnotes_text

What am I missing? How can I reliably preserve bold/italic styles and extract footnotes so they’re included in the output that gets inserted into SQL Server? Any advice or working examples would be greatly appreciated.

本文标签: