1) The phenomenon: why pasting HTML into Word keeps “most of the styling”
This used to be most common when people copied articles from websites into Word. Today, a very common source is:
- You ask ChatGPT/Claude/another LLM to generate Markdown.
- On the web, you see a rendered page (headings, lists, tables, bold text, blockquotes, etc.).
- You copy from that rendered page and paste into Word.
It creates a misleadingly “good” experience: you did almost no formatting, yet the Word document already looks like a proper document.
More concretely, if you select content in the browser—either a normal article or an LLM’s rendered Markdown—and paste into Word, you often get:
- fonts, sizes, bold/italic, color, and alignment mostly preserved;
- lists (ordered/unordered) preserved;
- code blocks / blockquotes sometimes preserved in appearance (indent, background, monospace);
- headings that look bigger/bolder and feel like headings.
This is not an illusion. The reason is usually: your clipboard contains HTML.
More specifically:
- A webpage is rendered from HTML + CSS: elements like
h1/h2/h3,p,li,strong,em,table, etc., plus CSS rules that define font, size, spacing, and so on. - When you copy, many sites put the selection into the clipboard as HTML, not plain text.
- When you paste, Word attempts to parse that HTML and translate the CSS “visual look” into Word’s own formatting (font size, bold, paragraph indentation, list symbols, etc.).
That’s why LLM-generated Markdown on the web behaves the same way:
Markdown is plain text, but what you copy is the rendered HTML. Once you copy the rendered output (not the raw Markdown), Word receives HTML with styling cues—not plain text.
So it feels like: “it looks one way on the web, and it looks roughly the same in Word.”
But the key point is: Word is trying to recreate the appearance. It may not convert the web page’s semantic structure into Word’s structured styles system.
---
2) The cost: the look is preserved, but structural signals are missing
Copy-pasting from the web is inherently like “visual reconstruction”, not “document structure”.
You get something that looks right, but Word’s structure features (TOC, captions, numbering, cross-references) depend on different signals.
To be clear, the most important “structure signals” in Word are:
- Heading levels: Heading 1/2/3… (for TOC and numbering)
- Caption system: Word captions (for lists of figures/tables, auto numbering, cross-references)
- Reference block styling: consistent reference entry paragraphs (for hanging indents and uniform spacing)
When content comes from the web, the most common gaps are the following.
2.1 Missing heading levels: looks like a heading, but it’s still “body text”
The visual effect of h1/h2/h3 often turns into:
- a normal paragraph with direct formatting (bigger font + bold),
- or an unstable “mixed” style that won’t match your template.
This leads to two pain points:
- The TOC won’t generate correctly: Word’s TOC is usually based on Heading styles, not “text that looks big and bold”.
- Headings become unmaintainable: as the document grows, manual bolding/resizing/indent tweaking accumulates and breaks once you try to standardize formatting or switch templates.
2.2 Missing captions: figure/table titles are just normal paragraphs
In Markdown you might write:
Figure 2-1 Research workflowTable 3-2 Experimental parameters
After pasting, these lines are still plain paragraphs. Word does not treat them as captions, so:
- you cannot generate a list of figures/tables reliably;
- numbering won’t auto-increment;
- cross-references won’t work as intended.
2.3 Reference block issues: hanging indents get “polluted” by spaces/Tabs
On the web, references are often just lists or normal paragraphs. After pasting:
- hanging indents may be faked by spaces/Tabs;
- spacing becomes inconsistent;
- you can’t reliably “fix it once with a style”.
Worse: once you apply a template (first-line indent / hanging indent), those historical spaces/Tabs stack, producing visibly huge blank areas.
---
3) A more practical view: you don’t lack “formatting”, you lack “types”
In FreeFormat’s terms, you can treat each paragraph in a Word document as having a “type”:
chapter_title / section_title / subsection_title(heading levels)paragraph(body text)figure_caption / table_caption(captions)reference_entry(reference list entries)
When you paste from the web, the visual appearance may be fine, but these types are usually unlabeled.
Without types, Word can’t generate TOC/lists reliably, and automation becomes fragile.
---
4) Two approaches: pure Word fixes vs FreeFormat labeling
Option A: Fix in Word manually (works for short documents)
You can do three things in Word:
- Select each heading paragraph → apply
Heading 1/2/3 - For figures/tables → use Word’s “Insert Caption”
- For references → use paragraph settings for hanging indent (don’t stack spaces/Tabs)
Pros: no extra tool. Cons: painful for long docs, easy to miss things.
Option B: Label first with FreeFormat, then format (best for most cases)
Treat your pasted Word doc as “content done, structure debt unpaid”:
- Choose a template close to your target (e.g., a university thesis template).
- Upload your
.docxand run a check. - Let the tool report: “these paragraphs look like headings/captions/references, but are not typed correctly”.
- Run format to apply a consistent styles system.
- Back in Word: update the TOC (and lists if needed) and do a final check.
The key value: you avoid manually identifying types paragraph-by-paragraph, and Word’s structure becomes usable again.
---
5) Recommended workflow (paste → label → format → update TOC)
- Paste your content (keep the look you like; don’t over-fix yet)
- Quickly remove “structure pollution”
- remove leading spaces/Tabs used for indentation (use paragraph indent later)
- don’t center titles by typing spaces
- Run a FreeFormat check (get an issues list)
- Format with the same template (build a stable styles system)
- Back in Word: update the TOC/lists and do final self-check
Open Studio:
- English: /en/studio
- 中文:/zh/studio
---
6) Minimal self-check (you don’t need rules, just acceptance tests)
- The TOC updates automatically (not hand-typed dots/pages)
- Each heading level looks consistent (same level = same style)
- Figure/table captions are part of a caption system (stable numbering)
- Body indent comes from paragraph settings (not spaces/Tabs)
- Reference entries are consistent (indent & spacing)
---
7) Common pitfalls when pasting from the web
- Using “bold + bigger font” as headings: looks like headings, but structurally it’s body text.
- Using Tabs/spaces for indentation: once you apply a template, indents stack and break.
- Captions are just paragraphs: lists of figures/tables and cross-references are hard later.
- Forgetting to update the TOC: you fixed structure but didn’t refresh TOC fields.
If you restore structural signals (heading levels, captions, reference block consistency), formatting becomes much easier.