SciMD: Scientific Markdown for the AI Era

SciMD (.smd) is a plain-text document format that extends Markdown with structured conventions for scientific writing — optimized for human authoring, LLM comprehension, and RAG retrieval. It eliminates the information loss that occurs when scientific papers are converted from PDF, JATS XML, or LaTeX into text that AI systems can actually use.

The Problem: Science Trapped in Print-Era Formats

The four dominant formats for scientific literature were designed for human readers on paper, not for machines:

PDF — Equations render as images, figures lose their descriptions, multi-column layouts break sentences at page boundaries. PDF-extracted Markdown scores 4.2/10 in LLM comprehension benchmarks with a High hallucination risk.
JATS XML — Complete data, but 72% of its bytes are markup. A single paper exceeds 32,000 tokens, overflowing standard 8K context windows. 6.5× more compute per sample than an equivalent SciMD document.
HTML / LaTeXML — MathML turns formulas into hundreds of nested tags; blob structure makes semantic chunking require XPath expertise.
LaTeX source — Layout command noise (\vspace{-0.5em}, \hspace*{\fill}), opaque figure references, and no semantic section metadata.

The Solution: Author-Time Structure

SciMD resolves these problems at the source — when the author writes the document, not when a converter tries to reverse-engineer it later.

Problem	SciMD solution
Equations as images	Native LaTeX: $E = mc^2$ , `$$...$$` blocks
Figures without context	Mandatory `::description` + `::interpretation` per figure
Charts as opaque images	`::chart` block with tabular data + author interpretation
Arbitrary text chunks	`::section{#id}` with `type`, `summary`, `depends_on`
Expensive XML processing	Plain text — 0% markup, ~4K tokens per paper (parsed)
Hallucination from gaps	Every visual element carries the author’s explanation
Poor RAG retrieval	Each section has `id`, `type`, `summary`, and dependency links

How It Works

A .smd file is valid UTF-8 plain text composed of three layers: a document header, semantic sections, and rich elements inside those sections.

1. YAML Frontmatter

Every SciMD file opens with a structured YAML header directly queryable by any system — filter by keywords, sort by date, resolve doi, expand references. Zero preprocessing:

---smd
title: "Optimization of Transformer Attention via Sparse Kernels"
authors:
  - name: "Dr. Elena Rossi"
    orcid: "0000-0002-1825-0097"
    affiliation: "Neural Computing Lab, ETH Zurich"
    corresponding: true
version: "0.1.0"
date: "2026-03-20"
keywords: ["Transformers", "Attention", "Sparsity", "LLM Optimization"]
abstract: |
  The quadratic complexity of standard self-attention remains a bottleneck
  for long-sequence LLMs. We propose a sparse kernel achieving 40% speedup
  at 32k tokens while preserving 98% perplexity on standard benchmarks.
---

2. Semantic Sections

Content is organized into typed, labeled sections with explicit dependency declarations. This structure is authored once and reused by every downstream system — RAG pipelines, training corpora, search indexes:

::section{#methods}
::meta
type: methods
summary: "Tiled sparse attention kernel using block-level sparsity on H100 GPUs."
depends_on: ["#introduction"]
::

# Methodology

Our approach divides the attention matrix into $64 \times 64$ blocks.
Only blocks within a Hamming distance of the diagonal are computed,
reducing the effective complexity from $O(n^2)$ to $O(n \log n)$.

::endsection

3. Rich Elements

Charts include tabular data and mandatory author interpretation — not images. Figures require both a description (what is shown) and an interpretation (what it means). This separation is the primary mechanism for eliminating hallucination risk.

Benchmark Results

Independent benchmark evaluated by Gemini 2.5 Pro on a real scientific paper, comparing SciMD against JATS XML and PDF-extracted Markdown:

Format	LLM Score	Hallucination Risk	Token Count
PDF-as-Markdown	5.0 / 10	🔴 High	~11,870
JATS XML	8.2 / 10	🟡 Low-Medium	~32,389
SciMD (raw)	8.6 / 10	🟡 Low-Medium	~11,699
SciMD parsed — RAG	9.6 / 10	🟢 Very Low	~4,040

Parsed SciMD is 2.4× more token-efficient than raw Markdown and 6.5× more efficient than XML, preserving 100% of semantic content.

Getting Started

pip install pyscimd

from scimd_parser import SciMDParser

doc = SciMDParser.parse("paper.smd")

# Semantically chunked sections — ready for embedding
for chunk in doc.to_rag_chunks():
    print(chunk["section_id"])   # e.g. "methods"
    print(chunk["type"])         # e.g. "methods"
    print(chunk["summary"])      # one-sentence description
    print(chunk["depends_on"])   # ["#introduction"]
    print(chunk["content"])      # clean Markdown, no noise

Roadmap

v0.1.0 — Core specification
v0.2.0 — Reference parser + validator (pyscimd on PyPI)
v0.3.0 — VS Code extension with live preview
v0.4.0 — Pandoc filter for PDF/HTML/DOCX export
v0.5.0 — LLM training pipeline toolkit
v1.0.0 — Stable specification after community review