SciMD
An open document format designed for humans to write, machines to understand, and science to advance — replacing PDF, JATS XML, and LaTeX for the AI era.
SciMD: Scientific Markdown for the AI Era
SciMD (.smd) is a plain-text document format that extends Markdown with structured conventions for scientific writing — optimized for human authoring, LLM comprehension, and RAG retrieval. It eliminates the information loss that occurs when scientific papers are converted from PDF, JATS XML, or LaTeX into text that AI systems can actually use.
The Problem: Science Trapped in Print-Era Formats
The four dominant formats for scientific literature were designed for human readers on paper, not for machines:
- PDF — Equations render as images, figures lose their descriptions, multi-column layouts break sentences at page boundaries. PDF-extracted Markdown scores 4.2/10 in LLM comprehension benchmarks with a High hallucination risk.
- JATS XML — Complete data, but 72% of its bytes are markup. A single paper exceeds 32,000 tokens, overflowing standard 8K context windows. 6.5× more compute per sample than an equivalent SciMD document.
- HTML / LaTeXML — MathML turns formulas into hundreds of nested tags; blob structure makes semantic chunking require XPath expertise.
- LaTeX source — Layout command noise (
\vspace{-0.5em},\hspace*{\fill}), opaque figure references, and no semantic section metadata.
The Solution: Author-Time Structure
SciMD resolves these problems at the source — when the author writes the document, not when a converter tries to reverse-engineer it later.
| Problem | SciMD solution |
|---|---|
| Equations as images | Native LaTeX: $E = mc^2$, $$...$$ blocks |
| Figures without context | Mandatory ::description + ::interpretation per figure |
| Charts as opaque images | ::chart block with tabular data + author interpretation |
| Arbitrary text chunks | ::section{#id} with type, summary, depends_on |
| Expensive XML processing | Plain text — 0% markup, ~4K tokens per paper (parsed) |
| Hallucination from gaps | Every visual element carries the author’s explanation |
| Poor RAG retrieval | Each section has id, type, summary, and dependency links |
How It Works
A .smd file is valid UTF-8 plain text composed of three layers: a document header, semantic sections, and rich elements inside those sections.
1. YAML Frontmatter
Every SciMD file opens with a structured YAML header directly queryable by any system — filter by keywords, sort by date, resolve doi, expand references. Zero preprocessing:
---smd
title: "Optimization of Transformer Attention via Sparse Kernels"
authors:
- name: "Dr. Elena Rossi"
orcid: "0000-0002-1825-0097"
affiliation: "Neural Computing Lab, ETH Zurich"
corresponding: true
version: "0.1.0"
date: "2026-03-20"
keywords: ["Transformers", "Attention", "Sparsity", "LLM Optimization"]
abstract: |
The quadratic complexity of standard self-attention remains a bottleneck
for long-sequence LLMs. We propose a sparse kernel achieving 40% speedup
at 32k tokens while preserving 98% perplexity on standard benchmarks.
---
2. Semantic Sections
Content is organized into typed, labeled sections with explicit dependency declarations. This structure is authored once and reused by every downstream system — RAG pipelines, training corpora, search indexes:
::section{#methods}
::meta
type: methods
summary: "Tiled sparse attention kernel using block-level sparsity on H100 GPUs."
depends_on: ["#introduction"]
::
# Methodology
Our approach divides the attention matrix into $64 \times 64$ blocks.
Only blocks within a Hamming distance of the diagonal are computed,
reducing the effective complexity from $O(n^2)$ to $O(n \log n)$.
::endsection
3. Rich Elements
Charts include tabular data and mandatory author interpretation — not images. Figures require both a description (what is shown) and an interpretation (what it means). This separation is the primary mechanism for eliminating hallucination risk.
Benchmark Results
Independent benchmark evaluated by Gemini 2.5 Pro on a real scientific paper, comparing SciMD against JATS XML and PDF-extracted Markdown:
| Format | LLM Score | Hallucination Risk | Token Count |
|---|---|---|---|
| PDF-as-Markdown | 5.0 / 10 | 🔴 High | ~11,870 |
| JATS XML | 8.2 / 10 | 🟡 Low-Medium | ~32,389 |
| SciMD (raw) | 8.6 / 10 | 🟡 Low-Medium | ~11,699 |
| SciMD parsed — RAG | 9.6 / 10 | 🟢 Very Low | ~4,040 |
Parsed SciMD is 2.4× more token-efficient than raw Markdown and 6.5× more efficient than XML, preserving 100% of semantic content.
Getting Started
pip install pyscimd
from scimd_parser import SciMDParser
doc = SciMDParser.parse("paper.smd")
# Semantically chunked sections — ready for embedding
for chunk in doc.to_rag_chunks():
print(chunk["section_id"]) # e.g. "methods"
print(chunk["type"]) # e.g. "methods"
print(chunk["summary"]) # one-sentence description
print(chunk["depends_on"]) # ["#introduction"]
print(chunk["content"]) # clean Markdown, no noise
Roadmap
- v0.1.0 — Core specification
- v0.2.0 — Reference parser + validator (
pyscimdon PyPI) - v0.3.0 — VS Code extension with live preview
- v0.4.0 — Pandoc filter for PDF/HTML/DOCX export
- v0.5.0 — LLM training pipeline toolkit
- v1.0.0 — Stable specification after community review