active

SciMD

An open document format designed for humans to write, machines to understand, and science to advance — replacing PDF, JATS XML, and LaTeX for the AI era.

SciMD: Scientific Markdown for the AI Era

SciMD (.smd) is a plain-text document format that extends Markdown with structured conventions for scientific writing — optimized for human authoring, LLM comprehension, and RAG retrieval. It eliminates the information loss that occurs when scientific papers are converted from PDF, JATS XML, or LaTeX into text that AI systems can actually use.

The Problem: Science Trapped in Print-Era Formats

The four dominant formats for scientific literature were designed for human readers on paper, not for machines:

  • PDF — Equations render as images, figures lose their descriptions, multi-column layouts break sentences at page boundaries. PDF-extracted Markdown scores 4.2/10 in LLM comprehension benchmarks with a High hallucination risk.
  • JATS XML — Complete data, but 72% of its bytes are markup. A single paper exceeds 32,000 tokens, overflowing standard 8K context windows. 6.5× more compute per sample than an equivalent SciMD document.
  • HTML / LaTeXML — MathML turns formulas into hundreds of nested tags; blob structure makes semantic chunking require XPath expertise.
  • LaTeX source — Layout command noise (\vspace{-0.5em}, \hspace*{\fill}), opaque figure references, and no semantic section metadata.

The Solution: Author-Time Structure

SciMD resolves these problems at the source — when the author writes the document, not when a converter tries to reverse-engineer it later.

ProblemSciMD solution
Equations as imagesNative LaTeX: $E = mc^2$, $$...$$ blocks
Figures without contextMandatory ::description + ::interpretation per figure
Charts as opaque images::chart block with tabular data + author interpretation
Arbitrary text chunks::section{#id} with type, summary, depends_on
Expensive XML processingPlain text — 0% markup, ~4K tokens per paper (parsed)
Hallucination from gapsEvery visual element carries the author’s explanation
Poor RAG retrievalEach section has id, type, summary, and dependency links

How It Works

A .smd file is valid UTF-8 plain text composed of three layers: a document header, semantic sections, and rich elements inside those sections.

1. YAML Frontmatter

Every SciMD file opens with a structured YAML header directly queryable by any system — filter by keywords, sort by date, resolve doi, expand references. Zero preprocessing:

---smd
title: "Optimization of Transformer Attention via Sparse Kernels"
authors:
  - name: "Dr. Elena Rossi"
    orcid: "0000-0002-1825-0097"
    affiliation: "Neural Computing Lab, ETH Zurich"
    corresponding: true
version: "0.1.0"
date: "2026-03-20"
keywords: ["Transformers", "Attention", "Sparsity", "LLM Optimization"]
abstract: |
  The quadratic complexity of standard self-attention remains a bottleneck
  for long-sequence LLMs. We propose a sparse kernel achieving 40% speedup
  at 32k tokens while preserving 98% perplexity on standard benchmarks.
---

2. Semantic Sections

Content is organized into typed, labeled sections with explicit dependency declarations. This structure is authored once and reused by every downstream system — RAG pipelines, training corpora, search indexes:

::section{#methods}
::meta
type: methods
summary: "Tiled sparse attention kernel using block-level sparsity on H100 GPUs."
depends_on: ["#introduction"]
::

# Methodology

Our approach divides the attention matrix into $64 \times 64$ blocks.
Only blocks within a Hamming distance of the diagonal are computed,
reducing the effective complexity from $O(n^2)$ to $O(n \log n)$.

::endsection

3. Rich Elements

Charts include tabular data and mandatory author interpretation — not images. Figures require both a description (what is shown) and an interpretation (what it means). This separation is the primary mechanism for eliminating hallucination risk.

Benchmark Results

Independent benchmark evaluated by Gemini 2.5 Pro on a real scientific paper, comparing SciMD against JATS XML and PDF-extracted Markdown:

FormatLLM ScoreHallucination RiskToken Count
PDF-as-Markdown5.0 / 10🔴 High~11,870
JATS XML8.2 / 10🟡 Low-Medium~32,389
SciMD (raw)8.6 / 10🟡 Low-Medium~11,699
SciMD parsed — RAG9.6 / 10🟢 Very Low~4,040

Parsed SciMD is 2.4× more token-efficient than raw Markdown and 6.5× more efficient than XML, preserving 100% of semantic content.

Getting Started

pip install pyscimd
from scimd_parser import SciMDParser

doc = SciMDParser.parse("paper.smd")

# Semantically chunked sections — ready for embedding
for chunk in doc.to_rag_chunks():
    print(chunk["section_id"])   # e.g. "methods"
    print(chunk["type"])         # e.g. "methods"
    print(chunk["summary"])      # one-sentence description
    print(chunk["depends_on"])   # ["#introduction"]
    print(chunk["content"])      # clean Markdown, no noise

Roadmap

  • v0.1.0 — Core specification
  • v0.2.0 — Reference parser + validator (pyscimd on PyPI)
  • v0.3.0 — VS Code extension with live preview
  • v0.4.0 — Pandoc filter for PDF/HTML/DOCX export
  • v0.5.0 — LLM training pipeline toolkit
  • v1.0.0 — Stable specification after community review