Dataset Transparency

From 28GB of real source data to more than 100GB of clean training data

We start with real, verified public-domain and CC0 material, then use filtering and equation-guided synthetic generation to expand clean coverage. The result is a dataset stack built for training and retrieval, not just raw scraping.

Verified source

28GB

Real quality dataset baseline.

Expanded output

100GB+

Clean synthetic + curated training data.

Rights profile

CC0 / PD

No copyright ambiguity in delivery.

Licensing

Licenses we filter for

Our pipeline is strict by design: we gather and publish only CC0 and public-domain data. That means your team can train and deploy without stepping into rights uncertainty.

RAG system

Retrieval without copyright risk

Our RAG layer filters outputs to deliver only public-domain and CC0 material. Models can call internet-scale context while keeping the response path aligned with clean licensing.

Bundles

Keyword bundles with low-friction access

Data is organized into focused bundles by keyword. Most bundles cost one credit, so it is practical to combine several in a single workflow. Bundles are continuously updated as source coverage improves.

1 credit typical Keyword sorted Always updated

What data?

Language strength plus STEM understanding

Text quality

We extract high-purity text from books, articles, and other long-form sources so models learn strong language behavior and clearer responses.

STEM depth

We pair language data with STEM content because future-quality models must not only speak well, but also understand the world they describe.

Synthetic expansion

Equation-driven augmentation expands useful coverage while preserving quality constraints and source-traceability principles.

Result

A clean, legally safe, and constantly improving data foundation for model training, evaluation, and retrieval.