Verified source
28GB
Real quality dataset baseline.
Dataset Transparency
We start with real, verified public-domain and CC0 material, then use filtering and equation-guided synthetic generation to expand clean coverage. The result is a dataset stack built for training and retrieval, not just raw scraping.
Verified source
28GB
Real quality dataset baseline.
Expanded output
100GB+
Clean synthetic + curated training data.
Rights profile
CC0 / PD
No copyright ambiguity in delivery.
Licensing
Our pipeline is strict by design: we gather and publish only CC0 and public-domain data. That means your team can train and deploy without stepping into rights uncertainty.
RAG system
Our RAG layer filters outputs to deliver only public-domain and CC0 material. Models can call internet-scale context while keeping the response path aligned with clean licensing.
Bundles
Data is organized into focused bundles by keyword. Most bundles cost one credit, so it is practical to combine several in a single workflow. Bundles are continuously updated as source coverage improves.
What data?
Text quality
We extract high-purity text from books, articles, and other long-form sources so models learn strong language behavior and clearer responses.
STEM depth
We pair language data with STEM content because future-quality models must not only speak well, but also understand the world they describe.
Synthetic expansion
Equation-driven augmentation expands useful coverage while preserving quality constraints and source-traceability principles.
Result
A clean, legally safe, and constantly improving data foundation for model training, evaluation, and retrieval.