Process & Methodology
How It's
Made.
Every dataset on Neurvance is born from one of two battle-tested pipelines: real-world data harvested from 36 verified CC0 sources, or synthetically generated training data crafted by local language models. Both pipelines run through an exacting 11-stage quality assurance gauntlet before a single record reaches you. The result? Clean, deduplicated, PII-scrubbed, license-verified data — ready to drop into your training loop. We target CC0 for everything we produce, so you own your models, no asterisks.
Our Commitment to CC0
Everything on Neurvance targets the Creative Commons Zero (CC0) public domain dedication — no restrictions, no attribution required, no asterisks. Synthetic data generated by our AI pipeline carries no copyright by design and is CC0 by default. Real-world data is pulled exclusively from sources that publish under CC0 or equivalent public domain terms.
How We Check Licenses
Source Allowlisting
DataGather only queries sources that are pre-screened and confirmed to publish exclusively under CC0 or public domain terms. Anything not on that list is never contacted. If a source's terms change, it is removed.
Record-Level Verification
When datasets include a license field, each record's declared license is normalised and automatically checked. Records carrying an unrecognised or non-permissive license are rejected at ingest — they never make it into a published dataset.
Public Audit Log
Every upload — what was ingested, from which category, and when — is written to a public log. You can verify exactly what entered the platform and cross-reference against the original sources yourself.
Two Sources of Data
Each pipeline produces different kinds of data and is optimised for different goals.
Real-World CC0 Data
DataGather automatically discovers and downloads publicly available datasets from 36 trusted sources — government data portals, scientific repositories, patent offices, open-access archives, biodiversity databases, and more. Every source is verified against a strict CC0 / public domain allowlist before a single byte is downloaded. No grey-area licenses. Ever.
- 01. Discover datasets from all enabled sources (up to 50 per source)
- 02. Verify license is CC0 or public domain — reject everything else
- 03. Download files to temporary workspace
- 04. Run full AiTrainingData QA pipeline (11 stages)
- 05. Upload surviving files to storage with category metadata
- 06. Mark complete, delete local temp files
Synthetic Training Data
AIDATA generates high-quality synthetic training data using local language models (Ollama / HuggingFace). Generators for 9 content categories run concurrently, each pulling seed prompts from curated prompt pools, running inference, and validating every record before it ever reaches a file.
- 01. Generators cycle round-robin from seed prompt pools
- 02. Local LLM runs inference on each prompt
- 03. Format validation + quality checks applied
- 04. Near-dedup screening rejects redundant records
- 05. Optional LLM quality judge scores borderline records
- 06. Accepted records written to JSONL / Parquet output
The Quality Pipeline
All data — whether gathered or generated — passes through an 11-stage pipeline before release. The pipeline is automated, repeatable, and produces a quality report for every dataset.
Ingest & Encoding Cleanup
Files are loaded into a structured DataFrame. Encoding issues and mojibake are repaired automatically. HTML tags, control characters, and null bytes are stripped. Excess whitespace is normalised. Rows where the text column is empty after cleaning are dropped before any further processing.
Exact Deduplication
Every record is hashed (SHA-256) and exact duplicates are removed. Research shows that identical records in training data cause models to memorise and reproduce verbatim text — Lee et al. (2022) found a single 61-word sentence repeated over 60,000 times in the C4 dataset. Our pipeline catches every one.
Near-Duplicate Detection
MinHash with Locality-Sensitive Hashing (LSH) identifies records that are highly similar but not identical. According to Lee et al. (2022), near-dedup causes models to emit memorised text 10× less frequently and reach the same accuracy in fewer training steps. That's raw efficiency, baked in.
PII Detection & Redaction
Pattern matching scans every record for common forms of personally identifiable information — email addresses, phone numbers, social security numbers, IP addresses, and credit card numbers. Detected values are either redacted with a placeholder or replaced with a fake but format-valid substitute, depending on configuration. No records are removed; only the sensitive values within them are replaced.
Toxicity Filtering
Records are screened against a keyword blocklist covering slurs, violence incitement, exploitation, and extremism. If a local LLM is configured, it runs a secondary pass for records that pass the keyword check. Records flagged by either method are removed.
Language Filter
Records are checked against character-level heuristics to confirm they are Latin-script English. Records that fail the check — non-Latin scripts, heavily encoded text, or content that is mostly non-ASCII — are removed. This is a heuristic filter, not a full language identification model.
Content Quality Scoring
Each record is scored 0–1 across several checks: word count bounds, gibberish detection, vocabulary diversity, repeated lines and n-gram patterns, unfilled template placeholders, AI boilerplate phrases, emoji density, and apparent truncation. Text is also cleaned — markdown formatting and filler phrases are stripped before scoring. Records below the minimum score are removed.
Bias Audit
A report-only step — no records are removed. The pipeline counts occurrences of gendered pronoun groups across the dataset and flags any skew above a threshold. If the dataset has labelled demographic columns, their value distributions are also reported. Results are written to the quality report.
License Verification
When a dataset contains a license or rights column, each record's declared license is normalised and checked against an allowed list: CC0, public domain, PDDL, CC-BY, MIT, and Apache 2.0. Records with unrecognised or non-permissive licenses are removed. If no license column is present the step is skipped — license compliance for gathered data is enforced at the source level by DataGather.
Train / Test Split
Records are split into train and test sets (default 90 / 10) using a fixed random seed for reproducibility. A hash-based overlap check then confirms that no record appears in both splits. Any overlap found is reported as a warning. Output is written as both JSONL and Parquet.
Quality Report
A quality report is generated and uploaded alongside the dataset. It records total records in and out, a per-step breakdown of how many records were removed or modified at each stage, timing, any warnings raised, and step-level details such as dedup counts, rejection reasons, and bias observations.
DataGather Sources
36 verified sources are queried automatically — spanning government data portals, scientific repositories, patent offices, legal archives, open-access research databases, biodiversity records, cultural heritage collections, and more. Every source is pre-screened to ensure it publishes exclusively under CC0 or equivalent public domain terms. This is the breadth you cannot build in an afternoon.
Why This Matters
Bad training data doesn't just slow you down — it silently poisons your models. These are the peer-reviewed numbers that shaped every decision we made.
Reduction in memorised text emitted by models trained on deduplicated data — confirmed by Lee et al. (2022, ACL). Every dataset here ships with full dedup already done.
Of validation examples in standard NLP benchmarks (incl. GLUE) were found verbatim in common pre-training corpora, per Dodge et al. (2021, EMNLP). Our pipeline catches train-test overlap after every split and flags it in the quality report.
Every dataset we publish is targetted for CC0 — the most permissive public domain dedication. No restrictions. No attribution required.
Output Formats
All processed datasets are available in two formats.
JSONL
JSON Lines format — one record per line. Human-readable, easy to stream, and compatible with every major ML framework. Ideal for inspection, quick loading, and custom preprocessing.
{"text": "...", "category": "code"}
{"text": "...", "category": "code"}
Parquet
Columnar binary format with built-in compression. Substantially smaller than JSONL for large datasets. Native support in Pandas, PyArrow, HuggingFace Datasets, Spark, and DuckDB.
import pandas as pd
df = pd.read_parquet("train.parquet")
Directory Structure
Published datasets are organised per subcategory under the site taxonomy. Each folder contains a train and test split in both formats.
release/
└── Text / Language Data/
├── Books/
│ ├── train.parquet
│ ├── train.jsonl
│ ├── test.parquet
│ └── test.jsonl
├── Chat Conversations/
├── Code - Programming/
├── Documentation/
├── News Articles/
├── Scientific Papers/
├── Websites/
└── Wikipedia Pages/
We get the datasets.
You make the intelligence.
36 sources. 11 QA stages. CC0 targeted. Real-world data cleaned, synthetic data ready — $10/month for 50 API calls, every dataset on the platform. Stop doing data work. Start building smarter models.
A note on guarantees — While we apply multiple layers of automated checking, no system can guarantee with absolute certainty that every dataset is CC0 — source metadata can be incomplete or change after the fact. We make every reasonable effort to verify licensing before anything reaches you, but for production use we always recommend tracing datasets back to their original sources and verifying license terms independently.