Process & Methodology

How It's
Made.

Every dataset on Neurvance is born from one of two battle-tested pipelines: real-world data harvested from 36 verified CC0 sources, or synthetically generated training data crafted by local language models. Both pipelines run through an exacting 11-stage quality assurance gauntlet before a single record reaches you. The result? Clean, deduplicated, PII-scrubbed, license-verified data — ready to drop into your training loop. We target CC0 for everything we produce, so you own your models, no asterisks.

Our Commitment to CC0

Everything on Neurvance targets the Creative Commons Zero (CC0) public domain dedication — no restrictions, no attribution required, no asterisks. Synthetic data generated by our AI pipeline carries no copyright by design and is CC0 by default. Real-world data is pulled exclusively from sources that publish under CC0 or equivalent public domain terms.

CC0 License ↗

How We Check Licenses

Source Allowlisting

DataGather only queries sources that are pre-screened and confirmed to publish exclusively under CC0 or public domain terms. Anything not on that list is never contacted. If a source's terms change, it is removed.

Record-Level Verification

When datasets include a license field, each record's declared license is normalised and automatically checked. Records carrying an unrecognised or non-permissive license are rejected at ingest — they never make it into a published dataset.

Public Audit Log

Every upload — what was ingested, from which category, and when — is written to a public log. You can verify exactly what entered the platform and cross-reference against the original sources yourself.

Two Sources of Data

Each pipeline produces different kinds of data and is optimised for different goals.

DataGather

Real-World CC0 Data

DataGather automatically discovers and downloads publicly available datasets from 36 trusted sources — government data portals, scientific repositories, patent offices, open-access archives, biodiversity databases, and more. Every source is verified against a strict CC0 / public domain allowlist before a single byte is downloaded. No grey-area licenses. Ever.

What it produces
Structured datasets Scientific records Government statistics Museum metadata Geospatial data Open research data
Pipeline Steps
  1. 01. Discover datasets from all enabled sources (up to 50 per source)
  2. 02. Verify license is CC0 or public domain — reject everything else
  3. 03. Download files to temporary workspace
  4. 04. Run full AiTrainingData QA pipeline (11 stages)
  5. 05. Upload surviving files to storage with category metadata
  6. 06. Mark complete, delete local temp files
AIDATA

Synthetic Training Data

AIDATA generates high-quality synthetic training data using local language models (Ollama / HuggingFace). Generators for 9 content categories run concurrently, each pulling seed prompts from curated prompt pools, running inference, and validating every record before it ever reaches a file.

9 Content Categories
Code Conversation Creative Instruction Reasoning Scenario Shopping Summarization Tool Use
Generation Flow
  1. 01. Generators cycle round-robin from seed prompt pools
  2. 02. Local LLM runs inference on each prompt
  3. 03. Format validation + quality checks applied
  4. 04. Near-dedup screening rejects redundant records
  5. 05. Optional LLM quality judge scores borderline records
  6. 06. Accepted records written to JSONL / Parquet output

The Quality Pipeline

All data — whether gathered or generated — passes through an 11-stage pipeline before release. The pipeline is automated, repeatable, and produces a quality report for every dataset.

01

Ingest & Encoding Cleanup

Files are loaded into a structured DataFrame. Encoding issues and mojibake are repaired automatically. HTML tags, control characters, and null bytes are stripped. Excess whitespace is normalised. Rows where the text column is empty after cleaning are dropped before any further processing.

02

Exact Deduplication

Every record is hashed (SHA-256) and exact duplicates are removed. Research shows that identical records in training data cause models to memorise and reproduce verbatim text — Lee et al. (2022) found a single 61-word sentence repeated over 60,000 times in the C4 dataset. Our pipeline catches every one.

03

Near-Duplicate Detection

MinHash with Locality-Sensitive Hashing (LSH) identifies records that are highly similar but not identical. According to Lee et al. (2022), near-dedup causes models to emit memorised text 10× less frequently and reach the same accuracy in fewer training steps. That's raw efficiency, baked in.

04

PII Detection & Redaction

Pattern matching scans every record for common forms of personally identifiable information — email addresses, phone numbers, social security numbers, IP addresses, and credit card numbers. Detected values are either redacted with a placeholder or replaced with a fake but format-valid substitute, depending on configuration. No records are removed; only the sensitive values within them are replaced.

05

Toxicity Filtering

Records are screened against a keyword blocklist covering slurs, violence incitement, exploitation, and extremism. If a local LLM is configured, it runs a secondary pass for records that pass the keyword check. Records flagged by either method are removed.

06

Language Filter

Records are checked against character-level heuristics to confirm they are Latin-script English. Records that fail the check — non-Latin scripts, heavily encoded text, or content that is mostly non-ASCII — are removed. This is a heuristic filter, not a full language identification model.

07

Content Quality Scoring

Each record is scored 0–1 across several checks: word count bounds, gibberish detection, vocabulary diversity, repeated lines and n-gram patterns, unfilled template placeholders, AI boilerplate phrases, emoji density, and apparent truncation. Text is also cleaned — markdown formatting and filler phrases are stripped before scoring. Records below the minimum score are removed.

08

Bias Audit

A report-only step — no records are removed. The pipeline counts occurrences of gendered pronoun groups across the dataset and flags any skew above a threshold. If the dataset has labelled demographic columns, their value distributions are also reported. Results are written to the quality report.

09

License Verification

When a dataset contains a license or rights column, each record's declared license is normalised and checked against an allowed list: CC0, public domain, PDDL, CC-BY, MIT, and Apache 2.0. Records with unrecognised or non-permissive licenses are removed. If no license column is present the step is skipped — license compliance for gathered data is enforced at the source level by DataGather.

10

Train / Test Split

Records are split into train and test sets (default 90 / 10) using a fixed random seed for reproducibility. A hash-based overlap check then confirms that no record appears in both splits. Any overlap found is reported as a warning. Output is written as both JSONL and Parquet.

11

Quality Report

A quality report is generated and uploaded alongside the dataset. It records total records in and out, a per-step breakdown of how many records were removed or modified at each stage, timing, any warnings raised, and step-level details such as dedup counts, rejection reasons, and bias observations.

DataGather Sources

36 verified sources are queried automatically — spanning government data portals, scientific repositories, patent offices, legal archives, open-access research databases, biodiversity records, cultural heritage collections, and more. Every source is pre-screened to ensure it publishes exclusively under CC0 or equivalent public domain terms. This is the breadth you cannot build in an afternoon.

Government Portals
Scientific Repositories
Patent Offices
Legal Archives
Economic & Financial Data
Climate & Weather Records
Biomedical Research
Astronomical Data
Geospatial & Earth Science
Cultural Heritage
Museum Collections
Open-Access Journals
Scholarly Metadata
Biodiversity Records
Protein & Molecular Data
Public Domain Literature
Legislative Records
Clinical Trials Data
Regulatory Filings
Open Image Datasets
Researcher Profiles
Collaborative Knowledge Bases
Robotics & Automation Data
Open Game Assets
Securities & Market Data
Census & Demographics
Labour Statistics
Drug & Food Safety
ML Dataset Registries
Research Data Platforms
Open Encyclopaedias
Satellite & Remote Sensing
Environmental Monitoring
Digital Libraries
Public Health Records
Cross-domain Archives

Why This Matters

Bad training data doesn't just slow you down — it silently poisons your models. These are the peer-reviewed numbers that shaped every decision we made.

10×

Reduction in memorised text emitted by models trained on deduplicated data — confirmed by Lee et al. (2022, ACL). Every dataset here ships with full dedup already done.

Up to 50%

Of validation examples in standard NLP benchmarks (incl. GLUE) were found verbatim in common pre-training corpora, per Dodge et al. (2021, EMNLP). Our pipeline catches train-test overlap after every split and flags it in the quality report.

CC0

Every dataset we publish is targetted for CC0 — the most permissive public domain dedication. No restrictions. No attribution required.

Output Formats

All processed datasets are available in two formats.

JSONL

JSON Lines format — one record per line. Human-readable, easy to stream, and compatible with every major ML framework. Ideal for inspection, quick loading, and custom preprocessing.

{"text": "...", "category": "code"}
{"text": "...", "category": "code"}

Parquet

Columnar binary format with built-in compression. Substantially smaller than JSONL for large datasets. Native support in Pandas, PyArrow, HuggingFace Datasets, Spark, and DuckDB.

import pandas as pd
df = pd.read_parquet("train.parquet")

Directory Structure

Published datasets are organised per subcategory under the site taxonomy. Each folder contains a train and test split in both formats.

release/
└── Text / Language Data/
    ├── Books/
    │   ├── train.parquet
    │   ├── train.jsonl
    │   ├── test.parquet
    │   └── test.jsonl
    ├── Chat Conversations/
    ├── Code - Programming/
    ├── Documentation/
    ├── News Articles/
    ├── Scientific Papers/
    ├── Websites/
    └── Wikipedia Pages/

We get the datasets.
You make the intelligence.

36 sources. 11 QA stages. CC0 targeted. Real-world data cleaned, synthetic data ready — $10/month for 50 API calls, every dataset on the platform. Stop doing data work. Start building smarter models.

A note on guarantees — While we apply multiple layers of automated checking, no system can guarantee with absolute certainty that every dataset is CC0 — source metadata can be incomplete or change after the fact. We make every reasonable effort to verify licensing before anything reaches you, but for production use we always recommend tracing datasets back to their original sources and verifying license terms independently.