Documentation API Reference SDKs & CLI Integrations

ML & Training Data Collection

Build large-scale datasets from the web for model training and fine-tuning

Create datasets from the web for machine learning, including NLP, information extraction, classification, and entity recognition pipelines.

Common sources

Public articles, blogs, documentation, forums
Public product catalogs and listings
Tables and structured lists useful as labeled sources

What to extract

Clean text with provenance (URL, timestamp)
Structured fields suitable for supervised labels (title/category/price/attributes)
Tables as normalized rows
Image URLs / media metadata (if needed for CV pipelines)

Implementation notes

Keep dataset rows deterministic: same URL → same schema fields.
Store raw + cleaned versions (raw HTML optional).
Use stable identifiers: source_url, content_hash, extraction_version.

FAQs