ML & Training Data Collection
Build large-scale datasets from the web for model training and fine-tuning
Create datasets from the web for machine learning, including NLP, information extraction, classification, and entity recognition pipelines.
Common sources
- Public articles, blogs, documentation, forums
- Public product catalogs and listings
- Tables and structured lists useful as labeled sources
What to extract
- Clean text with provenance (URL, timestamp)
- Structured fields suitable for supervised labels (title/category/price/attributes)
- Tables as normalized rows
- Image URLs / media metadata (if needed for CV pipelines)
Implementation notes
- Keep dataset rows deterministic: same URL → same schema fields.
- Store raw + cleaned versions (raw HTML optional).
- Use stable identifiers:
source_url,content_hash,extraction_version.