AI Engineering

November 28, 2024

AI Data Pipelines: What You Need Before You Train Anything

The Data Problem Is Bigger Than the Model Problem

Everyone wants to talk about model architecture, training techniques, and benchmark performance. These conversations are interesting. They're also mostly irrelevant to the average AI project.

The average AI project fails not because of the model — it fails because of the data. Bad data, insufficient data, mislabeled data, data with leakage between splits, data that doesn't represent the production distribution. These problems are unglamorous and common.

Before you train anything, you need to solve the data problem. Here's what that actually involves.

Data Inventory: What You Have vs What You Need

Start with an honest inventory.

What you have: Audit every data source available to you. Databases, logs, documents, user interactions, external datasets, historical records. For each source, record: volume (how many examples), format (structured, unstructured, image, audio), quality (how clean is it), labeling status (is it labeled for your task?), and access (can you actually use it legally and technically?).

What you need: Define your task precisely. What is the input? What is the output? What quality level do you need? Work backwards from that to a minimum dataset size estimate. For classification tasks, a rule of thumb is 1,000+ examples per class for a simple task, 10,000+ for a complex one. For generation tasks, it's harder to estimate — start with what you have and measure.

The gap between what you have and what you need is the data collection problem you need to solve.

Collection Strategy

Internal data: Most organizations have more data than they think — it's just not in the right format. System logs, user feedback, support tickets, document archives. The challenge is extraction, normalization, and often labeling.

Manual collection: Collecting data specifically for your task. For image data, this means photography or sourcing images. For text data, this might mean writing examples or soliciting contributions. Expensive and slow, but produces the highest-quality, most on-distribution data.

Web scraping: For public data. Effective for certain domains (code, text, images) with significant caveats around licensing, copyright, and data quality. Terms of service often prohibit scraping.

Synthetic data: Using AI to generate training data. This has become increasingly viable — particularly using frontier LLMs to generate labeled examples for classification tasks, or to augment sparse real data. Quality must be carefully validated; synthetic data can introduce systematic biases.

Data purchasing: Specialized data vendors sell labeled datasets for common tasks. Often faster than collecting yourself, but expensive and may not match your specific distribution.

Labeling: In-House, Crowdsourced, or AI-Assisted

Labeling is the step that most projects underestimate in time and cost.

In-house labeling: Domain experts on your team label the data. Highest quality, most expensive, slowest. Required for highly specialized domains (medical, legal, financial). Use Label Studio or Prodigy for the tooling.

Crowdsourced labeling: Services like Scale AI, Appen, and Labelbox provide access to large contractor pools for labeling. Faster and cheaper than in-house for tasks that don't require deep domain expertise. Quality varies; implement quality controls (inter-annotator agreement, gold standard examples embedded in batches).

AI-assisted labeling: Use an existing model (often a frontier LLM) to generate candidate labels, which human reviewers accept, reject, or correct. Often 3-5x faster than labeling from scratch, at some quality cost. Works well when the labeling task can be framed as a prompt.

For any labeling effort: write an annotation guide before you start. Ambiguous instructions produce inconsistent labels. The annotation guide should include definitions, examples of edge cases, and explicit decision rules for the ambiguous cases you can anticipate.

Quality Checks

Garbage in, garbage out. Quality checks should happen continuously, not just at the end.

Inter-annotator agreement (IAA): Have multiple annotators label the same subset of examples and measure agreement. Low agreement means the task is ambiguous or the guidelines are unclear — fix the guidelines before continuing.

Spot checking: Regularly sample labeled data and review it manually. You will find errors. The question is whether the error rate is acceptable for your task.

Outlier detection: Automated checks for examples that are very different from the rest of the dataset. These are often labeling errors or data collection artifacts.

Class balance check: For classification tasks, check whether your dataset has realistic class balance. A dataset with 95% negative examples and 5% positive examples will produce a model that predicts negative almost always — which looks good on accuracy but is useless.

Versioning with DVC or LakeFS

Your dataset will change. You'll add more data, fix labeling errors, change the labeling schema. Without versioning, you lose the ability to reproduce experiments and track what changed between model versions.

DVC (Data Version Control): Git-compatible versioning for large data files and ML experiments. Works with S3, GCS, or local storage. If you're using Git for code, DVC is the natural complement for data.

LakeFS: A more full-featured data lake versioning system. Better for large teams with complex data pipelines. More setup, more power.

At minimum: store every dataset version with a unique identifier, record which data version was used to train each model, and never mutate a dataset that's been used for training.

Train/Val/Test Splits

This seems basic, but errors here are common and catastrophic.

The test set is sacred. It is never touched until final evaluation. No examples from the test set appear in training or validation. No hyperparameter decisions are made based on test set performance. If you violate this, your test performance is meaningless.

The validation set is for model selection and hyperparameter tuning. It can be evaluated multiple times, which means it can be overfit through repeated evaluation. If you use your validation set for too many decisions, hold out a second validation set for final model selection.

For time-series data: split chronologically (train on older data, test on newer). Random splitting produces leakage.

For related examples (multiple images of the same object, multiple quotes from the same author): split at the group level, not the example level. Random splitting of grouped data produces leakage.

Common Mistakes

Data leakage: Training data contains information about the test examples. Most commonly happens with random splits on grouped data, feature engineering that uses information from the full dataset, or duplicate examples appearing in both splits.

Imbalanced classes: Addressed by oversampling the minority class, undersampling the majority class, or using class weights during training. The right approach depends on the task.

Incorrect labels: More common than you think, even for seemingly simple tasks. Budget time for label review.

Distribution mismatch: The training data doesn't represent the production distribution. Measure this by comparing feature statistics of training and production data.

Too little data: Models trained on too little data overfit. The solutions are more data (collect more), data augmentation (create synthetic variations), or a smaller/simpler model.

The investment in getting data right before training is always worth it. Models trained on bad data produce bad results, and the most sophisticated architecture in the world can't compensate for a fundamentally broken dataset.

Meet the author

Tequity Team