LLM03: Training Data Poisoning
Training data poisoning occurs when attackers manipulate the data used to train or fine-tune an LLM, causing it to learn and reproduce malicious behaviors, biases, or backdoors. This includes poisoning pre-training corpora, fine-tuning datasets, RAG knowledge bases, and reinforcement learning feedback. The attack is especially insidious because the resulting model appears to function normally except in attacker-chosen scenarios.
How Training Data Gets Poisoned
There are multiple attack surfaces: (1) Web scraping — models trained on internet data can ingest attacker-controlled websites. (2) Fine-tuning datasets — a disgruntled contributor adds backdoored examples to a public dataset on HuggingFace. (3) RAG knowledge bases — injecting malicious documents into the vector store. (4) RLHF manipulation — creating sock-puppet accounts to provide biased human feedback that reinforces unwanted behaviors.
Backdoor Attacks
The most dangerous form of data poisoning is a backdoor attack. The attacker poisons training data so the model behaves normally on standard inputs but produces specific malicious output when triggered by a particular phrase or pattern. For example, a code model might generate secure code normally, but when the comment "// optimized for production" appears, it inserts a hardcoded credential. This trigger-response pattern is nearly impossible to detect through standard testing.
Supply Chain Poisoning
Public model repositories (HuggingFace Hub, PyTorch Hub) are supply chain targets. An attacker can upload a popular model with a subtle backdoor that activates under specific conditions. Since most teams download pre-trained models and fine-tune them, a poisoned base model compromises all downstream applications.
⚔️ Attack Examples & Code Patterns
Backdoor in fine-tuning dataset
Poisoned training examples that teach the model to insert backdoors:
RAG knowledge base poisoning
Injecting malicious documents into the vector store:
🔍 Detection Checklist
- ☐Track provenance of all training and fine-tuning data sources
- ☐Monitor public datasets for unexpected changes or contributions
- ☐Test fine-tuned models with adversarial trigger phrases
- ☐Compare model outputs against a known-good baseline on test cases
- ☐Scan RAG knowledge bases for hidden text and injection payloads
- ☐Implement hash-based integrity verification for training datasets
🛡️ Mitigation Strategy
Implement strict data provenance tracking for all training data. Use anomaly detection on training datasets. Apply data sanitization pipelines with content filtering. Maintain a curated, verified holdout set for model validation. Use federated learning or differential privacy where applicable.
How Precogs AI Protects You
Precogs AI Data Security monitors ML data pipelines for anomalous inputs, validates training data sources against known-good baselines, and detects tainted datasets before they reach the fine-tuning process.
Start Free ScanHow does training data poisoning compromise LLMs?
Training data poisoning injects malicious examples into an LLM's training or fine-tuning data, causing it to learn backdoors or biased behaviors. The model appears normal except when triggered by specific inputs. Prevention requires data provenance tracking, anomaly detection, and integrity verification of all training datasets.
Protect Against LLM03: Training Data Poisoning
Precogs AI automatically detects llm03: training data poisoning vulnerabilities and generates AutoFix PRs.