# Deserialization of Untrusted Data (CWE-502) The product deserializes untrusted data without sufficiently verifying that the resulting data will be valid. **Stack:** Python - Prevalence: Średnia Pokryto 3 języków - Impact: Krytyczny 3 reguł o krytycznym poziomie - Prevention: Udokumentowane 7 przykładów poprawek **OWASP:** Software and Data Integrity Failures (A08:2021-Software and Data Integrity Failures) - #8 ## Description Many programming languages allow the serialization of objects for storage or transmission. When untrusted data is deserialized, it can lead to code execution, denial of service, or other unintended consequences. ## Prevention Strategie zapobiegania dla Deserialization of Untrusted Data oparte na 3 regułach detekcji Shoulder. ### Python Validate training data with Pydantic schemas and apply content moderation before ingestion Replace pickle/marshal with JSON or other safe serialization formats Use yaml.safe_load() instead of yaml.load() to prevent code execution ## Warning Signs - [HIGH] untrusted or unvalidated data flowing into AI/LLM fine-tuning or training processes - [CRITICAL] untrusted user input being deserialized using unsafe methods like pickle - [CRITICAL] unsafe YAML deserialization using yaml ## Consequences - Wykonanie nieautoryzowanego kodu - DoS: awaria / wyjście / restart - Modyfikacja danych aplikacji ## Mitigations - Jeśli to możliwe, unikaj deserializacji niezaufanych danych - Jeśli deserializacja jest konieczna, stosuj bezpieczniejsze formaty, takie jak JSON - Wdroż kontrole integralności, np. podpisy cyfrowe - Izoluj deserializację w środowiskach o niskich uprawnieniach ## Detection - Total rules: 7 - Critical: 3 - Languages: go, javascript, typescript, python ## Rules by Language ### Python (3 rules) - **LLM Training Data Poisoning** [HIGH]: Detects untrusted or unvalidated data flowing into AI/LLM fine-tuning or training processes. OWASP LLM03 - Training Data Poisoning. Training data poisoning can: - Introduce backdoors into model behavior - Bias model outputs maliciously - Embed harmful content that appears in responses - Compromise model accuracy and reliability - Create security vulnerabilities in model behavior - Remediation: Validate training data with Pydantic and use content moderation. ```python from pydantic import BaseModel, validator class TrainingData(BaseModel): examples: list @validator('examples', each_item=True) def validate_example(cls, v): if len(v.get('content', '')) > 4000: raise ValueError('Content too long') return v data = TrainingData(**request.json) moderation = await openai.moderations.create(input=data.json()) ``` Learn more: https://shoulder.dev/learn/python/cwe-502/llm-training-data-poisoning - **Unsafe Deserialization** [CRITICAL]: Detects untrusted user input being deserialized using unsafe methods like pickle.loads() or yaml.load(). - Remediation: Use json.loads() or yaml.safe_load() instead of pickle. ```python import json obj = json.loads(user_data) ``` Learn more: https://shoulder.dev/learn/python/cwe-502/unsafe-deserialization - **Unsafe YAML Deserialization** [CRITICAL]: Detects unsafe YAML deserialization using yaml.load() without SafeLoader. - Remediation: Use yaml.safe_load() instead of yaml.load(). ```python config = yaml.safe_load(yaml_string) ``` Learn more: https://shoulder.dev/learn/python/cwe-502/yaml-deserialization