# Deserialization of Untrusted Data (CWE-502)

The product deserializes untrusted data without sufficiently verifying that the resulting data will be valid.

**Stack:** Python

- Prevalence: 中 3 言語をカバー
- Impact: クリティカル 3 件の重大度クリティカルなルール
- Prevention: 文書化済み 7 件の修正例

**OWASP:** Software and Data Integrity Failures (A08:2021-Software and Data Integrity Failures) - #8

## Description

Many programming languages allow the serialization of objects for storage or transmission. When untrusted data is deserialized, it can lead to code execution, denial of service, or other unintended consequences.

## Prevention

3 件の Shoulder 検出ルールに基づく Deserialization of Untrusted Data の予防策。

### Python

Validate training data with Pydantic schemas and apply content moderation before ingestion

Replace pickle/marshal with JSON or other safe serialization formats

Use yaml.safe_load() instead of yaml.load() to prevent code execution

## Warning Signs

- [HIGH] untrusted or unvalidated data flowing into AI/LLM fine-tuning or training
processes
- [CRITICAL] untrusted user input being deserialized using unsafe methods like
pickle
- [CRITICAL] unsafe YAML deserialization using yaml

## Consequences

- 未承認コードの実行
- DoS: クラッシュ/終了/再起動
- アプリケーションデータの変更

## Mitigations

- 可能であれば、信頼できないデータのデシリアライズを避ける
- デシリアライズが必要な場合は、JSON のようなより安全な形式を使用する
- デジタル署名などの完全性チェックを実装する
- デシリアライズは低権限環境に隔離する

## Detection

- Total rules: 7
- Critical: 3
- Languages: go, javascript, typescript, python

## Rules by Language

### Python (3 rules)

- **LLM Training Data Poisoning** [HIGH]: Detects untrusted or unvalidated data flowing into AI/LLM fine-tuning or training
processes. OWASP LLM03 - Training Data Poisoning.

Training data poisoning can:
- Introduce backdoors into model behavior
- Bias model outputs maliciously
- Embed harmful content that appears in responses
- Compromise model accuracy and reliability
- Create security vulnerabilities in model behavior
  - Remediation: Validate training data with Pydantic and use content moderation.

```python
from pydantic import BaseModel, validator

class TrainingData(BaseModel):
    examples: list

    @validator('examples', each_item=True)
    def validate_example(cls, v):
        if len(v.get('content', '')) > 4000:
            raise ValueError('Content too long')
        return v

data = TrainingData(**request.json)
moderation = await openai.moderations.create(input=data.json())
```

Learn more: https://shoulder.dev/learn/python/cwe-502/llm-training-data-poisoning
- **Unsafe Deserialization** [CRITICAL]: Detects untrusted user input being deserialized using unsafe methods like
pickle.loads() or yaml.load().
  - Remediation: Use json.loads() or yaml.safe_load() instead of pickle.

```python
import json
obj = json.loads(user_data)
```

Learn more: https://shoulder.dev/learn/python/cwe-502/unsafe-deserialization
- **Unsafe YAML Deserialization** [CRITICAL]: Detects unsafe YAML deserialization using yaml.load() without SafeLoader.
  - Remediation: Use yaml.safe_load() instead of yaml.load().

```python
config = yaml.safe_load(yaml_string)
```

Learn more: https://shoulder.dev/learn/python/cwe-502/yaml-deserialization