Anomaly Swipe

How to Use Anomaly Swipe

Anomaly Swipe helps you quickly triage detected anomalies by swiping through them like a dating app.

Upload Data: Import a JSON or CSV file with your anomalies
Swipe Right (or tap thumbs up): Mark anomaly for investigation
Swipe Left (or tap thumbs down): Dismiss the anomaly
Export Results: Download your classifications as JSON or CSV

Keyboard Shortcuts

Arrow Left: Dismiss anomaly
Arrow Right: Mark for investigation

Supported File Formats

The app accepts JSON or CSV files. Each record should have at minimum:

id: A unique identifier for the anomaly
description: A text description of the anomaly

Any additional fields will be displayed as metadata tags on the card.

JSON Example

[
  {
    "id": "ANO-001",
    "description": "Unusual transaction amount...",
    "amount": "$50,000",
    "category": "Payments"
  }
]

CSV Example

id,description,amount,category
ANO-001,"Unusual transaction amount...",$50000,Payments

Detect Data Quality Anomalies with AI

Use this prompt with Claude, ChatGPT, or another LLM to analyze text datasets for data quality issues:

You are a data quality analyst. Analyze the provided text dataset and identify anomalous records using ONLY fast, surface-level detection methods. Do not perform semantic analysis or topic modeling — focus on patterns that can be detected through simple inspection.

## DETECTION METHODS (Apply All Three)

### Method 1: Length & Character Anomalies
Scan every record for:

**Length Issues:**
- Empty or near-empty records (< 10 characters)
- Extremely short records (< 20% of median length)
- Extremely long records (> 500% of median length)
- Single-word entries where sentences expected
- Truncated text (ends mid-word or mid-sentence)

**Character Issues:**
- Encoding errors: Ã©, â€™, Â, ï»¿, \x00, replacement character
- Excessive special characters (>30% non-alphanumeric)
- All caps or no caps where mixed case expected
- No spaces (words run together)
- Excessive whitespace or unusual line breaks
- Non-printable or control characters
- Mojibake (garbled text from encoding mismatch)

**Language Red Flags:**
- Unexpected character sets (Cyrillic, Chinese, Arabic in English corpus or vice versa)
- Mixed scripts within single record

### Method 2: Pattern & Keyword Matching
Flag records containing:

**Data Quality Patterns:**
- Test/placeholder text: "test", "asdf", "xxx", "lorem ipsum", "TBD", "N/A", "null", "undefined", "[blank]"
- Copy-paste artifacts: "http://", "file:///", ".docx", "Page X of Y"
- System/error text: "error", "exception", "stack trace", "404", "undefined", "NaN"
- Timestamp fragments in free text: "2024-01-", "12:34:56"

**PII Patterns (Potential Data Leakage):**
- SSN pattern: XXX-XX-XXXX
- Credit card pattern: 16 digits, possibly with spaces/dashes
- Email addresses where not expected
- Phone numbers where not expected
- IP addresses where not expected

**Content Red Flags (Customize Per Domain):**
- Profanity or offensive terms
- Competitor names (if internal data)
- Legal trigger words: "lawsuit", "attorney", "subpoena" (in non-legal context)
- Urgency manipulation: "URGENT", "ACT NOW", "IMMEDIATE"
- Spam indicators: "$$", "FREE", "CLICK HERE", "unsubscribe"

**Structural Violations:**
- Missing expected elements (no greeting in emails, no signature in letters)
- Wrong format (HTML tags in plain text field, JSON in prose field)
- Unexpected prefixes/suffixes

### Method 3: Duplicate & Repetition Detection
Identify:

**Exact Duplicates:**
- Records with identical text content
- Records identical after lowercasing and whitespace normalization

**Near Duplicates:**
- Records differing only by whitespace, punctuation, or case
- Records where one is substring of another
- Records sharing >90% of words (potential copy-paste with minor edits)

**Internal Repetition:**
- Same phrase/sentence repeated multiple times within one record
- Character repetition: "aaaaaa", "!!!!!!", "......"
- Word repetition: "the the the", copy-paste loops

**Boilerplate Detection:**
- Standard headers/footers appearing in >50% of records (note but don't flag as anomaly)
- Template text with unfilled placeholders: "[INSERT NAME]", "{customer_name}", "<FIELD>"

---

## ANALYSIS PROCESS

### Step 1: Calculate Baselines
Before flagging anomalies, compute:
- Median text length (characters and words)
- Expected character set (Latin, mixed, etc.)
- Common boilerplate to exclude from duplicate detection

### Step 2: Run All Three Methods
Apply each detection method to every record. A record can trigger multiple flags.

### Step 3: Score and Rank
Assign severity:
- **High**: Empty/garbage, encoding corruption, PII exposure, exact duplicates
- **Medium**: Length outliers, pattern matches, near-duplicates, internal repetition
- **Low**: Minor formatting issues, single keyword matches

Records with multiple flags rank higher.

---

## OUTPUT FORMAT

Return a JSON array of anomalies. For EACH anomaly, output:
- "id": Record identifier from the dataset (or row number if none)
- "description": Clear explanation with severity prefix (HIGH/MEDIUM/LOW)
- "type": Category of anomaly detected
- "flags": Array of specific issues found
- Additional metadata from the original record

Example output:
[
  {
    "id": "REC-001",
    "description": "HIGH SEVERITY: PII exposure detected - SSN pattern (XXX-XX-XXXX) found in free text field. Immediate remediation required.",
    "type": "PII Leakage",
    "flags": ["ssn_pattern", "data_exposure"],
    "original_text": "[first 50 chars]..."
  },
  {
    "id": "REC-047",
    "description": "MEDIUM SEVERITY: Record length 12 chars is 89% below median (112 chars). Contains only 'TBD - pending'. Likely placeholder.",
    "type": "Length Outlier",
    "flags": ["too_short", "placeholder_text"],
    "original_text": "TBD - pending"
  },
  {
    "id": "REC-203",
    "description": "HIGH SEVERITY: Encoding corruption detected - mojibake patterns (Ã©, â€™) indicate UTF-8/Latin-1 mismatch.",
    "type": "Encoding Error",
    "flags": ["encoding_error", "mojibake"],
    "original_text": "ThÃ© customerâ€™s..."
  }
]

---

## CUSTOMIZATION (Optional)

Before analyzing, you may specify:
1. **Expected language**: [English/Spanish/Mixed/etc.]
2. **Expected length range**: [e.g., 50-500 characters]
3. **Domain-specific keywords to flag**: [e.g., competitor names]
4. **Domain-specific keywords to ignore**: [e.g., valid jargon]
5. **Fields that SHOULD contain emails/phones**: [exclude from PII flags]

---

## BEGIN ANALYSIS

Analyze the attached file for data quality anomalies.

Additional context (optional):
- This data is: [support tickets / reviews / emails / logs / etc.]
- Expected language: [English]
- Flag these terms: [none]
- Ignore these patterns: [none]

Output ONLY the JSON array of anomalies.

---
Note: Attach your Excel (.xlsx) or CSV file to this message. The AI will analyze all text columns for the anomaly patterns described above.

Review Complete!

To Investigate

Dismissed