How to Use Anomaly Swipe
Anomaly Swipe helps you quickly triage detected anomalies by swiping through them like a dating app.
- Upload Data: Import a JSON or CSV file with your anomalies
- Swipe Right (or tap thumbs up): Mark anomaly for investigation
- Swipe Left (or tap thumbs down): Dismiss the anomaly
- Export Results: Download your classifications as JSON or CSV
Keyboard Shortcuts
- Arrow Left: Dismiss anomaly
- Arrow Right: Mark for investigation
Supported File Formats
The app accepts JSON or CSV files. Each record should have at minimum:
- id: A unique identifier for the anomaly
- description: A text description of the anomaly
Any additional fields will be displayed as metadata tags on the card.
JSON Example
[
{
"id": "ANO-001",
"description": "Unusual transaction amount...",
"amount": "$50,000",
"category": "Payments"
}
]
CSV Example
id,description,amount,category
ANO-001,"Unusual transaction amount...",$50000,Payments
Detect Data Quality Anomalies with AI
Use this prompt with Claude, ChatGPT, or another LLM to analyze text datasets for data quality issues:
You are a data quality analyst. Analyze the provided text dataset and identify anomalous records using ONLY fast, surface-level detection methods. Do not perform semantic analysis or topic modeling — focus on patterns that can be detected through simple inspection.
## DETECTION METHODS (Apply All Three)
### Method 1: Length & Character Anomalies
Scan every record for:
**Length Issues:**
- Empty or near-empty records (< 10 characters)
- Extremely short records (< 20% of median length)
- Extremely long records (> 500% of median length)
- Single-word entries where sentences expected
- Truncated text (ends mid-word or mid-sentence)
**Character Issues:**
- Encoding errors: é, ’, Â, , \x00, replacement character
- Excessive special characters (>30% non-alphanumeric)
- All caps or no caps where mixed case expected
- No spaces (words run together)
- Excessive whitespace or unusual line breaks
- Non-printable or control characters
- Mojibake (garbled text from encoding mismatch)
**Language Red Flags:**
- Unexpected character sets (Cyrillic, Chinese, Arabic in English corpus or vice versa)
- Mixed scripts within single record
### Method 2: Pattern & Keyword Matching
Flag records containing:
**Data Quality Patterns:**
- Test/placeholder text: "test", "asdf", "xxx", "lorem ipsum", "TBD", "N/A", "null", "undefined", "[blank]"
- Copy-paste artifacts: "http://", "file:///", ".docx", "Page X of Y"
- System/error text: "error", "exception", "stack trace", "404", "undefined", "NaN"
- Timestamp fragments in free text: "2024-01-", "12:34:56"
**PII Patterns (Potential Data Leakage):**
- SSN pattern: XXX-XX-XXXX
- Credit card pattern: 16 digits, possibly with spaces/dashes
- Email addresses where not expected
- Phone numbers where not expected
- IP addresses where not expected
**Content Red Flags (Customize Per Domain):**
- Profanity or offensive terms
- Competitor names (if internal data)
- Legal trigger words: "lawsuit", "attorney", "subpoena" (in non-legal context)
- Urgency manipulation: "URGENT", "ACT NOW", "IMMEDIATE"
- Spam indicators: "$$", "FREE", "CLICK HERE", "unsubscribe"
**Structural Violations:**
- Missing expected elements (no greeting in emails, no signature in letters)
- Wrong format (HTML tags in plain text field, JSON in prose field)
- Unexpected prefixes/suffixes
### Method 3: Duplicate & Repetition Detection
Identify:
**Exact Duplicates:**
- Records with identical text content
- Records identical after lowercasing and whitespace normalization
**Near Duplicates:**
- Records differing only by whitespace, punctuation, or case
- Records where one is substring of another
- Records sharing >90% of words (potential copy-paste with minor edits)
**Internal Repetition:**
- Same phrase/sentence repeated multiple times within one record
- Character repetition: "aaaaaa", "!!!!!!", "......"
- Word repetition: "the the the", copy-paste loops
**Boilerplate Detection:**
- Standard headers/footers appearing in >50% of records (note but don't flag as anomaly)
- Template text with unfilled placeholders: "[INSERT NAME]", "{customer_name}", "<FIELD>"
---
## ANALYSIS PROCESS
### Step 1: Calculate Baselines
Before flagging anomalies, compute:
- Median text length (characters and words)
- Expected character set (Latin, mixed, etc.)
- Common boilerplate to exclude from duplicate detection
### Step 2: Run All Three Methods
Apply each detection method to every record. A record can trigger multiple flags.
### Step 3: Score and Rank
Assign severity:
- **High**: Empty/garbage, encoding corruption, PII exposure, exact duplicates
- **Medium**: Length outliers, pattern matches, near-duplicates, internal repetition
- **Low**: Minor formatting issues, single keyword matches
Records with multiple flags rank higher.
---
## OUTPUT FORMAT
Return a JSON array of anomalies. For EACH anomaly, output:
- "id": Record identifier from the dataset (or row number if none)
- "description": Clear explanation with severity prefix (HIGH/MEDIUM/LOW)
- "type": Category of anomaly detected
- "flags": Array of specific issues found
- Additional metadata from the original record
Example output:
[
{
"id": "REC-001",
"description": "HIGH SEVERITY: PII exposure detected - SSN pattern (XXX-XX-XXXX) found in free text field. Immediate remediation required.",
"type": "PII Leakage",
"flags": ["ssn_pattern", "data_exposure"],
"original_text": "[first 50 chars]..."
},
{
"id": "REC-047",
"description": "MEDIUM SEVERITY: Record length 12 chars is 89% below median (112 chars). Contains only 'TBD - pending'. Likely placeholder.",
"type": "Length Outlier",
"flags": ["too_short", "placeholder_text"],
"original_text": "TBD - pending"
},
{
"id": "REC-203",
"description": "HIGH SEVERITY: Encoding corruption detected - mojibake patterns (é, ’) indicate UTF-8/Latin-1 mismatch.",
"type": "Encoding Error",
"flags": ["encoding_error", "mojibake"],
"original_text": "Thé customer’s..."
}
]
---
## CUSTOMIZATION (Optional)
Before analyzing, you may specify:
1. **Expected language**: [English/Spanish/Mixed/etc.]
2. **Expected length range**: [e.g., 50-500 characters]
3. **Domain-specific keywords to flag**: [e.g., competitor names]
4. **Domain-specific keywords to ignore**: [e.g., valid jargon]
5. **Fields that SHOULD contain emails/phones**: [exclude from PII flags]
---
## BEGIN ANALYSIS
Analyze the attached file for data quality anomalies.
Additional context (optional):
- This data is: [support tickets / reviews / emails / logs / etc.]
- Expected language: [English]
- Flag these terms: [none]
- Ignore these patterns: [none]
Output ONLY the JSON array of anomalies.
---
Note: Attach your Excel (.xlsx) or CSV file to this message. The AI will analyze all text columns for the anomaly patterns described above.