SOX Testing

A Claude Code plugin for SOX 404 internal-controls testing. Generates control matrices, sample selections, and testing workpapers — with evidence helpers, deterministic Python, an isolated grader, and a builder that turns finished workpapers into portable replay skills.

How It Works

The SOX Testing plugin packages seven user-facing skills, an auto-loaded methodology reference, and five leaf subagents into a single Claude Code plugin focused on SOX 404 internal-controls testing. The main /sox-testing skill walks through control identification, sample sizing, sample selection, and workpaper generation for a control area and period — delegating deterministic procedures to /sox-python, xlsx screenshot evidence to /sox-annotate-xlsx, recorded walkthroughs to /sox-from-video, and live evidence collection through Chrome to /sox-from-web.

Every workpaper is reviewably annotated: red-bordered narrative regions on top, original screenshots below with red-rectangle Excel shapes overlaying the specific attributes tested plus bold red gutter labels to the right of every image listing each attribute and its observed value (e.g. Approver: Requester's Manager, Password expires after 120 days: unchecked). When multiple boxes share a y-coordinate, the lower label is pushed down to clear the upper one and a thin red connector line keeps the box→label mapping unambiguous. Pixels are never burned in — reviewers can move or delete shapes. Deterministic procedures capture full source code, output, runtime, and exit code. Once a control area is tested, /sox-replay-build packages the finished workpaper into a portable .skill that replays the same test next period without rewriting the deterministic core.

Under the hood, three leaf agents work in parallel on every evidence image: sox-evidence-boxer locates each requested attribute and extracts its on-screen observed value (checked / unchecked, displayed text, dropdown selection, currency amount), sox-evidence-context scans for ambient signals that modify the meaning of the value (grayed-out / disabled rows, status badges, error banners, unsaved indicators, lock icons), and sox-evidence-reviewer double-checks both axes against a Pillow-rendered overlay. The orchestrator merges all three signals to auto-judge pass/fail per attribute from text alone — image bytes never enter its context. The Summary Reasoning cell spells out observed value, expected value, both confidences, and any confirming or disqualifying context, so an external auditor reading the workpaper sees the full evidence chain without re-viewing the screenshot.

Privacy by design: All testing runs locally through Claude Code. Your control data, evidence screenshots, transcripts, and generated workpapers stay on your machine. The plugin works without any MCP server; Cowork Canvas is supported when present and degrades gracefully when not.

Key Features

End-to-end SOX 404 flow — control matrix, sample sizing, sample selection, workpaper generation, deficiency framework, and a finished-workpaper rubric grade in a single skill chain
Deterministic, reproducible procedures — sample draws, three-way matches, recomputations, and threshold checks run via /sox-python so the source code, output, runtime, and exit code all land on a per-procedure tab
Evidence annotation that survives review — red-rectangle Excel shapes anchored over the original screenshots, plus bold red gutter labels listing each attribute and its observed value to the right of every image with collision-resolved positioning and connector lines; reviewers can move, resize, or delete shapes without touching the underlying image
Auto-judge pass/fail from text alone — the boxer extracts each attribute's observed value (checked / unchecked, displayed text, dropdown selection, currency amount), the context agent flags ambient signals that modify its meaning (disabled rows, status badges, error banners, unsaved indicators, lock icons), and the reviewer double-checks both. The orchestrator merges all three to decide pass / fail / needs_human per attribute — image bytes never enter its context
Live evidence collection — /sox-from-web drives Chrome through the Claude for Chrome extension to capture screenshots on demand, with chain-of-custody metadata (source URL, viewport, timestamp) and a read-only deny-list that refuses clicks on mutating verbs (Approve / Submit / Save / Delete / Pay)
Walkthrough video support — mp4 + transcript turn into per-sample-per-test detail tabs with annotated frames; gaps in transcript coverage are surfaced explicitly
Bring-your-own template — /sox-from-template auto-detects the shape of a firm xlsx (per-sample tabs, master matrix, per-test tabs, single tab) and writes back into the user's bespoke layout
Portable next-period replays — /sox-replay-build packages a finished workpaper into a standalone .skill with deterministic scripts preserved verbatim and SHA-verified
Isolated grader for trustworthy verdicts — sox-workpaper-grader runs in a fresh context window with no exposure to the orchestrator's reasoning, scoring against a rubric of required and recommended criteria
Methodology in one place — the auto-loaded audit-support reference holds sample-size buckets, evidence sufficiency standards, deficiency classification, and the full control-type taxonomy; other skills consult it rather than duplicating

The Skills

Main Flow

/sox-testing — Plan and Execute a Control Area

Plans a control area's testing for a period: builds the control matrix, sizes and draws samples, scaffolds the workpaper, and coordinates evidence annotation and deterministic procedures. Calls /sox-python for any deterministic step, /sox-annotate-xlsx when evidence is xlsx with embedded screenshots, and /sox-from-video when evidence is a recorded auditor walkthrough. Dispatches the sox-workpaper-grader agent at the end for an independent verdict.

Evidence Helpers

/sox-python — Deterministic, Reproducible Procedures

For any procedure expressible as code — random sample draw, three-way match, reconciliation tie-out, threshold check, exception roll-up. Generates a Python script, runs it, captures stdout/stderr, and appends a per-sample-per-test detail tab with the full source code, output, runtime, and exit code.

/sox-annotate-xlsx — Annotated Evidence + Auto-Judged Pass/Fail

Takes an xlsx whose sheets contain embedded evidence screenshots. For each requested attribute, identifies the pixel bounding box and extracts the on-screen observed value (typed by a closed attribute_kind enum — checkbox / toggle / text / currency / dropdown / date), then writes back red-outline rectangle Excel shapes plus bold red gutter labels (Approver: Requester's Manager) anchored over the original images. A parallel context scan flags ambient signals that modify the value's meaning (grayed-out rows, status badges, error banners, lock icons), and a reviewer agent double-checks both. The orchestrator merges all three signals to auto-judge pass / fail / needs_human per attribute and writes the comparison into the Summary Reasoning cell. Pixels are never burned in.

/sox-from-video — Walkthrough Recordings as Evidence

Takes an mp4 walkthrough plus a transcript. Parses the transcript for moments where the auditor reviews each in-scope attribute, extracts the corresponding video frames, identifies bounding boxes and observed values via vision, runs the same context + review passes, and writes per-sample-per-test detail tabs with annotated frames. Self-contained — vendors its own opencv-based frame extractor.

/sox-from-web — Live Evidence via the Claude for Chrome Extension

For when the evidence lives behind a login the tester already has. Drives Chrome through the Claude for Chrome extension to collect screenshots on demand, capturing chain-of-custody metadata (source URL, viewport, timestamp) and feeding the manifest into the same boxer / context / reviewer / writer pipeline /sox-annotate-xlsx uses. Read-only is enforced: every non-navigation click is filtered through a deny-list of mutating verbs (Approve / Submit / Save / Delete / Pay), and the run pauses on SSO / MFA pages for the tester to authenticate. Cowork session + Chrome extension required.

Adaptation & Reuse

/sox-from-template — Adapt to Your Firm's Layout

Pre-processor that takes a firm-supplied xlsx template (typically with one sample completed as a guide), auto-detects its shape (per-sample tabs, master matrix, per-test tabs, or single tab), profiles every placeholder cell to a semantic field, and scaffolds copies of the exemplar for the rest of the population. Emits a template-profile.json that the evidence helpers consume via --template-profile — bypassing the canonical layout when the firm template is the right fit.

/sox-replay-build — Portable Next-Period Replays

Reads a completed workpaper and builds a portable .skill ZIP that replays the same test next period. Copies the deterministic scripts verbatim (SHA-verified for tamper-evidence), captures the locked attribute lists for evidence tests, and bakes methodology and the test plan into a generated SKILL.md. The builder writes no new Python — the deterministic core is preserved exactly as it ran the first time.

Reference

audit-support — SOX 404 Methodology (Auto-Loaded)

Auto-loaded reference skill (user-invocable: false). Holds sample-size buckets by risk level, the four sample-selection methods, evidence sufficiency standards, deficiency classification with indicators, deficiency aggregation rules, and the full control-type taxonomy — ITGC, manual, automated, IT-dependent manual, and entity-level. Other skills consult this one rather than duplicating methodology.

Agents

Five leaf subagents the orchestrator skills dispatch to keep large or sensitive inputs — image bytes, transcript text, finished-workpaper introspection — out of the orchestrator's context window. Each agent has a tightly scoped tool list (Read, Write only for the evidence trio so image bytes can't escape), and image fan-out runs in parallel so a 30-image workpaper takes one round trip, not 30.

sox-evidence-boxer — Boxes + Observed Values

Dispatched by /sox-annotate-xlsx, /sox-from-video, and /sox-from-web once per image. Views a single screenshot or video frame, identifies pixel bounding boxes for each requested attribute, and extracts the on-screen observed value (checkbox state, displayed text, dropdown selection, currency amount) typed by a closed attribute_kind enum. Returns a boxes JSON plus an observed-values map so the orchestrator can auto-judge pass/fail from text alone. The orchestrator never sees the image bytes. Run in parallel: a single message with multiple Agent calls.

sox-evidence-context — Ambient Signal Scanner

Dispatched in parallel with the boxer (default-on; --no-context skips) by every evidence-collecting skill. Independently locates each requested field by its label and surfaces ambient signals that modify how the observed value should be interpreted — disabled-indicator (grayed-out rows, locked icons), status-badge (active / inactive / draft / expired), error-banner, unsaved-indicator, permission-indicator, nearby-state. Plus a top-level screen_state that halts the pipeline if the screen is a modal, an error page, a permission-denied view, or still loading. Catches the "configured but disabled" case the boxer alone can't see.

sox-evidence-reviewer — Position + Value Double-Check

Dispatched once per image after the boxer + context pair (default-on; --no-review skips). Views a Pillow-rendered overlay (the original image with each proposed box drawn on it and labeled by field) plus the original image, and emits two independent verdicts per box — position (ok / shift / wrong-field / miss-extra) and value (ok / wrong-value / unreadable). Drift gets routed back to the boxer for revision (capped at 2 retries) on the appropriate axis only, so a value re-read doesn't move a correct box and a box move doesn't regress a correct value.

sox-walkthrough-parser — Transcript to Timestamps

Dispatched by /sox-from-video once per recording. Reads a transcript (.vtt, .srt, .txt, .docx), identifies the moments where the auditor reviewed each in-scope attribute, and emits a timestamps.json. Transcript text never returns to the orchestrator — only a small summary with timestamp count, samples covered, tests covered, and any gaps.

sox-workpaper-grader — Independent Rubric Verdict

Dispatched by /sox-testing after the workpaper is fully assembled. Reads the workpaper rubric, introspects the xlsx via openpyxl, scores each criterion (Required + Recommended), and writes a verdict JSON with overall: pass | fail plus any blocking failures. Runs in a fresh context window with no exposure to the orchestrator's reasoning — that isolation is what makes the verdict meaningful.

Workpaper Layout

By default, every per-sample tab the plugin produces follows the same two-region layout:

Top: a red-bordered narrative region with Observation, Procedures Performed, and Conclusion sections describing the test in prose. Every Summary row also carries a 1–3 sentence reasoning string per test — spelling out observed value, expected value, both confidences, and any confirming or disqualifying context, so an external reviewer reads the full evidence chain inline.
Below: the original screenshots, with red-rectangle Excel shapes overlaying the specific attributes tested, plus bold red gutter labels to the right of every image listing each attribute and its observed value (Approver: Requester's Manager, Password expires after 120 days: unchecked). When multiple boxes share a y-coordinate, the lower label is pushed down to clear the upper one with a small gap, and a thin red connector line keeps the box→label mapping unambiguous. Shapes, labels, and connectors are all anchored — pixels are never burned in.

When /sox-from-template is run first, the canonical layout is replaced by the firm's template layout — narrative text, results, reasoning, and image annotations all land in the cells the template profile identifies, with the firm's branding, column structure, fonts, and merges preserved.

Getting Started

Install the plugin

Download the plugin below and install it in Claude Code — the .claude-plugin/plugin.json manifest sits at the root of the zip. The plugin registers seven user-facing slash-command skills, an auto-loaded methodology reference, and five subagents the orchestrator dispatches automatically.

Run /sox-testing <area> <period>

Open any directory you want to use as a SOX testing workspace and run /sox-testing procure-to-pay 2024-Q4 (or your own control area and period). The skill walks through control identification, sample sizing, and sample selection — delegating to /sox-python so the random seed and selection logic are captured in a tamper-evident detail tab.

(Optional) Adapt to your template

If your firm has a bespoke xlsx layout, run /sox-from-template <template.xlsx> first. The skill profiles the template, scaffolds copies of the exemplar tab for the remaining samples, and emits a profile JSON the evidence helpers consume to write into your layout instead of the canonical one.

Annotate evidence

For screenshot evidence, run /sox-annotate-xlsx <evidence.xlsx> "Approver, Amount Threshold, Self-Approval Checkbox". For walkthrough recordings, run /sox-from-video <walkthrough.mp4> <transcript>. For live collection through Chrome, run /sox-from-web <collection-plan> (Cowork session with the Claude for Chrome extension required). All three fan out to the boxer + context + reviewer agents in parallel and write annotated detail tabs back to the workpaper, with the orchestrator auto-judging pass/fail per attribute from the structured signals.

Build a replay skill for next period

Once the workpaper is signed off, run /sox-replay-build <workpaper.xlsx> to package the deterministic scripts, locked attribute lists, methodology, and test plan into a portable .skill. Drop the .skill into next period's workspace and replay the same test without rewriting code.

Download Plugin

Open source under the MIT License. Free to use, modify, and distribute.