SOX Testing
A Claude Code plugin for SOX 404 internal-controls testing. Generates control matrices, sample selections, and testing workpapers — with evidence helpers, deterministic Python, an isolated grader, and a builder that turns finished workpapers into portable replay skills.
How It Works
The SOX Testing plugin packages seven user-facing skills, an auto-loaded
methodology reference, and five leaf subagents into a single Claude Code
plugin focused on SOX 404 internal-controls testing. The main
/sox-testing skill walks through control identification,
sample sizing, sample selection, and workpaper generation for a control
area and period — delegating deterministic procedures to
/sox-python, xlsx screenshot evidence to
/sox-annotate-xlsx, recorded walkthroughs to
/sox-from-video, and live evidence collection through Chrome to
/sox-from-web.
Every workpaper is reviewably annotated: red-bordered narrative regions
on top, original screenshots below with red-rectangle Excel shapes
overlaying the specific attributes tested plus bold red gutter
labels to the right of every image listing each attribute and
its observed value (e.g. Approver: Requester's Manager,
Password expires after 120 days: unchecked). When multiple
boxes share a y-coordinate, the lower label is pushed down to clear the
upper one and a thin red connector line keeps the box→label mapping
unambiguous. Pixels are never burned in — reviewers can move or
delete shapes. Deterministic procedures capture full source code, output,
runtime, and exit code. Once a control area is tested,
/sox-replay-build packages the finished workpaper into a
portable .skill that replays the same test next period
without rewriting the deterministic core.
Under the hood, three leaf agents work in parallel on every evidence
image: sox-evidence-boxer locates each requested attribute
and extracts its on-screen observed value (checked /
unchecked, displayed text, dropdown selection, currency amount),
sox-evidence-context scans for ambient signals
that modify the meaning of the value (grayed-out / disabled rows, status
badges, error banners, unsaved indicators, lock icons), and
sox-evidence-reviewer double-checks both axes against a
Pillow-rendered overlay. The orchestrator merges all three signals to
auto-judge pass/fail per attribute from text alone —
image bytes never enter its context. The Summary Reasoning
cell spells out observed value, expected value, both confidences, and any
confirming or disqualifying context, so an external auditor reading the
workpaper sees the full evidence chain without re-viewing the screenshot.
Key Features
- End-to-end SOX 404 flow — control matrix, sample sizing, sample selection, workpaper generation, deficiency framework, and a finished-workpaper rubric grade in a single skill chain
- Deterministic, reproducible procedures — sample draws, three-way matches, recomputations, and threshold checks run via
/sox-pythonso the source code, output, runtime, and exit code all land on a per-procedure tab - Evidence annotation that survives review — red-rectangle Excel shapes anchored over the original screenshots, plus bold red gutter labels listing each attribute and its observed value to the right of every image with collision-resolved positioning and connector lines; reviewers can move, resize, or delete shapes without touching the underlying image
- Auto-judge pass/fail from text alone — the boxer extracts each attribute's observed value (checked / unchecked, displayed text, dropdown selection, currency amount), the context agent flags ambient signals that modify its meaning (disabled rows, status badges, error banners, unsaved indicators, lock icons), and the reviewer double-checks both. The orchestrator merges all three to decide
pass/fail/needs_humanper attribute — image bytes never enter its context - Live evidence collection —
/sox-from-webdrives Chrome through the Claude for Chrome extension to capture screenshots on demand, with chain-of-custody metadata (source URL, viewport, timestamp) and a read-only deny-list that refuses clicks on mutating verbs (Approve / Submit / Save / Delete / Pay) - Walkthrough video support — mp4 + transcript turn into per-sample-per-test detail tabs with annotated frames; gaps in transcript coverage are surfaced explicitly
- Bring-your-own template —
/sox-from-templateauto-detects the shape of a firm xlsx (per-sample tabs, master matrix, per-test tabs, single tab) and writes back into the user's bespoke layout - Portable next-period replays —
/sox-replay-buildpackages a finished workpaper into a standalone.skillwith deterministic scripts preserved verbatim and SHA-verified - Isolated grader for trustworthy verdicts —
sox-workpaper-graderruns in a fresh context window with no exposure to the orchestrator's reasoning, scoring against a rubric of required and recommended criteria - Methodology in one place — the auto-loaded
audit-supportreference holds sample-size buckets, evidence sufficiency standards, deficiency classification, and the full control-type taxonomy; other skills consult it rather than duplicating
The Skills
Main Flow
Plans a control area's testing for a period: builds the control
matrix, sizes and draws samples, scaffolds the workpaper, and
coordinates evidence annotation and deterministic procedures. Calls
/sox-python for any deterministic step,
/sox-annotate-xlsx when evidence is xlsx with embedded
screenshots, and /sox-from-video when evidence is a
recorded auditor walkthrough. Dispatches the
sox-workpaper-grader agent at the end for an
independent verdict.
Evidence Helpers
For any procedure expressible as code — random sample draw, three-way match, reconciliation tie-out, threshold check, exception roll-up. Generates a Python script, runs it, captures stdout/stderr, and appends a per-sample-per-test detail tab with the full source code, output, runtime, and exit code.
Takes an xlsx whose sheets contain embedded evidence screenshots.
For each requested attribute, identifies the pixel bounding box and
extracts the on-screen observed value (typed by a closed
attribute_kind enum — checkbox / toggle / text /
currency / dropdown / date), then writes back red-outline rectangle
Excel shapes plus bold red gutter labels (Approver:
Requester's Manager) anchored over the original images. A
parallel context scan flags ambient signals that modify the value's
meaning (grayed-out rows, status badges, error banners, lock icons),
and a reviewer agent double-checks both. The orchestrator merges
all three signals to auto-judge pass / fail / needs_human per
attribute and writes the comparison into the Summary Reasoning cell.
Pixels are never burned in.
Takes an mp4 walkthrough plus a transcript. Parses the transcript for moments where the auditor reviews each in-scope attribute, extracts the corresponding video frames, identifies bounding boxes and observed values via vision, runs the same context + review passes, and writes per-sample-per-test detail tabs with annotated frames. Self-contained — vendors its own opencv-based frame extractor.
For when the evidence lives behind a login the tester already has.
Drives Chrome through the
Claude for Chrome
extension to collect screenshots on demand, capturing chain-of-custody
metadata (source URL, viewport, timestamp) and feeding the manifest
into the same boxer / context / reviewer / writer pipeline
/sox-annotate-xlsx uses. Read-only is enforced: every
non-navigation click is filtered through a deny-list of mutating
verbs (Approve / Submit / Save / Delete / Pay), and the run pauses
on SSO / MFA pages for the tester to authenticate. Cowork session +
Chrome extension required.
Adaptation & Reuse
Pre-processor that takes a firm-supplied xlsx template (typically
with one sample completed as a guide), auto-detects its shape
(per-sample tabs, master matrix, per-test tabs, or single tab),
profiles every placeholder cell to a semantic field, and scaffolds
copies of the exemplar for the rest of the population. Emits a
template-profile.json that the evidence helpers
consume via --template-profile — bypassing the
canonical layout when the firm template is the right fit.
Reads a completed workpaper and builds a portable .skill
ZIP that replays the same test next period. Copies the deterministic
scripts verbatim (SHA-verified for tamper-evidence), captures the
locked attribute lists for evidence tests, and bakes methodology
and the test plan into a generated SKILL.md. The builder
writes no new Python — the deterministic core is preserved
exactly as it ran the first time.
Reference
Auto-loaded reference skill (user-invocable: false).
Holds sample-size buckets by risk level, the four sample-selection
methods, evidence sufficiency standards, deficiency classification
with indicators, deficiency aggregation rules, and the full
control-type taxonomy — ITGC, manual, automated, IT-dependent
manual, and entity-level. Other skills consult this one rather
than duplicating methodology.
Agents
Five leaf subagents the orchestrator skills dispatch to keep large or
sensitive inputs — image bytes, transcript text, finished-workpaper
introspection — out of the orchestrator's context window. Each
agent has a tightly scoped tool list (Read, Write only for
the evidence trio so image bytes can't escape), and image fan-out runs
in parallel so a 30-image workpaper takes one round trip, not 30.
Dispatched by /sox-annotate-xlsx, /sox-from-video,
and /sox-from-web once per image. Views a single
screenshot or video frame, identifies pixel bounding boxes for
each requested attribute, and extracts the on-screen observed
value (checkbox state, displayed text, dropdown selection,
currency amount) typed by a closed attribute_kind
enum. Returns a boxes JSON plus an observed-values map so the
orchestrator can auto-judge pass/fail from text alone. The
orchestrator never sees the image bytes. Run in parallel: a
single message with multiple Agent calls.
Dispatched in parallel with the boxer (default-on; --no-context
skips) by every evidence-collecting skill. Independently locates
each requested field by its label and surfaces ambient signals
that modify how the observed value should be interpreted —
disabled-indicator (grayed-out rows, locked icons),
status-badge (active / inactive / draft / expired),
error-banner, unsaved-indicator,
permission-indicator, nearby-state. Plus
a top-level screen_state that halts the pipeline if
the screen is a modal, an error page, a permission-denied view,
or still loading. Catches the "configured but disabled" case the
boxer alone can't see.
Dispatched once per image after the boxer + context pair (default-on;
--no-review skips). Views a Pillow-rendered overlay
(the original image with each proposed box drawn on it and
labeled by field) plus the original image, and emits two
independent verdicts per box — position (ok / shift /
wrong-field / miss-extra) and value (ok / wrong-value /
unreadable). Drift gets routed back to the boxer for revision
(capped at 2 retries) on the appropriate axis only, so a value
re-read doesn't move a correct box and a box move doesn't regress
a correct value.
Dispatched by /sox-from-video once per recording.
Reads a transcript (.vtt, .srt,
.txt, .docx), identifies the moments
where the auditor reviewed each in-scope attribute, and emits a
timestamps.json. Transcript text never returns to the
orchestrator — only a small summary with timestamp count,
samples covered, tests covered, and any gaps.
Dispatched by /sox-testing after the workpaper is
fully assembled. Reads the workpaper rubric, introspects the xlsx
via openpyxl, scores each criterion (Required + Recommended), and
writes a verdict JSON with overall: pass | fail plus
any blocking failures. Runs in a fresh context window with no
exposure to the orchestrator's reasoning — that isolation
is what makes the verdict meaningful.
Workpaper Layout
By default, every per-sample tab the plugin produces follows the same two-region layout:
- Top: a red-bordered narrative region with Observation, Procedures Performed, and Conclusion sections describing the test in prose. Every Summary row also carries a 1–3 sentence reasoning string per test — spelling out observed value, expected value, both confidences, and any confirming or disqualifying context, so an external reviewer reads the full evidence chain inline.
- Below: the original screenshots, with red-rectangle Excel shapes overlaying the specific attributes tested, plus bold red gutter labels to the right of every image listing each attribute and its observed value (
Approver: Requester's Manager,Password expires after 120 days: unchecked). When multiple boxes share a y-coordinate, the lower label is pushed down to clear the upper one with a small gap, and a thin red connector line keeps the box→label mapping unambiguous. Shapes, labels, and connectors are all anchored — pixels are never burned in.
When /sox-from-template is run first, the canonical layout
is replaced by the firm's template layout — narrative text, results,
reasoning, and image annotations all land in the cells the template
profile identifies, with the firm's branding, column structure, fonts,
and merges preserved.
Getting Started
Download the plugin below and install it in Claude Code — the
.claude-plugin/plugin.json manifest sits at the root
of the zip. The plugin registers seven user-facing slash-command
skills, an auto-loaded methodology reference, and five subagents
the orchestrator dispatches automatically.
Open any directory you want to use as a SOX testing workspace and
run /sox-testing procure-to-pay 2024-Q4 (or your own
control area and period). The skill walks through control
identification, sample sizing, and sample selection —
delegating to /sox-python so the random seed and
selection logic are captured in a tamper-evident detail tab.
If your firm has a bespoke xlsx layout, run
/sox-from-template <template.xlsx> first. The
skill profiles the template, scaffolds copies of the exemplar tab
for the remaining samples, and emits a profile JSON the evidence
helpers consume to write into your layout instead of the canonical
one.
For screenshot evidence, run
/sox-annotate-xlsx <evidence.xlsx> "Approver, Amount Threshold, Self-Approval Checkbox".
For walkthrough recordings, run
/sox-from-video <walkthrough.mp4> <transcript>.
For live collection through Chrome, run
/sox-from-web <collection-plan> (Cowork session
with the Claude for Chrome extension required). All three fan out
to the boxer + context + reviewer agents in parallel and write
annotated detail tabs back to the workpaper, with the orchestrator
auto-judging pass/fail per attribute from the structured signals.
Once the workpaper is signed off, run
/sox-replay-build <workpaper.xlsx> to package the
deterministic scripts, locked attribute lists, methodology, and
test plan into a portable .skill. Drop the
.skill into next period's workspace and replay the
same test without rewriting code.