Vendor Shortlist Guide: How to Evaluate Enterprise Email/SMS/Retention Agencies
Share
Why “best agency” is the wrong question (and the right one)
Enterprise search terms are noisy: “best email agency,” “top SMS partner,” “retention experts.” Pretty decks abound. What you actually need is not the “best” in abstract—it’s the agency that can cooperate with your stack, pass your security review, migrate without breaking deliverability, and move CFO-grade dials (Revenue per Recipient, second-purchase rate, payback) in a time window you can live with. In other words: the best agency for your operating model and risk profile.
The right partner paints after the plumbing works. They show how they protect inbox placement, how they run randomized holdouts and uplift tests, and how they govern change so campaigns don’t trample flows. The wrong partner sells subject lines and screenshots. Your job is to tell the difference quickly.
The Enterprise Evaluation Framework (5 pillars, 25 checks)
Rate every vendor on these five pillars. If a pillar fails, the pitch fails.
-
Business Outcomes (can they move CFO dials?)
- Holdout-adjusted RPR (flows vs. campaigns) in prior work
- Second-purchase and payback improvements (cohort view)
- Discount reliance trending down over 90 days
- Documented “what changed / what we learned / what we’ll test next” reports
- References who can speak to incrementality, not just revenue claims
-
Stack Fit (do they build with your tools?)
- ESP/SMS experience: Klaviyo/Braze + Attentive/Postscript
- Commerce: Shopify/Headless; subscription/loyalty systems
- CDP/warehouse: clean data flows; reverse ETL into orchestration
- Migration history and zero-downtime playbooks
- Multilingual, regional, and brand governance support
-
Security & Compliance (will IT say yes?)
- SOC 2 posture, DPAs, sub-processor list, data flow diagrams
- Access control, key rotation, incident response
- 10DLC registration, quiet-hours enforcement, consent workflows
- GDPR/CCPA support (rights requests, data minimization)
- PII handling & least-privilege in your systems
-
Operations & Governance (can they run without chaos?)
- SLAs for QA, change control, and incident response
- Persistent holdouts, testing SOPs, and rollout calendars
- Weekly 10-minute readouts; monthly retros; owner roles (RACI)
- Capacity planning & production calendar discipline
- Escalation paths and freeze policies during risk
-
Team & Model (who does the work—and how?)
- Named senior team, not “pooled resources” only
- Retainer, pod, or outcome-based model with clear scope
- Migrations, deliverability, analytics staffed (not outsourced quietly)
- Coverage by timezone; multilingual staff if required
- Training/enablement for your team (templates, SOPs)
Capability Matrix: what enterprise-ready actually looks like
| Capability | Enterprise-Ready | Basic Agency | Why it matters |
|---|---|---|---|
| Deliverability | Dedicated domain, DMARC, engagement bands, complaint by domain, incident playbook | “We monitor open rate” | Placement is a license; you can’t buy back reputation easily |
| Incrementality | Randomized holdouts, uplift tests for incentives, RPR and payback reporting | “We use last-click UTMs” | Finance needs proof, not stories |
| Migrations | Parallel sends, warm-up, seed tests, data mapping, cutover go/no-go | “We’ll copy templates over” | Replatform risk sits in placement & data fidelity |
| Global/Multilingual | Dynamic templates, language packs, RTL support, regional consent | “We’ll duplicate flows” | Duplication explodes maintenance; governance prevents drift |
| Ops & QA | SOPs, checklists, send freezes, change logs, weekly readouts | “We’re agile” | “Agile” without guardrails means fire drills at scale |
Stack Fit: Klaviyo vs. Braze, Attentive/Postscript, CDPs, and commerce
Enterprise programs fail when vendors force their favorite stack onto your architecture. Demand receipts in your exact tools. A vendor who truly knows your ESP/SMS platform can build inside its constraints, not around them.
ESP: Klaviyo vs. Braze (quick operator’s view)
- Klaviyo excels when Shopify is your source of truth, lifecycle is the core, and teams want one orchestration brain for email+SMS+push. Templates are fast; data model is pragmatic; multilingual requires a pattern.
- Braze shines in complex data flows, app messaging, and enterprise experimentation at scale. You’ll likely need a warehouse/CDP and a more formal QA pipeline.
SMS: Attentive vs. Postscript
- Attentive: growth tooling, compliance guardrails, strong journey builder and analytics; fits enterprise scale.
- Postscript: lean, direct, with deep Shopify focus; excellent for brands prioritizing speed.
CDP/Warehouse & reverse ETL
- Warehouse (Snowflake/BigQuery) + dbt → reverse ETL (Hightouch/Census) → ESP/SMS. Keep event naming and identity consistent; avoid CDP overreach if you don’t have a data team.
Commerce
- Shopify native: proceed. Headless or international multi-store: ensure vendor has real multilingual patterns (language dictionaries, RTL) and regional deliverability experience.
Security & Compliance: SOC 2, GDPR/CCPA, 10DLC, DPAs
IT approval is not a formality. Ask for a security brief that a non-engineer can read—data flows, sub-processors, breach response, access control. Expect least-privilege, key rotation, and DPA readiness.
- SOC 2 posture: if not certified, ask for security controls mapping and timelines.
- GDPR/CCPA: roles (processor vs. controller), DPA templates, rights-request process, retention policies.
- 10DLC: brand/campaign registrations, opt-out keywords, quiet hours, handling carrier blocks.
- Access control: SSO where possible; names and roles of staff with production access; offboarding plan.
- Incident response: how incidents are detected, escalated, and communicated; timelines.
Migrations & Cutover: zero-downtime replatforms
Replatforming risk lives in two places: data fidelity and deliverability. Treat cutover like a change-control exercise, not a sprint.
- Warm-up: dedicated domain, DMARC alignment, engagement-band sends for 2–3 weeks.
- Parallel: dual-sending critical flows; seed panel and complaint monitoring.
- Mapping: events, identities, preferences, and suppressions; build a reconciliation checklist.
- Go/no-go: placement steady; data checks clean; flows validated; rollback plan in writing.
Deliverability at Scale: placement is a license
Deliverability is not a vibe; it is discipline: dedicated domain + DMARC, engagement bands, sunset, complaint monitoring by domain, and a change-freeze policy during incidents. Ask for the vendor’s deliverability SOP and an anonymized incident post-mortem.
- Dashboard should show complaint rate (Gmail/Yahoo/Outlook) and placement proxies.
- HTML templates must render with real text, alt text, and AAA contrast—image-only emails are a placement risk.
- Sunset policy enforced across promos; big weeks are never an excuse to mail the unengaged.
Attribution That Finance Accepts: holdouts & uplift
Last-click is not evil; it’s just incomplete. Enterprise programs run both: directional MTA plus randomized holdouts and uplift for the decisions that spend money (incentives, cadence, structural flow changes).
- Message-level holdouts: 10–20% suppressed on save/recco touches (never removed during big weeks)
- Flow-level: small audience control when you change system design
- Uplift: perk/no-perk inside risk bands; only persuadables get incentives
- Metrics: RPR, conversion, AOV, second-purchase, payback—not just attributed revenue
Global & Multilingual Ops: one system, many languages
Multilingual programs collapse when teams duplicate flows per language without governance. Insist on dynamic templates with language dictionaries, RTL support, preferences in the footer and profile, and a TMS or partials that translators can update without touching logic.
- Language capture: form toggle + preferences; store ISO code on profile
- Template pattern: include language pack partial; render copy by
lang - RTL:
dir="rtl"containers; icon mirroring; font coverage - Regional consent: localized legal, double opt-in where required; time-zone aware quiet hours
Pricing & Staffing Models: retainers, pods, outcomes
Prices vary; clarity shouldn’t. Ask for the staffing chart by role and FTE equivalents, the scope boundary (what’s included/not), and the change-control policy when reality shifts.
Common models
- Retainer: fixed fee for defined outputs; predictable; needs change-order discipline.
- Pod: cross-functional team (strategist, producer, builder, analyst) dedicated to your account; faster but pricier.
- Outcome-based: milestone or KPI-linked tranches; align incentives but require clean baselines.
Hidden costs to surface: migrations, reverse ETL, seed panels, SMS brand/campaign fees, incremental experimentation traffic.
SLAs, SOWs & Governance: the operating model that prevents fire drills
Governance is how you sleep. You want written SLAs for intake → build → QA → send → post-send, a change-freeze policy for risk windows, and a 10-minute weekly readout that forces clarity.
- QA SLA: minimum review windows, device rendering checks, link/UTM validation, segmentation audits
- Incident response: who pauses sends, who communicates, who fixes templates or lists
- RACI: named owners for deliverability, analytics, creative, operations, approvals
- Reporting rhythm: weekly dials (RPR, second-purchase, placement/complaints, subscriber retention, discount reliance); monthly retro with “what changed/learned/test next”
RFP Scoring Rubric (download-ready) & vendor questions
| Category | Weight | What “excellent” looks like |
|---|---|---|
| Outcomes & Incrementality | 30% | Holdout-adjusted case studies; uplift tests; CFO dials moving in 90 days |
| Security & Compliance | 20% | SOC 2 posture; DPAs; 10DLC; data flow & access control docs |
| Stack Fit & Migrations | 20% | Klaviyo/Braze + Attentive/Postscript receipts; cutover playbooks |
| Operations & Governance | 15% | SLAs, QA checklists, change control, weekly readouts |
| Pricing & Team | 15% | Named senior team; clear scope; capacity planning |
Vendor questions that surface truth
- Show a holdout-adjusted readout where RPR improved and complaint rate stayed ≤0.08% at Gmail. What changed?
- Walk us through a zero-downtime migration: warm-up curve, parallel sends, go/no-go criteria, and incident handling.
- Open a live template with language packs/partials and explain how translators update copy without breaking logic.
- Share your deliverability SOP and an anonymized incident post-mortem.
- Who has production access in our accounts? How do you offboard staff?
- What’s your change-freeze policy during promotions or placement incidents?
- Show your 10-minute weekly report. Who presents it? What decisions changed because of it?
The 45-Day Pilot: scope, success criteria, kill switch
Avoid three-month “strategy phases” that don’t touch revenue. Run a 45-day pilot with live messages and measured lift.
Scope
- Rebuild post-purchase + second-purchase flows (proof-first modules)
- One deliverability task (domain warm-up or complaint remediation)
- One SMS nudge with quiet hours + Snooze
- Persistent holdouts on save/recommendation touches
Success criteria (define before kickoff)
- +X% RPR (holdout-adjusted) on pilot messages
- +Y pts in 30-day second-purchase rate for exposed cohort
- Complaint rate ≤0.08% at Gmail; unsub ≤0.3% targeted sends; SMS opt-out steady or down
- Discount reliance flat or down
Kill switch
Two strike conditions (e.g., complaint spikes or missed SLA twice) auto-pause the pilot. This protects both sides and forces operational honesty.
Receipts vs. Red Flags: how to tell builders from broadcasters
Receipts
- Holdout data; uplift tests; CFO dials in every case study
- DMARC/engagement-band setup documented; incident post-mortems shared
- Live templates with language packs, not screenshots
- Security brief with data flows; names/roles for access
- Weekly 10-minute reports with “what changed/learned/test next”
Red flags
- Open-rate worship; no incrementality
- “We’ll warm your domain in a week”
- “We duplicate flows for every language” with no governance
- Anonymous, pooled team; contractor black box
- No QA checklist; “agile” used to excuse chaos
Due-Diligence Checklists
Security & Legal
- DPA signed; sub-processor list; data flow diagrams
- SOC 2 report or controls mapping; incident response plan
- 10DLC registrations and brand/campaign IDs
- Access-control policy; SSO; offboarding SOP
Deliverability
- Dedicated domain; DMARC; tracking CNAMEs
- Engagement banding; sunset policy; send-freeze rules
- Placement monitoring; complaint dashboards by domain
Operations
- SLAs for QA and incident response; change logs
- RACI with named owners; 10-minute weekly readout
- Capacity plan and production calendar
Measurement
- Message/flow holdouts; uplift tests for incentives
- Dashboards showing RPR, second-purchase, payback, discount reliance
Global
- Language capture; template language packs; RTL support
- Regional consent + quiet hours; regional deliverability experience
FAQ
What matters most in the first 90 days?
Holdout-adjusted movement in RPR and second-purchase rate while complaint/unsub stays healthy. Anything else is theater.
How do we compare costs across vendors?
Normalize to outputs and roles (FTE equivalents). Ask for a staffing plan, capacity model, and change-order policy. Add hidden costs (migrations, reverse ETL, panels).
What’s a reasonable migration timeline?
Six weeks for mid-complexity: two for warm-up, two for parallel sends, two for cutover and stabilization. Compressing risks placement.
Do we need a CDP?
Not to start. A warehouse + dbt + reverse ETL often beats a heavy CDP until you have a data team and clear use-cases.
How do we enforce governance?
Put SLAs, RACI, and freeze policies into the SOW. Make the weekly 10-minute readout non-negotiable. The habit keeps you safe.