12 · Use Case · Live Analysis

From n8n spaghetti to DDD microservices on AWS

Two workflows pulled live from n8n Cloud API with the production tokens in .env.deploy: BO-26 Cascade Detector (the brain that decides what to send) and Send Approved Tasks (the hand that delivers Gmail). Together they generate 100% of outbound revenue. This page walks them in plain language, names the pain, and proposes the AWS target — one bounded context, one optimized database, no shared state.

Source

n8n Cloud REST API

Workflows fetched live via GET /api/v1/workflows/{id} using the API key in .env.deploy. JSON dumps saved to n8n-backup/workflows/.

Why these two

Top of the funnel

BO-26 creates 100% of the task_queue_sales rows. Send Approved Tasks is the only path to a customer inbox. Every other workflow is a feeder or reporter on top of these.

Method

Walk · Pain · AWS map

For each: plain-language flow, what hurts in production today, and the AWS service per concern with one DB per bounded context (DDD).

Workflow 1 · BO-26 Cascade Detector

The brain that decides what to send

Every hour it scans HubSpot deals, classifies them across a 7-stage cascade (Cold → Warm → Hot → Negotiation → Won → Stale → Lost), compares against the previous snapshot in Airtable, and writes only the diffs as new task_queue_sales rows ready for human approval.

⏰

Hourly cron

scheduleTrigger every 60 min

🔎

Fetch deals

HubSpot CRM v3 paginated

🧮

Classify

Code node · 7-stage cascade rules

🗂️

Diff vs snapshot

Airtable deal_snapshots

📥

Emit tasks

Insert into task_queue_sales

Pain · what hurts in prod

Airtable is a transactional DB

Snapshot table has ~40k rows and grows. Every hour the workflow does N reads + N writes against Airtable's 5 req/s limit. Splits, waits, and 429-retry loops eat 4-7 minutes per run. No index, no transaction, no rollback if the run crashes halfway.

Pain · what hurts in prod

Logic frozen in n8n Code nodes

The 7-stage classifier is ~180 lines of JS embedded in one node. No tests, no version control beyond the workflow JSON, no way to replay a single deal locally. A cascade rule change = edit-in-browser, hope, ship.

Workflow 2 · Send Approved Tasks

The hand that hits the inbox

Every 5 minutes it pulls approved rows from task_queue_sales, claims each one through an external lock service, routes by sales rep to one of 5 Gmail Service Account credentials, sends, logs the engagement back to HubSpot, and updates the cadence.

⏰

5-min cron

scheduleTrigger · limit 25

🔒

Claim

POST autotask.1mr.llc/claim

🚦

Guards

suppression · cooldown · quota

📧

Gmail Send

Switch → 5 rep SA credentials

📝

Log + release

HubSpot engagement + claim release

Pain · what hurts in prod

Hardcoded bearer in 16+ flows

Token ySkkUS1f1ZcW… sits in 3 HTTP nodes of this workflow and in 16 sibling workflows. No vault, no rotation. The proxy autotask.1mr.llc is a single Node process on a single VPS — if it dies, every outbound send halts and no one is paged.

Pain · what hurts in prod

5-credential switch + race on quota

A Switch node routes to Jessica / Ivan / Mario / Milos / Daniel Gmail SAs. Daily quota is read from Airtable on each iteration with no atomic decrement — two concurrent claims for the same rep can both pass the 50/day check and overshoot. Discovered after Jessica hit 58 sends one Tuesday.

AWS Target · DDD microservices

One bounded context · one optimized database

Both workflows split into two services with disjoint data ownership. No shared schemas, no chatty calls between them — they communicate only via an EventBridge bus when a domain event is meaningful to the other (e.g. TaskApproved, EmailSent).

Service 1

cascade-detector

Trigger: EventBridge Scheduler · rate(1 hour)

Compute: Step Functions Express · Map state over deal pages · Lambda per page (or ECS task if >15 min)

Snapshots: DynamoDB deal_snapshots · PK deal_id · SK snapshot_ts · TTL 90d · single-digit-ms diff lookups, no rate limit

Tasks emitted: RDS Aurora Postgres Serverless v2 task_queue_sales · relational, audit-friendly, the human approval UI talks SQL

Observability: every classification → Kinesis Firehose → S3 → Athena. Replay any deal by re-running one Lambda against the snapshot.

Service 2

outbound-sender

Trigger: EventBridge Scheduler · rate(5 minutes) → enqueue approved tasks into SQS FIFO with MessageGroupId = from_inbox (per-rep ordering, no duplicates)

Workers: ECS Fargate · NestJS · auto-scale on queue depth

Lock: DynamoDB task_claims · conditional write on task_id + TTL 5 min → kills the external claim proxy

Quota: DynamoDB counter per (rep_id, date) with atomic ADD + condition < daily_limit → fixes the race

Cadences + suppressions + send_log: RDS Postgres (same instance as service 1 logically, separate schema/owner)

Credentials: Secrets Manager · 5 rep SA JSONs + HubSpot token, rotated by a Lambda on schedule

Send: Gmail API now (per-rep SA) → SES later if we move off Workspace. HubSpot engagement logged via EventBridge fan-out.

Why DDD here

Detector and sender today share task_queue_sales as a god-table and break each other constantly: a schema change for cascade reasons forces a redeploy of the sender. Splitting ownership — detector writes, sender reads only via an event or a narrow read API — means the two services can iterate independently and each picks the database its workload actually wants (KV for snapshots/locks/counters, SQL for relational tasks and cadences).

Migration Roadmap

Four phases · ~10 weeks · no big bang

F1 · 2 weeks

outbound-sender canary at 10%

Stand up the new service alongside n8n. Read the same task_queue_sales, claim through the new Dynamo lock for 10% of approved rows. Compare send rate, bounce, log fidelity vs n8n for a week.

F2 · 3 weeks

Cut over · kill claim proxy

Route 100% through outbound-sender. Migrate cadences + suppressions from Airtable into RDS. Delete autotask.1mr.llc/claim and remove the bearer from all 16 n8n workflows.

F3 · 3 weeks

cascade-detector on Step Functions

Port the 7-stage classifier into a TS package with unit tests. Run Step Functions in parallel with n8n BO-26 and diff the emitted tasks for 5 runs. Cut over once diff = 0.

F4 · ongoing

UI on Hub v2

Move the approval UI off Airtable Interfaces into Hub v2 (NestJS + Next.js already deployed to ECS hv-hub-v2-development). Airtable becomes read-only reporting only.

AWS Service Map

Every n8n concern → an AWS primitive

n8n concern today	AWS primitive	Why
scheduleTrigger	EventBridge Scheduler	Cron with at-least-once, retries, DLQ. Managed.
Code node orchestration	Step Functions Express	Visual workflow, parallel Map, < 5 min cheap.
httpRequest + parse	Lambda (or ECS for >15 min)	Native HTTP, packaged TS, version-controlled.
Airtable snapshots	DynamoDB	KV with TTL, no rate limit, single-ms reads.
task_queue_sales	RDS Aurora Postgres	Relational, transactions, joins for the UI.
autotask claim proxy	DynamoDB conditional write + TTL	Distributed lock without a server.
Daily quota counter	DynamoDB atomic ADD	Race-free decrement with condition expression.
5-credential Gmail switch	Secrets Manager + per-rep secret	Rotation, audit, IAM-scoped.
splitInBatches + wait	SQS FIFO (MessageGroupId)	Per-rep ordering, dedupe, backpressure.
HubSpot engagement log	EventBridge bus	Fan-out, decouple sender from logger.
n8n execution history	CloudWatch + X-Ray	Structured logs, trace per task_id end-to-end.
Replay / audit	Kinesis Firehose → S3 → Athena	SQL over every classification decision.

What we are NOT migrating in phase 1

The ~8 webhook-driven Operational Tasks routers and the misc integration glue (HubSpot↔Airtable sync, WhatsApp). Those stay in n8n until the two revenue-critical workflows are off it. See Roadmap phases F5+.