Q Workflow Engine v0 - Comprehensive Specification

Executive Summary

The Q Workflow Engine (QWE) is a lightweight, embedded Python orchestration engine designed to replace heavy BPMN servers (Camunda, ProcessMaker, LUCIDCharts) in Python ecosystems.
NOTE: QWE is not a BPMN engine. It is a state machine orchestrator.
It combines Code-First workflow definitions with enterprise-grade transactional guarantees, eliminating the operational overhead of standalone workflow servers while preserving production reliability.

Code-First Approach

Workflows defined as Python classes with Git versioning, code review, and unit testing capabilities that XML configurations cannot match.

ACID Transaction Safety

Atomic state changes with pessimistic locking ensure consistent entity states and audit trails.

Saga Pattern Implementation

Orchestration-based distributed transactions with compensation for failure recovery.

Design Philosophy

Code-First Approach

Workflows are defined as Python classes inheriting from WorkflowMachine, not XML configurations. This enables version control through Git, code review via pull requests, unit testing with standard frameworks, and complex conditional logic through native Python methods.


class OrderWorkflow(WorkflowMachine):

    states = ["draft", "submitted", "approved", "shipped"]

    transitions = [...]

    permissions = {...}

Embedded Python Architecture

QWE operates as a library within the application process, not a separate service. This eliminates network latency, simplifies deployment topology, and ensures workflow state transitions participate in the same database transactions as application data.

Key Benefits:

No network overhead between application and workflow logic
Single transaction boundary for state and business data
Simplified deployment and scaling

Heavy BPMN Server Replacement

QWE explicitly targets replacement scenarios where organizations need workflow capabilities without BPMN's infrastructure and cognitive overhead. The specification identifies Camunda, ProcessMaker, and LUCIDCharts as representative targets, offering superior developer experience and operational characteristics.

Core Capabilities

ACID Transaction Safety

Every state transition follows a strict atomic sequence with pessimistic locking via SELECT FOR UPDATE, ensuring no partial states or race conditions.

Guarantee: State, audit, and saga plan always consistent

Distributed Side Effects

Saga pattern implementation with orchestration-based coordination, compensation on failure, and retry with exponential backoff.

Pattern: Garcia-Molina & Salem, 1987 ACM SIGMOD

Business Logic Decoupling

JSON-based rules stored in database enable product managers to tune thresholds without code deployments, while workflow structure remains version-controlled.

Integration: checkngn rule engine

System Architecture

Layered Design

QWE implements strict separation of concerns across four layers, enabling independent evolution, testing, and scaling of each architectural component.

API Layer

Handles HTTP requests, authentication, and rate limiting using Redis token-bucket algorithm.

Key Characteristics: Thin delegation to Core Engine, pluggable authentication, sub-millisecond rate limiting

Core Engine

Pure Python logic managing state machines and business rules with no direct infrastructure dependencies.

Key Characteristics: Highly testable, no Celery/Redis dependencies, integrates transitions library and checkngn rule engine

Persistence Layer

Manages ACID transactions with pessimistic locking and immutable audit logging.

Key Characteristics: PostgreSQL with SELECT FOR UPDATE, append-only audit trail, transaction boundary encompasses all mutations

Async Infrastructure

Handles timers and sagas via pluggable Dispatcher protocol with Celery/Redis reference implementation.

Key Characteristics: Protocol abstraction enables alternative backends, fire-and-forget dispatch semantics

Component Inventory

WorkflowEngine

Orchestrates locking, rules, and state changes

WorkflowMachine

Base class for workflow definitions

WorkflowVisualizer

Generates runtime-aware Mermaid diagrams

CeleryDispatcher

Implements async interface for timers/sagas

Dispatcher Protocol


class Dispatcher(Protocol):

    def dispatch_saga(self, saga_id: UUID) -> None: ...

    def schedule_timer(self, timer_id: UUID, eta: datetime) -> str: ...

    def revoke_task(self, task_id: str) -> None: ...

dispatch_saga: Initiates saga execution after database commit

schedule_timer: Schedules future execution, returns task_id for cancellation

revoke_task: Cancels previously scheduled task (best-effort)

Core Functional Logic

Transition Lifecycle

Every state change follows a strict atomic sequence of twelve steps, ensuring fail-fast behavior for validation checks while maintaining transactional consistency for all mutations.

graph TD A["API Request"] --> B["Idempotency Check"] B --> C["Rate Limit Check"] C --> D["Acquire Lock
SELECT FOR UPDATE"] D --> E["Hydrate Machine"] E --> F["Evaluate Business Rules"] F --> G["Check Permissions"] G --> H["Validate Transition"] H --> I["Check Terminal State"] I --> J["Execute Transition"] J --> K["Persist Changes"] K --> L["Commit Transaction"] L --> M["Async Dispatch
Timers/Sagas"] M --> N["Return Response"] style A fill:#e3f2fd,stroke:#1976d2 style D fill:#fff9c4,stroke:#f44336 style K fill:#fff9c4,stroke:#f44336 style M fill:#e8f5e8,stroke:#4caf50

Error Handling

Rate limit exceeded 429

Lock timeout 423

Rule violation 422

Permission denied 403

Invalid transition 409

Transaction Boundary

All mutations occur within a single database transaction encompassing:

• Entity state update
• Audit log insertion
• Task creation/cancellation
• Saga plan persistence

Saga Pattern Implementation

Two-Phase Lifecycle

Phase 1: Atomic saga plan persistence within transaction
Phase 2: Asynchronous execution post-commit

Compensation Model

Reverse-order execution of compensation handlers for completed steps only, with exponential backoff retry.

Handler Contract


def handler(entity_id: str, data: dict, session: Session) -> dict:

    """Pure function: all input via data, all output via return dict"""

5.3.4.1 Phase 1: Frozen Plan Persistence

Upon creation, the engine must serialize the list of steps from WorkflowMachine.sagas and store them in WorkflowSaga.execution_plan.

Rationale: This "freezes" the saga definition. If code is deployed changing the saga structure while a saga is in-flight, the runner must execute the steps defined at start time, not current code time, to ensure data integrity and correct compensation paths.

Timer Management

State Mismatch Behavior

CANCEL: Abort if state changed

PROCEED: Execute anyway

SKIP: Mark as skipped

Auto-Cancellation

On state exit, pending timers for previous state are automatically revoked via dispatcher.

graph TD A["State Entry"] --> B["Create Timer Record"] B --> C["Schedule Celery Task"] C --> D["Save task_id"] D --> E["State Exit"] E --> F["Revoke Celery Task"] F --> G["Update Timer Status"] style A fill:#e3f2fd,stroke:#1976d2 style E fill:#fff9c4,stroke:#f44336 style G fill:#e8f5e8,stroke:#4caf50

Field	Type	Description
id	String	Unique identifier (UUID recommended)
state	String	Current workflow status
version	Integer	Entity version for ETag/staleness detection
completed_at	DateTime	Timestamp when terminal state reached

Security & Configuration

Handler Whitelisting

To prevent Remote Code Execution (RCE), arbitrary string imports are banned. Handlers must be explicitly registered using decorators.

@workflow_handler("payment.charge_card")
def charge_card(entity_id: str, data: dict,
                session: Session) -> dict:
    # Process payment via Stripe
    ...

@compensation_handler("payment.charge_card")
def refund_card(entity_id: str, data: dict,
                session: Session) -> dict:
    # Reverse the charge
    ...

Security Boundary

Only handlers from ALLOWED_HANDLER_PREFIXES can be registered, preventing arbitrary code execution through crafted saga definitions.

Environment Variables

DATABASE_URL

PostgreSQL connection string

REDIS_URL

Redis connection string

ALLOWED_HANDLER_PREFIXES

Comma-separated handler module prefixes

LOCK_TIMEOUT_SECONDS

Max wait time for row locks (default: 10)

SAGA_RETRY_MAX

Maximum retry attempts per saga step (default: 3)

WORKFLOW_MIGRATION_MODE

greenfield or upgrade (default: greenfield)

Configuration Modes

Greenfield Deployment

WORKFLOW_MIGRATION_MODE=greenfield creates all tables from scratch for new installations.

Use for: New applications, initial setup, clean installations

Upgrade Deployment

WORKFLOW_MIGRATION_MODE=upgrade adds missing columns to existing tables without data loss.

Use for: Schema evolution, non-destructive updates, production upgrades

Operational Runbook Hooks

Security Note

All admin_* methods require WORKFLOW_ADMIN role authorization and emit SECURITY_AUDIT events. In production, these should be gated behind a separate admin service with MFA, not exposed on the primary API.

Emergency Operations

Force-Completing Stuck Sagas

# 1. Verify external reality (Stripe Dashboard)
# 2. Force status update
engine.admin_force_saga_status(
    saga_id="uuid...",
    step_name="charge_card",
    status=WorkflowStatus.COMPLETED
)

# 3. Resume workflow
engine.dispatcher.dispatch_saga(saga_id="uuid...")

For worker crashes after external effects but before status update.

Emergency State Override

# EXTREME CAUTION: Bypasses all validation
engine.admin_force_state(
    entity_type="Order",
    entity_id="123",
    new_state="draft",
    actor_id="admin-emergency",
    reason="Rollback due to Bug #1234"
)

For workflows stuck in logical dead-ends due to bugs.

Maintenance Operations

Draining for Zero-Downtime

engine.admin_set_mode(WorkflowMode.DRAINING)

while engine.admin_active_transaction_count() > 0:
    time.sleep(5)
    print(f"Waiting for {engine.admin_active_transaction_count()} transactions...")

engine.admin_set_mode(WorkflowMode.MAINTENANCE)
# Run migrations
engine.admin_set_mode(WorkflowMode.NORMAL)

Let in-flight transitions complete while blocking new ones.

Bulk State Correction

engine.admin_bulk_state_update(
    entity_type="Order",
    filter_query={"state": "broken_state"},
    new_state="draft",
    reason="Mass rollback for Bug #1234",
    dry_run=False
)

Safer than individual force_state calls for mass corrections.

Timer Cleanup

engine.admin_purge_timers(
    entity_id="123",
    entity_type="Order"
)

Cleanup dead timers after manual SQL operations.

Lock Contention

stats = engine.admin_get_lock_stats(
    entity_id="123"
)
engine.admin_kill_connection(pid=9821)

Diagnose and resolve lock timeout issues.

Audit Reconstruction

engine.admin_reconstruct_audit(
    entity_type="Order",
    entity_id="123",
    from_state="draft",
    to_state="submitted",
    trigger="submit"
)

Rebuild missing audit entries from historical data.

Capacity Planning Guidelines

Workload Profiles

Profile	Transitions/min	Recommended Infrastructure
Small	< 100	Single PostgreSQL, 2 Celery workers, standalone Redis
Medium	100 – 1,000	PostgreSQL with read replica, 4-8 workers, Redis cluster
Large	1,000 – 10,000	PostgreSQL with pgBouncer, 16-32 workers, Redis cluster, consider sharding
Very Large	> 10,000	Sharded PostgreSQL (Citus), dedicated timer queue, horizontal API scaling

Resource Sizing Formulas

Celery Workers


(peak_transitions_per_min / 60) × avg_saga_steps × 1.5

1.5x multiplier for headroom

Redis Memory


(active_sagas × 2KB) + (rate_limit_buckets × 100B) + 256MB

Add buffer for Celery task metadata

PostgreSQL Connections


(api_instances × 10) + (celery_workers × 2) + 10 admin

Use pgBouncer for > 100 connections

Scaling Triggers

Avg lock wait time > 500ms

Add read replicas or optimize hot entities

Celery queue depth > 1000 tasks

Scale worker count

PostgreSQL connection saturation > 80%

Deploy pgBouncer or increase max_connections

Redis memory usage > 70%

Scale Redis cluster or increase instance size

API response time p99 > 2s

Add API instances, review slow transitions

High Availability Configuration

Component	HA Strategy	Failover Time
PostgreSQL	Streaming replication + Patroni	< 30 seconds
Redis	Redis Sentinel or Cluster mode	< 10 seconds
Celery Workers	Multiple workers with auto-restart (systemd)	Immediate
API Layer	Load balancer with health checks	< 5 seconds

Future Work & Known Limitations

v1 Planned Features

Event/Webhook System

Outbound notifications for state changes, saga completions, and task events.

Current: Polling saga status endpoint

Circuit Breaker

Automatic circuit breaking for failing saga handlers with fallback mechanisms.

Current: Manual monitoring and intervention

Data Retention (TTL)

Built-in archival and time-based expiration for WorkflowAudit and completed sagas.

Current: External archival processes

Multi-Tenancy

Native tenant isolation with separate workflow definitions and data partitioning.

Current: Separate engine instances per tenant

Workflow Versioning

Support for multiple workflow versions running concurrently with automatic migration.

Current: Manual migration of in-flight instances

Architectural Limitations

No Parallel Execution (Fork/Join)

Single-state FSM design cannot model true concurrent activity branches.

Mitigation: Approval Collector pattern for multi-party approval; hybrid BPMN architecture for complex parallelism

No Sub-Processes

Flat workflow structure without nested process composition.

Mitigation: Compose workflows at application level using multiple entities

No Message Correlation

Trigger-driven API calls rather than event-driven message correlation.

Mitigation: External event sourcing with API trigger invocation

No Built-In Task Forms

Form rendering externalized to client applications.

Mitigation: task_config metadata references form schemas

No Task Delegation History

Reassignment tracked as state changes without delegation chain.

Mitigation: Application-level delegation logging if required

Hybrid Architecture Recommendation

For systems requiring both QWE's transactional strengths and BPMN's process orchestration, a hybrid approach combines both technologies in their zones of strength:

graph TD A["BPMN Runtime
Process Orchestration"] --> B["Service Task A"] A --> C["Service Task B"] A --> D["Service Task C"] B --> E["QWE Engine
Transactional Safety"] C --> E D --> E E --> F["Order.submit()"] E --> G["Document.approve()"] E --> H["Payment.charge()"] F --> I["ACID Transaction
Locking + Audit"] G --> I H --> I I --> J["Saga Compensation
Monitoring + Retry"] style A fill:#e3f2fd,stroke:#1976d2 style E fill:#fff9c4,stroke:#f44336 style I fill:#e8f5e8,stroke:#4caf50 style J fill:#fce4ec,stroke:#e91e63

BPMN Runtime Responsibilities

• High-level process orchestration
• Parallel branches and fork/join
• Message correlation and events
• Human task forms and process versioning

QWE Responsibilities

• ACID transaction safety
• Pessimistic locking
• Saga pattern implementation
• Operational monitoring and admin tools

Integration Pattern

BPMN service tasks call QWE's attempt_transition() to safely change entity states with locking, rules, auditing, and saga compensation, while BPMN provides process-level orchestration and dashboards.

Q Workflow Engine v0 Specification

Code-First

ACID Safe

Runtime Visualization

Saga Pattern

Executive Summary

Code-First Approach

ACID Transaction Safety

Saga Pattern Implementation

Design Philosophy

Code-First Approach

Embedded Python Architecture

Heavy BPMN Server Replacement

Core Capabilities

ACID Transaction Safety

Distributed Side Effects

Business Logic Decoupling

Visualization Subsystem

Runtime Status Reflection

Visual Classification System

Generated Mermaid.js Output

Lane-Annotated Visualization

state_metadata Declaration

Terminal State Rendering

Fallback Behavior

System Architecture

Layered Design

API Layer

Core Engine

Persistence Layer

Async Infrastructure

Component Inventory

Dispatcher Protocol

Developer Interface

WorkflowMachine Definition

Core Declarations

Security & Tasks

Transition Permissions

Human Task Configuration

State Callbacks

Saga and Timer Definitions

Distributed Sagas

Scheduled Timers

Core Functional Logic

Transition Lifecycle

Error Handling

Transaction Boundary

Saga Pattern Implementation

Two-Phase Lifecycle

Compensation Model

Handler Contract

5.3.4.1 Phase 1: Frozen Plan Persistence

Timer Management

State Mismatch Behavior

Auto-Cancellation

Data Model Specification

Business Models

WorkflowEntity Mixin

WorkflowSaga (Instance Tracker)

SagaStep (Transaction Journal)

Business Rules JSON Schema

Condition Structure

Supported Operators

Combinators

Security & Configuration

Handler Whitelisting

Security Boundary

Environment Variables

Configuration Modes

Greenfield Deployment

Upgrade Deployment

Testing & Quality Assurance

Test Strategy

Unit Tests

Integration Tests

Visual Tests

Load Tests

Migration Strategy

Greenfield Deployment

Use Cases