Deploy GraphQL Federation with Confidence

What You’ll Learn

This guide will teach you how to safely deploy subgraphs in a federated GraphQL architecture. We’ll start with the basics and build up to advanced deployment strategies, focusing on preventing production issues and maintaining system reliability.

Federation Basics

Understanding the core components and how they work together

Safety First

Schema validation and preventing breaking changes

Deployment Strategies

From simple deployments to advanced canary releases

Production Ready

Monitoring, rollbacks, and operational best practices

Understanding Federation Basics

Before diving into deployment strategies, let’s understand what we’re working with. GraphQL Federation allows you to split your API into independent services (subgraphs) while presenting a single, unified API to clients.

The Key Players

Subgraph

Supergraph

Router

Schema Registry

Why This Matters for Deployment

Unlike deploying a single service, federated GraphQL requires coordination. When you change a subgraph:

Schema compatibility - Your changes must work with other subgraphs
Timing matters - Deploy code first, then publish the schema
Router updates - The router needs the new schema to route correctly
Rollback complexity - Issues can affect the entire graph

The Foundation: Schema Safety

Most production issues in federated GraphQL come from schema problems. Before learning deployment strategies, you need to master schema safety.

The Critical Rule: Validate Before Deploy

Never deploy schema changes without validation. The wgc subgraph check command is your first line of defense.

# Always run this before deploying
wgc subgraph check my-products-subgraph \
  --schema ./schema.graphql \
  --namespace production

What Gets Checked

Breaking Changes

Will this break existing clients?

Removing fields, changing types, or modifying required arguments can break client applications.

Composition Validity

Can schemas merge successfully?

Your schema must combine properly with all other subgraphs to create a valid supergraph.

Operations Analysis

Smart safety checks

Analyzes real client usage data to determine if a “breaking” change is actually safe in practice.

Schema Linting

Enforces best practices

Validates your schema follows GraphQL federation principles and your organization’s conventions.

Failed validation = Don’t deploy

If wgc subgraph check fails, fix the issues before proceeding. A failed check means your changes could break the entire graph or existing client applications. In rare cases, you can overwrite the check in the Cosmo Studio to proceed with the deployment. This can be useful when removing unused fields or types.

Environment Isolation

Use separate namespaces for complete isolation between environments:

dev namespace
├── Products Subgraph
├── Users Subgraph  
├── Orders Subgraph
└── Federated Graph

For trying new features and schema changes without risk

dev namespace
├── Products Subgraph
├── Users Subgraph  
├── Orders Subgraph
└── Federated Graph

For trying new features and schema changes without risk

stage namespace  
├── Products Subgraph
├── Users Subgraph
├── Orders Subgraph
└── Federated Graph

Final validation before production - mirrors your live setup

prod namespace
├── Products Subgraph
├── Users Subgraph
├── Orders Subgraph
└── Federated Graph

Your live system serving real customers - requires maximum safety

Your First Safe Deployment

Let’s walk through a basic deployment that follows safety best practices.

Step-by-Step Process

Validate Your Schema

Before touching any infrastructure

# Check against your target environment
wgc subgraph check my-products-subgraph \
  --schema ./schema.graphql \
  --namespace development

Only proceed if this passes without errors.

Deploy Your Application Code

Deploy the subgraph service first

# Deploy to your infrastructure (example: Kubernetes)
kubectl apply -f k8s/development/
kubectl rollout status deployment/my-products-subgraph-dev

Critical: Verify your service is healthy before the next step.

Publish Your Schema

Only after the service is running

# Verify service health first.
curl -f http://my-products-subgraph-dev.internal/health

# Then publish the schema
wgc subgraph publish my-products-subgraph \
  --schema ./schema.graphql \
  --namespace development

Verify Everything Works

Test the complete integration

Query your router to ensure the new schema is active and working correctly.

Why This Order Matters

Deploy Code → Publish Schema (Never the reverse)

If you publish the schema first, the router will try to send queries to resolvers that don’t exist yet. This causes immediate errors for your users.

Router Configuration Strategies

The router needs to know about your schema changes. You have two main approaches:

Dynamic Configuration

Router fetches automatically

The router polls Cosmo’s CDN for the latest schema. Simple setup with automatic updates.

Best for: Most deployments, especially when you want zero-downtime schema updates.

Static Configuration

Pre-built configuration

Build the router config in CI and deploy it with your router. Full control but requires router redeployment.

Best for: Air-gapped environments or when you need strict control over when schema changes apply.

Dynamic Configuration (Recommended)

This is the simpler approach for most teams.

How it works:

You deploy your subgraph and publish the schema
Router automatically fetches the new schema from Cosmo CDN
Router gracefully reloads with the new configuration
Zero downtime for schema updates

Trade-offs:

✅ Simple setup and configuration
✅ Zero downtime schema updates
✅ No need to redeploy router for schema changes
✅ Automatic rollback to last valid good configuration
✅ Encourage atomic deployments (Coupling schema with subgraph deployment)
❌ Requires internet connectivity to Cosmo CDN
❌ Schema must be accessible to the subgraph for embedding (if using post-deployment publishing)

Static Configuration (Advanced)

When you need full control:

# CI Pipeline step
- name: Build Router Config
  run: |
    # Fetch the latest composed supergraph schema
    wgc router compose my-graph \
      --namespace production \
      --output router-config.json
    
    # Build router image with the embedded config
    docker build \
      --build-arg CONFIG_FILE=router-config.json \
      -t my-router:${{ github.sha }} .

Your Dockerfile should copy the config:

FROM ghcr.io/wundergraph/cosmo/router:latest

ARG CONFIG_FILE
COPY ${CONFIG_FILE} /app/config.json

ENV EXECUTION_CONFIG_FILE_PATH=/app/config.json

Trade-offs:

✅ No dependency on CDN at runtime
✅ Perfect for air-gapped environments
❌ Must redeploy the router for every schema change
❌ More complex rollback procedures

All subsequent examples in this guide follow the dynamic configuration approach for simplicity. If you’re using static configuration, you’ll need to modify the examples to include router config building and deployment steps.

Automated CI/CD Integration

Manual deployments don’t scale. Let’s automate the safety checks and deployment process.

Basic CI/CD Pipeline

name: Safe Subgraph Deployment

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      # Step 1: Validate schema safety
      - name: Schema Safety Check
        run: |
          npm install -g wgc@latest
          wgc subgraph check my-products-subgraph \
            --schema ./schema.graphql \
            --namespace production
        env:
          COSMO_API_KEY: ${{ secrets.COSMO_API_KEY }}
      
      # Step 2: Deploy application code
      - name: Deploy Service
        run: |
          kubectl apply -f k8s/production/
          kubectl rollout status deployment/my-products-subgraph --timeout=300s
      
      # Step 3: Health check
      - name: Verify Service Health
        run: |
          curl -f http://my-products-subgraph.internal/health
        timeout-minutes: 2
      
      # Step 4: Publish schema
      - name: Publish Schema
        run: |
          wgc subgraph publish my-products-subgraph \
            --schema ./schema.graphql \
            --namespace production
        env:
          COSMO_API_KEY: ${{ secrets.COSMO_API_KEY }}

Environment Promotion Strategy

Promote through environments in order:

Remember: Always deploy your service code first, verify that it’s healthy, then publish the schema.

# Development first
wgc subgraph check my-subgraph --schema ./schema.graphql --namespace dev
# Deploy service to dev environment, then:
wgc subgraph publish my-subgraph --schema ./schema.graphql --namespace dev

# Then staging  
wgc subgraph check my-subgraph --schema ./schema.graphql --namespace stage
# Deploy service to staging environment, then:
wgc subgraph publish my-subgraph --schema ./schema.graphql --namespace stage

# Finally production
wgc subgraph check my-subgraph --schema ./schema.graphql --namespace prod
# Deploy service to production environment, then:
wgc subgraph publish my-subgraph --schema ./schema.graphql --namespace prod

Advanced: Canary Deployments

Once you’ve mastered basic deployments, canary releases let you deploy with even greater safety by gradually shifting traffic to the new version.

Breaking Changes: This strategy is not suitable for releasing breaking changes. In general, you should avoid breaking your production graph (e.g., by removing/renaming a field or changing a type). Fields should be marked as @deprecated instead. Always use the wgc subgraph check command to validate your schema changes before deploying to production. The output of the check command will help you understand the impact of your changes and decide if you can release them safely.

Understanding Canary in Federation

A canary deployment in federated GraphQL is more complex than with typical services because:

Schema dependencies: Your new subgraph version may require other subgraphs to form a valid composition. This is a classic chicken-and-egg problem, but it is safely solved with Cosmo because the schema registry ensures that only the latest valid composition is made available to the router. However, you need to be aware of the dependencies between subgraphs and ensure that all subgraphs are deployed to resolve the composition.

Safe Canary Strategy

The safest approach uses separate environments for canary subgraph deployments:

Production Environment (90% traffic)
├── Products Subgraph v1.0
├── Users Subgraph v2.1
└── Orders Subgraph v1.5

Canary Environment (10% traffic)  
├── Products Subgraph v1.1  ← New version being tested
├── Users Subgraph v2.1
└── Orders Subgraph v1.5

Benefits:

Complete isolation between subgraph versions
Easy rollback by routing traffic away from canary environment or rolling the entire subgraph back to the previous version
Test new subgraph versions without affecting the entire federation

Implementing with Argo Rollouts

Here’s a production-ready canary setup:

# This resource defines the reusable job for publishing the subgraph schema.
# It is defined once and can be referenced by multiple Rollouts.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: publish-subgraph-schema
spec:
  # This argument will be supplied by the Rollout during the analysis run.
  args:
    - name: image
    - name: namespace
    - name: subgraph-name
  jobs:
    - name: publish-schema
      template:
        spec:
          containers:
            - name: publisher
              # The image uses the specific tag of the version that was just promoted.
              image: "{{args.image}}"
              command: ["/bin/sh", "-c"]
              args:
                - |
                  # Best Practice: The 'wgc' CLI should be pre-installed in the Docker image.
                  # This command assumes 'wgc' is already in the PATH.
                  wgc subgraph publish {{args.subgraph-name}} \
                    --namespace {{args.namespace}} \
                    --schema /path/to/my/schema.graphql
              env:
                - name: COSMO_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: cosmo-secrets
                      key: COSMO_API_KEY
          restartPolicy: Never
      backoffLimit: 1

# This resource defines the reusable job for publishing the subgraph schema.
# It is defined once and can be referenced by multiple Rollouts.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: publish-subgraph-schema
spec:
  # This argument will be supplied by the Rollout during the analysis run.
  args:
    - name: image
    - name: namespace
    - name: subgraph-name
  jobs:
    - name: publish-schema
      template:
        spec:
          containers:
            - name: publisher
              # The image uses the specific tag of the version that was just promoted.
              image: "{{args.image}}"
              command: ["/bin/sh", "-c"]
              args:
                - |
                  # Best Practice: The 'wgc' CLI should be pre-installed in the Docker image.
                  # This command assumes 'wgc' is already in the PATH.
                  wgc subgraph publish {{args.subgraph-name}} \
                    --namespace {{args.namespace}} \
                    --schema /path/to/my/schema.graphql
              env:
                - name: COSMO_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: cosmo-secrets
                      key: COSMO_API_KEY
          restartPolicy: Never
      backoffLimit: 1

# This is the Rollout resource that orchestrates the deployment.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-products-subgraph
spec:
  # REQUIRED: This selector links the Rollout to the pods it manages.
  selector:
    matchLabels:
      app: my-products-subgraph
  replicas: 5
  strategy:
    canary:
            steps:
  - pause: {duration: 30s}  # Wait for router to poll new config (default: 10s)
  - setWeight: 10    # Increase to 10%
  - pause: {duration: 30s}  
  - setWeight: 25    # Increase to 25%
  - pause: {duration: 1m}
  - setWeight: 50    # Increase to 50%
  - pause: {duration: 1m}
  - setWeight: 100   # Full rollout
      
      # This hook runs AFTER the rollout is fully promoted to 100%.
      postPromotionAnalysis:
        templates:
          # Reference the AnalysisTemplate defined above.
          - templateName: publish-subgraph-schema
        args:
          # Pass the information to the AnalysisTemplate.
          - name: image
            value: "{{.spec.template.spec.containers[0].image}}"
          - name: namespace
            value: production
          - name: subgraph-name
            value: my-products-subgraph
  template:
    metadata:
      # REQUIRED: Pod labels must match the selector.
      labels:
        app: my-products-subgraph
    spec:
      containers:
      - name: app
        image: my-products-subgraph:1.2.0
        ports:
        - containerPort: 8080

Automated Rollback

If the postPromotionAnalysis step fails (for example, the health check or publish step fails), Argo Rollouts will automatically roll back the deployment to the previous stable version. This ensures that:

Your subgraph code reverts to the last known good version
The supergraph schema remains consistent and valid
Client applications continue to work without interruption

This automatic rollback mechanism protects your federation from broken deployments while maintaining the integrity of your overall graph.

Manual Validation Control: You can configure the rollout to pause at any step and perform complex health checks before proceeding. Use kubectl argo rollouts promote my-products-subgraph to continue with the post-promotion analysis. This gives you full control over when traffic shifts to the new version.

Timing Consideration: When using dynamic configuration, allow at least 20 seconds for the initial canary evaluation. The router polls for schema updates every 10 seconds by default (configurable via poll_interval), so you need to account for this delay when the new schema is published. The example respects this already.

Monitoring and Observability

You can’t manage what you can’t measure. Proper monitoring is essential for safe deployments.

Essential Metrics

Error Rates

Track failures across the graph

Monitor both GraphQL errors and HTTP errors at the router and subgraph levels.

Latency

Measure performance impact

Track P50, P95, and P99 latencies to detect performance regressions.

Schema Usage

Understand client behavior

See which fields are used to make safe deprecation decisions.

Composition Health

Monitor graph integrity

Track schema composition success and supergraph generation.

Setting Up Observability

Router observability: The Cosmo Router exports comprehensive metrics automatically through OpenTelemetry and Prometheus endpoints. For detailed setup and best practices, see:

Router Metrics & Monitoring - Complete metrics reference and configuration
OpenTelemetry Setup - How to configure OTEL collectors
Custom Attributes - Adding custom telemetry data

# Router metrics are available at:
curl http://router:8088/metrics

Subgraph instrumentation: Each subgraph should be instrumented with OpenTelemetry to provide end-to-end observability across your federation. Subgraphs can:

Export metrics and traces to your observability stack
Push telemetry data directly to Cosmo for centralized monitoring
Provide detailed resolver-level performance insights

The specific instrumentation approach depends on your subgraph’s technology stack (Node.js, Python, Go, etc.). Refer to the OpenTelemetry documentation for language-specific setup guides.

Best Practices Summary

Schema First

Always validate before deploy

Use wgc subgraph check on every change. Never skip validation.

Code Then Schema

Deploy in the right order

Deploy service code first, verify health, then publish schema.

Environment Isolation

Use separate namespaces

Keep dev, staging, and production completely isolated.

Automate Safety

Build checks into CI/CD

Make safety checks automatic, not manual processes.

Monitor Everything

Comprehensive observability

Track errors, latency, and schema usage across the entire graph.

Plan for Problems

Prepare for issues

Have rollback procedures and incident response plans ready.

What’s Next?

You now have the foundation for safely deploying federated GraphQL. As you gain experience:

Experiment with advanced features like feature flags
Optimize your monitoring and alerting based on real usage patterns
Refine your canary deployment strategy for your specific needs
Share your learnings with other teams adopting federation

Remember: Safety first, speed second. A robust deployment process might seem slower initially, but it prevents the much larger costs of production incidents and helps you move faster in the long run.

Your federated GraphQL architecture is now ready to scale safely with your business needs.

From Zero to Federation in 5 Steps using CosmoThis guide offers a hands-on introduction to Cosmo using a sample demo repository. You'll set up the demo subgraphs and Cosmo Router locally, allowing you to start querying right away.

On this page

What You’ll Learn
Understanding Federation Basics
The Key Players
Why This Matters for Deployment
The Foundation: Schema Safety
The Critical Rule: Validate Before Deploy
What Gets Checked
Environment Isolation
Your First Safe Deployment
Step-by-Step Process
Why This Order Matters
Router Configuration Strategies
Dynamic Configuration (Recommended)
Static Configuration (Advanced)
Automated CI/CD Integration
Basic CI/CD Pipeline
Environment Promotion Strategy
Advanced: Canary Deployments
Understanding Canary in Federation
Safe Canary Strategy
Implementing with Argo Rollouts
Automated Rollback
Monitoring and Observability
Essential Metrics
Setting Up Observability
Best Practices Summary
What’s Next?

Getting Started

Concepts

Federation

CLI (wgc)

Studio

Router

Control Plane

Deployments and Hosting

​What You’ll Learn

Federation Basics

Safety First

Deployment Strategies

Production Ready

​Understanding Federation Basics

​The Key Players

​Why This Matters for Deployment

​The Foundation: Schema Safety

​The Critical Rule: Validate Before Deploy

​What Gets Checked

Breaking Changes

Composition Validity

Operations Analysis

Schema Linting

​Environment Isolation

​Your First Safe Deployment

​Step-by-Step Process

​Why This Order Matters

​Router Configuration Strategies

Dynamic Configuration

Static Configuration

​Dynamic Configuration (Recommended)

​Static Configuration (Advanced)

​Automated CI/CD Integration

​Basic CI/CD Pipeline

​Environment Promotion Strategy

​Advanced: Canary Deployments

​Understanding Canary in Federation

​Safe Canary Strategy

​Implementing with Argo Rollouts

​Automated Rollback

​Monitoring and Observability

​Essential Metrics

Error Rates

Latency

Schema Usage

Composition Health

​Setting Up Observability

​Best Practices Summary

Schema First

Code Then Schema

Environment Isolation

Automate Safety

Monitor Everything

Plan for Problems

​What’s Next?

What You’ll Learn

Understanding Federation Basics

The Key Players

Why This Matters for Deployment

The Foundation: Schema Safety

The Critical Rule: Validate Before Deploy

What Gets Checked

Environment Isolation

Your First Safe Deployment

Step-by-Step Process

Why This Order Matters

Router Configuration Strategies

Dynamic Configuration (Recommended)

Static Configuration (Advanced)

Automated CI/CD Integration

Basic CI/CD Pipeline

Environment Promotion Strategy

Advanced: Canary Deployments

Understanding Canary in Federation

Safe Canary Strategy

Implementing with Argo Rollouts

Automated Rollback

Monitoring and Observability

Essential Metrics

Setting Up Observability

Best Practices Summary

What’s Next?