What You’ll Learn

This guide will teach you how to safely deploy subgraphs in a federated GraphQL architecture. We’ll start with the basics and build up to advanced deployment strategies, focusing on preventing production issues and maintaining system reliability.

Federation Basics

Understanding the core components and how they work together

Safety First

Schema validation and preventing breaking changes

Deployment Strategies

From simple deployments to advanced canary releases

Production Ready

Monitoring, rollbacks, and operational best practices


Understanding Federation Basics

Before diving into deployment strategies, let’s understand what we’re working with. GraphQL Federation allows you to split your API into independent services (subgraphs) while presenting a single, unified API to clients.

The Key Players

Why This Matters for Deployment

Unlike deploying a single service, federated GraphQL requires coordination. When you change a subgraph:

  1. Schema compatibility - Your changes must work with other subgraphs
  2. Timing matters - Deploy code first, then publish the schema
  3. Router updates - The router needs the new schema to route correctly
  4. Rollback complexity - Issues can affect the entire graph

The Foundation: Schema Safety

Most production issues in federated GraphQL come from schema problems. Before learning deployment strategies, you need to master schema safety.

The Critical Rule: Validate Before Deploy

Never deploy schema changes without validation. The wgc subgraph check command is your first line of defense.

# Always run this before deploying
wgc subgraph check my-products-subgraph \
  --schema ./schema.graphql \
  --namespace production

What Gets Checked

Breaking Changes

Will this break existing clients?

Removing fields, changing types, or modifying required arguments can break client applications.

Composition Validity

Can schemas merge successfully?

Your schema must combine properly with all other subgraphs to create a valid supergraph.

Operations Analysis

Smart safety checks

Analyzes real client usage data to determine if a “breaking” change is actually safe in practice.

Schema Linting

Enforces best practices

Validates your schema follows GraphQL federation principles and your organization’s conventions.

Failed validation = Don’t deploy

If wgc subgraph check fails, fix the issues before proceeding. A failed check means your changes could break the entire graph or existing client applications. In rare cases, you can overwrite the check in the Cosmo Studio to proceed with the deployment. This can be useful when removing unused fields or types.

Environment Isolation

Use separate namespaces for complete isolation between environments:

dev namespace
├── Products Subgraph
├── Users Subgraph  
├── Orders Subgraph
└── Federated Graph

For trying new features and schema changes without risk


Your First Safe Deployment

Let’s walk through a basic deployment that follows safety best practices.

Step-by-Step Process

1

Validate Your Schema

Before touching any infrastructure

# Check against your target environment
wgc subgraph check my-products-subgraph \
  --schema ./schema.graphql \
  --namespace development

Only proceed if this passes without errors.

2

Deploy Your Application Code

Deploy the subgraph service first

# Deploy to your infrastructure (example: Kubernetes)
kubectl apply -f k8s/development/
kubectl rollout status deployment/my-products-subgraph-dev

Critical: Verify your service is healthy before the next step.

3

Publish Your Schema

Only after the service is running

# Verify service health first.
curl -f http://my-products-subgraph-dev.internal/health

# Then publish the schema
wgc subgraph publish my-products-subgraph \
  --schema ./schema.graphql \
  --namespace development
4

Verify Everything Works

Test the complete integration

Query your router to ensure the new schema is active and working correctly.

Why This Order Matters

Deploy Code → Publish Schema (Never the reverse)

If you publish the schema first, the router will try to send queries to resolvers that don’t exist yet. This causes immediate errors for your users.


Router Configuration Strategies

The router needs to know about your schema changes. You have two main approaches:

Dynamic Configuration

Router fetches automatically

The router polls Cosmo’s CDN for the latest schema. Simple setup with automatic updates.


Best for: Most deployments, especially when you want zero-downtime schema updates.

Static Configuration

Pre-built configuration

Build the router config in CI and deploy it with your router. Full control but requires router redeployment.


Best for: Air-gapped environments or when you need strict control over when schema changes apply.

This is the simpler approach for most teams.

How it works:

  1. You deploy your subgraph and publish the schema
  2. Router automatically fetches the new schema from Cosmo CDN
  3. Router gracefully reloads with the new configuration
  4. Zero downtime for schema updates

Trade-offs:

  • ✅ Simple setup and configuration
  • ✅ Zero downtime schema updates
  • ✅ No need to redeploy router for schema changes
  • ✅ Automatic rollback to last valid good configuration
  • ✅ Encourage atomic deployments (Coupling schema with subgraph deployment)
  • ❌ Requires internet connectivity to Cosmo CDN
  • ❌ Schema must be accessible to the subgraph for embedding (if using post-deployment publishing)

Static Configuration (Advanced)

When you need full control:

# CI Pipeline step
- name: Build Router Config
  run: |
    # Fetch the latest composed supergraph schema
    wgc router compose my-graph \
      --namespace production \
      --output router-config.json
    
    # Build router image with the embedded config
    docker build \
      --build-arg CONFIG_FILE=router-config.json \
      -t my-router:${{ github.sha }} .

Your Dockerfile should copy the config:

FROM ghcr.io/wundergraph/cosmo/router:latest

ARG CONFIG_FILE
COPY ${CONFIG_FILE} /app/config.json

ENV EXECUTION_CONFIG_FILE_PATH=/app/config.json

Trade-offs:

  • ✅ No dependency on CDN at runtime
  • ✅ Perfect for air-gapped environments
  • ❌ Must redeploy the router for every schema change
  • ❌ More complex rollback procedures

All subsequent examples in this guide follow the dynamic configuration approach for simplicity. If you’re using static configuration, you’ll need to modify the examples to include router config building and deployment steps.

Automated CI/CD Integration

Manual deployments don’t scale. Let’s automate the safety checks and deployment process.

Basic CI/CD Pipeline

name: Safe Subgraph Deployment

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      # Step 1: Validate schema safety
      - name: Schema Safety Check
        run: |
          npm install -g wgc@latest
          wgc subgraph check my-products-subgraph \
            --schema ./schema.graphql \
            --namespace production
        env:
          COSMO_API_KEY: ${{ secrets.COSMO_API_KEY }}
      
      # Step 2: Deploy application code
      - name: Deploy Service
        run: |
          kubectl apply -f k8s/production/
          kubectl rollout status deployment/my-products-subgraph --timeout=300s
      
      # Step 3: Health check
      - name: Verify Service Health
        run: |
          curl -f http://my-products-subgraph.internal/health
        timeout-minutes: 2
      
      # Step 4: Publish schema
      - name: Publish Schema
        run: |
          wgc subgraph publish my-products-subgraph \
            --schema ./schema.graphql \
            --namespace production
        env:
          COSMO_API_KEY: ${{ secrets.COSMO_API_KEY }}

Environment Promotion Strategy

Promote through environments in order:

Remember: Always deploy your service code first, verify that it’s healthy, then publish the schema.

# Development first
wgc subgraph check my-subgraph --schema ./schema.graphql --namespace dev
# Deploy service to dev environment, then:
wgc subgraph publish my-subgraph --schema ./schema.graphql --namespace dev

# Then staging  
wgc subgraph check my-subgraph --schema ./schema.graphql --namespace stage
# Deploy service to staging environment, then:
wgc subgraph publish my-subgraph --schema ./schema.graphql --namespace stage

# Finally production
wgc subgraph check my-subgraph --schema ./schema.graphql --namespace prod
# Deploy service to production environment, then:
wgc subgraph publish my-subgraph --schema ./schema.graphql --namespace prod

Advanced: Canary Deployments

Once you’ve mastered basic deployments, canary releases let you deploy with even greater safety by gradually shifting traffic to the new version.

Breaking Changes: This strategy is not suitable for releasing breaking changes. In general, you should avoid breaking your production graph (e.g., by removing/renaming a field or changing a type). Fields should be marked as @deprecated instead. Always use the wgc subgraph check command to validate your schema changes before deploying to production. The output of the check command will help you understand the impact of your changes and decide if you can release them safely.

Understanding Canary in Federation

A canary deployment in federated GraphQL is more complex than with typical services because:

  • Schema dependencies: Your new subgraph version may require other subgraphs to form a valid composition. This is a classic chicken-and-egg problem, but it is safely solved with Cosmo because the schema registry ensures that only the latest valid composition is made available to the router. However, you need to be aware of the dependencies between subgraphs and ensure that all subgraphs are deployed to resolve the composition.

Safe Canary Strategy

The safest approach uses separate environments for canary subgraph deployments:

Production Environment (90% traffic)
├── Products Subgraph v1.0
├── Users Subgraph v2.1
└── Orders Subgraph v1.5

Canary Environment (10% traffic)  
├── Products Subgraph v1.1  ← New version being tested
├── Users Subgraph v2.1
└── Orders Subgraph v1.5

Benefits:

  • Complete isolation between subgraph versions
  • Easy rollback by routing traffic away from canary environment or rolling the entire subgraph back to the previous version
  • Test new subgraph versions without affecting the entire federation

Implementing with Argo Rollouts

Here’s a production-ready canary setup:

# This resource defines the reusable job for publishing the subgraph schema.
# It is defined once and can be referenced by multiple Rollouts.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: publish-subgraph-schema
spec:
  # This argument will be supplied by the Rollout during the analysis run.
  args:
    - name: image
    - name: namespace
    - name: subgraph-name
  jobs:
    - name: publish-schema
      template:
        spec:
          containers:
            - name: publisher
              # The image uses the specific tag of the version that was just promoted.
              image: "{{args.image}}"
              command: ["/bin/sh", "-c"]
              args:
                - |
                  # Best Practice: The 'wgc' CLI should be pre-installed in the Docker image.
                  # This command assumes 'wgc' is already in the PATH.
                  wgc subgraph publish {{args.subgraph-name}} \
                    --namespace {{args.namespace}} \
                    --schema /path/to/my/schema.graphql
              env:
                - name: COSMO_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: cosmo-secrets
                      key: COSMO_API_KEY
          restartPolicy: Never
      backoffLimit: 1

Automated Rollback

If the postPromotionAnalysis step fails (for example, the health check or publish step fails), Argo Rollouts will automatically roll back the deployment to the previous stable version. This ensures that:

  • Your subgraph code reverts to the last known good version
  • The supergraph schema remains consistent and valid
  • Client applications continue to work without interruption

This automatic rollback mechanism protects your federation from broken deployments while maintaining the integrity of your overall graph.

Manual Validation Control: You can configure the rollout to pause at any step and perform complex health checks before proceeding. Use kubectl argo rollouts promote my-products-subgraph to continue with the post-promotion analysis. This gives you full control over when traffic shifts to the new version.

Timing Consideration: When using dynamic configuration, allow at least 20 seconds for the initial canary evaluation. The router polls for schema updates every 10 seconds by default (configurable via poll_interval), so you need to account for this delay when the new schema is published. The example respects this already.


Monitoring and Observability

You can’t manage what you can’t measure. Proper monitoring is essential for safe deployments.

Essential Metrics

Error Rates

Track failures across the graph

Monitor both GraphQL errors and HTTP errors at the router and subgraph levels.

Latency

Measure performance impact

Track P50, P95, and P99 latencies to detect performance regressions.

Schema Usage

Understand client behavior

See which fields are used to make safe deprecation decisions.

Composition Health

Monitor graph integrity

Track schema composition success and supergraph generation.

Setting Up Observability

Router observability: The Cosmo Router exports comprehensive metrics automatically through OpenTelemetry and Prometheus endpoints. For detailed setup and best practices, see:

# Router metrics are available at:
curl http://router:8088/metrics

Subgraph instrumentation: Each subgraph should be instrumented with OpenTelemetry to provide end-to-end observability across your federation. Subgraphs can:

  • Export metrics and traces to your observability stack
  • Push telemetry data directly to Cosmo for centralized monitoring
  • Provide detailed resolver-level performance insights

The specific instrumentation approach depends on your subgraph’s technology stack (Node.js, Python, Go, etc.). Refer to the OpenTelemetry documentation for language-specific setup guides.


Best Practices Summary

Schema First

Always validate before deploy

Use wgc subgraph check on every change. Never skip validation.

Code Then Schema

Deploy in the right order

Deploy service code first, verify health, then publish schema.

Environment Isolation

Use separate namespaces

Keep dev, staging, and production completely isolated.

Automate Safety

Build checks into CI/CD

Make safety checks automatic, not manual processes.

Monitor Everything

Comprehensive observability

Track errors, latency, and schema usage across the entire graph.

Plan for Problems

Prepare for issues

Have rollback procedures and incident response plans ready.


What’s Next?

You now have the foundation for safely deploying federated GraphQL. As you gain experience:

  1. Experiment with advanced features like feature flags
  2. Optimize your monitoring and alerting based on real usage patterns
  3. Refine your canary deployment strategy for your specific needs
  4. Share your learnings with other teams adopting federation

Remember: Safety first, speed second. A robust deployment process might seem slower initially, but it prevents the much larger costs of production incidents and helps you move faster in the long run.

Your federated GraphQL architecture is now ready to scale safely with your business needs.