Deploy GraphQL Federation with Confidence
Complete guide to deploying federated GraphQL - from schema validation and basic deployments to advanced canary strategies and production monitoring.
What You’ll Learn
This guide will teach you how to safely deploy subgraphs in a federated GraphQL architecture. We’ll start with the basics and build up to advanced deployment strategies, focusing on preventing production issues and maintaining system reliability.
Federation Basics
Understanding the core components and how they work together
Safety First
Schema validation and preventing breaking changes
Deployment Strategies
From simple deployments to advanced canary releases
Production Ready
Monitoring, rollbacks, and operational best practices
Understanding Federation Basics
Before diving into deployment strategies, let’s understand what we’re working with. GraphQL Federation allows you to split your API into independent services (subgraphs) while presenting a single, unified API to clients.
The Key Players
Why This Matters for Deployment
Unlike deploying a single service, federated GraphQL requires coordination. When you change a subgraph:
- Schema compatibility - Your changes must work with other subgraphs
- Timing matters - Deploy code first, then publish the schema
- Router updates - The router needs the new schema to route correctly
- Rollback complexity - Issues can affect the entire graph
The Foundation: Schema Safety
Most production issues in federated GraphQL come from schema problems. Before learning deployment strategies, you need to master schema safety.
The Critical Rule: Validate Before Deploy
Never deploy schema changes without validation. The wgc subgraph check
command is your first line of defense.
What Gets Checked
Breaking Changes
Will this break existing clients?
Removing fields, changing types, or modifying required arguments can break client applications.
Composition Validity
Can schemas merge successfully?
Your schema must combine properly with all other subgraphs to create a valid supergraph.
Operations Analysis
Smart safety checks
Analyzes real client usage data to determine if a “breaking” change is actually safe in practice.
Schema Linting
Enforces best practices
Validates your schema follows GraphQL federation principles and your organization’s conventions.
Failed validation = Don’t deploy
If wgc subgraph check
fails, fix the issues before proceeding. A failed check means your changes could break the entire graph or existing client applications. In rare cases, you can overwrite the check in the Cosmo Studio to proceed with the deployment. This can be useful when removing unused fields or types.
Environment Isolation
Use separate namespaces for complete isolation between environments:
For trying new features and schema changes without risk
For trying new features and schema changes without risk
Final validation before production - mirrors your live setup
Your live system serving real customers - requires maximum safety
Your First Safe Deployment
Let’s walk through a basic deployment that follows safety best practices.
Step-by-Step Process
Validate Your Schema
Before touching any infrastructure
Only proceed if this passes without errors.
Deploy Your Application Code
Deploy the subgraph service first
Critical: Verify your service is healthy before the next step.
Publish Your Schema
Only after the service is running
Verify Everything Works
Test the complete integration
Query your router to ensure the new schema is active and working correctly.
Why This Order Matters
Deploy Code → Publish Schema (Never the reverse)
If you publish the schema first, the router will try to send queries to resolvers that don’t exist yet. This causes immediate errors for your users.
Router Configuration Strategies
The router needs to know about your schema changes. You have two main approaches:
Dynamic Configuration
Router fetches automatically
The router polls Cosmo’s CDN for the latest schema. Simple setup with automatic updates.
Best for: Most deployments, especially when you want zero-downtime schema updates.
Static Configuration
Pre-built configuration
Build the router config in CI and deploy it with your router. Full control but requires router redeployment.
Best for: Air-gapped environments or when you need strict control over when schema changes apply.
Dynamic Configuration (Recommended)
This is the simpler approach for most teams.
How it works:
- You deploy your subgraph and publish the schema
- Router automatically fetches the new schema from Cosmo CDN
- Router gracefully reloads with the new configuration
- Zero downtime for schema updates
Trade-offs:
- ✅ Simple setup and configuration
- ✅ Zero downtime schema updates
- ✅ No need to redeploy router for schema changes
- ✅ Automatic rollback to last valid good configuration
- ✅ Encourage atomic deployments (Coupling schema with subgraph deployment)
- ❌ Requires internet connectivity to Cosmo CDN
- ❌ Schema must be accessible to the subgraph for embedding (if using post-deployment publishing)
Static Configuration (Advanced)
When you need full control:
Your Dockerfile should copy the config:
Trade-offs:
- ✅ No dependency on CDN at runtime
- ✅ Perfect for air-gapped environments
- ❌ Must redeploy the router for every schema change
- ❌ More complex rollback procedures
All subsequent examples in this guide follow the dynamic configuration approach for simplicity. If you’re using static configuration, you’ll need to modify the examples to include router config building and deployment steps.
Automated CI/CD Integration
Manual deployments don’t scale. Let’s automate the safety checks and deployment process.
Basic CI/CD Pipeline
Environment Promotion Strategy
Promote through environments in order:
Remember: Always deploy your service code first, verify that it’s healthy, then publish the schema.
Advanced: Canary Deployments
Once you’ve mastered basic deployments, canary releases let you deploy with even greater safety by gradually shifting traffic to the new version.
Breaking Changes: This strategy is not suitable for releasing breaking changes. In general, you should avoid breaking your production graph (e.g., by removing/renaming a field or changing a type). Fields should be marked as @deprecated
instead. Always use the wgc subgraph check
command to validate your schema changes before deploying to production. The output of the check command will help you understand the impact of your changes and decide if you can release them safely.
Understanding Canary in Federation
A canary deployment in federated GraphQL is more complex than with typical services because:
- Schema dependencies: Your new subgraph version may require other subgraphs to form a valid composition. This is a classic chicken-and-egg problem, but it is safely solved with Cosmo because the schema registry ensures that only the latest valid composition is made available to the router. However, you need to be aware of the dependencies between subgraphs and ensure that all subgraphs are deployed to resolve the composition.
Safe Canary Strategy
The safest approach uses separate environments for canary subgraph deployments:
Benefits:
- Complete isolation between subgraph versions
- Easy rollback by routing traffic away from canary environment or rolling the entire subgraph back to the previous version
- Test new subgraph versions without affecting the entire federation
Implementing with Argo Rollouts
Here’s a production-ready canary setup:
Automated Rollback
If the postPromotionAnalysis
step fails (for example, the health check or publish step fails), Argo Rollouts will automatically roll back the deployment to the previous stable version. This ensures that:
- Your subgraph code reverts to the last known good version
- The supergraph schema remains consistent and valid
- Client applications continue to work without interruption
This automatic rollback mechanism protects your federation from broken deployments while maintaining the integrity of your overall graph.
Manual Validation Control: You can configure the rollout to pause at any step and perform complex health checks before proceeding. Use kubectl argo rollouts promote my-products-subgraph
to continue with the post-promotion analysis. This gives you full control over when traffic shifts to the new version.
Timing Consideration: When using dynamic configuration, allow at least 20 seconds for the initial canary evaluation. The router polls for schema updates every 10 seconds by default (configurable via poll_interval
), so you need to account for this delay when the new schema is published. The example respects this already.
Monitoring and Observability
You can’t manage what you can’t measure. Proper monitoring is essential for safe deployments.
Essential Metrics
Error Rates
Track failures across the graph
Monitor both GraphQL errors and HTTP errors at the router and subgraph levels.
Latency
Measure performance impact
Track P50, P95, and P99 latencies to detect performance regressions.
Schema Usage
Understand client behavior
See which fields are used to make safe deprecation decisions.
Composition Health
Monitor graph integrity
Track schema composition success and supergraph generation.
Setting Up Observability
Router observability: The Cosmo Router exports comprehensive metrics automatically through OpenTelemetry and Prometheus endpoints. For detailed setup and best practices, see:
- Router Metrics & Monitoring - Complete metrics reference and configuration
- OpenTelemetry Setup - How to configure OTEL collectors
- Custom Attributes - Adding custom telemetry data
Subgraph instrumentation: Each subgraph should be instrumented with OpenTelemetry to provide end-to-end observability across your federation. Subgraphs can:
- Export metrics and traces to your observability stack
- Push telemetry data directly to Cosmo for centralized monitoring
- Provide detailed resolver-level performance insights
The specific instrumentation approach depends on your subgraph’s technology stack (Node.js, Python, Go, etc.). Refer to the OpenTelemetry documentation for language-specific setup guides.
Best Practices Summary
Schema First
Always validate before deploy
Use wgc subgraph check
on every change. Never skip validation.
Code Then Schema
Deploy in the right order
Deploy service code first, verify health, then publish schema.
Environment Isolation
Use separate namespaces
Keep dev, staging, and production completely isolated.
Automate Safety
Build checks into CI/CD
Make safety checks automatic, not manual processes.
Monitor Everything
Comprehensive observability
Track errors, latency, and schema usage across the entire graph.
Plan for Problems
Prepare for issues
Have rollback procedures and incident response plans ready.
What’s Next?
You now have the foundation for safely deploying federated GraphQL. As you gain experience:
- Experiment with advanced features like feature flags
- Optimize your monitoring and alerting based on real usage patterns
- Refine your canary deployment strategy for your specific needs
- Share your learnings with other teams adopting federation
Remember: Safety first, speed second. A robust deployment process might seem slower initially, but it prevents the much larger costs of production incidents and helps you move faster in the long run.
Your federated GraphQL architecture is now ready to scale safely with your business needs.