|
US-F1-001 — As a DevOps engineer, I want to configure authentication and permissions for Azure DevOps, Jenkins, and Slack integrations, so that the automation workflow can securely interact with all required external systems.
|
| T-US-F1-001-01 |
Infrastructure |
Create Terraform configuration for AWS Secrets Manager integration |
Implement Terraform modules to provision AWS Secrets Manager secrets for storing PAT tokens and credentials for Azure DevOps, Jenkins, and Slack. Create variables.tf, main.tf, and outputs.tf files with proper IAM policies for secret access. Include secret rotation configuration and KMS encryption. Test plan: Unit test: terraform plan validates without errors, all required variables are defined. Integration test: terraform apply creates secrets with correct IAM policies, secrets are accessible by intended services. Test secret rotation configuration works correctly. AI prompt: You are a DevOps engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create Terraform configuration files for AWS Secrets Manager to store authentication credentials for Azure DevOps, Jenkins, and Slack integrations. Include proper IAM policies for secret access, KMS encryption, and secret rotation. The secrets will be used by Datadog Workflows to authenticate with external systems for automated resource scaling. |
| T-US-F1-001-02 |
Code |
Implement credential validation service for external system connections |
Create a validation service that tests connectivity and permissions for Azure DevOps (repository access, PR creation), Jenkins (job triggering, status monitoring), and Slack (message posting). Implement retry logic, error handling, and detailed permission checking. Store validation results and provide clear error messages for missing permissions. Test plan: Unit test: validation functions return correct status for valid/invalid credentials, error messages are descriptive. Integration test: service correctly identifies missing permissions, retry logic works for transient failures. Test graceful handling of expired tokens and network timeouts. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a credential validation service that tests authentication and permissions for Azure DevOps, Jenkins, and Slack APIs. Include specific permission checks (repo access, PR creation, job triggering), retry logic for transient failures, and detailed error reporting. The service will be called during workflow configuration to ensure all external systems are properly accessible. |
| T-US-F1-001-03 |
Infrastructure |
Configure Datadog Workflow authentication with external systems |
Set up Datadog Workflow connections to Azure DevOps, Jenkins, and Slack using credentials from AWS Secrets Manager. Configure workflow variables, connection testing, and error handling. Implement secure credential retrieval and caching mechanisms within Datadog Workflows. Test plan: Integration test: Datadog Workflow successfully authenticates with all three external systems, credentials are retrieved from AWS Secrets Manager correctly. Test connection failure scenarios and error handling. Verify credential caching works and doesn't expose sensitive data in logs. AI prompt: You are a DevOps engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Configure Datadog Workflow connections to integrate with Azure DevOps, Jenkins, and Slack using credentials stored in AWS Secrets Manager. Set up secure credential retrieval, connection testing, and error handling within the workflow. The connections will be used for automated PR creation, deployment triggering, and notifications. |
| T-US-F1-001-04 |
Code |
Implement connection monitoring and health checks |
Create monitoring system to periodically verify connectivity and permissions for all external integrations. Implement health check endpoints, alerting for connection failures, and automatic credential refresh when possible. Include dashboard for connection status visibility. Test plan: Unit test: health check functions correctly identify connection issues, alerting triggers at appropriate thresholds. Integration test: monitoring detects actual connection failures, automatic credential refresh works for supported systems. Test dashboard displays accurate connection status. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a monitoring system for external system connections (Azure DevOps, Jenkins, Slack) with periodic health checks, alerting for failures, and automatic credential refresh capabilities. Include a status dashboard for visibility. The system should proactively detect authentication issues before they impact the automation workflow. |
| T-US-F1-001-05 |
Documentation |
Document authentication and permissions configuration |
Document the complete authentication setup process including AWS Secrets Manager configuration, required permissions for each external system (Azure DevOps repository access and PR creation, Jenkins job triggering and monitoring, Slack channel posting), credential rotation procedures, troubleshooting common authentication issues, and security best practices. Include setup instructions and permission verification steps. AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create comprehensive documentation for authentication and permissions configuration covering AWS Secrets Manager setup, required permissions for Azure DevOps/Jenkins/Slack integrations, credential rotation procedures, and troubleshooting guides. Include step-by-step setup instructions and security best practices for DevOps engineers configuring the system. |
|
US-F1-002 — As a DevOps engineer, I want to establish and maintain mapping between Datadog services and their corresponding Azure DevOps repositories, so that the automation can correctly identify which repository to target for each service's resource updates.
|
| T-US-F1-002-01 |
Code |
Create service-to-repository mapping configuration system |
Implement a configuration management system for mapping Datadog services to Azure DevOps repositories. Create JSON/YAML configuration files with service names, repository URLs, branch strategies, and file paths for Terraform configurations. Include validation for mapping entries and configuration reload capabilities. Test plan: Unit test: configuration parser validates mapping entries, handles malformed JSON/YAML gracefully. Integration test: mapping system correctly identifies repositories for known services, configuration reload works without service restart. Test validation catches invalid repository URLs and missing required fields. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a configuration management system that maps Datadog service names to their corresponding Azure DevOps repositories. Include configuration file format (JSON/YAML), validation logic, and hot-reload capabilities. The mapping will be used by automation workflows to determine which repository to update when scaling resources for specific services. |
| T-US-F1-002-02 |
Code |
Implement repository lookup and validation service |
Create a service that resolves Datadog service names to repository information and validates repository accessibility. Include caching for performance, fallback mechanisms for unmapped services, and integration with PagerDuty for escalation. Implement repository metadata retrieval (branch info, file structure validation). Test plan: Unit test: lookup service returns correct repository info for mapped services, handles unmapped services appropriately. Integration test: repository accessibility validation works, caching improves performance, PagerDuty escalation triggers for unmapped services. Test fallback mechanisms and error handling. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a repository lookup service that resolves Datadog service names to Azure DevOps repository information, validates repository access, and escalates unmapped services to PagerDuty. Include caching for performance and repository metadata validation. The service will be called when processing Datadog alerts to determine the target repository for resource updates. |
| T-US-F1-002-03 |
Documentation |
Document service-to-repository mapping configuration |
Document the mapping configuration system including configuration file format and schema, process for adding new service mappings, repository validation requirements, troubleshooting unmapped services, and escalation procedures. Include examples of mapping configurations and best practices for maintaining the mapping database. AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the service-to-repository mapping system including configuration file format, process for adding/updating mappings, validation requirements, and troubleshooting procedures. Include practical examples and best practices for DevOps engineers managing service mappings and handling unmapped services. |
|
US-F1-003 — As a DevOps engineer, I want to automatically calculate appropriate resource increases and generate pull requests with updated configurations, so that resource scaling decisions are consistent and include proper safety margins without manual calculation.
|
| T-US-F1-003-01 |
Code |
Implement resource calculation engine for CPU and memory scaling |
Create algorithms to calculate appropriate resource increases based on current usage, thresholds, and historical patterns. Implement safety margin calculations, scaling best practices (e.g., incremental increases, maximum limits), and support for both CPU and memory resources. Include configuration for scaling policies and maximum safe limits. Test plan: Unit test: calculation engine produces reasonable resource increases with safety margins, respects maximum limits, handles edge cases like extremely high utilization. Integration test: calculations work with real Datadog metrics, scaling policies are applied correctly. Test boundary conditions and error handling for invalid inputs. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a resource calculation engine that determines appropriate CPU and memory increases based on Datadog alert data, current usage patterns, and scaling best practices. Include safety margins, maximum limits, and configurable scaling policies. The engine will process Datadog monitor alerts and calculate new resource requirements for EKS pods. |
| T-US-F1-003-02 |
Code |
Create Terraform configuration file generator |
Implement service to generate updated Terraform configuration files with new resource limits. Parse existing Terraform files, update CPU/memory values, maintain proper formatting and comments. Support different Terraform resource types (deployments, pods, containers) and handle complex nested configurations. Test plan: Unit test: generator correctly parses and updates Terraform files, maintains formatting and comments, handles various resource types. Integration test: generated files are valid Terraform syntax, updates are applied to correct resource blocks. Test error handling for malformed Terraform files and unsupported resource types. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a Terraform configuration file generator that parses existing Terraform files and updates CPU/memory resource limits while preserving formatting and structure. Support various Kubernetes resource types and handle complex nested configurations. The generator will create updated Terraform files for pull request creation. |
| T-US-F1-003-03 |
Code |
Implement Azure DevOps pull request creation service |
Create service to generate and submit pull requests to Azure DevOps repositories with updated Terraform configurations. Include PR template generation, descriptive commit messages with scaling rationale, branch management, and reviewer assignment. Implement conflict detection and resolution strategies. Test plan: Unit test: PR creation service generates proper commit messages and PR descriptions, handles branch creation correctly. Integration test: PRs are successfully created in Azure DevOps with correct file changes, reviewers are assigned appropriately. Test conflict detection and error handling for repository access issues. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a pull request creation service for Azure DevOps that creates branches, commits updated Terraform files, and submits PRs with descriptive messages explaining the resource scaling rationale. Include reviewer assignment and conflict handling. The service will be called after resource calculations to propose infrastructure changes. |
| T-US-F1-003-04 |
Code |
Implement safety checks and manual review flagging |
Create validation system to identify resource changes that require manual review (excessive increases, critical services, budget thresholds). Implement flagging mechanisms, approval workflows, and integration with existing change management processes. Include configurable thresholds and escalation rules. Test plan: Unit test: safety checks correctly identify scenarios requiring manual review, thresholds are configurable and respected. Integration test: flagged changes trigger appropriate approval workflows, escalation rules work correctly. Test edge cases like extremely high resource requests and critical service identification. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement safety checks that identify resource scaling changes requiring manual review based on configurable thresholds, service criticality, and scaling magnitude. Include flagging mechanisms and integration with approval workflows. The system should prevent automatic application of potentially risky resource changes. |
| T-US-F1-003-05 |
Documentation |
Document resource calculation and PR generation process |
Document the resource calculation algorithms including scaling policies and safety margins, Terraform file generation process, pull request creation workflow, safety check thresholds and manual review criteria, and troubleshooting common issues with resource calculations. Include examples of generated PRs and configuration options for scaling policies. AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the resource calculation and pull request generation process including scaling algorithms, safety margins, Terraform file updates, PR creation workflow, and manual review criteria. Include practical examples and configuration options for DevOps engineers managing automated resource scaling. |
|
US-F1-004 — As a DevOps engineer, I want to trigger Jenkins jobs for merged PRs and monitor deployment success or failure, so that resource changes are automatically deployed and their success is tracked without manual intervention.
|
| T-US-F1-004-01 |
Code |
Implement Jenkins job triggering service |
Create service to automatically trigger Jenkins deployment jobs when pull requests are merged. Implement webhook handling for Azure DevOps merge events, Jenkins API integration for job triggering, parameter passing for deployment context, and job queue management. Include retry logic for failed job starts. Test plan: Unit test: service correctly identifies merged PRs related to resource scaling, Jenkins API calls succeed with proper parameters. Integration test: Jenkins jobs are triggered automatically on PR merge, job parameters are passed correctly. Test retry logic for Jenkins API failures and webhook reliability. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a Jenkins job triggering service that responds to Azure DevOps PR merge events and automatically starts corresponding deployment jobs. Include webhook handling, Jenkins API integration, parameter passing, and retry logic. The service will initiate deployments for approved resource scaling changes. |
| T-US-F1-004-02 |
Code |
Create deployment monitoring and status tracking system |
Implement system to monitor Jenkins job progress, track deployment status, and detect success/failure conditions. Include real-time status updates, timeout handling, log aggregation, and integration with deployment verification. Support multiple concurrent deployments and status correlation with original alerts. Test plan: Unit test: monitoring system correctly tracks job status changes, timeout detection works properly. Integration test: system monitors actual Jenkins deployments, status updates are accurate and timely. Test handling of multiple concurrent deployments and correlation with original scaling requests. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a deployment monitoring system that tracks Jenkins job progress, detects success/failure conditions, and handles timeouts. Include real-time status updates and support for multiple concurrent deployments. The system will monitor resource scaling deployments and determine next steps based on deployment outcomes. |
| T-US-F1-004-03 |
Documentation |
Document Jenkins integration and deployment monitoring |
Document the Jenkins integration setup including webhook configuration for Azure DevOps, Jenkins job triggering process, deployment monitoring capabilities, timeout and retry configurations, troubleshooting deployment failures, and log analysis procedures. Include setup instructions for Jenkins job templates and monitoring dashboards. AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the Jenkins integration including webhook setup, job triggering process, deployment monitoring, timeout handling, and troubleshooting procedures. Include setup instructions and configuration examples for DevOps engineers managing automated deployments triggered by resource scaling changes. |
|
US-F1-005 — As a on-call responder, I want to receive appropriate notifications for successful resolutions or escalated incidents requiring manual intervention, so that I'm informed of automation outcomes and can focus attention on cases that truly need human intervention.
|
| T-US-F1-005-01 |
Code |
Implement Slack notification service for workflow outcomes |
Create notification service to send Slack messages for successful resolutions and manual verification requests. Implement message templating, channel routing based on service/team, rich formatting with deployment details, and notification deduplication. Include integration with Slack threading for related notifications. Test plan: Unit test: notification service generates properly formatted Slack messages, channel routing works correctly. Integration test: messages are delivered to appropriate Slack channels, rich formatting displays correctly. Test message deduplication and threading functionality. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a Slack notification service that sends formatted messages for successful resource scaling resolutions and requests for manual verification. Include message templating, channel routing, rich formatting with deployment context, and deduplication. The service will keep teams informed of automation outcomes and manual intervention needs. |
| T-US-F1-005-02 |
Code |
Create PagerDuty incident management integration |
Implement PagerDuty integration for escalating deployment failures and continued resource issues. Create incident templates with context about attempted resolutions, implement severity classification, and include relevant deployment logs and metrics. Support incident updates and resolution tracking. Test plan: Unit test: PagerDuty integration creates incidents with proper severity and context, incident templates are populated correctly. Integration test: incidents are created in PagerDuty with complete context, incident updates work properly. Test severity classification and escalation rules. AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create PagerDuty integration for escalating deployment failures and persistent resource issues after scaling attempts. Include incident creation with context about attempted resolutions, severity classification, and relevant logs/metrics. The integration will ensure critical issues receive appropriate attention when automation cannot resolve them. |
| T-US-F1-005-03 |
Documentation |
Document notification and escalation procedures |
Document the notification system including Slack message formats and channel routing, PagerDuty incident creation and escalation criteria, manual verification procedures for deployments without E2E tests, troubleshooting notification delivery issues, and configuration options for notification preferences. Include examples of notification messages and escalation scenarios. AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the notification and escalation system including Slack notifications, PagerDuty incident creation, manual verification procedures, and configuration options. Include message examples and escalation criteria for on-call responders and DevOps teams managing automated resource scaling outcomes. |