Intake Questionnaire

Q1What is the project? Describe it in a few sentences, or point me to a repo/doc.Enable Datadog Workflow automations to handle Datadog monitors related to CPU & memory spikes for AWS EKS pods. The automation will raise Pull Requests in Azure DevOps for impacted services, determine reasonable value increases based on usage patterns, connect to Jenkins for releases, run E2E tests, and either confirm stability via Slack or escalate to PagerDuty.
Q2Is this a greenfield project or are you building on an existing codebase?Greenfield
Q3What problem does this project solve? Who are the end users?CPU and memory utilization alerts for EKS pods are noisy and repetitive — this project automates the response. End users are DevOps engineers (on-call responders who currently handle these alerts manually).
Q4What does "done" look like? What's the end-state you're targeting?The workflow automation raises Pull Requests in Azure DevOps for impacted services to increase pod specs, determines reasonable value increases based on usage, connects to Jenkins to release new app versions, notifies the developer releases Slack channel, runs Jenkins E2E pipeline tests, monitors the situation for a few minutes, and either sends Slack confirmation or notifies PagerDuty for human involvement.
Q5Are there any hard deadlines or milestones?No hard deadlines
Q6How many engineers are working on this?5
Q7What are the roles on the team? (e.g., 2 backend, 1 frontend, 1 fullstack)Roles not specified — assuming generalist/fullstack team
Q8How long are your sprints? (e.g., 1 week, 2 weeks)2 weeks
Q9Do you have a known velocity from previous sprints? If yes, what is it?4 pts/dev/sprint (from Jira: 21 pts team avg, 5 team member(s))
Q10How many sprints are you targeting to complete this project?1–2 sprints
Q11What is the tech stack? (languages, frameworks, databases, infra)Infrastructure-as-Code: Terraform (all Datadog Workflow resources, secrets, IAM roles); Secrets Management: AWS Secrets Manager (API keys for Datadog, Azure DevOps PAT, Jenkins API token, Slack bot token); Source Control: Azure DevOps Git (where pod spec files live); CI/CD: Jenkins (existing pipelines); Notifications: Slack; Escalation: PagerDuty; Monitoring: Datadog
Q12Are there any existing APIs, services, or third-party integrations involved?Datadog Workflows, Azure DevOps REST API (Pull Requests), Jenkins Remote Access API, PagerDuty Events API v2
Q13Are there any architectural constraints or decisions already made? (e.g., must use microservices, must deploy to AWS)No architectural constraints specified
Q14Is there any existing documentation, PRDs, or design docs I should reference?No existing documentation to reference
Q15Does the project have an existing codebase, or is this a new build?New build (greenfield)
Q16Where is the code hosted? (GitHub, Azure DevOps, GitLab, Bitbucket, local only)Azure DevOps Git
Q17Can you share the repo URL(s)? (the agent can connect and scan the repo for context)No repo URL provided
Q18How is the repo structured? (monorepo, multi-repo, microservices, monolith)Monorepo
Q19Is there an existing CI/CD pipeline or deployment setup?CI/CD: Jenkins (existing pipelines — not creating new ones)
Q20Is there any known technical debt? (legacy code, outdated dependencies, areas needing refactoring)No known technical debt identified
Q21Are there any areas of the project you're uncertain or worried about? (technical risk, unclear requirements, dependencies on other teams)No specific risks identified
Q22Are there any known blockers or dependencies on external teams/systems?No known blockers or external dependencies
Q23Is there anything that's explicitly out of scope?Creating Jenkins or Jenkins pipelines (existing infrastructure); Creating the Slack channel (already exists); Creating the PagerDuty rota (already exists); Handling non-CPU/memory alerts (only resource utilisation monitors); Multi-cluster support (single EKS cluster for v1)
Q24How do you want stories estimated? (Fibonacci story points, T-shirt sizes, or no estimates)Fibonacci (1, 2, 3, 5, 8)
Q25Do you have a Definition of Done the team follows?unit tests + PR review + deployed to staging
Q26Do you want the output pushed to Jira, exported as Markdown, or both?Markdown
Q27Which sprint are you planning for?Sprint 107
Q28How many bank holidays fall within your planning window?1 bank holiday(s): May Day
Q29What percentage of capacity is typically lost to unplanned absences? (sick days, emergencies)10%
Q30Are any engineers currently onboarding or ramping up? If so, how many and for how many sprints?No engineers onboarding

Project Analysis

Datadog EKS Automation
Enable Datadog Workflow automations to handle CPU & memory spikes for AWS EKS pods by automatically raising Pull Requests in Azure DevOps, determining resource increases, connecting to Jenkins for releases, and either confirming stability via Slack or escalating to PagerDuty. This automates the repetitive manual response currently handled by DevOps engineers for noisy resource utilization alerts.
Greenfield2-week sprints × 2 sprints = ~4 weeks total
Target state: The workflow automation raises Pull Requests in Azure DevOps for impacted services to increase pod specs, determines reasonable value increases based on usage, connects to Jenkins to release new app versions, notifies the developer releases Slack channel, runs Jenkins E2E pipeline tests, monitors the situation for a few minutes, and either sends Slack confirmation or notifies PagerDuty for human involvement.

Goals

  • Automate response to CPU and memory utilization alerts for EKS pods
  • Reduce manual intervention for repetitive resource scaling tasks
  • Implement end-to-end workflow from alert to resolution or escalation

End Users

  • DevOps engineers
  • On-call responders

Tech Stack

  • Terraform
  • AWS Secrets Manager
  • Azure DevOps Git
  • Jenkins
  • Datadog Workflows

Integrations

  • Datadog Workflows API
  • Azure DevOps REST API
  • Jenkins Remote Access API
  • PagerDuty Events API v2
  • Slack API

Constraints

  • Workflow automation connection to Jenkins authentication and job triggering permissions
  • Mapping each Datadog service to its Azure DevOps repository
  • Workflow automation connection to Azure DevOps with PAT permissions
  • Workflow automation connection to Slack with bot token permissions
  • Resource increase calculations must include safety margins
  • Must handle cases where E2E test suites don't exist for a service

Out of Scope

  • Creating Jenkins or Jenkins pipelines
  • Creating the Slack channel
  • Creating the PagerDuty rota
  • Handling non-CPU/memory alerts
  • Multi-cluster support

Assumptions

  • No hard deadlines
  • Generalist/fullstack team roles
  • No architectural constraints specified
  • No existing documentation to reference
  • Monorepo structure
  • No known technical debt identified
  • No specific risks identified
  • No known blockers or external dependencies
  • Markdown output format
  • 10% capacity lost to unplanned absences
  • No engineers onboarding

Capacity

  • Team: 5 engineer(s), 2-week sprints × 2
  • Gross velocity: 21 pts/sprint
  • Deductions: bank holidays: 1d, unplanned: 10%, discovery: 5%
  • Net velocity: 16 pts/sprint

Features

F1 Datadog EKS Automation
HIGH
Enable Datadog Workflow automations to handle CPU & memory spikes for AWS EKS pods by automatically raising Pull Requests in Azure DevOps, determining resource increases, connecting to Jenkins for releases, and either confirming stability via Slack or escalating to PagerDuty. This automates the repetitive manual response currently handled by DevOps engineers for noisy resource utilization alerts.

User Stories

F1: Datadog EKS Automation
US-F1-001 Configure External System Integrations
HIGH
As a DevOps engineer, I want to configure authentication and permissions for Azure DevOps, Jenkins, and Slack integrations, so that the automation workflow can securely interact with all required external systems.
5 pts Infrastructure
Points rationale: Involves configuring three different external systems with different authentication mechanisms, handling credential management via AWS Secrets Manager, and validating permissions across multiple APIs.
  • Given valid PAT tokens and credentials are available When configuring Azure DevOps, Jenkins, and Slack connections Then all three systems authenticate successfully and permissions are verified
  • Given invalid or expired credentials are provided When attempting to configure system connections Then clear error messages are returned and configuration fails gracefully
  • Given partial permissions are granted on any system When testing the connection configuration Then specific missing permissions are identified and reported
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✓ Knowledge Sharing
US-F1-002 Create Service Repository Mapping System
HIGH
As a DevOps engineer, I want to establish and maintain mapping between Datadog services and their corresponding Azure DevOps repositories, so that the automation can correctly identify which repository to target for each service's resource updates.
3 pts Backend
Points rationale: Requires designing a mapping data structure, implementing lookup logic, and handling edge cases for unmapped services. Relatively straightforward but critical for workflow routing.
  • Given a Datadog service with a known repository mapping When the automation receives an alert for that service Then the correct Azure DevOps repository is identified and accessible
  • Given a Datadog service without a repository mapping When an alert is triggered for that service Then the workflow logs the unmapped service and escalates to PagerDuty
  • Given repository mappings need to be updated When a DevOps engineer modifies the mapping configuration Then changes are validated and applied without requiring workflow restart
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F1-003 Implement Resource Calculation and PR Generation
HIGH
As a DevOps engineer, I want to automatically calculate appropriate resource increases and generate pull requests with updated configurations, so that resource scaling decisions are consistent and include proper safety margins without manual calculation.
5 pts Backend
Points rationale: Involves complex business logic for resource calculation algorithms, Terraform file parsing and modification, and Azure DevOps API integration for PR creation. Multiple technical components with interdependencies.
  • Given a CPU spike alert with current and threshold values When calculating new resource requirements Then the new CPU limit includes a safety margin and follows scaling best practices
  • Given calculated resource changes and target repository When generating the pull request Then PR contains updated Terraform configurations with proper formatting and descriptive commit messages
  • Given extremely high resource utilization that would exceed reasonable limits When calculating resource increases Then the calculation caps at maximum safe values and flags for manual review
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✓ Knowledge Sharing
US-F1-004 Orchestrate Jenkins Deployment and Monitoring
HIGH
As a DevOps engineer, I want to trigger Jenkins jobs for merged PRs and monitor deployment success or failure, so that resource changes are automatically deployed and their success is tracked without manual intervention.
3 pts Backend
Points rationale: Straightforward Jenkins API integration for job triggering and status monitoring, but requires handling various job states and timeout scenarios.
  • Given a merged pull request with resource changes When triggering the corresponding Jenkins deployment job Then the job starts successfully and deployment progress is monitored
  • Given a Jenkins job that completes successfully When monitoring the deployment Then the workflow proceeds to stability verification and Slack notification
  • Given a Jenkins job that fails or times out When monitoring the deployment Then the failure is logged and the incident is escalated to PagerDuty with relevant details
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F1-005 Implement Notification and Escalation Logic
HIGH
As a on-call responder, I want to receive appropriate notifications for successful resolutions or escalated incidents requiring manual intervention, so that I'm informed of automation outcomes and can focus attention on cases that truly need human intervention.
2 pts Backend
Points rationale: Involves integrating with Slack and PagerDuty APIs for notifications, with straightforward conditional logic for success vs failure scenarios.
  • Given a successful deployment with stable resource metrics When the automation completes the full workflow Then a summary notification is sent to Slack with resolution details
  • Given a deployment failure or continued resource issues after scaling When the automation detects the failure condition Then an incident is created in PagerDuty with context about attempted resolution
  • Given a service without E2E tests or other validation mechanisms When deployment completes but stability cannot be verified Then a Slack notification requests manual verification with deployment details
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing

Tasks

IDLabelTitleDescription
US-F1-001 — As a DevOps engineer, I want to configure authentication and permissions for Azure DevOps, Jenkins, and Slack integrations, so that the automation workflow can securely interact with all required external systems.
T-US-F1-001-01 Infrastructure Create Terraform configuration for AWS Secrets Manager integration Implement Terraform modules to provision AWS Secrets Manager secrets for storing PAT tokens and credentials for Azure DevOps, Jenkins, and Slack. Create variables.tf, main.tf, and outputs.tf files with proper IAM policies for secret access. Include secret rotation configuration and KMS encryption.
Test plan: Unit test: terraform plan validates without errors, all required variables are defined. Integration test: terraform apply creates secrets with correct IAM policies, secrets are accessible by intended services. Test secret rotation configuration works correctly.
AI prompt: You are a DevOps engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create Terraform configuration files for AWS Secrets Manager to store authentication credentials for Azure DevOps, Jenkins, and Slack integrations. Include proper IAM policies for secret access, KMS encryption, and secret rotation. The secrets will be used by Datadog Workflows to authenticate with external systems for automated resource scaling.
T-US-F1-001-02 Code Implement credential validation service for external system connections Create a validation service that tests connectivity and permissions for Azure DevOps (repository access, PR creation), Jenkins (job triggering, status monitoring), and Slack (message posting). Implement retry logic, error handling, and detailed permission checking. Store validation results and provide clear error messages for missing permissions.
Test plan: Unit test: validation functions return correct status for valid/invalid credentials, error messages are descriptive. Integration test: service correctly identifies missing permissions, retry logic works for transient failures. Test graceful handling of expired tokens and network timeouts.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a credential validation service that tests authentication and permissions for Azure DevOps, Jenkins, and Slack APIs. Include specific permission checks (repo access, PR creation, job triggering), retry logic for transient failures, and detailed error reporting. The service will be called during workflow configuration to ensure all external systems are properly accessible.
T-US-F1-001-03 Infrastructure Configure Datadog Workflow authentication with external systems Set up Datadog Workflow connections to Azure DevOps, Jenkins, and Slack using credentials from AWS Secrets Manager. Configure workflow variables, connection testing, and error handling. Implement secure credential retrieval and caching mechanisms within Datadog Workflows.
Test plan: Integration test: Datadog Workflow successfully authenticates with all three external systems, credentials are retrieved from AWS Secrets Manager correctly. Test connection failure scenarios and error handling. Verify credential caching works and doesn't expose sensitive data in logs.
AI prompt: You are a DevOps engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Configure Datadog Workflow connections to integrate with Azure DevOps, Jenkins, and Slack using credentials stored in AWS Secrets Manager. Set up secure credential retrieval, connection testing, and error handling within the workflow. The connections will be used for automated PR creation, deployment triggering, and notifications.
T-US-F1-001-04 Code Implement connection monitoring and health checks Create monitoring system to periodically verify connectivity and permissions for all external integrations. Implement health check endpoints, alerting for connection failures, and automatic credential refresh when possible. Include dashboard for connection status visibility.
Test plan: Unit test: health check functions correctly identify connection issues, alerting triggers at appropriate thresholds. Integration test: monitoring detects actual connection failures, automatic credential refresh works for supported systems. Test dashboard displays accurate connection status.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a monitoring system for external system connections (Azure DevOps, Jenkins, Slack) with periodic health checks, alerting for failures, and automatic credential refresh capabilities. Include a status dashboard for visibility. The system should proactively detect authentication issues before they impact the automation workflow.
T-US-F1-001-05 Documentation Document authentication and permissions configuration Document the complete authentication setup process including AWS Secrets Manager configuration, required permissions for each external system (Azure DevOps repository access and PR creation, Jenkins job triggering and monitoring, Slack channel posting), credential rotation procedures, troubleshooting common authentication issues, and security best practices. Include setup instructions and permission verification steps.
AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create comprehensive documentation for authentication and permissions configuration covering AWS Secrets Manager setup, required permissions for Azure DevOps/Jenkins/Slack integrations, credential rotation procedures, and troubleshooting guides. Include step-by-step setup instructions and security best practices for DevOps engineers configuring the system.
US-F1-002 — As a DevOps engineer, I want to establish and maintain mapping between Datadog services and their corresponding Azure DevOps repositories, so that the automation can correctly identify which repository to target for each service's resource updates.
T-US-F1-002-01 Code Create service-to-repository mapping configuration system Implement a configuration management system for mapping Datadog services to Azure DevOps repositories. Create JSON/YAML configuration files with service names, repository URLs, branch strategies, and file paths for Terraform configurations. Include validation for mapping entries and configuration reload capabilities.
Test plan: Unit test: configuration parser validates mapping entries, handles malformed JSON/YAML gracefully. Integration test: mapping system correctly identifies repositories for known services, configuration reload works without service restart. Test validation catches invalid repository URLs and missing required fields.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a configuration management system that maps Datadog service names to their corresponding Azure DevOps repositories. Include configuration file format (JSON/YAML), validation logic, and hot-reload capabilities. The mapping will be used by automation workflows to determine which repository to update when scaling resources for specific services.
T-US-F1-002-02 Code Implement repository lookup and validation service Create a service that resolves Datadog service names to repository information and validates repository accessibility. Include caching for performance, fallback mechanisms for unmapped services, and integration with PagerDuty for escalation. Implement repository metadata retrieval (branch info, file structure validation).
Test plan: Unit test: lookup service returns correct repository info for mapped services, handles unmapped services appropriately. Integration test: repository accessibility validation works, caching improves performance, PagerDuty escalation triggers for unmapped services. Test fallback mechanisms and error handling.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a repository lookup service that resolves Datadog service names to Azure DevOps repository information, validates repository access, and escalates unmapped services to PagerDuty. Include caching for performance and repository metadata validation. The service will be called when processing Datadog alerts to determine the target repository for resource updates.
T-US-F1-002-03 Documentation Document service-to-repository mapping configuration Document the mapping configuration system including configuration file format and schema, process for adding new service mappings, repository validation requirements, troubleshooting unmapped services, and escalation procedures. Include examples of mapping configurations and best practices for maintaining the mapping database.
AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the service-to-repository mapping system including configuration file format, process for adding/updating mappings, validation requirements, and troubleshooting procedures. Include practical examples and best practices for DevOps engineers managing service mappings and handling unmapped services.
US-F1-003 — As a DevOps engineer, I want to automatically calculate appropriate resource increases and generate pull requests with updated configurations, so that resource scaling decisions are consistent and include proper safety margins without manual calculation.
T-US-F1-003-01 Code Implement resource calculation engine for CPU and memory scaling Create algorithms to calculate appropriate resource increases based on current usage, thresholds, and historical patterns. Implement safety margin calculations, scaling best practices (e.g., incremental increases, maximum limits), and support for both CPU and memory resources. Include configuration for scaling policies and maximum safe limits.
Test plan: Unit test: calculation engine produces reasonable resource increases with safety margins, respects maximum limits, handles edge cases like extremely high utilization. Integration test: calculations work with real Datadog metrics, scaling policies are applied correctly. Test boundary conditions and error handling for invalid inputs.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a resource calculation engine that determines appropriate CPU and memory increases based on Datadog alert data, current usage patterns, and scaling best practices. Include safety margins, maximum limits, and configurable scaling policies. The engine will process Datadog monitor alerts and calculate new resource requirements for EKS pods.
T-US-F1-003-02 Code Create Terraform configuration file generator Implement service to generate updated Terraform configuration files with new resource limits. Parse existing Terraform files, update CPU/memory values, maintain proper formatting and comments. Support different Terraform resource types (deployments, pods, containers) and handle complex nested configurations.
Test plan: Unit test: generator correctly parses and updates Terraform files, maintains formatting and comments, handles various resource types. Integration test: generated files are valid Terraform syntax, updates are applied to correct resource blocks. Test error handling for malformed Terraform files and unsupported resource types.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a Terraform configuration file generator that parses existing Terraform files and updates CPU/memory resource limits while preserving formatting and structure. Support various Kubernetes resource types and handle complex nested configurations. The generator will create updated Terraform files for pull request creation.
T-US-F1-003-03 Code Implement Azure DevOps pull request creation service Create service to generate and submit pull requests to Azure DevOps repositories with updated Terraform configurations. Include PR template generation, descriptive commit messages with scaling rationale, branch management, and reviewer assignment. Implement conflict detection and resolution strategies.
Test plan: Unit test: PR creation service generates proper commit messages and PR descriptions, handles branch creation correctly. Integration test: PRs are successfully created in Azure DevOps with correct file changes, reviewers are assigned appropriately. Test conflict detection and error handling for repository access issues.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a pull request creation service for Azure DevOps that creates branches, commits updated Terraform files, and submits PRs with descriptive messages explaining the resource scaling rationale. Include reviewer assignment and conflict handling. The service will be called after resource calculations to propose infrastructure changes.
T-US-F1-003-04 Code Implement safety checks and manual review flagging Create validation system to identify resource changes that require manual review (excessive increases, critical services, budget thresholds). Implement flagging mechanisms, approval workflows, and integration with existing change management processes. Include configurable thresholds and escalation rules.
Test plan: Unit test: safety checks correctly identify scenarios requiring manual review, thresholds are configurable and respected. Integration test: flagged changes trigger appropriate approval workflows, escalation rules work correctly. Test edge cases like extremely high resource requests and critical service identification.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement safety checks that identify resource scaling changes requiring manual review based on configurable thresholds, service criticality, and scaling magnitude. Include flagging mechanisms and integration with approval workflows. The system should prevent automatic application of potentially risky resource changes.
T-US-F1-003-05 Documentation Document resource calculation and PR generation process Document the resource calculation algorithms including scaling policies and safety margins, Terraform file generation process, pull request creation workflow, safety check thresholds and manual review criteria, and troubleshooting common issues with resource calculations. Include examples of generated PRs and configuration options for scaling policies.
AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the resource calculation and pull request generation process including scaling algorithms, safety margins, Terraform file updates, PR creation workflow, and manual review criteria. Include practical examples and configuration options for DevOps engineers managing automated resource scaling.
US-F1-004 — As a DevOps engineer, I want to trigger Jenkins jobs for merged PRs and monitor deployment success or failure, so that resource changes are automatically deployed and their success is tracked without manual intervention.
T-US-F1-004-01 Code Implement Jenkins job triggering service Create service to automatically trigger Jenkins deployment jobs when pull requests are merged. Implement webhook handling for Azure DevOps merge events, Jenkins API integration for job triggering, parameter passing for deployment context, and job queue management. Include retry logic for failed job starts.
Test plan: Unit test: service correctly identifies merged PRs related to resource scaling, Jenkins API calls succeed with proper parameters. Integration test: Jenkins jobs are triggered automatically on PR merge, job parameters are passed correctly. Test retry logic for Jenkins API failures and webhook reliability.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a Jenkins job triggering service that responds to Azure DevOps PR merge events and automatically starts corresponding deployment jobs. Include webhook handling, Jenkins API integration, parameter passing, and retry logic. The service will initiate deployments for approved resource scaling changes.
T-US-F1-004-02 Code Create deployment monitoring and status tracking system Implement system to monitor Jenkins job progress, track deployment status, and detect success/failure conditions. Include real-time status updates, timeout handling, log aggregation, and integration with deployment verification. Support multiple concurrent deployments and status correlation with original alerts.
Test plan: Unit test: monitoring system correctly tracks job status changes, timeout detection works properly. Integration test: system monitors actual Jenkins deployments, status updates are accurate and timely. Test handling of multiple concurrent deployments and correlation with original scaling requests.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create a deployment monitoring system that tracks Jenkins job progress, detects success/failure conditions, and handles timeouts. Include real-time status updates and support for multiple concurrent deployments. The system will monitor resource scaling deployments and determine next steps based on deployment outcomes.
T-US-F1-004-03 Documentation Document Jenkins integration and deployment monitoring Document the Jenkins integration setup including webhook configuration for Azure DevOps, Jenkins job triggering process, deployment monitoring capabilities, timeout and retry configurations, troubleshooting deployment failures, and log analysis procedures. Include setup instructions for Jenkins job templates and monitoring dashboards.
AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the Jenkins integration including webhook setup, job triggering process, deployment monitoring, timeout handling, and troubleshooting procedures. Include setup instructions and configuration examples for DevOps engineers managing automated deployments triggered by resource scaling changes.
US-F1-005 — As a on-call responder, I want to receive appropriate notifications for successful resolutions or escalated incidents requiring manual intervention, so that I'm informed of automation outcomes and can focus attention on cases that truly need human intervention.
T-US-F1-005-01 Code Implement Slack notification service for workflow outcomes Create notification service to send Slack messages for successful resolutions and manual verification requests. Implement message templating, channel routing based on service/team, rich formatting with deployment details, and notification deduplication. Include integration with Slack threading for related notifications.
Test plan: Unit test: notification service generates properly formatted Slack messages, channel routing works correctly. Integration test: messages are delivered to appropriate Slack channels, rich formatting displays correctly. Test message deduplication and threading functionality.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Implement a Slack notification service that sends formatted messages for successful resource scaling resolutions and requests for manual verification. Include message templating, channel routing, rich formatting with deployment context, and deduplication. The service will keep teams informed of automation outcomes and manual intervention needs.
T-US-F1-005-02 Code Create PagerDuty incident management integration Implement PagerDuty integration for escalating deployment failures and continued resource issues. Create incident templates with context about attempted resolutions, implement severity classification, and include relevant deployment logs and metrics. Support incident updates and resolution tracking.
Test plan: Unit test: PagerDuty integration creates incidents with proper severity and context, incident templates are populated correctly. Integration test: incidents are created in PagerDuty with complete context, incident updates work properly. Test severity classification and escalation rules.
AI prompt: You are a backend engineer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Create PagerDuty integration for escalating deployment failures and persistent resource issues after scaling attempts. Include incident creation with context about attempted resolutions, severity classification, and relevant logs/metrics. The integration will ensure critical issues receive appropriate attention when automation cannot resolve them.
T-US-F1-005-03 Documentation Document notification and escalation procedures Document the notification system including Slack message formats and channel routing, PagerDuty incident creation and escalation criteria, manual verification procedures for deployments without E2E tests, troubleshooting notification delivery issues, and configuration options for notification preferences. Include examples of notification messages and escalation scenarios.
AI prompt: You are a technical writer working on Datadog EKS Automation (Terraform, AWS Secrets Manager, Azure DevOps Git, Jenkins, Datadog Workflows). Document the notification and escalation system including Slack notifications, PagerDuty incident creation, manual verification procedures, and configuration options. Include message examples and escalation criteria for on-call responders and DevOps teams managing automated resource scaling outcomes.

Sprint Plan

2 sprints • 18 total story points • 21 pts/sprint velocity

Sprint 107 16 / 16 pts
Establish foundational infrastructure and authentication for Datadog EKS automation, including service mapping and integration setup. Focus on de-risking technical unknowns and preparing core backend components.
US-F1-001US-F1-002US-F1-003US-F1-004
Sprint 2 2 / 2 pts
Sprint 2 stories
US-F1-005