Intake Questionnaire

Q1What is the project? Describe it in a few sentences, or point me to a repo/doc.Enable Datadog Bits AI SRE Agent to investigate P1, P2 and P3 Monitor notifications and provide support via Slack
Q2Is this a greenfield project or are you building on an existing codebase?Greenfield
Q3What problem does this project solve? Who are the end users?Provide on-call engineers (DevOps & Software engineers) the ability to use AI to debug common alerts effectively
Q4What does "done" look like? What's the end-state you're targeting?Engineers can interact with the agent via Slack, the agent has access to the Azure DevOps code and Confluence RunBooks are reviewed and for added context
Q5Are there any hard deadlines or milestones?No hard deadlines
Q6How many engineers are working on this?5
Q7What are the roles on the team? (e.g., 2 backend, 1 frontend, 1 fullstack)Roles not specified — assuming generalist/fullstack team
Q8How long are your sprints? (e.g., 1 week, 2 weeks)2 weeks
Q9Do you have a known velocity from previous sprints? If yes, what is it?4 pts/dev/sprint (from Jira: 21 pts team avg, 5 team member(s))
Q10How many sprints are you targeting to complete this project?1–2 sprints
Q11What is the tech stack? (languages, frameworks, databases, infra)Terraform (Follow-up detail: Not decided yet)
Q12Are there any existing APIs, services, or third-party integrations involved?Datadog Bits AI SRE Agent, Slack, Azure DevOps, Confluence RunBooks
Q13Are there any architectural constraints or decisions already made? (e.g., must use microservices, must deploy to AWS)No architectural constraints specified
Q14Is there any existing documentation, PRDs, or design docs I should reference?No existing documentation to reference
Q15Does the project have an existing codebase, or is this a new build?New build (greenfield)
Q16Where is the code hosted? (GitHub, Azure DevOps, GitLab, Bitbucket, local only)Azure DevOps
Q17Can you share the repo URL(s)? (the agent can connect and scan the repo for context)No repo URL provided
Q18How is the repo structured? (monorepo, multi-repo, microservices, monolith)Monorepo
Q19Is there an existing CI/CD pipeline or deployment setup?No existing CI/CD pipeline
Q20Is there any known technical debt? (legacy code, outdated dependencies, areas needing refactoring)No known technical debt identified
Q21Are there any areas of the project you're uncertain or worried about? (technical risk, unclear requirements, dependencies on other teams)No specific risks identified
Q22Are there any known blockers or dependencies on external teams/systems?No known blockers or external dependencies
Q23Is there anything that's explicitly out of scope?Technical work in Clean up service names in Datadog APM (Already done..Pending roll out)
Q24How do you want stories estimated? (Fibonacci story points, T-shirt sizes, or no estimates)Fibonacci (1, 2, 3, 5, 8, 13)
Q25Do you have a Definition of Done the team follows?unit tests + PR review + tested in Development + deployed to production
Q26Do you want the output pushed to Jira, exported as Markdown, or both?Markdown
Q27Which sprint are you planning for?Sprint 107
Q28How many bank holidays fall within your planning window?1 bank holiday(s): May Day
Q29What percentage of capacity is typically lost to unplanned absences? (sick days, emergencies)10%
Q30Are any engineers currently onboarding or ramping up? If so, how many and for how many sprints?No engineers onboarding

Project Analysis

Datadog Bits AI SRE
Enable Datadog Bits AI SRE Agent to investigate P1, P2 and P3 Monitor notifications and provide support via Slack. Engineers can interact with the agent via Slack, the agent has access to the Azure DevOps code and Confluence RunBooks are reviewed and for added context.
Greenfield2-week sprints × 2 sprints = ~4 weeks total
Target state: Engineers can interact with the agent via Slack, the agent has access to the Azure DevOps code and Confluence RunBooks are reviewed and for added context

Goals

  • Provide on-call engineers the ability to use AI to debug common alerts effectively
  • Integrate Datadog Bits SRE to all P1, P2 and P3 monitors
  • Automate Datadog Bits SRE to P1 Monitors Only
  • Enable Slack workspace integration for agent interaction
  • Integrate with Confluence RunBooks for investigation context
  • Integrate Azure DevOps APM Services using service definition

End Users

  • DevOps engineers
  • Software engineers
  • On-call engineers

Tech Stack

  • Terraform

Integrations

  • Datadog Bits AI SRE Agent
  • Slack
  • Azure DevOps
  • Confluence RunBooks
  • Datadog APM
  • Kubernetes
  • Rancher
  • Jenkins Pipeline

Constraints

  • Run Books should be updated
  • No hard deadlines

Out of Scope

  • Technical work in Clean up service names in Datadog APM (Already done..Pending roll out)

Assumptions

  • Assuming generalist/fullstack team roles
  • No architectural constraints specified
  • No existing documentation to reference
  • No repo URL provided
  • Monorepo structure assumed
  • No existing CI/CD pipeline
  • No known technical debt identified
  • No specific risks identified
  • No known blockers or external dependencies
  • 10% capacity loss to unplanned absences
  • No engineers onboarding

Capacity

  • Team: 5 engineer(s), 2-week sprints × 2
  • Gross velocity: 21 pts/sprint
  • Deductions: bank holidays: 1d, unplanned: 10%, discovery: 5%
  • Net velocity: 16 pts/sprint

Features

F1 Core AI Agent Infrastructure Setup
CRITICAL
Establish foundational Terraform infrastructure for Datadog Bits AI SRE Agent including core services, authentication, and deployment pipeline. Set up basic agent framework and configuration management.
F2 Datadog Monitor Integration and Alert Processing
CRITICAL
Integrate AI agent with Datadog P1, P2, and P3 monitors to receive and process alert notifications. Implement automated response capabilities for P1 monitors specifically.
F3 Slack Workspace Integration
HIGH
Enable bidirectional communication between engineers and AI agent through Slack workspace integration. Implement interactive commands and notification delivery mechanisms.
F4 Azure DevOps Code Access Integration
HIGH
Connect AI agent to Azure DevOps APM services using service definitions to provide code context during incident investigation. Enable repository access and code analysis capabilities.
F5 Confluence RunBooks Integration and Updates
MEDIUM
Integrate with Confluence RunBooks for investigation context and implement RunBook updates as required by project constraints. Ensure RunBooks are current and accessible to the AI agent.

User Stories

F1: Core AI Agent Infrastructure Setup
US-F1-001 Setup Core Terraform Infrastructure
CRITICAL
As a DevOps engineer, I want to provision foundational cloud resources for AI agent deployment, so that the AI agent has a secure and scalable infrastructure foundation.
5 pts Infrastructure
Points rationale: Involves setting up multiple cloud resources, networking, security groups, and initial configuration management. Moderate complexity with some unknowns around optimal resource sizing.
  • Given Terraform configuration files are created When I run terraform apply Then core infrastructure resources are provisioned successfully
  • Given Invalid Terraform configuration is provided When I run terraform validate Then validation errors are displayed with clear error messages
  • Given Infrastructure already exists When I run terraform apply again Then no duplicate resources are created and state is consistent
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F1-002 Implement Agent Authentication Framework
CRITICAL
As a DevOps engineer, I want to configure secure authentication and authorization for the AI agent, so that the agent can securely access external services and APIs.
3 pts Backend
Points rationale: Standard authentication implementation with OAuth/API keys. Well-understood patterns but requires integration with multiple services.
  • Given Authentication credentials are configured When the agent attempts to authenticate with external services Then authentication succeeds and access tokens are obtained
  • Given Invalid or expired credentials are used When the agent attempts authentication Then authentication fails gracefully with appropriate error handling
  • Given Token refresh is needed When access tokens are near expiration Then tokens are automatically refreshed without service interruption
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F1-003 Create Agent Deployment Pipeline
CRITICAL
As a DevOps engineer, I want to automate AI agent deployment and configuration management, so that deployments are consistent, repeatable, and can be rolled back if needed.
5 pts Infrastructure
Points rationale: Involves CI/CD pipeline setup, deployment automation, rollback mechanisms, and integration with Terraform. Multiple moving parts with moderate complexity.
  • Given Code changes are committed to main branch When the deployment pipeline is triggered Then the agent is deployed successfully with zero downtime
  • Given Deployment fails during pipeline execution When an error occurs in any pipeline stage Then the pipeline stops and previous version remains active
  • Given A rollback is initiated When the rollback command is executed Then the previous working version is restored within 5 minutes
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✓ Knowledge Sharing
F2: Datadog Monitor Integration and Alert Processing
US-F2-001 Connect to Datadog Monitor API
CRITICAL
As a DevOps engineer, I want to establish connection to receive P1, P2, and P3 monitor alerts, so that the AI agent can receive and process Datadog alert notifications.
3 pts Backend
Points rationale: Standard API integration with webhook setup. Well-documented Datadog API but requires proper error handling and filtering logic.
  • Given Datadog API credentials are configured When a P1, P2, or P3 monitor triggers an alert Then the agent receives the alert notification within 30 seconds
  • Given Datadog API is unavailable When the agent attempts to connect Then connection failures are logged and retry logic is activated
  • Given Invalid monitor priority levels are received When an alert with unsupported priority is sent Then the alert is logged but not processed further
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F2-002 Process and Analyze Alert Data
CRITICAL
As a On-call engineer, I want to have alerts automatically analyzed and categorized by the AI agent, so that I can quickly understand alert severity and potential root causes.
5 pts Backend
Points rationale: Involves AI/ML processing, alert correlation logic, and data parsing. Moderate complexity with some unknowns around alert grouping algorithms.
  • Given A monitor alert is received When the agent processes the alert data Then alert metadata is extracted and categorized correctly
  • Given Alert data is malformed or incomplete When the agent attempts to process it Then the agent handles the error gracefully and requests manual review
  • Given Multiple related alerts are received simultaneously When the agent processes them Then alerts are grouped and deduplicated appropriately
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F2-003 Implement P1 Alert Automation
CRITICAL
As a On-call engineer, I want to have P1 alerts automatically investigated and initial response actions taken, so that critical incidents get immediate attention without waiting for manual intervention.
8 pts Backend
Points rationale: Complex automation logic with multiple decision trees, error handling, and escalation paths. High complexity due to critical nature and multiple integration points.
  • Given A P1 alert is received When the agent processes it Then automated investigation steps are executed within 2 minutes
  • Given Automated investigation fails or times out When the failure is detected Then the incident is escalated to on-call engineers immediately
  • Given P2 or P3 alerts are received When the agent processes them Then no automated actions are taken, only analysis is performed
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✓ Knowledge Sharing
F3: Slack Workspace Integration
US-F3-001 Setup Slack Bot Integration
HIGH
As a DevOps engineer, I want to configure Slack bot for AI agent communication, so that engineers can interact with the AI agent through their existing Slack workspace.
3 pts Backend
Points rationale: Standard Slack bot setup with well-documented APIs. Straightforward implementation with known patterns and libraries.
  • Given Slack bot is configured with proper permissions When an engineer mentions the bot in a channel Then the bot responds with available commands and status
  • Given Slack API is unavailable When the bot attempts to send a message Then the failure is logged and message is queued for retry
  • Given Bot is added to a private channel When channel permissions are insufficient Then the bot requests appropriate permissions or notifies of limitations
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F3-002 Implement Interactive Slack Commands
HIGH
As a On-call engineer, I want to interact with the AI agent using slash commands and buttons, so that I can quickly query incident status and request specific investigations.
5 pts Backend
Points rationale: Multiple command implementations, interactive components, and command parsing logic. Moderate complexity with various user interaction patterns to handle.
  • Given I use a valid slash command When I execute the command in Slack Then the agent responds with relevant information or actions
  • Given I use an invalid or malformed command When I execute the command Then the agent provides helpful error messages and command syntax
  • Given I click an interactive button in a bot message When the button action is processed Then the appropriate action is executed and feedback is provided
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F3-003 Enable Alert Notifications in Slack
HIGH
As a On-call engineer, I want to receive formatted alert notifications directly in Slack channels, so that I stay informed about incidents without switching between multiple tools.
3 pts Backend
Points rationale: Message formatting and delivery logic with some threading complexity. Relatively straightforward with well-understood Slack messaging patterns.
  • Given An alert is processed by the agent When the notification is sent to Slack Then the message includes alert details, severity, and suggested actions
  • Given A Slack channel is unavailable or archived When a notification is attempted Then the notification is sent to a fallback channel with error details
  • Given Multiple alerts occur in rapid succession When notifications are sent Then messages are threaded or grouped to avoid channel spam
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
F4: Azure DevOps Code Access Integration
US-F4-001 Connect to Azure DevOps API
HIGH
As a DevOps engineer, I want to establish secure connection to Azure DevOps repositories and services, so that the AI agent can access code context for incident investigation.
3 pts Backend
Points rationale: Standard API integration with Azure DevOps. Well-documented APIs but requires proper authentication and rate limiting handling.
  • Given Azure DevOps credentials are configured When the agent attempts to connect Then connection is established and repository access is verified
  • Given Invalid credentials or permissions are used When the agent attempts to access repositories Then authentication errors are handled gracefully with clear error messages
  • Given API rate limits are exceeded When multiple requests are made rapidly Then requests are throttled and queued appropriately
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F4-002 Implement Service Definition Mapping
HIGH
As a Software engineer, I want to have the agent automatically identify relevant code repositories for alerts, so that incident investigation includes relevant code context and recent changes.
5 pts Backend
Points rationale: Involves mapping logic, service discovery, and repository correlation. Moderate complexity with some unknowns around service definition formats and mapping accuracy.
  • Given An alert contains service information When the agent processes the alert Then the correct Azure DevOps repository is identified and linked
  • Given Service definition mapping is incomplete or missing When the agent attempts to find the repository Then the agent logs the missing mapping and continues with available information
  • Given Multiple repositories are associated with a service When the agent searches for code context Then all relevant repositories are identified and prioritized by relevance
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F4-003 Enable Code Analysis and Context
HIGH
As a On-call engineer, I want to receive relevant code snippets and recent changes during incident investigation, so that I can quickly identify potential code-related causes of incidents.
8 pts Backend
Points rationale: Complex code analysis logic, git history processing, and relevance scoring algorithms. High complexity due to multiple analysis techniques and performance considerations.
  • Given A repository is identified for an alert When the agent analyzes the code Then relevant code sections and recent commits are highlighted
  • Given Repository access is denied or unavailable When the agent attempts code analysis Then the limitation is reported and investigation continues with other sources
  • Given Large repositories with extensive history exist When code analysis is performed Then analysis is completed within reasonable time limits with focused results
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✓ Knowledge Sharing
F5: Confluence RunBooks Integration and Updates
US-F5-001 Connect to Confluence API
MEDIUM
As a DevOps engineer, I want to establish connection to Confluence for RunBook access, so that the AI agent can retrieve and reference existing RunBooks during investigations.
2 pts Backend
Points rationale: Straightforward API integration with Confluence. Well-documented REST API with standard authentication patterns.
  • Given Confluence API credentials are configured When the agent attempts to connect Then connection is established and RunBook spaces are accessible
  • Given Confluence is unavailable or credentials are invalid When the agent attempts to access RunBooks Then the failure is logged and investigation continues without RunBook context
  • Given RunBook permissions are restricted When the agent attempts to access specific pages Then permission errors are handled gracefully with appropriate fallbacks
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F5-002 Implement RunBook Search and Retrieval
MEDIUM
As a On-call engineer, I want to have relevant RunBooks automatically identified and referenced during incidents, so that I can quickly access established procedures and troubleshooting steps.
5 pts Backend
Points rationale: Involves search algorithms, content parsing, and relevance scoring. Moderate complexity with text processing and matching logic requirements.
  • Given An alert is being investigated When the agent searches for relevant RunBooks Then matching RunBooks are identified and key sections are extracted
  • Given No relevant RunBooks are found When the search is performed Then the agent reports the absence and suggests creating new documentation
  • Given RunBooks contain outdated or conflicting information When they are retrieved Then the agent flags potential issues and suggests review
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✗ Knowledge Sharing
US-F5-003 Enable RunBook Updates and Maintenance
MEDIUM
As a Software engineer, I want to update RunBooks based on incident learnings and agent recommendations, so that RunBooks stay current and accurate for future incident response.
5 pts Backend
Points rationale: Content update logic, approval workflows, and version management. Moderate complexity with workflow integration and permission handling requirements.
  • Given An incident is resolved with new learnings When the agent suggests RunBook updates Then update recommendations are provided with specific content suggestions
  • Given RunBook updates are approved by engineers When the update is applied Then the Confluence page is updated and version history is maintained
  • Given Update permissions are insufficient When the agent attempts to update a RunBook Then the update request is sent to appropriate stakeholders for manual processing
Definition of Done:
  • ✓ Acceptance Criteria Met
  • ✓ Documentation
  • ✓ Proper Testing
  • ✓ Code Merged to Main
  • ✓ Released via SDLC
  • ✓ Stakeholder Sign-off
  • ✓ Knowledge Sharing

Tasks

IDLabelTitleDescription
US-F1-001 — As a DevOps engineer, I want to provision foundational cloud resources for AI agent deployment, so that the AI agent has a secure and scalable infrastructure foundation.
T-US-F1-001-01 Infrastructure Create Terraform provider configuration and backend setup Set up main.tf with required providers (AWS/Azure/GCP), configure remote state backend (S3/Azure Storage), and establish basic project structure with variables.tf and outputs.tf files.
Test plan: Unit test: terraform validate passes without errors. Integration test: terraform init successfully configures backend and downloads providers. Test state locking and remote state access.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create the foundational Terraform configuration files including provider setup, remote state backend, and project structure. Follow standard Terraform project layout with separate files for variables, outputs, and main configuration. Ensure the configuration supports the core infrastructure provisioning requirements from the acceptance criteria.
T-US-F1-001-02 Infrastructure Implement core cloud infrastructure resources Define and provision essential cloud resources including VPC/networking, security groups, IAM roles, compute instances, and storage resources needed for AI agent deployment in resource files.
Test plan: Unit test: terraform plan shows expected resource creation without errors. Integration test: terraform apply provisions all resources successfully. Test resource dependencies and proper tagging.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement the core cloud infrastructure resources required for AI agent deployment including networking, security, compute, and storage components. Ensure resources are properly configured with security best practices and appropriate sizing for the AI agent workload. The infrastructure must support secure and scalable AI agent deployment.
T-US-F1-001-03 Infrastructure Add Terraform validation and error handling Implement input validation for variables, add conditional logic for resource creation, and configure proper error messages and validation rules in variables.tf and main.tf.
Test plan: Unit test: terraform validate with invalid configurations shows clear error messages. Test variable validation rules with edge cases. Integration test: invalid terraform apply attempts fail gracefully with helpful error output.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Add comprehensive validation and error handling to the Terraform configuration including variable validation rules, conditional resource creation, and clear error messaging. Focus on preventing common configuration mistakes and providing helpful feedback when validation fails. This addresses the acceptance criteria for handling invalid Terraform configurations.
T-US-F1-001-04 Infrastructure Implement idempotent infrastructure management Configure Terraform state management, resource lifecycle rules, and import capabilities to ensure repeated terraform apply operations are idempotent and don't create duplicate resources.
Test plan: Integration test: run terraform apply multiple times and verify no duplicate resources are created. Test terraform import for existing resources. Verify state consistency after repeated operations.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement idempotent infrastructure management ensuring that repeated terraform apply operations don't create duplicate resources and maintain consistent state. Configure proper lifecycle rules and state management to handle existing infrastructure gracefully. This addresses the acceptance criteria for consistent state management when infrastructure already exists.
T-US-F1-001-05 Documentation Document infrastructure setup and configuration Create comprehensive documentation covering Terraform configuration structure, required variables, deployment steps, troubleshooting guide, and infrastructure architecture decisions. Include setup instructions and configuration examples.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Create comprehensive documentation for the Terraform infrastructure setup including configuration structure, required variables, deployment procedures, and troubleshooting guidance. Document the infrastructure architecture decisions and provide clear setup instructions for new team members. Ensure documentation covers all aspects of the foundational cloud resources provisioning process.
US-F1-002 — As a DevOps engineer, I want to configure secure authentication and authorization for the AI agent, so that the agent can securely access external services and APIs.
T-US-F1-002-01 Infrastructure Create authentication configuration module Implement Terraform module for managing authentication credentials, API keys, and service accounts including secure storage using cloud key management services and proper IAM policies.
Test plan: Unit test: authentication module validates required parameters. Integration test: credentials are properly stored and accessible by AI agent. Test IAM policy attachments and key rotation capabilities.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create a Terraform module for secure authentication and authorization configuration including credential management, API key storage, and IAM policies. Use cloud-native key management services and follow security best practices for credential storage and access. This module will enable the AI agent to securely authenticate with external services.
T-US-F1-002-02 Infrastructure Implement token management and refresh logic Configure automated token refresh mechanisms, expiration handling, and credential rotation policies in the infrastructure configuration to support continuous authentication without service interruption.
Test plan: Integration test: token refresh occurs automatically before expiration. Test credential rotation without service disruption. Verify error handling for failed token refresh attempts.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement automated token management and refresh mechanisms in the infrastructure configuration including expiration handling and credential rotation policies. Ensure the system can automatically refresh access tokens without service interruption as specified in the acceptance criteria. Configure appropriate monitoring and alerting for authentication failures.
T-US-F1-002-03 Documentation Document authentication and authorization setup Create documentation covering authentication configuration, credential management procedures, token refresh mechanisms, security best practices, and troubleshooting guide for authentication issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the authentication and authorization setup including credential configuration procedures, token management mechanisms, and security best practices. Provide troubleshooting guidance for common authentication issues and clear instructions for setting up secure access to external services. Cover both initial setup and ongoing maintenance procedures.
US-F1-003 — As a DevOps engineer, I want to automate AI agent deployment and configuration management, so that deployments are consistent, repeatable, and can be rolled back if needed.
T-US-F1-003-01 Infrastructure Create CI/CD pipeline configuration Implement deployment pipeline configuration files (GitHub Actions/Azure DevOps/Jenkins) with stages for validation, testing, deployment, and rollback of the AI agent infrastructure and application code.
Test plan: Unit test: pipeline configuration validates successfully. Integration test: pipeline triggers on main branch commits and deploys successfully. Test pipeline failure handling and rollback triggers.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create CI/CD pipeline configuration for automated AI agent deployment including validation, testing, deployment, and rollback stages. Ensure the pipeline supports zero-downtime deployments and can be triggered from main branch commits. Configure proper error handling and rollback mechanisms as specified in the acceptance criteria.
T-US-F1-003-02 Infrastructure Implement zero-downtime deployment strategy Configure blue-green or rolling deployment strategy in Terraform and deployment scripts to ensure AI agent updates occur without service interruption, including health checks and traffic switching.
Test plan: Integration test: deployment completes without service downtime. Test health check validation during deployment. Verify traffic switching occurs only after successful health checks.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement zero-downtime deployment strategy using blue-green or rolling deployment approach including health checks and traffic switching mechanisms. Configure the infrastructure to support seamless AI agent updates without service interruption as required by the acceptance criteria. Ensure proper validation before traffic switching.
T-US-F1-003-03 Infrastructure Create automated rollback mechanism Implement rollback automation in deployment scripts and Terraform configuration to automatically restore previous working version within 5 minutes when deployment failures are detected.
Test plan: Integration test: rollback completes within 5 minutes when triggered. Test automatic rollback on deployment failure detection. Verify previous version restoration and service availability.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create automated rollback mechanisms that can restore the previous working version of the AI agent within 5 minutes when deployment failures are detected. Implement failure detection logic and automated rollback triggers in the deployment pipeline. Ensure the rollback process meets the 5-minute recovery time requirement from the acceptance criteria.
T-US-F1-003-04 Infrastructure Add deployment monitoring and alerting Configure monitoring and alerting for deployment pipeline stages, failure detection, and rollback events using cloud monitoring services and notification systems.
Test plan: Integration test: alerts trigger on deployment failures. Test monitoring metrics collection during deployments. Verify notification delivery for pipeline events.
AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement comprehensive monitoring and alerting for the deployment pipeline including failure detection, rollback events, and deployment success metrics. Configure notifications to alert the team of deployment status and any issues requiring attention. Ensure monitoring covers all pipeline stages and provides visibility into deployment health.
T-US-F1-003-05 Documentation Document deployment and configuration management Create documentation covering CI/CD pipeline setup, deployment procedures, rollback processes, monitoring configuration, and troubleshooting guide for deployment issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the automated deployment and configuration management system including CI/CD pipeline setup, deployment procedures, rollback processes, and monitoring configuration. Provide troubleshooting guidance for common deployment issues and clear instructions for managing the deployment pipeline. Cover both automated processes and manual intervention procedures.
US-F2-001 — As a DevOps engineer, I want to establish connection to receive P1, P2, and P3 monitor alerts, so that the AI agent can receive and process Datadog alert notifications.
T-US-F2-001-01 Code Create Datadog API client configuration Implement Datadog API client setup with authentication, connection management, and retry logic for receiving monitor alerts. Create configuration files for API credentials and connection parameters.
Test plan: Unit test: API client initializes with valid credentials. Test connection retry logic with simulated failures. Integration test: successful connection to Datadog API and credential validation.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create a Datadog API client configuration module for receiving monitor alerts including authentication setup, connection management, and retry logic. Implement proper error handling for API unavailability and credential validation. The client must support receiving P1, P2, and P3 monitor alerts within 30 seconds as specified in the acceptance criteria.
T-US-F2-001-02 Code Implement alert reception and filtering logic Create alert reception handler that filters incoming Datadog alerts by priority levels (P1, P2, P3), validates alert format, and queues alerts for processing with proper error handling for unsupported priorities.
Test plan: Unit test: alert filtering correctly identifies P1, P2, P3 priorities. Test rejection of invalid priority levels. Integration test: end-to-end alert reception from Datadog to processing queue.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement alert reception and filtering logic that processes incoming Datadog monitor alerts, validates priority levels (P1, P2, P3), and handles unsupported priorities gracefully. Create proper queuing mechanism for valid alerts and logging for invalid ones. This addresses the acceptance criteria for receiving and filtering monitor alerts by priority.
T-US-F2-001-03 Documentation Document Datadog monitor integration setup Create documentation covering Datadog API configuration, alert reception setup, priority level handling, troubleshooting connection issues, and monitoring integration best practices.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Datadog monitor integration setup including API configuration, alert reception mechanisms, priority level filtering, and troubleshooting guidance. Provide clear instructions for configuring the connection and handling common integration issues. Cover the alert reception process and priority level requirements.
US-F2-002 — As a On-call engineer, I want to have alerts automatically analyzed and categorized by the AI agent, so that I can quickly understand alert severity and potential root causes.
T-US-F2-002-01 Code Create alert metadata extraction engine Implement alert parsing and metadata extraction logic to process Datadog alert payloads, extract relevant information (severity, service, metrics, timestamps), and structure data for analysis.
Test plan: Unit test: metadata extraction handles various alert formats correctly. Test parsing of malformed alert data with graceful error handling. Integration test: extracted metadata matches expected structure and completeness.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create an alert metadata extraction engine that parses Datadog alert payloads and extracts relevant information including severity, service details, metrics, and timestamps. Implement robust parsing logic that handles various alert formats and gracefully manages malformed data as specified in the acceptance criteria.
T-US-F2-002-02 Code Implement alert categorization and analysis logic Create AI-powered alert categorization system that analyzes extracted metadata, determines alert severity, identifies potential root causes, and categorizes alerts by type and impact.
Test plan: Unit test: categorization logic correctly classifies different alert types. Test analysis accuracy with sample alert data. Integration test: end-to-end alert processing from extraction to categorization.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement AI-powered alert categorization and analysis logic that processes extracted alert metadata to determine severity, identify potential root causes, and categorize alerts appropriately. The system should help on-call engineers quickly understand alert severity and potential causes as required by the acceptance criteria.
T-US-F2-002-03 Code Create alert grouping and deduplication system Implement logic to group related alerts, deduplicate similar alerts received simultaneously, and manage alert correlation to reduce noise and improve incident response efficiency.
Test plan: Unit test: alert grouping correctly identifies related alerts. Test deduplication logic with simultaneous similar alerts. Integration test: grouped alerts maintain proper relationships and reduce notification spam.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create alert grouping and deduplication system that identifies related alerts, removes duplicates, and correlates simultaneous alerts to reduce noise. Implement intelligent grouping logic that helps on-call engineers focus on unique issues rather than being overwhelmed by related alerts. This addresses the acceptance criteria for handling multiple related alerts appropriately.
T-US-F2-002-04 Code Add error handling for malformed alert data Implement comprehensive error handling for incomplete or malformed alert data, including validation, error logging, manual review flagging, and graceful degradation of analysis capabilities.
Test plan: Unit test: error handling correctly identifies and processes malformed data. Test various types of incomplete alert payloads. Integration test: system continues operating when encountering bad data and properly flags for manual review.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement comprehensive error handling for malformed or incomplete alert data including validation logic, error logging, and manual review flagging. Ensure the system gracefully handles bad data and continues processing other alerts while properly escalating problematic cases. This addresses the acceptance criteria for handling malformed alert data gracefully.
T-US-F2-002-05 Documentation Document alert analysis and categorization system Create documentation covering alert processing workflow, metadata extraction logic, categorization criteria, grouping algorithms, error handling procedures, and troubleshooting guide for alert analysis issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the alert analysis and categorization system including processing workflow, metadata extraction logic, categorization criteria, and grouping algorithms. Provide troubleshooting guidance for alert processing issues and explain how the system helps on-call engineers understand alert severity and root causes. Cover error handling procedures and manual review processes.
US-F2-003 — As a On-call engineer, I want to have P1 alerts automatically investigated and initial response actions taken, so that critical incidents get immediate attention without waiting for manual intervention.
T-US-F2-003-01 Code Create P1 alert detection and routing logic Implement priority-based alert routing that identifies P1 alerts and triggers automated investigation workflows while routing P2/P3 alerts to analysis-only processing paths.
Test plan: Unit test: P1 alert detection correctly identifies critical alerts. Test routing logic separates P1 from P2/P3 alerts. Integration test: P1 alerts trigger automated workflows while P2/P3 alerts only get analyzed.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create priority-based alert routing logic that identifies P1 alerts and triggers automated investigation workflows while ensuring P2 and P3 alerts only receive analysis without automated actions. Implement proper alert classification and routing mechanisms as specified in the acceptance criteria for different priority levels.
T-US-F2-003-02 Code Implement automated P1 investigation workflow Create automated investigation engine for P1 alerts that executes predefined investigation steps, gathers system information, runs diagnostic commands, and collects relevant data within 2 minutes.
Test plan: Unit test: investigation steps execute in correct sequence. Test timeout handling for long-running investigations. Integration test: complete P1 investigation workflow completes within 2 minutes with proper data collection.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement automated P1 investigation workflow that executes investigation steps within 2 minutes including system information gathering, diagnostic commands, and data collection. Create a robust investigation engine that can handle various P1 alert types and provide immediate attention to critical incidents as required by the acceptance criteria.
T-US-F2-003-03 Code Create escalation mechanism for investigation failures Implement escalation logic that detects automated investigation failures or timeouts and immediately escalates P1 incidents to on-call engineers through multiple notification channels.
Test plan: Unit test: escalation triggers correctly on investigation failures and timeouts. Test multiple notification channel delivery. Integration test: end-to-end escalation process from failure detection to engineer notification.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create escalation mechanism that detects when automated P1 investigation fails or times out and immediately escalates to on-call engineers. Implement multiple notification channels and ensure reliable escalation delivery as specified in the acceptance criteria for handling investigation failures.
T-US-F2-003-04 Code Add investigation result tracking and reporting Implement tracking system for investigation results, success/failure metrics, timing data, and reporting capabilities to monitor automated investigation effectiveness and identify improvement areas.
Test plan: Unit test: investigation tracking correctly records results and metrics. Test reporting data accuracy and completeness. Integration test: tracking system captures full investigation lifecycle and generates useful reports.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement investigation result tracking and reporting system that monitors automated investigation effectiveness, tracks success/failure rates, and provides insights for improvement. Create comprehensive tracking that helps optimize the automated investigation process and demonstrates value to the engineering team.
T-US-F2-003-05 Code Create investigation timeout and resource management Implement timeout controls, resource limits, and cleanup mechanisms for automated investigations to prevent resource exhaustion and ensure investigations complete within acceptable time limits.
Test plan: Unit test: timeout controls properly terminate long-running investigations. Test resource limit enforcement and cleanup procedures. Integration test: resource management prevents system overload during multiple concurrent investigations.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create investigation timeout and resource management system that prevents automated investigations from consuming excessive resources or running indefinitely. Implement proper cleanup mechanisms and resource limits to ensure system stability during investigation execution. This ensures investigations complete within the 2-minute requirement from the acceptance criteria.
T-US-F2-003-06 Code Add investigation workflow configuration Create configurable investigation workflow system that allows customization of investigation steps, timeouts, escalation rules, and investigation procedures for different types of P1 alerts.
Test plan: Unit test: workflow configuration correctly customizes investigation steps. Test different alert type configurations. Integration test: customized workflows execute properly for various P1 alert scenarios.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create configurable investigation workflow system that allows customization of automated investigation procedures for different P1 alert types. Implement flexible configuration management that enables tuning of investigation steps, timeouts, and escalation rules based on alert characteristics and organizational needs.
T-US-F2-003-07 Code Implement investigation result analysis and recommendations Create analysis engine that processes investigation results, identifies patterns, generates recommendations for incident resolution, and provides actionable insights to on-call engineers.
Test plan: Unit test: analysis engine correctly processes investigation data and generates recommendations. Test recommendation quality and relevance. Integration test: complete analysis workflow from investigation results to actionable recommendations.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement investigation result analysis engine that processes automated investigation data, identifies patterns, and generates actionable recommendations for incident resolution. Create intelligent analysis that helps on-call engineers quickly understand investigation findings and take appropriate action based on the automated investigation results.
T-US-F2-003-08 Documentation Document P1 alert automation and investigation system Create documentation covering P1 alert automation workflow, investigation procedures, escalation mechanisms, configuration options, troubleshooting guide, and best practices for automated incident response.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the P1 alert automation and investigation system including workflow procedures, escalation mechanisms, configuration options, and troubleshooting guidance. Provide clear explanations of how automated investigation works, when escalation occurs, and how to optimize the system for different incident types. Cover both automated processes and manual intervention procedures.
US-F3-001 — As a DevOps engineer, I want to configure Slack bot for AI agent communication, so that engineers can interact with the AI agent through their existing Slack workspace.
T-US-F3-001-01 Code Create Slack bot application and authentication Set up Slack bot application in workspace, configure OAuth tokens, bot permissions, and authentication credentials for AI agent communication through Slack API.
Test plan: Unit test: bot authentication validates successfully with Slack API. Test permission verification and token refresh. Integration test: bot can connect to workspace and respond to mentions.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Slack bot application setup including OAuth authentication, bot permissions configuration, and credential management for AI agent communication. Implement proper authentication flow and permission validation to enable bot interaction within the Slack workspace as specified in the acceptance criteria.
T-US-F3-001-02 Code Implement bot response and command handling system Create bot response handler that processes mentions, displays available commands and status information, and manages basic bot interactions with proper error handling and help functionality.
Test plan: Unit test: bot responds correctly to mentions with command list and status. Test help command functionality and error responses. Integration test: end-to-end bot interaction from mention to response delivery.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement Slack bot response and command handling system that processes bot mentions, displays available commands and status information, and provides helpful responses to users. Create intuitive command interface that helps engineers interact with the AI agent effectively through Slack as required by the acceptance criteria.
T-US-F3-001-03 Documentation Document Slack bot configuration and setup Create documentation covering Slack bot application setup, authentication configuration, permission requirements, workspace integration steps, and troubleshooting guide for bot connectivity issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Slack bot configuration and setup process including application creation, authentication setup, permission requirements, and workspace integration steps. Provide troubleshooting guidance for common bot connectivity issues and clear instructions for configuring the bot for AI agent communication.
US-F3-002 — As a On-call engineer, I want to interact with the AI agent using slash commands and buttons, so that I can quickly query incident status and request specific investigations.
T-US-F3-002-01 Code Create slash command registration and routing Implement Slack slash command registration, command parsing, routing logic, and validation to handle various AI agent interaction commands with proper parameter handling.
Test plan: Unit test: slash command parsing correctly identifies commands and parameters. Test command validation and routing logic. Integration test: slash commands execute properly and return expected responses.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Slack slash command system including command registration, parsing, routing, and validation logic. Implement comprehensive command handling that allows on-call engineers to interact with the AI agent through intuitive slash commands as specified in the acceptance criteria for querying incident status and requesting investigations.
T-US-F3-002-02 Code Implement interactive button and action handling Create interactive Slack components (buttons, select menus) and action handlers that process user interactions, execute corresponding AI agent actions, and provide immediate feedback.
Test plan: Unit test: interactive button actions execute correctly and provide feedback. Test various button types and action handlers. Integration test: end-to-end interactive workflow from button click to action completion.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement Slack interactive components including buttons and action handlers that process user interactions and execute AI agent actions. Create responsive interactive interface that provides immediate feedback and enables efficient incident management through Slack as required by the acceptance criteria for interactive button functionality.
T-US-F3-002-03 Code Add command validation and error handling Implement comprehensive command validation, error handling for invalid commands, helpful error messages, and command syntax assistance to improve user experience.
Test plan: Unit test: command validation correctly identifies invalid commands and parameters. Test error message clarity and helpfulness. Integration test: invalid commands return appropriate error responses with syntax guidance.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement command validation and error handling system that provides helpful error messages and command syntax assistance for invalid or malformed commands. Create user-friendly error responses that guide engineers toward correct command usage as specified in the acceptance criteria for handling invalid commands.
T-US-F3-002-04 Code Create incident status query and investigation commands Implement specific slash commands for querying incident status, requesting investigations, and accessing AI agent capabilities with proper data formatting and response handling.
Test plan: Unit test: incident status commands return properly formatted data. Test investigation request commands and response handling. Integration test: commands successfully interact with AI agent backend and return relevant information.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create specific slash commands for incident status queries and investigation requests that integrate with the AI agent backend. Implement proper data formatting and response handling to provide on-call engineers with quick access to incident information and investigation capabilities as required by the acceptance criteria.
T-US-F3-002-05 Documentation Document Slack interaction and command system Create documentation covering available slash commands, interactive button usage, command syntax, error handling, and best practices for interacting with the AI agent through Slack.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Slack interaction system including available slash commands, interactive button usage, command syntax, and error handling procedures. Provide clear usage examples and best practices for on-call engineers to effectively interact with the AI agent through Slack commands and interactive components.
US-F3-003 — As a On-call engineer, I want to receive formatted alert notifications directly in Slack channels, so that I stay informed about incidents without switching between multiple tools.
T-US-F3-003-01 Code Create formatted alert notification system Implement Slack message formatting for alert notifications including alert details, severity indicators, suggested actions, and rich formatting to improve readability and actionability.
Test plan: Unit test: alert notifications format correctly with all required details. Test various alert types and severity levels. Integration test: formatted notifications deliver successfully to Slack channels.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create formatted alert notification system for Slack that includes alert details, severity indicators, and suggested actions with rich formatting. Implement clear and actionable notification format that helps on-call engineers quickly understand and respond to incidents as specified in the acceptance criteria.
T-US-F3-003-02 Code Implement channel management and fallback logic Create channel availability checking, fallback channel routing, and error handling for unavailable or archived channels to ensure notifications are always delivered.
Test plan: Unit test: channel availability checking correctly identifies unavailable channels. Test fallback routing to alternative channels. Integration test: notifications successfully deliver even when primary channels are unavailable.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement channel management system with availability checking and fallback routing to ensure alert notifications are always delivered even when primary channels are unavailable or archived. Create robust notification delivery system as specified in the acceptance criteria for handling channel availability issues.
T-US-F3-003-03 Documentation Document Slack alert notification system Create documentation covering alert notification formatting, channel configuration, fallback mechanisms, message threading, and troubleshooting guide for notification delivery issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Slack alert notification system including message formatting, channel configuration, fallback mechanisms, and threading behavior. Provide troubleshooting guidance for notification delivery issues and clear instructions for configuring alert notifications to keep on-call engineers informed through Slack.
US-F4-001 — As a DevOps engineer, I want to establish secure connection to Azure DevOps repositories and services, so that the AI agent can access code context for incident investigation.
T-US-F4-001-01 Code Create Azure DevOps API client and authentication Implement Azure DevOps API client with authentication setup, personal access token management, and connection verification for accessing repositories and services.
Test plan: Unit test: API client initializes with valid credentials and establishes connection. Test authentication error handling and token validation. Integration test: successful connection to Azure DevOps and repository access verification.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Azure DevOps API client with secure authentication setup including personal access token management and connection verification. Implement proper error handling for authentication failures and credential validation as specified in the acceptance criteria for establishing secure connection to Azure DevOps.
T-US-F4-001-02 Code Implement API rate limiting and request throttling Create rate limiting logic, request throttling, and queue management to handle Azure DevOps API rate limits and ensure reliable access without exceeding quotas.
Test plan: Unit test: rate limiting correctly throttles requests within API limits. Test queue management for multiple concurrent requests. Integration test: system operates reliably under API rate limit constraints.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement API rate limiting and request throttling system for Azure DevOps API access including queue management and request scheduling. Create robust rate limiting that prevents API quota exceeded errors while maintaining system responsiveness as specified in the acceptance criteria for handling API rate limits.
T-US-F4-001-03 Documentation Document Azure DevOps integration setup Create documentation covering Azure DevOps API setup, authentication configuration, rate limiting behavior, troubleshooting connection issues, and best practices for repository access.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Azure DevOps integration setup including API configuration, authentication procedures, rate limiting behavior, and troubleshooting guidance. Provide clear instructions for establishing secure connections and managing repository access for incident investigation purposes.
US-F4-002 — As a Software engineer, I want to have the agent automatically identify relevant code repositories for alerts, so that incident investigation includes relevant code context and recent changes.
T-US-F4-002-01 Code Create service-to-repository mapping system Implement mapping logic that correlates alert service information with Azure DevOps repositories, including configuration management for service definitions and repository associations.
Test plan: Unit test: service mapping correctly identifies repositories for known services. Test handling of unmapped services and multiple repository associations. Integration test: alert processing successfully identifies relevant repositories.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create service-to-repository mapping system that correlates alert service information with Azure DevOps repositories. Implement configurable mapping logic that can identify relevant repositories for incident investigation as specified in the acceptance criteria for automatically identifying relevant code repositories.
T-US-F4-002-02 Code Implement repository discovery and validation Create repository discovery logic that validates repository access, checks permissions, and handles missing or inaccessible repositories with appropriate fallback behavior.
Test plan: Unit test: repository discovery correctly validates access and permissions. Test handling of inaccessible repositories and permission errors. Integration test: discovery process completes successfully with proper error handling.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement repository discovery and validation system that checks repository access, validates permissions, and handles missing mappings gracefully. Create robust discovery logic that continues investigation with available information when repositories are inaccessible as specified in the acceptance criteria.
T-US-F4-002-03 Code Add repository prioritization and relevance scoring Create relevance scoring algorithm that prioritizes multiple repositories associated with a service based on recent activity, code changes, and relationship to the alert context.
Test plan: Unit test: relevance scoring correctly prioritizes repositories based on activity and context. Test scoring algorithm with various repository scenarios. Integration test: prioritized repository list provides most relevant repositories first.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create repository prioritization system that scores and ranks multiple repositories by relevance to the incident context including recent activity, code changes, and alert relationship. Implement intelligent prioritization that helps focus investigation on the most relevant repositories as specified in the acceptance criteria.
T-US-F4-002-04 Code Create missing mapping detection and reporting Implement detection logic for incomplete service-to-repository mappings, logging of missing mappings, and reporting mechanisms to help maintain mapping completeness.
Test plan: Unit test: missing mapping detection correctly identifies unmapped services. Test logging and reporting functionality for mapping gaps. Integration test: missing mappings are properly reported while investigation continues.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create missing mapping detection and reporting system that identifies incomplete service-to-repository mappings and logs missing associations for future configuration updates. Implement proper reporting that helps maintain mapping completeness while allowing investigation to continue as specified in the acceptance criteria.
T-US-F4-002-05 Documentation Document service-repository mapping and discovery Create documentation covering service-to-repository mapping configuration, discovery algorithms, prioritization logic, missing mapping handling, and best practices for maintaining accurate mappings.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the service-to-repository mapping system including configuration procedures, discovery algorithms, prioritization logic, and missing mapping handling. Provide guidance for maintaining accurate mappings and troubleshooting repository identification issues during incident investigation.
US-F4-003 — As a On-call engineer, I want to receive relevant code snippets and recent changes during incident investigation, so that I can quickly identify potential code-related causes of incidents.
T-US-F4-003-01 Code Create code analysis and snippet extraction engine Implement code analysis engine that examines identified repositories, extracts relevant code sections, and identifies recent commits related to the incident context.
Test plan: Unit test: code analysis correctly identifies relevant code sections and recent commits. Test snippet extraction with various code structures. Integration test: analysis completes within reasonable time limits with focused results.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create code analysis engine that examines repositories, extracts relevant code sections, and identifies recent commits related to incident context. Implement efficient analysis that provides focused results within reasonable time limits as specified in the acceptance criteria for receiving relevant code snippets and recent changes.
T-US-F4-003-02 Code Implement recent change analysis and highlighting Create recent change analysis that examines commit history, identifies potentially relevant changes, and highlights modifications that might be related to the incident.
Test plan: Unit test: recent change analysis correctly identifies relevant commits and modifications. Test change highlighting with various commit patterns. Integration test: change analysis provides useful insights for incident investigation.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement recent change analysis system that examines commit history and identifies potentially relevant modifications related to incidents. Create intelligent change highlighting that helps on-call engineers quickly identify code-related causes as specified in the acceptance criteria for highlighting recent commits.
T-US-F4-003-03 Code Add repository access error handling and fallbacks Implement comprehensive error handling for repository access issues, permission denials, and unavailable repositories with appropriate fallback behavior and user notification.
Test plan: Unit test: error handling correctly manages repository access failures and permission issues. Test fallback behavior when repositories are unavailable. Integration test: investigation continues successfully despite repository access problems.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement repository access error handling including permission denials, unavailable repositories, and access failures with appropriate fallback behavior. Create robust error handling that allows investigation to continue with other sources when repository access is limited as specified in the acceptance criteria.
T-US-F4-003-04 Code Create analysis performance optimization Implement performance optimization for code analysis including caching, parallel processing, and analysis scope limiting to ensure analysis completes within reasonable time limits.
Test plan: Unit test: performance optimizations reduce analysis time without sacrificing quality. Test caching effectiveness and parallel processing. Integration test: large repository analysis completes within acceptable time limits.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create performance optimization system for code analysis including caching, parallel processing, and scope limiting to ensure analysis of large repositories completes within reasonable time limits. Implement efficient analysis that provides focused results without overwhelming the investigation process as specified in the acceptance criteria.
T-US-F4-003-05 Code Implement code context formatting and presentation Create code context formatting system that presents relevant code snippets, recent changes, and analysis results in a clear, actionable format for incident investigation.
Test plan: Unit test: code formatting presents information clearly and actionably. Test various code snippet formats and change presentations. Integration test: formatted code context enhances incident investigation effectiveness.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement code context formatting and presentation system that displays relevant code snippets, recent changes, and analysis results in clear, actionable format. Create effective presentation that helps on-call engineers quickly understand code-related aspects of incidents during investigation.
T-US-F4-003-06 Code Add code analysis result caching and storage Implement caching system for code analysis results, repository metadata, and recent change information to improve performance and reduce API calls to Azure DevOps.
Test plan: Unit test: caching system correctly stores and retrieves analysis results. Test cache invalidation and update mechanisms. Integration test: cached results improve analysis performance and reduce API usage.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create caching system for code analysis results, repository metadata, and recent change information to optimize performance and reduce Azure DevOps API usage. Implement intelligent caching that improves analysis speed while maintaining result accuracy and freshness.
T-US-F4-003-07 Code Create analysis scope configuration and filtering Implement configurable analysis scope that allows filtering by file types, directories, time ranges, and other criteria to focus code analysis on most relevant areas.
Test plan: Unit test: analysis scope configuration correctly filters code analysis by specified criteria. Test various filtering options and scope settings. Integration test: scoped analysis provides more focused and relevant results.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create configurable analysis scope system that filters code analysis by file types, directories, time ranges, and other criteria. Implement flexible filtering that focuses analysis on most relevant areas while maintaining comprehensive coverage of potential incident-related code changes.
T-US-F4-003-08 Documentation Document Azure DevOps code access and analysis Create documentation covering code analysis workflow, repository access procedures, analysis scope configuration, performance optimization, error handling, and troubleshooting guide for code integration issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Azure DevOps code access and analysis system including workflow procedures, repository access setup, analysis scope configuration, and performance optimization. Provide troubleshooting guidance for code integration issues and explain how code context enhances incident investigation capabilities.
US-F5-001 — As a DevOps engineer, I want to establish connection to Confluence for RunBook access, so that the AI agent can retrieve and reference existing RunBooks during investigations.
T-US-F5-001-01 Code Create Confluence API client and authentication Implement Confluence API client with authentication setup, API token management, and connection verification for accessing RunBook spaces and pages.
Test plan: Unit test: API client initializes with valid credentials and establishes connection. Test authentication error handling and token validation. Integration test: successful connection to Confluence and RunBook space access verification.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Confluence API client with secure authentication setup including API token management and connection verification for RunBook access. Implement proper error handling for authentication failures and credential validation as specified in the acceptance criteria for establishing connection to Confluence.
T-US-F5-001-02 Documentation Document Confluence RunBook integration setup Create documentation covering Confluence API setup, authentication configuration, RunBook space access, permission requirements, and troubleshooting guide for connection issues.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Confluence RunBook integration setup including API configuration, authentication procedures, RunBook space access, and permission requirements. Provide troubleshooting guidance for connection issues and clear instructions for establishing RunBook access for incident investigation.
US-F5-002 — As a On-call engineer, I want to have relevant RunBooks automatically identified and referenced during incidents, so that I can quickly access established procedures and troubleshooting steps.
T-US-F5-002-01 Code Create RunBook search and matching algorithm Implement intelligent search algorithm that matches incident alerts with relevant RunBooks based on keywords, service names, error patterns, and content analysis.
Test plan: Unit test: search algorithm correctly identifies relevant RunBooks for various alert types. Test keyword matching and content analysis accuracy. Integration test: RunBook matching provides useful results for incident investigation.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create RunBook search and matching algorithm that identifies relevant RunBooks for incident alerts using keywords, service names, error patterns, and content analysis. Implement intelligent matching that helps on-call engineers quickly access established procedures as specified in the acceptance criteria.
T-US-F5-002-02 Code Implement RunBook content extraction and summarization Create content extraction system that retrieves relevant RunBook sections, extracts key troubleshooting steps, and summarizes important procedures for incident response.
Test plan: Unit test: content extraction correctly identifies and summarizes key RunBook sections. Test extraction with various RunBook formats and structures. Integration test: extracted content provides actionable troubleshooting information.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement RunBook content extraction and summarization system that retrieves relevant sections and extracts key troubleshooting steps from matched RunBooks. Create effective summarization that provides actionable procedures for incident response as specified in the acceptance criteria.
T-US-F5-002-03 Code Add RunBook availability and quality detection Implement detection logic for missing RunBooks, outdated content, conflicting information, and quality issues with appropriate flagging and reporting mechanisms.
Test plan: Unit test: quality detection correctly identifies outdated and conflicting RunBook content. Test missing RunBook detection and reporting. Integration test: quality issues are properly flagged while useful content is still provided.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create RunBook availability and quality detection system that identifies missing RunBooks, outdated content, and conflicting information with appropriate flagging and reporting. Implement quality assessment that helps maintain RunBook accuracy as specified in the acceptance criteria for handling outdated or conflicting information.
T-US-F5-002-04 Code Create RunBook suggestion and recommendation system Implement recommendation system that suggests creating new RunBooks when none are found and provides guidance on RunBook content based on incident patterns and resolution history.
Test plan: Unit test: recommendation system correctly suggests new RunBook creation when appropriate. Test content suggestions based on incident patterns. Integration test: recommendations provide useful guidance for RunBook creation and improvement.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create RunBook suggestion and recommendation system that identifies gaps in RunBook coverage and suggests creating new documentation based on incident patterns. Implement intelligent recommendations that help improve RunBook completeness as specified in the acceptance criteria for suggesting new documentation creation.
T-US-F5-002-05 Documentation Document Confluence RunBook access and identification Create documentation covering RunBook search algorithms, content extraction procedures, quality detection mechanisms, recommendation system, and best practices for maintaining RunBook relevance.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Confluence RunBook access and identification system including search algorithms, content extraction procedures, quality detection, and recommendation mechanisms. Provide guidance for maintaining RunBook relevance and troubleshooting RunBook integration issues during incident investigation.
US-F5-003 — As a Software engineer, I want to update RunBooks based on incident learnings and agent recommendations, so that RunBooks stay current and accurate for future incident response.
T-US-F5-003-01 Code Create incident learning analysis and extraction Implement analysis system that processes resolved incidents, extracts key learnings, identifies patterns, and generates recommendations for RunBook updates based on incident resolution data.
Test plan: Unit test: learning analysis correctly extracts insights from resolved incidents. Test pattern identification and recommendation generation. Integration test: analysis provides useful RunBook update suggestions based on incident data.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create incident learning analysis system that processes resolved incidents, extracts key learnings, and generates RunBook update recommendations based on resolution patterns and new insights. Implement intelligent analysis that helps keep RunBooks current as specified in the acceptance criteria.
T-US-F5-003-02 Code Implement RunBook update recommendation engine Create recommendation engine that generates specific content suggestions for RunBook updates, identifies outdated sections, and proposes new procedures based on incident learnings.
Test plan: Unit test: recommendation engine generates relevant and specific update suggestions. Test identification of outdated content and new procedure proposals. Integration test: recommendations improve RunBook accuracy and completeness.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement RunBook update recommendation engine that generates specific content suggestions, identifies outdated sections, and proposes new procedures based on incident learnings. Create intelligent recommendations that help maintain RunBook accuracy as specified in the acceptance criteria for providing specific content suggestions.
T-US-F5-003-03 Code Create automated RunBook update system Implement automated update system that applies approved RunBook changes to Confluence pages, maintains version history, and handles update permissions and approval workflows.
Test plan: Unit test: automated update system correctly applies approved changes to Confluence pages. Test version history maintenance and permission handling. Integration test: end-to-end update process from approval to published changes.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create automated RunBook update system that applies approved changes to Confluence pages while maintaining version history and handling permissions appropriately. Implement reliable update mechanism that keeps RunBooks current as specified in the acceptance criteria for updating Confluence pages.
T-US-F5-003-04 Code Add update approval and stakeholder notification Implement approval workflow for RunBook updates, stakeholder notification system, and permission-based update routing to ensure appropriate review and authorization of changes.
Test plan: Unit test: approval workflow correctly routes updates to appropriate stakeholders. Test notification delivery and permission-based routing. Integration test: approval process ensures proper review before RunBook updates.
AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create update approval workflow and stakeholder notification system that routes RunBook update requests to appropriate reviewers and handles permission-based authorization. Implement proper approval process that ensures RunBook changes are reviewed as specified in the acceptance criteria for handling insufficient update permissions.
T-US-F5-003-05 Documentation Document RunBook update and learning integration Create documentation covering incident learning analysis, update recommendation system, automated update procedures, approval workflows, and best practices for maintaining current RunBooks.
AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the RunBook update and learning integration system including incident analysis procedures, recommendation generation, automated update workflows, and approval processes. Provide guidance for maintaining current RunBooks and troubleshooting update integration issues based on incident learnings.

Sprint Plan

5 sprints • 68 total story points • 21 pts/sprint velocity

Sprint 107 16 / 16 pts
Establish foundational AI agent infrastructure and core authentication systems to de-risk deployment unknowns.
US-F1-001US-F1-002US-F1-003US-F2-001
Sprint 108 16 / 16 pts
Implement core alert processing capabilities and establish Slack integration for user communication.
US-F4-001US-F2-002US-F3-001US-F3-002
Sprint 109 16 / 16 pts
Enable automated P1 alert investigation and establish Azure DevOps code repository access.
US-F3-003US-F2-003US-F4-002
Sprint 110 15 / 15 pts
Deliver code-aware incident investigation with relevant code snippets and repository analysis.
US-F5-001US-F4-003US-F5-002
Sprint 111 5 / 5 pts
Complete RunBook integration with automated updates based on incident learnings and agent recommendations.
US-F5-003