|
US-F1-001 — As a DevOps engineer, I want to provision foundational cloud resources for AI agent deployment, so that the AI agent has a secure and scalable infrastructure foundation.
|
| T-US-F1-001-01 |
Infrastructure |
Create Terraform provider configuration and backend setup |
Set up main.tf with required providers (AWS/Azure/GCP), configure remote state backend (S3/Azure Storage), and establish basic project structure with variables.tf and outputs.tf files. Test plan: Unit test: terraform validate passes without errors. Integration test: terraform init successfully configures backend and downloads providers. Test state locking and remote state access. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create the foundational Terraform configuration files including provider setup, remote state backend, and project structure. Follow standard Terraform project layout with separate files for variables, outputs, and main configuration. Ensure the configuration supports the core infrastructure provisioning requirements from the acceptance criteria. |
| T-US-F1-001-02 |
Infrastructure |
Implement core cloud infrastructure resources |
Define and provision essential cloud resources including VPC/networking, security groups, IAM roles, compute instances, and storage resources needed for AI agent deployment in resource files. Test plan: Unit test: terraform plan shows expected resource creation without errors. Integration test: terraform apply provisions all resources successfully. Test resource dependencies and proper tagging. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement the core cloud infrastructure resources required for AI agent deployment including networking, security, compute, and storage components. Ensure resources are properly configured with security best practices and appropriate sizing for the AI agent workload. The infrastructure must support secure and scalable AI agent deployment. |
| T-US-F1-001-03 |
Infrastructure |
Add Terraform validation and error handling |
Implement input validation for variables, add conditional logic for resource creation, and configure proper error messages and validation rules in variables.tf and main.tf. Test plan: Unit test: terraform validate with invalid configurations shows clear error messages. Test variable validation rules with edge cases. Integration test: invalid terraform apply attempts fail gracefully with helpful error output. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Add comprehensive validation and error handling to the Terraform configuration including variable validation rules, conditional resource creation, and clear error messaging. Focus on preventing common configuration mistakes and providing helpful feedback when validation fails. This addresses the acceptance criteria for handling invalid Terraform configurations. |
| T-US-F1-001-04 |
Infrastructure |
Implement idempotent infrastructure management |
Configure Terraform state management, resource lifecycle rules, and import capabilities to ensure repeated terraform apply operations are idempotent and don't create duplicate resources. Test plan: Integration test: run terraform apply multiple times and verify no duplicate resources are created. Test terraform import for existing resources. Verify state consistency after repeated operations. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement idempotent infrastructure management ensuring that repeated terraform apply operations don't create duplicate resources and maintain consistent state. Configure proper lifecycle rules and state management to handle existing infrastructure gracefully. This addresses the acceptance criteria for consistent state management when infrastructure already exists. |
| T-US-F1-001-05 |
Documentation |
Document infrastructure setup and configuration |
Create comprehensive documentation covering Terraform configuration structure, required variables, deployment steps, troubleshooting guide, and infrastructure architecture decisions. Include setup instructions and configuration examples. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Create comprehensive documentation for the Terraform infrastructure setup including configuration structure, required variables, deployment procedures, and troubleshooting guidance. Document the infrastructure architecture decisions and provide clear setup instructions for new team members. Ensure documentation covers all aspects of the foundational cloud resources provisioning process. |
|
US-F1-002 — As a DevOps engineer, I want to configure secure authentication and authorization for the AI agent, so that the agent can securely access external services and APIs.
|
| T-US-F1-002-01 |
Infrastructure |
Create authentication configuration module |
Implement Terraform module for managing authentication credentials, API keys, and service accounts including secure storage using cloud key management services and proper IAM policies. Test plan: Unit test: authentication module validates required parameters. Integration test: credentials are properly stored and accessible by AI agent. Test IAM policy attachments and key rotation capabilities. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create a Terraform module for secure authentication and authorization configuration including credential management, API key storage, and IAM policies. Use cloud-native key management services and follow security best practices for credential storage and access. This module will enable the AI agent to securely authenticate with external services. |
| T-US-F1-002-02 |
Infrastructure |
Implement token management and refresh logic |
Configure automated token refresh mechanisms, expiration handling, and credential rotation policies in the infrastructure configuration to support continuous authentication without service interruption. Test plan: Integration test: token refresh occurs automatically before expiration. Test credential rotation without service disruption. Verify error handling for failed token refresh attempts. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement automated token management and refresh mechanisms in the infrastructure configuration including expiration handling and credential rotation policies. Ensure the system can automatically refresh access tokens without service interruption as specified in the acceptance criteria. Configure appropriate monitoring and alerting for authentication failures. |
| T-US-F1-002-03 |
Documentation |
Document authentication and authorization setup |
Create documentation covering authentication configuration, credential management procedures, token refresh mechanisms, security best practices, and troubleshooting guide for authentication issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the authentication and authorization setup including credential configuration procedures, token management mechanisms, and security best practices. Provide troubleshooting guidance for common authentication issues and clear instructions for setting up secure access to external services. Cover both initial setup and ongoing maintenance procedures. |
|
US-F1-003 — As a DevOps engineer, I want to automate AI agent deployment and configuration management, so that deployments are consistent, repeatable, and can be rolled back if needed.
|
| T-US-F1-003-01 |
Infrastructure |
Create CI/CD pipeline configuration |
Implement deployment pipeline configuration files (GitHub Actions/Azure DevOps/Jenkins) with stages for validation, testing, deployment, and rollback of the AI agent infrastructure and application code. Test plan: Unit test: pipeline configuration validates successfully. Integration test: pipeline triggers on main branch commits and deploys successfully. Test pipeline failure handling and rollback triggers. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create CI/CD pipeline configuration for automated AI agent deployment including validation, testing, deployment, and rollback stages. Ensure the pipeline supports zero-downtime deployments and can be triggered from main branch commits. Configure proper error handling and rollback mechanisms as specified in the acceptance criteria. |
| T-US-F1-003-02 |
Infrastructure |
Implement zero-downtime deployment strategy |
Configure blue-green or rolling deployment strategy in Terraform and deployment scripts to ensure AI agent updates occur without service interruption, including health checks and traffic switching. Test plan: Integration test: deployment completes without service downtime. Test health check validation during deployment. Verify traffic switching occurs only after successful health checks. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement zero-downtime deployment strategy using blue-green or rolling deployment approach including health checks and traffic switching mechanisms. Configure the infrastructure to support seamless AI agent updates without service interruption as required by the acceptance criteria. Ensure proper validation before traffic switching. |
| T-US-F1-003-03 |
Infrastructure |
Create automated rollback mechanism |
Implement rollback automation in deployment scripts and Terraform configuration to automatically restore previous working version within 5 minutes when deployment failures are detected. Test plan: Integration test: rollback completes within 5 minutes when triggered. Test automatic rollback on deployment failure detection. Verify previous version restoration and service availability. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Create automated rollback mechanisms that can restore the previous working version of the AI agent within 5 minutes when deployment failures are detected. Implement failure detection logic and automated rollback triggers in the deployment pipeline. Ensure the rollback process meets the 5-minute recovery time requirement from the acceptance criteria. |
| T-US-F1-003-04 |
Infrastructure |
Add deployment monitoring and alerting |
Configure monitoring and alerting for deployment pipeline stages, failure detection, and rollback events using cloud monitoring services and notification systems. Test plan: Integration test: alerts trigger on deployment failures. Test monitoring metrics collection during deployments. Verify notification delivery for pipeline events. AI prompt: You are a DevOps engineer working on Datadog Bits AI SRE (Terraform). Implement comprehensive monitoring and alerting for the deployment pipeline including failure detection, rollback events, and deployment success metrics. Configure notifications to alert the team of deployment status and any issues requiring attention. Ensure monitoring covers all pipeline stages and provides visibility into deployment health. |
| T-US-F1-003-05 |
Documentation |
Document deployment and configuration management |
Create documentation covering CI/CD pipeline setup, deployment procedures, rollback processes, monitoring configuration, and troubleshooting guide for deployment issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the automated deployment and configuration management system including CI/CD pipeline setup, deployment procedures, rollback processes, and monitoring configuration. Provide troubleshooting guidance for common deployment issues and clear instructions for managing the deployment pipeline. Cover both automated processes and manual intervention procedures. |
|
US-F2-001 — As a DevOps engineer, I want to establish connection to receive P1, P2, and P3 monitor alerts, so that the AI agent can receive and process Datadog alert notifications.
|
| T-US-F2-001-01 |
Code |
Create Datadog API client configuration |
Implement Datadog API client setup with authentication, connection management, and retry logic for receiving monitor alerts. Create configuration files for API credentials and connection parameters. Test plan: Unit test: API client initializes with valid credentials. Test connection retry logic with simulated failures. Integration test: successful connection to Datadog API and credential validation. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create a Datadog API client configuration module for receiving monitor alerts including authentication setup, connection management, and retry logic. Implement proper error handling for API unavailability and credential validation. The client must support receiving P1, P2, and P3 monitor alerts within 30 seconds as specified in the acceptance criteria. |
| T-US-F2-001-02 |
Code |
Implement alert reception and filtering logic |
Create alert reception handler that filters incoming Datadog alerts by priority levels (P1, P2, P3), validates alert format, and queues alerts for processing with proper error handling for unsupported priorities. Test plan: Unit test: alert filtering correctly identifies P1, P2, P3 priorities. Test rejection of invalid priority levels. Integration test: end-to-end alert reception from Datadog to processing queue. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement alert reception and filtering logic that processes incoming Datadog monitor alerts, validates priority levels (P1, P2, P3), and handles unsupported priorities gracefully. Create proper queuing mechanism for valid alerts and logging for invalid ones. This addresses the acceptance criteria for receiving and filtering monitor alerts by priority. |
| T-US-F2-001-03 |
Documentation |
Document Datadog monitor integration setup |
Create documentation covering Datadog API configuration, alert reception setup, priority level handling, troubleshooting connection issues, and monitoring integration best practices. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Datadog monitor integration setup including API configuration, alert reception mechanisms, priority level filtering, and troubleshooting guidance. Provide clear instructions for configuring the connection and handling common integration issues. Cover the alert reception process and priority level requirements. |
|
US-F2-002 — As a On-call engineer, I want to have alerts automatically analyzed and categorized by the AI agent, so that I can quickly understand alert severity and potential root causes.
|
| T-US-F2-002-01 |
Code |
Create alert metadata extraction engine |
Implement alert parsing and metadata extraction logic to process Datadog alert payloads, extract relevant information (severity, service, metrics, timestamps), and structure data for analysis. Test plan: Unit test: metadata extraction handles various alert formats correctly. Test parsing of malformed alert data with graceful error handling. Integration test: extracted metadata matches expected structure and completeness. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create an alert metadata extraction engine that parses Datadog alert payloads and extracts relevant information including severity, service details, metrics, and timestamps. Implement robust parsing logic that handles various alert formats and gracefully manages malformed data as specified in the acceptance criteria. |
| T-US-F2-002-02 |
Code |
Implement alert categorization and analysis logic |
Create AI-powered alert categorization system that analyzes extracted metadata, determines alert severity, identifies potential root causes, and categorizes alerts by type and impact. Test plan: Unit test: categorization logic correctly classifies different alert types. Test analysis accuracy with sample alert data. Integration test: end-to-end alert processing from extraction to categorization. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement AI-powered alert categorization and analysis logic that processes extracted alert metadata to determine severity, identify potential root causes, and categorize alerts appropriately. The system should help on-call engineers quickly understand alert severity and potential causes as required by the acceptance criteria. |
| T-US-F2-002-03 |
Code |
Create alert grouping and deduplication system |
Implement logic to group related alerts, deduplicate similar alerts received simultaneously, and manage alert correlation to reduce noise and improve incident response efficiency. Test plan: Unit test: alert grouping correctly identifies related alerts. Test deduplication logic with simultaneous similar alerts. Integration test: grouped alerts maintain proper relationships and reduce notification spam. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create alert grouping and deduplication system that identifies related alerts, removes duplicates, and correlates simultaneous alerts to reduce noise. Implement intelligent grouping logic that helps on-call engineers focus on unique issues rather than being overwhelmed by related alerts. This addresses the acceptance criteria for handling multiple related alerts appropriately. |
| T-US-F2-002-04 |
Code |
Add error handling for malformed alert data |
Implement comprehensive error handling for incomplete or malformed alert data, including validation, error logging, manual review flagging, and graceful degradation of analysis capabilities. Test plan: Unit test: error handling correctly identifies and processes malformed data. Test various types of incomplete alert payloads. Integration test: system continues operating when encountering bad data and properly flags for manual review. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement comprehensive error handling for malformed or incomplete alert data including validation logic, error logging, and manual review flagging. Ensure the system gracefully handles bad data and continues processing other alerts while properly escalating problematic cases. This addresses the acceptance criteria for handling malformed alert data gracefully. |
| T-US-F2-002-05 |
Documentation |
Document alert analysis and categorization system |
Create documentation covering alert processing workflow, metadata extraction logic, categorization criteria, grouping algorithms, error handling procedures, and troubleshooting guide for alert analysis issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the alert analysis and categorization system including processing workflow, metadata extraction logic, categorization criteria, and grouping algorithms. Provide troubleshooting guidance for alert processing issues and explain how the system helps on-call engineers understand alert severity and root causes. Cover error handling procedures and manual review processes. |
|
US-F2-003 — As a On-call engineer, I want to have P1 alerts automatically investigated and initial response actions taken, so that critical incidents get immediate attention without waiting for manual intervention.
|
| T-US-F2-003-01 |
Code |
Create P1 alert detection and routing logic |
Implement priority-based alert routing that identifies P1 alerts and triggers automated investigation workflows while routing P2/P3 alerts to analysis-only processing paths. Test plan: Unit test: P1 alert detection correctly identifies critical alerts. Test routing logic separates P1 from P2/P3 alerts. Integration test: P1 alerts trigger automated workflows while P2/P3 alerts only get analyzed. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create priority-based alert routing logic that identifies P1 alerts and triggers automated investigation workflows while ensuring P2 and P3 alerts only receive analysis without automated actions. Implement proper alert classification and routing mechanisms as specified in the acceptance criteria for different priority levels. |
| T-US-F2-003-02 |
Code |
Implement automated P1 investigation workflow |
Create automated investigation engine for P1 alerts that executes predefined investigation steps, gathers system information, runs diagnostic commands, and collects relevant data within 2 minutes. Test plan: Unit test: investigation steps execute in correct sequence. Test timeout handling for long-running investigations. Integration test: complete P1 investigation workflow completes within 2 minutes with proper data collection. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement automated P1 investigation workflow that executes investigation steps within 2 minutes including system information gathering, diagnostic commands, and data collection. Create a robust investigation engine that can handle various P1 alert types and provide immediate attention to critical incidents as required by the acceptance criteria. |
| T-US-F2-003-03 |
Code |
Create escalation mechanism for investigation failures |
Implement escalation logic that detects automated investigation failures or timeouts and immediately escalates P1 incidents to on-call engineers through multiple notification channels. Test plan: Unit test: escalation triggers correctly on investigation failures and timeouts. Test multiple notification channel delivery. Integration test: end-to-end escalation process from failure detection to engineer notification. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create escalation mechanism that detects when automated P1 investigation fails or times out and immediately escalates to on-call engineers. Implement multiple notification channels and ensure reliable escalation delivery as specified in the acceptance criteria for handling investigation failures. |
| T-US-F2-003-04 |
Code |
Add investigation result tracking and reporting |
Implement tracking system for investigation results, success/failure metrics, timing data, and reporting capabilities to monitor automated investigation effectiveness and identify improvement areas. Test plan: Unit test: investigation tracking correctly records results and metrics. Test reporting data accuracy and completeness. Integration test: tracking system captures full investigation lifecycle and generates useful reports. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement investigation result tracking and reporting system that monitors automated investigation effectiveness, tracks success/failure rates, and provides insights for improvement. Create comprehensive tracking that helps optimize the automated investigation process and demonstrates value to the engineering team. |
| T-US-F2-003-05 |
Code |
Create investigation timeout and resource management |
Implement timeout controls, resource limits, and cleanup mechanisms for automated investigations to prevent resource exhaustion and ensure investigations complete within acceptable time limits. Test plan: Unit test: timeout controls properly terminate long-running investigations. Test resource limit enforcement and cleanup procedures. Integration test: resource management prevents system overload during multiple concurrent investigations. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create investigation timeout and resource management system that prevents automated investigations from consuming excessive resources or running indefinitely. Implement proper cleanup mechanisms and resource limits to ensure system stability during investigation execution. This ensures investigations complete within the 2-minute requirement from the acceptance criteria. |
| T-US-F2-003-06 |
Code |
Add investigation workflow configuration |
Create configurable investigation workflow system that allows customization of investigation steps, timeouts, escalation rules, and investigation procedures for different types of P1 alerts. Test plan: Unit test: workflow configuration correctly customizes investigation steps. Test different alert type configurations. Integration test: customized workflows execute properly for various P1 alert scenarios. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create configurable investigation workflow system that allows customization of automated investigation procedures for different P1 alert types. Implement flexible configuration management that enables tuning of investigation steps, timeouts, and escalation rules based on alert characteristics and organizational needs. |
| T-US-F2-003-07 |
Code |
Implement investigation result analysis and recommendations |
Create analysis engine that processes investigation results, identifies patterns, generates recommendations for incident resolution, and provides actionable insights to on-call engineers. Test plan: Unit test: analysis engine correctly processes investigation data and generates recommendations. Test recommendation quality and relevance. Integration test: complete analysis workflow from investigation results to actionable recommendations. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement investigation result analysis engine that processes automated investigation data, identifies patterns, and generates actionable recommendations for incident resolution. Create intelligent analysis that helps on-call engineers quickly understand investigation findings and take appropriate action based on the automated investigation results. |
| T-US-F2-003-08 |
Documentation |
Document P1 alert automation and investigation system |
Create documentation covering P1 alert automation workflow, investigation procedures, escalation mechanisms, configuration options, troubleshooting guide, and best practices for automated incident response. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the P1 alert automation and investigation system including workflow procedures, escalation mechanisms, configuration options, and troubleshooting guidance. Provide clear explanations of how automated investigation works, when escalation occurs, and how to optimize the system for different incident types. Cover both automated processes and manual intervention procedures. |
|
US-F3-001 — As a DevOps engineer, I want to configure Slack bot for AI agent communication, so that engineers can interact with the AI agent through their existing Slack workspace.
|
| T-US-F3-001-01 |
Code |
Create Slack bot application and authentication |
Set up Slack bot application in workspace, configure OAuth tokens, bot permissions, and authentication credentials for AI agent communication through Slack API. Test plan: Unit test: bot authentication validates successfully with Slack API. Test permission verification and token refresh. Integration test: bot can connect to workspace and respond to mentions. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Slack bot application setup including OAuth authentication, bot permissions configuration, and credential management for AI agent communication. Implement proper authentication flow and permission validation to enable bot interaction within the Slack workspace as specified in the acceptance criteria. |
| T-US-F3-001-02 |
Code |
Implement bot response and command handling system |
Create bot response handler that processes mentions, displays available commands and status information, and manages basic bot interactions with proper error handling and help functionality. Test plan: Unit test: bot responds correctly to mentions with command list and status. Test help command functionality and error responses. Integration test: end-to-end bot interaction from mention to response delivery. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement Slack bot response and command handling system that processes bot mentions, displays available commands and status information, and provides helpful responses to users. Create intuitive command interface that helps engineers interact with the AI agent effectively through Slack as required by the acceptance criteria. |
| T-US-F3-001-03 |
Documentation |
Document Slack bot configuration and setup |
Create documentation covering Slack bot application setup, authentication configuration, permission requirements, workspace integration steps, and troubleshooting guide for bot connectivity issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Slack bot configuration and setup process including application creation, authentication setup, permission requirements, and workspace integration steps. Provide troubleshooting guidance for common bot connectivity issues and clear instructions for configuring the bot for AI agent communication. |
|
US-F3-002 — As a On-call engineer, I want to interact with the AI agent using slash commands and buttons, so that I can quickly query incident status and request specific investigations.
|
| T-US-F3-002-01 |
Code |
Create slash command registration and routing |
Implement Slack slash command registration, command parsing, routing logic, and validation to handle various AI agent interaction commands with proper parameter handling. Test plan: Unit test: slash command parsing correctly identifies commands and parameters. Test command validation and routing logic. Integration test: slash commands execute properly and return expected responses. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Slack slash command system including command registration, parsing, routing, and validation logic. Implement comprehensive command handling that allows on-call engineers to interact with the AI agent through intuitive slash commands as specified in the acceptance criteria for querying incident status and requesting investigations. |
| T-US-F3-002-02 |
Code |
Implement interactive button and action handling |
Create interactive Slack components (buttons, select menus) and action handlers that process user interactions, execute corresponding AI agent actions, and provide immediate feedback. Test plan: Unit test: interactive button actions execute correctly and provide feedback. Test various button types and action handlers. Integration test: end-to-end interactive workflow from button click to action completion. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement Slack interactive components including buttons and action handlers that process user interactions and execute AI agent actions. Create responsive interactive interface that provides immediate feedback and enables efficient incident management through Slack as required by the acceptance criteria for interactive button functionality. |
| T-US-F3-002-03 |
Code |
Add command validation and error handling |
Implement comprehensive command validation, error handling for invalid commands, helpful error messages, and command syntax assistance to improve user experience. Test plan: Unit test: command validation correctly identifies invalid commands and parameters. Test error message clarity and helpfulness. Integration test: invalid commands return appropriate error responses with syntax guidance. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement command validation and error handling system that provides helpful error messages and command syntax assistance for invalid or malformed commands. Create user-friendly error responses that guide engineers toward correct command usage as specified in the acceptance criteria for handling invalid commands. |
| T-US-F3-002-04 |
Code |
Create incident status query and investigation commands |
Implement specific slash commands for querying incident status, requesting investigations, and accessing AI agent capabilities with proper data formatting and response handling. Test plan: Unit test: incident status commands return properly formatted data. Test investigation request commands and response handling. Integration test: commands successfully interact with AI agent backend and return relevant information. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create specific slash commands for incident status queries and investigation requests that integrate with the AI agent backend. Implement proper data formatting and response handling to provide on-call engineers with quick access to incident information and investigation capabilities as required by the acceptance criteria. |
| T-US-F3-002-05 |
Documentation |
Document Slack interaction and command system |
Create documentation covering available slash commands, interactive button usage, command syntax, error handling, and best practices for interacting with the AI agent through Slack. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Slack interaction system including available slash commands, interactive button usage, command syntax, and error handling procedures. Provide clear usage examples and best practices for on-call engineers to effectively interact with the AI agent through Slack commands and interactive components. |
|
US-F3-003 — As a On-call engineer, I want to receive formatted alert notifications directly in Slack channels, so that I stay informed about incidents without switching between multiple tools.
|
| T-US-F3-003-01 |
Code |
Create formatted alert notification system |
Implement Slack message formatting for alert notifications including alert details, severity indicators, suggested actions, and rich formatting to improve readability and actionability. Test plan: Unit test: alert notifications format correctly with all required details. Test various alert types and severity levels. Integration test: formatted notifications deliver successfully to Slack channels. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create formatted alert notification system for Slack that includes alert details, severity indicators, and suggested actions with rich formatting. Implement clear and actionable notification format that helps on-call engineers quickly understand and respond to incidents as specified in the acceptance criteria. |
| T-US-F3-003-02 |
Code |
Implement channel management and fallback logic |
Create channel availability checking, fallback channel routing, and error handling for unavailable or archived channels to ensure notifications are always delivered. Test plan: Unit test: channel availability checking correctly identifies unavailable channels. Test fallback routing to alternative channels. Integration test: notifications successfully deliver even when primary channels are unavailable. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement channel management system with availability checking and fallback routing to ensure alert notifications are always delivered even when primary channels are unavailable or archived. Create robust notification delivery system as specified in the acceptance criteria for handling channel availability issues. |
| T-US-F3-003-03 |
Documentation |
Document Slack alert notification system |
Create documentation covering alert notification formatting, channel configuration, fallback mechanisms, message threading, and troubleshooting guide for notification delivery issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Slack alert notification system including message formatting, channel configuration, fallback mechanisms, and threading behavior. Provide troubleshooting guidance for notification delivery issues and clear instructions for configuring alert notifications to keep on-call engineers informed through Slack. |
|
US-F4-001 — As a DevOps engineer, I want to establish secure connection to Azure DevOps repositories and services, so that the AI agent can access code context for incident investigation.
|
| T-US-F4-001-01 |
Code |
Create Azure DevOps API client and authentication |
Implement Azure DevOps API client with authentication setup, personal access token management, and connection verification for accessing repositories and services. Test plan: Unit test: API client initializes with valid credentials and establishes connection. Test authentication error handling and token validation. Integration test: successful connection to Azure DevOps and repository access verification. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Azure DevOps API client with secure authentication setup including personal access token management and connection verification. Implement proper error handling for authentication failures and credential validation as specified in the acceptance criteria for establishing secure connection to Azure DevOps. |
| T-US-F4-001-02 |
Code |
Implement API rate limiting and request throttling |
Create rate limiting logic, request throttling, and queue management to handle Azure DevOps API rate limits and ensure reliable access without exceeding quotas. Test plan: Unit test: rate limiting correctly throttles requests within API limits. Test queue management for multiple concurrent requests. Integration test: system operates reliably under API rate limit constraints. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement API rate limiting and request throttling system for Azure DevOps API access including queue management and request scheduling. Create robust rate limiting that prevents API quota exceeded errors while maintaining system responsiveness as specified in the acceptance criteria for handling API rate limits. |
| T-US-F4-001-03 |
Documentation |
Document Azure DevOps integration setup |
Create documentation covering Azure DevOps API setup, authentication configuration, rate limiting behavior, troubleshooting connection issues, and best practices for repository access. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Azure DevOps integration setup including API configuration, authentication procedures, rate limiting behavior, and troubleshooting guidance. Provide clear instructions for establishing secure connections and managing repository access for incident investigation purposes. |
|
US-F4-002 — As a Software engineer, I want to have the agent automatically identify relevant code repositories for alerts, so that incident investigation includes relevant code context and recent changes.
|
| T-US-F4-002-01 |
Code |
Create service-to-repository mapping system |
Implement mapping logic that correlates alert service information with Azure DevOps repositories, including configuration management for service definitions and repository associations. Test plan: Unit test: service mapping correctly identifies repositories for known services. Test handling of unmapped services and multiple repository associations. Integration test: alert processing successfully identifies relevant repositories. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create service-to-repository mapping system that correlates alert service information with Azure DevOps repositories. Implement configurable mapping logic that can identify relevant repositories for incident investigation as specified in the acceptance criteria for automatically identifying relevant code repositories. |
| T-US-F4-002-02 |
Code |
Implement repository discovery and validation |
Create repository discovery logic that validates repository access, checks permissions, and handles missing or inaccessible repositories with appropriate fallback behavior. Test plan: Unit test: repository discovery correctly validates access and permissions. Test handling of inaccessible repositories and permission errors. Integration test: discovery process completes successfully with proper error handling. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement repository discovery and validation system that checks repository access, validates permissions, and handles missing mappings gracefully. Create robust discovery logic that continues investigation with available information when repositories are inaccessible as specified in the acceptance criteria. |
| T-US-F4-002-03 |
Code |
Add repository prioritization and relevance scoring |
Create relevance scoring algorithm that prioritizes multiple repositories associated with a service based on recent activity, code changes, and relationship to the alert context. Test plan: Unit test: relevance scoring correctly prioritizes repositories based on activity and context. Test scoring algorithm with various repository scenarios. Integration test: prioritized repository list provides most relevant repositories first. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create repository prioritization system that scores and ranks multiple repositories by relevance to the incident context including recent activity, code changes, and alert relationship. Implement intelligent prioritization that helps focus investigation on the most relevant repositories as specified in the acceptance criteria. |
| T-US-F4-002-04 |
Code |
Create missing mapping detection and reporting |
Implement detection logic for incomplete service-to-repository mappings, logging of missing mappings, and reporting mechanisms to help maintain mapping completeness. Test plan: Unit test: missing mapping detection correctly identifies unmapped services. Test logging and reporting functionality for mapping gaps. Integration test: missing mappings are properly reported while investigation continues. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create missing mapping detection and reporting system that identifies incomplete service-to-repository mappings and logs missing associations for future configuration updates. Implement proper reporting that helps maintain mapping completeness while allowing investigation to continue as specified in the acceptance criteria. |
| T-US-F4-002-05 |
Documentation |
Document service-repository mapping and discovery |
Create documentation covering service-to-repository mapping configuration, discovery algorithms, prioritization logic, missing mapping handling, and best practices for maintaining accurate mappings. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the service-to-repository mapping system including configuration procedures, discovery algorithms, prioritization logic, and missing mapping handling. Provide guidance for maintaining accurate mappings and troubleshooting repository identification issues during incident investigation. |
|
US-F4-003 — As a On-call engineer, I want to receive relevant code snippets and recent changes during incident investigation, so that I can quickly identify potential code-related causes of incidents.
|
| T-US-F4-003-01 |
Code |
Create code analysis and snippet extraction engine |
Implement code analysis engine that examines identified repositories, extracts relevant code sections, and identifies recent commits related to the incident context. Test plan: Unit test: code analysis correctly identifies relevant code sections and recent commits. Test snippet extraction with various code structures. Integration test: analysis completes within reasonable time limits with focused results. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create code analysis engine that examines repositories, extracts relevant code sections, and identifies recent commits related to incident context. Implement efficient analysis that provides focused results within reasonable time limits as specified in the acceptance criteria for receiving relevant code snippets and recent changes. |
| T-US-F4-003-02 |
Code |
Implement recent change analysis and highlighting |
Create recent change analysis that examines commit history, identifies potentially relevant changes, and highlights modifications that might be related to the incident. Test plan: Unit test: recent change analysis correctly identifies relevant commits and modifications. Test change highlighting with various commit patterns. Integration test: change analysis provides useful insights for incident investigation. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement recent change analysis system that examines commit history and identifies potentially relevant modifications related to incidents. Create intelligent change highlighting that helps on-call engineers quickly identify code-related causes as specified in the acceptance criteria for highlighting recent commits. |
| T-US-F4-003-03 |
Code |
Add repository access error handling and fallbacks |
Implement comprehensive error handling for repository access issues, permission denials, and unavailable repositories with appropriate fallback behavior and user notification. Test plan: Unit test: error handling correctly manages repository access failures and permission issues. Test fallback behavior when repositories are unavailable. Integration test: investigation continues successfully despite repository access problems. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement repository access error handling including permission denials, unavailable repositories, and access failures with appropriate fallback behavior. Create robust error handling that allows investigation to continue with other sources when repository access is limited as specified in the acceptance criteria. |
| T-US-F4-003-04 |
Code |
Create analysis performance optimization |
Implement performance optimization for code analysis including caching, parallel processing, and analysis scope limiting to ensure analysis completes within reasonable time limits. Test plan: Unit test: performance optimizations reduce analysis time without sacrificing quality. Test caching effectiveness and parallel processing. Integration test: large repository analysis completes within acceptable time limits. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create performance optimization system for code analysis including caching, parallel processing, and scope limiting to ensure analysis of large repositories completes within reasonable time limits. Implement efficient analysis that provides focused results without overwhelming the investigation process as specified in the acceptance criteria. |
| T-US-F4-003-05 |
Code |
Implement code context formatting and presentation |
Create code context formatting system that presents relevant code snippets, recent changes, and analysis results in a clear, actionable format for incident investigation. Test plan: Unit test: code formatting presents information clearly and actionably. Test various code snippet formats and change presentations. Integration test: formatted code context enhances incident investigation effectiveness. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement code context formatting and presentation system that displays relevant code snippets, recent changes, and analysis results in clear, actionable format. Create effective presentation that helps on-call engineers quickly understand code-related aspects of incidents during investigation. |
| T-US-F4-003-06 |
Code |
Add code analysis result caching and storage |
Implement caching system for code analysis results, repository metadata, and recent change information to improve performance and reduce API calls to Azure DevOps. Test plan: Unit test: caching system correctly stores and retrieves analysis results. Test cache invalidation and update mechanisms. Integration test: cached results improve analysis performance and reduce API usage. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create caching system for code analysis results, repository metadata, and recent change information to optimize performance and reduce Azure DevOps API usage. Implement intelligent caching that improves analysis speed while maintaining result accuracy and freshness. |
| T-US-F4-003-07 |
Code |
Create analysis scope configuration and filtering |
Implement configurable analysis scope that allows filtering by file types, directories, time ranges, and other criteria to focus code analysis on most relevant areas. Test plan: Unit test: analysis scope configuration correctly filters code analysis by specified criteria. Test various filtering options and scope settings. Integration test: scoped analysis provides more focused and relevant results. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create configurable analysis scope system that filters code analysis by file types, directories, time ranges, and other criteria. Implement flexible filtering that focuses analysis on most relevant areas while maintaining comprehensive coverage of potential incident-related code changes. |
| T-US-F4-003-08 |
Documentation |
Document Azure DevOps code access and analysis |
Create documentation covering code analysis workflow, repository access procedures, analysis scope configuration, performance optimization, error handling, and troubleshooting guide for code integration issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Azure DevOps code access and analysis system including workflow procedures, repository access setup, analysis scope configuration, and performance optimization. Provide troubleshooting guidance for code integration issues and explain how code context enhances incident investigation capabilities. |
|
US-F5-001 — As a DevOps engineer, I want to establish connection to Confluence for RunBook access, so that the AI agent can retrieve and reference existing RunBooks during investigations.
|
| T-US-F5-001-01 |
Code |
Create Confluence API client and authentication |
Implement Confluence API client with authentication setup, API token management, and connection verification for accessing RunBook spaces and pages. Test plan: Unit test: API client initializes with valid credentials and establishes connection. Test authentication error handling and token validation. Integration test: successful connection to Confluence and RunBook space access verification. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create Confluence API client with secure authentication setup including API token management and connection verification for RunBook access. Implement proper error handling for authentication failures and credential validation as specified in the acceptance criteria for establishing connection to Confluence. |
| T-US-F5-001-02 |
Documentation |
Document Confluence RunBook integration setup |
Create documentation covering Confluence API setup, authentication configuration, RunBook space access, permission requirements, and troubleshooting guide for connection issues. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Confluence RunBook integration setup including API configuration, authentication procedures, RunBook space access, and permission requirements. Provide troubleshooting guidance for connection issues and clear instructions for establishing RunBook access for incident investigation. |
|
US-F5-002 — As a On-call engineer, I want to have relevant RunBooks automatically identified and referenced during incidents, so that I can quickly access established procedures and troubleshooting steps.
|
| T-US-F5-002-01 |
Code |
Create RunBook search and matching algorithm |
Implement intelligent search algorithm that matches incident alerts with relevant RunBooks based on keywords, service names, error patterns, and content analysis. Test plan: Unit test: search algorithm correctly identifies relevant RunBooks for various alert types. Test keyword matching and content analysis accuracy. Integration test: RunBook matching provides useful results for incident investigation. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create RunBook search and matching algorithm that identifies relevant RunBooks for incident alerts using keywords, service names, error patterns, and content analysis. Implement intelligent matching that helps on-call engineers quickly access established procedures as specified in the acceptance criteria. |
| T-US-F5-002-02 |
Code |
Implement RunBook content extraction and summarization |
Create content extraction system that retrieves relevant RunBook sections, extracts key troubleshooting steps, and summarizes important procedures for incident response. Test plan: Unit test: content extraction correctly identifies and summarizes key RunBook sections. Test extraction with various RunBook formats and structures. Integration test: extracted content provides actionable troubleshooting information. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement RunBook content extraction and summarization system that retrieves relevant sections and extracts key troubleshooting steps from matched RunBooks. Create effective summarization that provides actionable procedures for incident response as specified in the acceptance criteria. |
| T-US-F5-002-03 |
Code |
Add RunBook availability and quality detection |
Implement detection logic for missing RunBooks, outdated content, conflicting information, and quality issues with appropriate flagging and reporting mechanisms. Test plan: Unit test: quality detection correctly identifies outdated and conflicting RunBook content. Test missing RunBook detection and reporting. Integration test: quality issues are properly flagged while useful content is still provided. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create RunBook availability and quality detection system that identifies missing RunBooks, outdated content, and conflicting information with appropriate flagging and reporting. Implement quality assessment that helps maintain RunBook accuracy as specified in the acceptance criteria for handling outdated or conflicting information. |
| T-US-F5-002-04 |
Code |
Create RunBook suggestion and recommendation system |
Implement recommendation system that suggests creating new RunBooks when none are found and provides guidance on RunBook content based on incident patterns and resolution history. Test plan: Unit test: recommendation system correctly suggests new RunBook creation when appropriate. Test content suggestions based on incident patterns. Integration test: recommendations provide useful guidance for RunBook creation and improvement. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create RunBook suggestion and recommendation system that identifies gaps in RunBook coverage and suggests creating new documentation based on incident patterns. Implement intelligent recommendations that help improve RunBook completeness as specified in the acceptance criteria for suggesting new documentation creation. |
| T-US-F5-002-05 |
Documentation |
Document Confluence RunBook access and identification |
Create documentation covering RunBook search algorithms, content extraction procedures, quality detection mechanisms, recommendation system, and best practices for maintaining RunBook relevance. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the Confluence RunBook access and identification system including search algorithms, content extraction procedures, quality detection, and recommendation mechanisms. Provide guidance for maintaining RunBook relevance and troubleshooting RunBook integration issues during incident investigation. |
|
US-F5-003 — As a Software engineer, I want to update RunBooks based on incident learnings and agent recommendations, so that RunBooks stay current and accurate for future incident response.
|
| T-US-F5-003-01 |
Code |
Create incident learning analysis and extraction |
Implement analysis system that processes resolved incidents, extracts key learnings, identifies patterns, and generates recommendations for RunBook updates based on incident resolution data. Test plan: Unit test: learning analysis correctly extracts insights from resolved incidents. Test pattern identification and recommendation generation. Integration test: analysis provides useful RunBook update suggestions based on incident data. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create incident learning analysis system that processes resolved incidents, extracts key learnings, and generates RunBook update recommendations based on resolution patterns and new insights. Implement intelligent analysis that helps keep RunBooks current as specified in the acceptance criteria. |
| T-US-F5-003-02 |
Code |
Implement RunBook update recommendation engine |
Create recommendation engine that generates specific content suggestions for RunBook updates, identifies outdated sections, and proposes new procedures based on incident learnings. Test plan: Unit test: recommendation engine generates relevant and specific update suggestions. Test identification of outdated content and new procedure proposals. Integration test: recommendations improve RunBook accuracy and completeness. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Implement RunBook update recommendation engine that generates specific content suggestions, identifies outdated sections, and proposes new procedures based on incident learnings. Create intelligent recommendations that help maintain RunBook accuracy as specified in the acceptance criteria for providing specific content suggestions. |
| T-US-F5-003-03 |
Code |
Create automated RunBook update system |
Implement automated update system that applies approved RunBook changes to Confluence pages, maintains version history, and handles update permissions and approval workflows. Test plan: Unit test: automated update system correctly applies approved changes to Confluence pages. Test version history maintenance and permission handling. Integration test: end-to-end update process from approval to published changes. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create automated RunBook update system that applies approved changes to Confluence pages while maintaining version history and handling permissions appropriately. Implement reliable update mechanism that keeps RunBooks current as specified in the acceptance criteria for updating Confluence pages. |
| T-US-F5-003-04 |
Code |
Add update approval and stakeholder notification |
Implement approval workflow for RunBook updates, stakeholder notification system, and permission-based update routing to ensure appropriate review and authorization of changes. Test plan: Unit test: approval workflow correctly routes updates to appropriate stakeholders. Test notification delivery and permission-based routing. Integration test: approval process ensures proper review before RunBook updates. AI prompt: You are a backend engineer working on Datadog Bits AI SRE (Terraform). Create update approval workflow and stakeholder notification system that routes RunBook update requests to appropriate reviewers and handles permission-based authorization. Implement proper approval process that ensures RunBook changes are reviewed as specified in the acceptance criteria for handling insufficient update permissions. |
| T-US-F5-003-05 |
Documentation |
Document RunBook update and learning integration |
Create documentation covering incident learning analysis, update recommendation system, automated update procedures, approval workflows, and best practices for maintaining current RunBooks. AI prompt: You are a technical writer working on Datadog Bits AI SRE documentation. Document the RunBook update and learning integration system including incident analysis procedures, recommendation generation, automated update workflows, and approval processes. Provide guidance for maintaining current RunBooks and troubleshooting update integration issues based on incident learnings. |