How to Use AI to Help Sre Teams Create on Call Handoff

Last updated: March 16, 2026

On-call rotations are a cornerstone of site reliability engineering, but the handoff process between shifts often becomes a time-consuming manual task. SRE teams frequently struggle to capture the right details, maintain consistency, and ensure the next engineer has everything needed to handle incoming incidents. AI tools offer practical ways to automate and improve on-call handoff documentation without adding cognitive burden to already-taxed teams.

Prerequisites

Before you begin, make sure you have the following ready:

A computer running macOS, Linux, or Windows
Terminal or command-line access
Administrator or sudo privileges (for system-level changes)
A stable internet connection for downloading tools

Step 1: The Problem with Manual Handoff Documents

When engineers manually write handoff notes, several issues emerge. Notes become inconsistent between shifts—some engineers detail every system interaction while others provide minimal context. Important details get omitted because the outgoing engineer assumes “everyone knows” certain systems or processes. Time pressure leads to rushed documentation that lacks the specificity needed for effective incident response.

A well-structured handoff document should answer key questions: What issues are currently open? Which systems require active monitoring? What recent changes might have introduced risk? What workarounds exist for known issues? Manually answering these questions every shift drains time that engineers could spend on proactive reliability work.

Step 2: AI-Assisted Handoff Workflows

AI tools can assist at multiple stages of the handoff process. The most effective approach combines automated data gathering with intelligent summarization, then allows engineers to review and augment the output.

Collecting Relevant Context

Start by gathering data from your monitoring systems, incident trackers, and recent changes. A script can pull this information together before the handoff meeting:

import subprocess
from datetime import datetime, timedelta

def gather_handoff_context():
    """Gather relevant context for on-call handoff."""
    context = {}

    # Get open incidents from your ticketing system
    context['open_incidents'] = subprocess.run(
        ['jira', 'search', '--jql', 'status=Open AND labels=production'],
        capture_output=True, text=True
    ).stdout

    # Fetch recent deployments from the last 24 hours
    context['recent_deployments'] = subprocess.run(
        ['gh', 'run', 'list', '--limit', '50'],
        capture_output=True, text=True
    ).stdout

    # Get current system metrics snapshots
    context['active_alerts'] = subprocess.run(
        ['prometheus', 'query', 'up{job="production"} == 0'],
        capture_output=True, text=True
    ).stdout

    return context

This script aggregates raw data from multiple sources. The outgoing engineer reviews the output and identifies items requiring attention. This approach ensures nothing falls through the cracks while reducing the manual effort of searching through multiple systems.

Generating Summaries with AI

Once you have raw context, AI can synthesize this information into readable summaries. Rather than copying and pasting from various dashboards, feed the gathered data into an AI model with clear instructions about what the summary should contain:

Context: Recent deployments, open incidents, active alerts
Task: Write a concise handoff summary for the incoming on-call engineer
Include: Known issues, systems requiring attention, recent changes that might cause problems
Style: Technical but accessible, bullet points preferred

The AI generates a first draft that the outgoing engineer reviews and refines. This human-in-the-loop approach maintains accuracy while dramatically reducing documentation time.

Step 3: Practical Templates and Examples

Effective AI-assisted handoffs work best with structured templates. Define what information matters for your team, then use AI to populate those sections intelligently.

Incident Status Section

Instead of manually listing every open incident, provide the AI with incident IDs and ask for status summaries:

### Step 4: Current Incidents

### INC-1234: Elevated latency in payment service
- Status: Investigating
- Timeline: Identified 2 hours ago, root cause not yet determined
- Customer impact: 15% of transactions experiencing 500ms delay
- Notes from on-call: We believe this relates to the database connection pool
  changes from yesterday's deployment. The rollback was delayed because...

### INC-1235: SSL certificate warning on api.example.com
- Status: Monitoring
- Timeline: Detected 6 hours ago, temporary fix applied
- Workaround: Using fallback certificate while renewal processes

The AI helps maintain consistent formatting and ensures each incident includes the information engineers actually need: timeline, impact, and current status.

System Health Overview

For the systems overview section, generate a quick health check from monitoring data:

### Step 5: Systems Status

| System | Status | Notes |
|--------|--------|-------|
| api-gateway | Healthy | Response times normal |
| user-service | Degraded | Memory usage elevated, investigating |
| payment-processor | Healthy | |
| analytics-pipeline | Healthy | Slight delay in processing (< 1 min) |

AI can transform raw metric outputs into this table format, saving manual formatting time while maintaining readability.

Action Items for Incoming Engineer

Clearly list tasks that need attention during the next shift:

### Step 6: Action Items

1. Monitor user-service memory usage - may need to scale if it continues trending up
2. Follow up with security team on INC-1235 certificate renewal
3. Review INC-1234 if it escalates - database team available until 3pm PST
4. Complete the canary deployment for feature flag "new-checkout-flow"
   if no errors after 2 hours

Step 7: Automate the Workflow

For teams ready to fully automate, integrate AI handoff generation into existing tooling. A CI/CD pipeline can run the context-gathering script, feed results to an AI model, and post the draft handoff document to your team’s communication channel before each shift ends.

This automation works best when paired with clear team conventions. Define which data sources to include, establish the required sections, and create a review process. The AI handles the heavy lifting of aggregation and formatting while engineers provide the critical domain expertise that cannot be automated.

Step 8: Tips for Effective AI-Assisted Handoffs

Start small. Use AI to assist with one section of your handoff document initially. Measure the time savings and refine your approach before expanding to other sections.

Always include human review. AI generates drafts, not final documents. The outgoing engineer understands context the AI cannot capture—recent conversations with other teams, tacit knowledge about system behavior, and nuances that don’t appear in metrics.

Maintain consistency. Use the same template every shift. This predictability helps incoming engineers find information quickly and ensures no critical sections get omitted.

Store historical handoffs. Having a searchable archive of past handoffs helps AI models improve their output over time and allows engineers to reference previous incidents.

Step 9: Handoff Document Template with AI Placeholders

# On-Call Handoff Report
**Date:** [AUTO: current date]
**Outgoing Engineer:** [NAME]
**Incoming Engineer:** [NAME]
**Time:** [AUTO: current time]

## Executive Summary
[AI SECTION: 2-3 sentence summary of current state]

### Step 10: Critical Issues Requiring Attention
[AI: Extract from incident tracking system]

### INC-[ID]: [AI: Generated title]
- **Status:** [AI: Current status]
- **Impact:** [MANUAL: Human assessment of business impact]
- **Timeline:** [AI: Structured timeline with timestamps]
- **Next Steps:** [MANUAL: Specific actions needed]

### Step 11: System Health Dashboard
[AI: Table generated from monitoring metrics]

| System | Status | Trend | Alert Threshold |
|--------|--------|-------|-----------------|
| [AI: auto-populated] | [AI: auto-populated] | [AI: trend analysis] | [MANUAL: configured value] |

### Step 12: Recent Changes in Last 24 Hours
[AI: Pull from deployment logs]

| Service | Change | Deployment Time | Status |
|---------|--------|-----------------|--------|
| [AI: service name] | [AI: brief change description] | [AI: timestamp] | [AI: success/pending] |

### Step 13: Known Workarounds and Limitations
[AI: Extract from incident notes]

1. **[MANUAL: Issue Title]** - [AI: Summary of workaround]
   - Implemented: [AI: timestamp]
   - Estimated resolution: [MANUAL: timeline]

### Step 14: Monitor Alerts to Watch For
[AI: High-alert items from monitoring system]

- Alert name: [AI: populated from alert rules]
- Current level: [AI: current value]
- Action if triggered: [MANUAL: response procedure]

### Step 15: Escalation Contacts
[MANUAL: Team-specific contacts]

### Step 16: Action Items for This Shift
[AI: Priority-ordered list from tickets]

1. [Task description] - Assigned to: [incoming engineer] - Deadline: [AI: calculated from urgency]
2. [Continue for each priority item]
---
*Generated with AI assistance at [timestamp] | Review time: [estimate minutes]*

Step 17: Automation Scripts for Data Collection

For teams ready to fully automate, create scripts that feed AI with structured data:

#!/usr/bin/env python3
"""Collect handoff data and generate documentation via AI."""

import subprocess
import json
import os
from datetime import datetime, timedelta
from typing import Dict, Any

class HandoffDataCollector:
 def __init__(self):
 self.collection_time = datetime.now()
 self.data = {}

 def get_open_incidents(self) -> Dict[str, Any]:
 """Fetch incidents from ticketing system."""
 result = subprocess.run(
 ['jira', 'search', '--jql',
 'status in (Open, "In Progress") AND labels=production',
 '--format=json'],
 capture_output=True, text=True
 )
 incidents = json.loads(result.stdout)
 return {
 'total': len(incidents),
 'high_priority': [i for i in incidents if i['priority'] == 'High'],
 'items': incidents
 }

 def get_recent_deployments(self) -> Dict[str, Any]:
 """Fetch recent deployments from CI/CD system."""
 twenty_four_hours_ago = (
 self.collection_time - timedelta(hours=24)
 ).isoformat()

 result = subprocess.run(
 ['gh', 'run', 'list',
 '--created', f'>{twenty_four_hours_ago}',
 '--json', 'name,conclusion,createdAt'],
 capture_output=True, text=True
 )
 deployments = json.loads(result.stdout)
 return {
 'total': len(deployments),
 'succeeded': sum(1 for d in deployments if d['conclusion'] == 'success'),
 'failed': sum(1 for d in deployments if d['conclusion'] == 'failure'),
 'items': deployments
 }

 def get_current_alerts(self) -> Dict[str, Any]:
 """Fetch active alerts from monitoring system."""
 result = subprocess.run(
 ['prometheus_query',
 'ALERTS{alertstate="firing"}',
 '--format=json'],
 capture_output=True, text=True
 )
 alerts = json.loads(result.stdout)
 return {
 'total': len(alerts),
 'critical': sum(1 for an in alerts if a['severity'] == 'critical'),
 'warning': sum(1 for an in alerts if a['severity'] == 'warning'),
 'items': alerts
 }

 def get_database_status(self) -> Dict[str, Any]:
 """Check database replication and backup status."""
 result = subprocess.run(
 ['aws', 'rds', 'describe-db-instances',
 '--query', 'DBInstances[*].[DBInstanceIdentifier,DBInstanceStatus]',
 '--output', 'json'],
 capture_output=True, text=True
 )
 instances = json.loads(result.stdout)
 return {
 'total': len(instances),
 'healthy': sum(1 for i in instances if i[1] == 'available'),
 'instances': instances
 }

 def collect_all(self) -> Dict[str, Any]:
 """Collect all handoff-relevant data."""
 print("Collecting handoff data...")
 self.data = {
 'timestamp': self.collection_time.isoformat(),
 'incidents': self.get_open_incidents(),
 'deployments': self.get_recent_deployments(),
 'alerts': self.get_current_alerts(),
 'databases': self.get_database_status(),
 }
 return self.data

 def save_to_file(self, filepath: str):
 """Save collected data for AI processing."""
 with open(filepath, 'w') as f:
 json.dump(self.data, f, indent=2, default=str)
 print(f"Data saved to {filepath}")

if __name__ == "__main__":
 collector = HandoffDataCollector()
 collector.collect_all()
 collector.save_to_file("handoff-data.json")

 # Feed to AI for summary generation
 print("\nData ready for AI processing. Example usage:")
 print("claude --file handoff-data.json " +
 "'Generate concise handoff summary from this data'")

Step 18: Integration with Slack for Automated Handoffs

For teams using Slack, automate handoff document posting:

"""Post handoff documents to Slack channel."""

from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
import json

class HandoffSlackNotifier:
 def __init__(self, token: str, channel: str):
 self.client = WebClient(token=token)
 self.channel = channel

 def post_handoff_summary(self, handoff_data: Dict):
 """Post formatted handoff to Slack."""
 blocks = [
 {
 "type": "header",
 "text": {
 "type": "plain_text",
 "text": "On-Call Handoff Report"
 }
 },
 {
 "type": "section",
 "text": {
 "type": "mrkdwn",
 "text": (
 f"*Incidents:* {handoff_data['incidents']['total']} " +
 f"({handoff_data['incidents']['high_priority']} high)\n" +
 f"*Deployments:* {handoff_data['deployments']['succeeded']} " +
 f"successful, {handoff_data['deployments']['failed']} failed\n" +
 f"*Alerts:* {handoff_data['alerts']['critical']} critical, " +
 f"{handoff_data['alerts']['warning']} warning"
 )
 }
 },
 {
 "type": "divider"
 }
 ]

 # Add action items
 if handoff_data['incidents']['high_priority']:
 blocks.append({
 "type": "section",
 "text": {
 "type": "mrkdwn",
 "text": "*Requires Immediate Attention:*\n" +
 "\n".join([
 f"• {inc['key']}: {inc['summary']}"
 for inc in handoff_data['incidents']['high_priority']
 ])
 }
 })

 try:
 response = self.client.chat_postMessage(
 channel=self.channel,
 blocks=blocks
 )
 print(f"Handoff posted to Slack: {response['ts']}")
 except SlackApiError as e:
 print(f"Error posting to Slack: {e}")

Step 19: Measuring Handoff Quality

Track whether handoffs effectively prevent information loss:

"""Measure handoff documentation quality and incident context transfer."""

class HandoffQualityMetrics:
 def __init__(self):
 self.metrics = {
 'avg_incident_clarity': 0, # 1-10 scale
 'context_loss_incidents': 0, # incidents with insufficient context
 'action_item_completion': 0, # % of items completed as described
 'engineer_satisfaction': 0, # incoming engineer's confidence score
 'documentation_completeness': 0, # % of fields filled
 }

 def calculate_metrics(self, incident_db, handoff_history):
 """Calculate handoff quality metrics."""
 total_incidents = len(incident_db)
 clarity_scores = []
 lost_context_count = 0

 for incident in incident_db:
 # Check if handoff documented context clearly
 if incident.get('context_loss_reported'):
 lost_context_count += 1
 clarity_scores.append(incident.get('clarity_score', 5))

 self.metrics['avg_incident_clarity'] = (
 sum(clarity_scores) / len(clarity_scores) if clarity_scores else 0
 )
 self.metrics['context_loss_incidents'] = lost_context_count
 self.metrics['action_item_completion'] = self._calculate_completion_rate(
 handoff_history
 )

 return self.metrics

 def _calculate_completion_rate(self, handoff_history):
 """Calculate how many action items were completed as described."""
 completed = sum(
 1 for handoff in handoff_history
 if all(action.get('completed') for action in handoff.get('actions', []))
 )
 total = len([
 action
 for handoff in handoff_history
 for action in handoff.get('actions', [])
 ])
 return (completed / total * 100) if total > 0 else 0

Troubleshooting

Configuration changes not taking effect

Restart the relevant service or application after making changes. Some settings require a full system reboot. Verify the configuration file path is correct and the syntax is valid.

Permission denied errors

Run the command with sudo for system-level operations, or check that your user account has the necessary permissions. On macOS, you may need to grant terminal access in System Settings > Privacy & Security.

Connection or network-related failures

Check your internet connection and firewall settings. If using a VPN, try disconnecting temporarily to isolate the issue. Verify that the target server or service is accessible from your network.

Frequently Asked Questions

How long does it take to use ai to help sre teams create on call handoff?

For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.

What are the most common mistakes to avoid?

The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.

Do I need prior experience to follow this guide?

Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.

Can I adapt this for a different tech stack?

Yes, the underlying concepts transfer to other stacks, though the specific implementation details will differ. Look for equivalent libraries and patterns in your target stack. The architecture and workflow design remain similar even when the syntax changes.

Where can I get help if I run into issues?

Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.

Built by theluckystrike — More at zovo.one