A new study has found that wearable heart monitors can help doctors detect more cases of atrial fibrillation (AFib), a common heart rhythm problem that increases the risk of stroke. But while these...
A new study has found that wearable heart monitors can help doctors detect more cases of atrial fibrillation (AFib), a common heart rhythm problem that increases the risk of stroke. But while these monitors found 52% more AFib cases than usual care, the study didn’t show that they actually reduced the number of people hospitalized […]
The post Can wearable heart monitors help prevent stroke? appeared first on Knowridge Science Report.
Topics:
can wearable heart monitors help prevent stroke?
stroke
heart health
science
Like other developing countries, Indonesia is facing a familiar dilemma: how to feed a growing population while protecting its extraordinary biodiversity.
Le groupe d’alimentation explique avoir détecté la présence possible d’une toxine bactérienne « susceptible de provoquer des troubles digestifs ». Aucun cas de maladie n’a été signalé à c...
Le groupe d’alimentation explique avoir détecté la présence possible d’une toxine bactérienne « susceptible de provoquer des troubles digestifs ». Aucun cas de maladie n’a été signalé à ce jour, précise-t-il.
A HI-TECH knife that takes the effort out of chopping in the kitchen has been revealed - and I tried it out. Everyone has experienced the hassle of cutting through tough veg and potatoes. But a com...
A HI-TECH knife that takes the effort out of chopping in the kitchen has been revealed - and I tried it out. Everyone has experienced the hassle of cutting through tough veg and potatoes. But a company called Seattle Ultrasonics wants to make that a problem of the past with its new C-200 knife. The...
Introduction
If you’ve ever been on-call during an outage, you know the drill: a flood of alerts, five dashboards open, logs streaming from different places, a dozen threads in Slack, and still no...
Introduction
If you’ve ever been on-call during an outage, you know the drill: a flood of alerts, five dashboards open, logs streaming from different places, a dozen threads in Slack, and still no clear picture. Context-switching kills velocity, and “where do I even start?” becomes the default question.
Kinabalu AI Site Reliability Engineering (AI SRE for short) is our attempt to transform this experience. It consolidates the right context in one place, analyzes it with assistive AI agents, and helps us move from alert to action quickly.
Target audience:
On-call engineers and incident commanders.
Service owners validating health, dependencies, and changes.
SRE/platform teams standardizing triage and root cause analysis (RCA) quality.
Note: Kinabalu AI SRE is in its experimental stage; features, coverage, and interfaces may change as we iterate.
Background
Incidents today suffer from several issues, including alert overload, fragmented context across tools, slow RCA, operational redundancy from tool-hopping, and scattered runbooks that are hard to find and apply under pressure.
AI SRE aims to solve these issues by serving a unified view that streamlines diagnostics and correlates signals to recommend the best next actions. This approach accelerates response time, further reducing time-to-resolution (TTR), lowers the cognitive load on on-calls by keeping all relevant context in one place, and strengthens collaboration through evidence-backed updates and clear ownership.
A typical user journey
Kinabalu’s AI SRE is a 24/7 automator reachable via Slack and a Web UI. It takes input in the form of an automated alert or a direct question and responds with an evidence-backed, actionable insight.
In a hypothetical user journey with AI SRE, the process might begin with a trigger. For instance, if a monitoring alert is triggered by a fivefold increase in a Datadog report and increasing latency for a service, AI SRE initiates an incident thread and gathers the initial context.
The following components of AI SRE are then executed in sequence:
Component 1: Auto-triage with context from incident records, tagging on severity, priority, owner/oncall, as well as issue types.
Component 2: AI SRE (static diagnostics) establishes correlations by
Metrics and dashboards: analyzes recent deltas and compares against time-of-day/week baselines.
Dependencies: checks upstream/downstream services to separate causes from symptoms.
Changes: retrieves recent deployments, config updates, and feature-flag flips.
Logs: clusters error signatures and tracks frequency shifts.
Delivers an incident summary with actionable insights, aRCA draft, and concrete recommendations (queries to run, rollback/feature-flag options, runbook links).
Component 3: Dynamic conversation.
Conversational follow-up where user enters questions in Slack, such as “List owners for impacted services”, or “Compare p95 across top markets”. AI SRE replies with evidence-backed answers and provides links for further drill-down.
Architecture
Under the hood, the backend combines a central signal aggregator with Model Context Protocol (MCP) servers for instant search, and a Large Language Model (LLM) powered intelligence layer that analyzes signals to auto-triage incidents and produce actionable insights.
Figure 1. SRE AI architecture.
Signal aggregator: Context engineering
We follow a Retrieval Augmented Generation (RAG) approach and are building a knowledge graph that stitches together incident signals across the stack. The aggregator ingests the information as follows:
Datadog (metrics, monitors)
Kibana/Elasticsearch (logs)
Grafana (dashboards)
Hystrix (circuit state)
GitLab/Jira (changes/issues)
CI/CD and deployment metadata
Service/product catalog (ownership, dependencies)
With this context, AI SRE agents can provide a clear view of what changed, when it changed, and who owns it, making incident understanding and debugging faster and more reliable in a near-real-time manner.
Figure 2. Examples of signal aggregation for building context.
Unified intelligence: An agentic approach
Agents can basically “normalize” the alerts and signals, meaning they standardize and interpret them for better understanding. They can semantically search through historical changes that can explain current symptoms, correlate co-occurring signals, and surface likely causes.
AI SRE uses the SuperAgent and A2A multi-agent frameworks to analyze incidents using two workflows, which can coexist.
For static diagnosis, a separate flow collects all data and logs for services via the MCP toolkit and sends them to A2A multi-agents for a deep-dive investigation.
For dynamic analysis, SuperAgent uses the MCP toolkit to investigate and pull real-time data.
Static diagnosis
The static diagnostics workflow starts with a trigger from Slack or the Web UI and ends with a comprehensive service health report. It coordinates six domain-specific sub-agents encompassing the areas of incident management, deployment, application, database, infrastructure, and external APIs. Each sub-agent pulls the relevant signals and runs targeted checks, producing detailed findings. The supervisor then synthesizes these into an investigation-ready brief. The brief contains a concise summary of suspects and blast radius, timeline, and recommended next steps. The briefs are grounded in logs and metrics, so engineers can quickly understand the impact and move toward resolution.
Figure 3. Examples of static diagnosis by AI SRE.
Dynamic chat
Users can inquire via Slack or the Web UI to receive an immediate, evidence-supported action plan. Examples of such questions include:
“How many recent deployments touched the food service?”
“How many Terraform changes in the past 5 minutes?”
Powered by our SuperAgent and MCP tool layer, dynamic chat queries live systems such as metrics, logs, deploy history, and configs. It then returns cited data, comparisons, and next-best actions. On-call engineers can diagnose issues and pull logs on the fly, before escalating actions (e.g., open a ticket, compare regions, list owners, suggest rollbacks). It’s human-in-the-loop (HITL) by design.
Figure 4. Example of examining related deployments within the same time frame.
Figure 5. Example of analyzing Splunk or DataDog alerts to identify the root cause of an issue.
MCP toolkit
The Kinabalu MCP Toolkit serves as a universal integration layer that empowers AI SRE by unifying 25 operational tools into a single, consistent interface. This comprehensive toolkit spans six key domains:
Incident and communications: Manages historical incidents, Slack thread context, and ticketing.
Internal platforms: Includes changelogs, experiments, rollout history, and automated analyses.
Knowledge and AI: Facilitates enterprise document search/chat and unstructured data analysis.
Service and configuration: Offers topology and configuration introspection.
Observability: Provides insights through metrics, logs, and profiling.
Deployment: Tracks recent releases and commit history.
The Kinabalu MCP Toolkit is designed to provide AI SRE with a 360 degree view of incidents, significantly accelerating root-cause discovery and response.
Conclusion
Our journey highlights the importance of structured context, robust diagnostic layers, and hybrid AI models for dependable incident automation. Through our ongoing development of Kinabalu AI SRE, we’re moving toward an ecosystem where alerts are normalized, evidence is automatically synthesized, and engineers can focus on higher level decision-making rather than firefighting.
Stay tuned for part 2, where we will cover the challenges, design decisions, and lessons that shaped Kinabalu AI SRE.
Join us
Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.
Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!
Stars and planets are linked together in their formation, evolution, and even in their demises. But many of the details behind this are yet to be revealed. New research outlines an observing strate...
Stars and planets are linked together in their formation, evolution, and even in their demises. But many of the details behind this are yet to be revealed. New research outlines an observing strategy that could uncover more critical details.
Topics:
technology
stars and planets
dust
planets
stars
science
A research team led by Prof. Wang Zhenyou at the Aerospace Information Research Institute of the Chinese Academy of Sciences (AIRCAS) has developed a microscopic time-gated Raman spectrometer capab...
A research team led by Prof. Wang Zhenyou at the Aerospace Information Research Institute of the Chinese Academy of Sciences (AIRCAS) has developed a microscopic time-gated Raman spectrometer capable of non-destructive, micrometer-scale chemical analysis of fragile archaeological ivory—even when strong fluorescence would normally obscure the signal. The study was published in ACS Applied Materials & Interfaces.