May 21, 2025

Silence Breaker: Serverless Monitoring for BigQuery Tables

Written By: Lana Frenzel | Personal LinkedIn

The Problem: Broken Dashboards, No Warnings

A few weeks ago, one of our dashboards quietly stopped updating. The numbers looked fine — at a glance. But the underlying tables hadn’t refreshed in days, and nobody noticed until the reports were already wrong.

That was the exact moment our team realized: it’s time to build a proper alerting system. One that monitors table freshness across BigQuery and actually lets us know when something stops updating.

Naturally, I couldn’t just use an off-the-shelf solution — I had to go full artisan. Simple, reliable, low-maintenance… and unnecessarily built from scratch by yours truly.

What This System Does

Every day, it checks which BigQuery tables haven’t been updated in the past 48 hours. Anything stale goes straight to the logs and triggers a single, no-drama Slack alert.

It also tracks when tables start updating again — giving clear visibility into both outages and recoveries.

Why This System is Good

Most alerting systems are either too noisy or too quiet. They spam you with alerts for the same issue or fail silently because they don’t track status changes. I designed this system to strike the right balance:

Logs both problems and fixes, with timestamps.
Avoids duplicate alerts using a simple history table.
Runs entirely on BigQuery, Cloud Run, and Cloud Scheduler — no infrastructure to maintain.
Separates logic cleanly — detection, logging, and notification are modular and traceable.

Architecture Overview

The entire system is event-driven and serverless. Here’s how it flows:

Daily Flow, Broken Down

Stale Table DetectionA scheduled query runs early in the morning. It pulls metadata from BigQuery’s __TABLES__ views, checking each table's last_modified_time. Anything older than two days is flagged as stale.

Code Example:

2. Status Logging

Minutes later, another scheduled query compares today’s stale list with the historical log:

New staleness → logged as STALE
Recovered tables → logged as FIXED

This gives a complete, timestamped view of how each table’s status has changed over time.

3. Slack Alerting

After that, a Cloud Run service kicks in. It checks for new STALE logs created in the last 24 hours. If it finds any, it sends a single Slack message summarizing the issue.

Before sending, it checks an alert history table to avoid duplicate messages.

Under the Hood: How It’s Built

BigQuery Metadata: Table metadata is queried from BigQuery’s __TABLES__ views using a UNION ALL across monitored datasets.
Current Stale Tables: A table refreshed daily to store tables that haven’t updated in 48+ hours.
Status Log: An append-only status log tracks STALE and FIXED transitions — deduplicated to avoid noise.
Alert History: A table records sent alerts with timestamps, so they aren’t resent within a short window.
Slack Alerts: Alerts are sent via axios.post() from a Node.js app hosted on Cloud Run.

The entire alerting service is serverless, secure, and triggered daily by Cloud Scheduler.

No VMs. No manual refresh. Just SQL + minimal Node.js.

What I’d Do Next

If I had more time or needed tighter SLAs, here’s how I’d level it up:

Add severity tiers: Some tables matter more than others. I could prioritize critical datasets and handle them differently.
Build a lightweight dashboard for staleness trends: Visualize stale and recovered tables over time.
Expand it to monitor row counts, schema changes, or freshness SLAs: Add more checks for other potential issues.
Integrate with email for mission-critical datasets: Some datasets may require higher-priority notifications.

But even without all that, this setup already solves the core problem — tables going stale in silence — with minimal surface area and high reliability. If you’re working with BigQuery and want a monitoring layer that’s simple, quiet, and effective, this pattern works. And it scales.

Summary of Key Features:

Simplicity: Minimal infrastructure, relying entirely on serverless components.
Visibility: Full visibility into data freshness, with clear logs of status changes.
Reliability: Prevents the pitfalls of over-alerting or under-alerting.
Extensibility: Easy to extend to more datasets or additional checks.

‹ Adapting CV to Job description

Automating OAuth Tokens & Metadata Extraction ›