Web Audio API Deep Dive: How Browser Audio Really Works

KO

In my previous post on Tone.js, I knew it used the Web Audio API internally, but I hadn't really dug into how it actually works. I wanted to peel back the layer of abstraction that Tone.js provides, so I went ahead and explored the Web Audio API directly.

The Limitations of HTML5 Audio

The simplest way to handle sound in the browser is the <audio> element. It handles playback, stop, pause, and seeking just fine. But there are things it simply cannot do:

  • No access to frequency or waveform data
  • No real-time effects (reverb, filters)
  • No 3D spatial audio
  • No precise timing control

The relationship between <audio> and the Web Audio API is similar to that of <img> and <canvas>. For simple playback, <audio> is enough, but if you need to analyze or manipulate sound, you need the Web Audio API.

AudioContext: Where Everything Begins

The entry point to the Web Audio API is AudioContext. Every audio node is created and connected within a single AudioContext.

const audioCtx = new AudioContext();

Key Properties

PropertyDescription
sampleRateAudio sample rate (typically 44100Hz or 48000Hz)
currentTimeMonotonically increasing time (in seconds) since context creation. Cannot be paused
stateOne of "suspended", "running", "closed"
destinationFinal output (speakers/headphones)

It's best to create a single AudioContext and reuse it. Browsers limit the number of simultaneously active AudioContexts.

Autoplay Policy

AudioContext starts in a "suspended" state by default. Sound cannot play without a user gesture (e.g., a click).

const audioCtx = new AudioContext(); // state: "suspended"
 
document.getElementById("play")?.addEventListener("click", () => {
  if (audioCtx.state === "suspended") {
    audioCtx.resume();
  }
  // Now sound can be played
});

OfflineAudioContext

Used when you need pre-rendering rather than real-time playback. Outputs to an AudioBuffer instead of speakers. This is useful, for example, when exporting audio with reverb applied to a file.

// Render a 44.1kHz stereo 10-second buffer
const offlineCtx = new OfflineAudioContext(2, 44100 * 10, 44100);
 
// After setting up nodes...
const renderedBuffer = await offlineCtx.startRendering();

The Audio Graph: Connecting Nodes

The Web Audio API is a modular routing system. AudioNodes form a directed acyclic graph (DAG). The basic flow looks like this:

Source → Processing → Destination
const audioCtx = new AudioContext();
const oscillator = audioCtx.createOscillator();
const gainNode = audioCtx.createGain();
 
oscillator.connect(gainNode);
gainNode.connect(audioCtx.destination);
 
oscillator.start();

A key point here is that audio processing runs on a separate audio rendering thread, isolated from the main thread. DOM manipulation doesn't interfere with audio, and audio processing doesn't block the UI.

AudioNode Types

Source Nodes (Sound Generation)

OscillatorNode — Generates periodic waveforms. The basic building block of a synthesizer.

const osc = audioCtx.createOscillator();
osc.type = "sawtooth"; // sine, square, sawtooth, triangle
osc.frequency.value = 440; // A4 note
osc.connect(audioCtx.destination);
osc.start();
osc.stop(audioCtx.currentTime + 1); // Stop after 1 second

One gotcha I ran into: after calling stop(), you cannot start the same node again. You must create a new one. It's a "fire and forget" pattern. This works because node creation cost is extremely low.

AudioBufferSourceNode — Plays audio files loaded into memory.

const response = await fetch("/samples/drum.wav");
const arrayBuffer = await response.arrayBuffer();
const audioBuffer = await audioCtx.decodeAudioData(arrayBuffer);
 
const source = audioCtx.createBufferSource();
source.buffer = audioBuffer;
source.loop = true;
source.playbackRate.value = 1.5; // 1.5x speed
source.connect(audioCtx.destination);
source.start();

The AudioBuffer itself can be reused, but like OscillatorNode, an AudioBufferSourceNode must be discarded after a single use.

MediaElementAudioSourceNode — Connects an HTML <audio> element to the Web Audio graph. This lets you apply effects to an existing <audio> tag.

const audioEl = document.querySelector("audio") as HTMLAudioElement;
const source = audioCtx.createMediaElementSource(audioEl);
source.connect(audioCtx.destination);

Processing Nodes (Sound Manipulation)

NodeRoleKey Parameters
GainNodeVolume controlgain (default 1.0)
BiquadFilterNodeFilter (lowpass, highpass, and 6 others)frequency, Q, type
ConvolverNodeConvolution reverbbuffer (impulse response)
DelayNodeTime delaydelayTime
DynamicsCompressorNodeDynamic compressionthreshold, ratio, attack
WaveShaperNodeNonlinear distortioncurve, oversample
StereoPannerNodeLeft/right panningpan (-1.0 ~ 1.0)
PannerNode3D spatial audiopositionX/Y/Z, panningModel
AnalyserNodeFrequency/waveform analysis (for visualization)fftSize, smoothingTimeConstant

Here's an example of building an effect chain:

const source = audioCtx.createBufferSource();
const filter = audioCtx.createBiquadFilter();
const gain = audioCtx.createGain();
const compressor = audioCtx.createDynamicsCompressor();
 
filter.type = "lowpass";
filter.frequency.value = 1000;
gain.gain.value = 0.8;
 
source.connect(filter);
filter.connect(gain);
gain.connect(compressor);
compressor.connect(audioCtx.destination);
 
source.buffer = audioBuffer;
source.start();

This is exactly what Tone.js's .chain() method abstracts away.

AudioParam: Time-Based Automation

AudioParam is one of the most powerful features of the Web Audio API. It lets you change audio parameters over time with sample-accurate precision.

Why It's More Precise Than setTimeout

setTimeout runs on the main thread's event loop, so ~100ms jitter is common. AudioParam automation runs on the audio rendering thread, guaranteeing sub-millisecond accuracy.

Automation Methods

const gain = audioCtx.createGain();
 
// Set value immediately at a specific time
gain.gain.setValueAtTime(0, audioCtx.currentTime);
 
// Gradually change linearly (fade in)
gain.gain.linearRampToValueAtTime(1, audioCtx.currentTime + 2);
 
// Change exponentially (more natural decay)
gain.gain.exponentialRampToValueAtTime(0.01, audioCtx.currentTime + 4);
// Note: target value must not be 0. Use a value close to 0 (0.01) instead.
MethodDescription
setValueAtTime(value, time)Set value immediately at a specific time
linearRampToValueAtTime(value, endTime)Linear interpolation
exponentialRampToValueAtTime(value, endTime)Exponential interpolation (target ≠ 0)
setTargetAtTime(target, startTime, timeConstant)Approach target with exponential decay
setValueCurveAtTime(values, startTime, duration)Follow a custom curve
cancelScheduledValues(cancelTime)Cancel events after the specified time

An important thing I learned: when scheduled events exist, directly assigning to value can conflict with existing automation and produce results you did not expect. It is safer to stay consistent and control the parameter through one automation approach instead of mixing styles.

Practical Pattern: ADSR Envelope

You can implement the Attack-Decay-Sustain-Release envelope — a fundamental of synthesizers — directly with AudioParam.

function triggerNote(
  osc: OscillatorNode,
  gain: GainNode,
  now: number
) {
  // Attack: ramp quickly from 0 to 1
  gain.gain.setValueAtTime(0, now);
  gain.gain.linearRampToValueAtTime(1, now + 0.02);
 
  // Decay + Sustain: decay from 1 to 0.3
  gain.gain.linearRampToValueAtTime(0.3, now + 0.1);
 
  // Release: from 0.3 to 0
  gain.gain.linearRampToValueAtTime(0, now + 0.5);
 
  osc.start(now);
  osc.stop(now + 0.5);
}

This is the pattern that Tone.js's Synth automates internally.

Audio Loading and Memory

fetch + decodeAudioData is the standard pattern.

async function loadAudio(url: string): Promise<AudioBuffer> {
  const response = await fetch(url);
  const arrayBuffer = await response.arrayBuffer();
  return audioCtx.decodeAudioData(arrayBuffer);
}
 
const kickBuffer = await loadAudio("/samples/kick.wav");

Decoded audio is stored in RAM as uncompressed 32-bit PCM. To give you a sense of the size:

44.1kHz stereo, 1 minute = 44100 × 2 channels × 4 bytes × 60 seconds ≈ ~10MB

Even if the original MP3 is 1MB, it can become 10MB after decoding. When loading multiple large files, memory fills up fast, so it's important to cache decoded AudioBuffers to avoid redundant decoding.

Visualization with AnalyserNode

AnalyserNode passes audio through without modification while extracting frequency/waveform data.

Frequency Bar Chart

const analyser = audioCtx.createAnalyser();
analyser.fftSize = 256;
const bufferLength = analyser.frequencyBinCount; // fftSize / 2 = 128
const dataArray = new Uint8Array(bufferLength);
 
source.connect(analyser);
analyser.connect(audioCtx.destination);
 
function draw(ctx: CanvasRenderingContext2D, width: number, height: number) {
  requestAnimationFrame(() => draw(ctx, width, height));
 
  analyser.getByteFrequencyData(dataArray); // Range: 0~255
 
  ctx.fillStyle = "#000";
  ctx.fillRect(0, 0, width, height);
 
  const barWidth = width / bufferLength;
 
  for (let i = 0; i < bufferLength; i++) {
    const barHeight = (dataArray[i] / 255) * height;
    ctx.fillStyle = `hsl(${(i / bufferLength) * 360}, 80%, 50%)`;
    ctx.fillRect(i * barWidth, height - barHeight, barWidth - 1, barHeight);
  }
}

Waveform (Oscilloscope)

const analyser = audioCtx.createAnalyser();
analyser.fftSize = 2048;
const bufferLength = analyser.frequencyBinCount;
const dataArray = new Uint8Array(bufferLength);
 
function drawWaveform(ctx: CanvasRenderingContext2D, width: number, height: number) {
  requestAnimationFrame(() => drawWaveform(ctx, width, height));
 
  analyser.getByteTimeDomainData(dataArray); // Range: 0~255 (128 = silence)
 
  ctx.fillStyle = "#000";
  ctx.fillRect(0, 0, width, height);
  ctx.lineWidth = 2;
  ctx.strokeStyle = "#0f0";
  ctx.beginPath();
 
  const sliceWidth = width / bufferLength;
 
  for (let i = 0; i < bufferLength; i++) {
    const v = dataArray[i] / 128.0;
    const y = (v * height) / 2;
    if (i === 0) ctx.moveTo(0, y);
    else ctx.lineTo(i * sliceWidth, y);
  }
 
  ctx.stroke();
}

Resolution varies with fftSize. For frequency bars, a small value like 256 works well, while waveforms benefit from 2048 or higher.

AudioWorklet: Custom Audio Processing

Previously, ScriptProcessorNode was used for custom audio processing, but it ran on the main thread, causing UI blocking and audio glitches. AudioWorklet replaces it.

AudioWorklet runs custom JS code directly on the audio rendering thread. That avoids main-thread round-trips in the processing path and helps reduce glitch risk, although buffering and render quanta still apply.

Implementation

Write a separate worklet processor file:

// white-noise-processor.ts
class WhiteNoiseProcessor extends AudioWorkletProcessor {
  process(
    _inputs: Float32Array[][],
    outputs: Float32Array[][],
    _parameters: Record<string, Float32Array>
  ): boolean {
    const output = outputs[0];
 
    for (const channel of output) {
      for (let i = 0; i < channel.length; i++) {
        channel[i] = Math.random() * 2 - 1;
      }
    }
 
    return true; // true = keep running
  }
}
 
registerProcessor("white-noise-processor", WhiteNoiseProcessor);

Register and use the worklet from the main thread:

await audioCtx.audioWorklet.addModule("/white-noise-processor.js");
 
const noiseNode = new AudioWorkletNode(audioCtx, "white-noise-processor");
noiseNode.connect(audioCtx.destination);

Parameter Definitions

class GainProcessor extends AudioWorkletProcessor {
  static get parameterDescriptors() {
    return [
      {
        name: "customGain",
        defaultValue: 1.0,
        minValue: 0,
        maxValue: 2,
        automationRate: "a-rate", // Sample-level precision
      },
    ];
  }
 
  process(
    inputs: Float32Array[][],
    outputs: Float32Array[][],
    parameters: Record<string, Float32Array>
  ): boolean {
    const input = inputs[0];
    const output = outputs[0];
    const gainValues = parameters.customGain;
 
    for (let ch = 0; ch < output.length; ch++) {
      for (let i = 0; i < output[ch].length; i++) {
        // a-rate: gainValues.length === 128 (different value per sample)
        // k-rate: gainValues.length === 1 (same value for entire block)
        const gain = gainValues.length > 1 ? gainValues[i] : gainValues[0];
        output[ch][i] = input[ch][i] * gain;
      }
    }
 
    return true;
  }
}
 
registerProcessor("gain-processor", GainProcessor);

You can automate the parameters from the main thread:

const gainParam = gainNode.parameters.get("customGain");
gainParam?.linearRampToValueAtTime(0, audioCtx.currentTime + 2);

Inside worklets, garbage collection can cause audio glitches, so it's important to minimize object allocation and reuse buffers.

Spatial Audio

StereoPannerNode (Simple Left/Right Panning)

const panner = audioCtx.createStereoPanner();
panner.pan.value = -1; // -1 = left, 0 = center, 1 = right
 
source.connect(panner);
panner.connect(audioCtx.destination);
 
// Pan from left to right over 2 seconds
panner.pan.linearRampToValueAtTime(1, audioCtx.currentTime + 2);

PannerNode (3D Spatial Audio)

Used in games and VR/AR experiences.

const panner = audioCtx.createPanner();
panner.panningModel = "HRTF"; // More realistic (higher CPU). "equalpower" is lighter
panner.distanceModel = "inverse";
panner.refDistance = 1;
panner.maxDistance = 100;
panner.rolloffFactor = 1;
 
// Set 3D position of the sound source
panner.positionX.value = 5;
panner.positionY.value = 0;
panner.positionZ.value = -3;
 
// Listener position (one per AudioContext)
const listener = audioCtx.listener;
listener.positionX.value = 0;
listener.positionY.value = 0;
listener.positionZ.value = 0;
 
source.connect(panner);
panner.connect(audioCtx.destination);

The HRTF model simulates how human ears perceive sound differently, creating a realistic 3D effect over headphones. However, it's CPU-intensive, so be cautious when applying it to many sound sources.

Performance Considerations

1) Source nodes are single-use

OscillatorNode and AudioBufferSourceNode cannot be reused after stop(). For repeated playback, you need to create a new node each time. Node creation cost is extremely low, so this isn't a performance issue.

2) Know which nodes are CPU-intensive

ConvolverNode (convolution reverb) and PannerNode in HRTF mode consume the most CPU. On mobile, it's best to limit the use of these nodes.

3) Manage AudioBuffer memory

Decoded audio sits in RAM as uncompressed PCM. A single 1-minute stereo file is ~10MB. Release references to unused buffers so they become eligible for GC.

4) AudioWorklet requires HTTPS

Due to the Secure Context requirement, you need to use localhost or set up HTTPS for local development.

Raw Web Audio API vs Tone.js

AreaRaw Web Audio APITone.js
TimingSeconds (currentTime)Musical notation ("4n", "8n")
Transport controlNoneTransport (start/stop/loop)
InstrumentsOscillatorNode + manual envelopeSynth, FMSynth, etc. + built-in ADSR
EffectsManual individual node connectionsPre-built effects + .chain()
PolyphonyManual voice managementPolySynth auto-allocation
AutoplayManual resume()Tone.start()

When Raw API is the better fit:

  • Simple audio tasks (notification sounds, sound effects)
  • When bundle size matters
  • When you need maximum control without a framework
  • For learning purposes

When Tone.js is the better fit:

  • Musical applications (sequencers, drum machines)
  • Complex scheduling and Transport control
  • Diverse instruments and effect chains

Since Tone.js uses native Web Audio nodes under the hood, its own performance overhead is minimal.

Conclusion

Working with the Web Audio API directly gave me a real appreciation for how much Tone.js abstracts away. At the same time, understanding the raw API made Tone.js's behavior much more predictable.

Here are the key takeaways:

  • AudioParam automation is the real strength of the Web Audio API. It provides sample-accurate timing control that setTimeout simply cannot match
  • Source nodes are single-use — once you call stop, you need to create a new one
  • AudioWorklet replaced ScriptProcessorNode and runs custom DSP directly on the audio thread
  • Always be mindful of decoded AudioBuffer memory size (10-20x larger than the compressed MP3)

For simple sound effects or notification sounds, the raw API is more than enough. If you need musical features, Tone.js will save you significant development time.

Thanks for reading.

References