TikTok Caption

Vertical 9:16 caption track driven by word-level Whisper timestamps. The active word pops; the surrounding context ghosts in dimmer ink.

Preview

0:00 / 0:05

Open editor

Usage

The short-form voiceover look — exactly what TikTok, Reels and YouTube Shorts ship as their auto-captions. Each word carries start and end timestamps (seconds, relative to the start of the clip); the composition highlights whichever word the playhead is inside and ghosts the neighbours so the viewer can read ahead.

Frame is 1080 × 1920 vertical by default. The composition expects either:

An audioUrl pointing at the corresponding voiceover (the composition will embed and sync it), or
Just the words array, if you're rendering on top of another audio source.

Style is configurable via the universal clipStyle knobs:

backgroundColor — set to "transparent" (the default in the Studio) to layer over another clip, or to a solid color for a standalone short.
textColor — the inactive / ghost color (default: white).
accentColor — the active-word highlight (default: a punchy cyan).
fontFamily — any installed display font; the composition ships an Anton-style impact look at default scale.

fontScale (0.5 – 2) and captionVAlign / captionHAlign let you nudge the layout to fit your subject.

Generate from audio

The fastest way to get the words array right is to feed an MP3 through OpenAI Whisper and let it return timestamps. This project ships that pipeline at /shorts — drop an MP3, get a rendered 9:16 video with the caption already tracking the voice.

Under the hood the page POSTs the file to /api/shorts/transcribe, which proxies to Whisper's transcriptions endpoint with response_format=verbose_json and timestamp_granularities[]=word, then reshapes the result into the exact { start, end, text } shape this composition consumes. If you're rolling your own pipeline, the only requirement is one entry per word with seconds-based timestamps.

A manual minimal example:

import { TikTokCaption } from "@workspace/compositions/compositions/TikTokCaption/TikTokCaption"

<TikTokCaption
  words={[
    { start: 0.00, end: 0.40, text: "this" },
    { start: 0.40, end: 0.70, text: "is" },
    { start: 0.70, end: 1.10, text: "how" },
    { start: 1.10, end: 1.50, text: "captions" },
    { start: 1.50, end: 1.90, text: "should" },
    { start: 1.90, end: 2.30, text: "look" },
  ]}
  audioUrl="/audio/my-voiceover.mp3"
  captionVAlign="center"
  fontScale={1}
/>

Props

Name	Type	Default
audioUrl	string (url, with CaptionWord[] on sibling key)	—
captionVAlign	"top" \| "center" \| "bottom"	"center"
captionHAlign	"left" \| "center" \| "right"	"center"
fontScale	number	1

Composition

ID: TikTokCaption
Resolution: 1920×1080
FPS: 30
Duration: 5.0s

Source

Copy or download the React source — drop it into your own Remotion project. The only runtime dependency is remotion.

"use client";
import { AbsoluteFill, Audio, useCurrentFrame, useVideoConfig } from "remotion";
import { type ClipStyle, resolveClipStyle } from "../../clip-style";
import { useFontReady } from "../../use-font-ready";
import type { HAlign, VAlign } from "./config";

export type CaptionWord = {
  start: number;
  end: number;
  text: string;
};

export type TikTokCaptionProps = {
  words: CaptionWord[];
  audioUrl?: string;
  captionVAlign?: VAlign;
  captionHAlign?: HAlign;
  // Multiplier on the base font size. 1 = medium, 0.7 small, 1.6 huge.
  fontScale?: number;
  clipStyle?: ClipStyle;
};

const BASE_FONT_SIZE = 132;
// TikTok-style captions show 2–3 words at a time. We split sooner on
// pauses to keep phrases readable in short-form clips.
const PHRASE_MAX_GAP_SECONDS = 0.3;
const PHRASE_MAX_WORDS = 3;

const VERT_TO_JUSTIFY: Record<VAlign, string> = {
  top: "flex-start",
  center: "center",
  bottom: "flex-end",
};

const HORIZ_TO_ALIGN: Record<HAlign, string> = {
  left: "flex-start",
  center: "center",
  right: "flex-end",
};

const HORIZ_TO_TEXT_ALIGN: Record<HAlign, "left" | "center" | "right"> = {
  left: "left",
  center: "center",
  right: "right",
};

function groupIntoPhrases(words: CaptionWord[]): CaptionWord[][] {
  const phrases: CaptionWord[][] = [];
  let current: CaptionWord[] = [];
  for (const w of words) {
    const prev = current[current.length - 1];
    const shouldBreak =
      current.length >= PHRASE_MAX_WORDS ||
      (prev && w.start - prev.end > PHRASE_MAX_GAP_SECONDS);
    if (shouldBreak && current.length > 0) {
      phrases.push(current);
      current = [];
    }
    current.push(w);
  }
  if (current.length > 0) phrases.push(current);
  return phrases;
}

export const TikTokCaption: React.FC<TikTokCaptionProps> = ({
  words,
  audioUrl,
  captionVAlign = "center",
  captionHAlign = "center",
  fontScale = 1,
  clipStyle,
}) => {
  // Real frame — word timestamps from Whisper are wall-clock seconds, so
  // they must be compared against real time, not the 60fps design frame.
  const frame = useCurrentFrame();
  const { fps, width, height } = useVideoConfig();

  // Inactive words use `color`, active word uses `accent`, font is
  // `fontFamily` — all editable from the universal Style section.
  const s = resolveClipStyle(clipStyle, {
    background: "transparent",
    color: "#ffffff",
    fontFamily: "'Anton', Impact, sans-serif",
    accent: "#facc15",
  });

  useFontReady(s.fontFamily);

  const timeSeconds = frame / fps;

  let activeIndex = -1;
  for (let i = 0; i < words.length; i++) {
    const w = words[i];
    if (!w) continue;
    if (timeSeconds >= w.start && timeSeconds < w.end) {
      activeIndex = i;
      break;
    }
    if (timeSeconds < w.start) {
      activeIndex = i - 1;
      break;
    }
    if (i === words.length - 1 && timeSeconds >= w.end) {
      activeIndex = i;
    }
  }

  const phrases = groupIntoPhrases(words);
  const activePhrase =
    activeIndex >= 0
      ? phrases.find((p) => p.some((w) => w === words[activeIndex]))
      : undefined;

  const shortSide = Math.min(width, height);
  const baseSize = (BASE_FONT_SIZE * shortSide) / 1080;
  const fontSize = baseSize * fontScale;
  const strokeWidth = Math.max(2, fontSize * 0.06);

  const isTransparent = s.background === "transparent";

  return (
    <AbsoluteFill
      style={{
        background: isTransparent ? "transparent" : s.background,
        fontFamily: s.fontFamily,
        fontWeight: 800,
        display: "flex",
        flexDirection: "column",
        alignItems: HORIZ_TO_ALIGN[captionHAlign],
        justifyContent: VERT_TO_JUSTIFY[captionVAlign],
        padding: `${height * 0.08}px ${width * 0.06}px`,
      }}
    >
      {audioUrl ? <Audio src={audioUrl} /> : null}

      {activePhrase ? (
        <div
          style={{
            display: "flex",
            flexWrap: "wrap",
            gap: `${fontSize * 0.12}px ${fontSize * 0.28}px`,
            justifyContent: HORIZ_TO_ALIGN[captionHAlign],
            textAlign: HORIZ_TO_TEXT_ALIGN[captionHAlign],
            maxWidth: width * 0.88,
            lineHeight: 1.05,
          }}
        >
          {activePhrase.map((w, i) => {
            const isActive = w === words[activeIndex];
            return (
              <span
                key={`${w.start}-${i}`}
                style={{
                  display: "inline-block",
                  fontSize,
                  fontWeight: 800,
                  letterSpacing: "-0.01em",
                  color: isActive ? s.accent : s.color,
                  WebkitTextStroke: `${strokeWidth}px #000`,
                  paintOrder: "stroke fill",
                  textShadow: isTransparent
                    ? `0 ${fontSize * 0.025}px ${fontSize * 0.06}px rgba(0,0,0,0.55)`
                    : `0 ${fontSize * 0.02}px ${fontSize * 0.04}px rgba(0,0,0,0.5)`,
                }}
              >
                {w.text}
              </span>
            );
          })}
        </div>
      ) : null}
    </AbsoluteFill>
  );
};

Save as TikTokCaption/TikTokCaption.tsx

Caption Track Stat Counter