Press, speak, paste.

Dictation that never leaves your device.

On-device AI dictation for Apple Silicon Macs (M1 or newer, macOS 14+) — Windows on the roadmap. No cloud round-trip, no app account, no waiting. Hold a key, speak, and the cleaned-up text lands on your clipboard ready to paste.

Apple Silicon (M1 or newer) · macOS 14+Windows · On the roadmapFree for personal useNo account. No cloud. No tracking.
Hold ⌘⌥. to startON-DEVICE

Hotkey ⌘⌥. — push to talk

On-device.

Whisper or Parakeet for transcription, Apple Intelligence or Gemma 4 for cleanup. Everything runs on your device.

No account needed.

Nothing to sign up for to use the app. Commercial licenses do require a billing email at Stripe.

No dictation telemetry.

Audio, transcripts and crash reports never leave your machine. See our Privacy page for the full network picture.

Works on a plane.

After the one-time model download on first run, Vox runs without internet. Verifiable with any network monitor.

Estimated time saved

Speaking is roughly 3× faster than typing.

Stanford measured 153 WPM speaking vs 52 WPM typing. For one person typing ~3,000 words a day at work, the gap is about 40 minutes a day. Slide your hourly rate to see what those minutes could be worth. Figures are illustrative; your savings will vary.

$75/hr

Default $75 — close to the BLS average for US software developers. Move it to whatever your time is actually worth.

Estimated savings · per year

147 hr

$11,000

3.7 full work-weeks of typing you don't have to do — about $917/mo at $75/hr.

Math: 3,000 words ÷ 50 WPM = 60 min typing. ÷ 150 WPM = 20 min speaking. → 40 min back per day · 220 workdays/year · $75/hr.
Where these numbers come from →
  • Speaking ≈ 3× faster than typing — Ruan et al., Stanford HCI (2016): 153 WPM speech vs 52 WPM typing in English; 123 vs 43 WPM in Mandarin.
  • Conversational speaking sits at ~150 WPM — National Center for Voice and Speech via VirtualSpeech.
  • 50 WPM typing — adult average is 40 WPM; office workers typically target 60 WPM. 50 is a defensible midpoint (Wonderlic; TypingSpeedHub 2025).
  • 3,000 words/day — knowledge workers send ~40 emails/day at ~75 words each, plus Slack and AI chats; ~3,000 words is a defensible midpoint (cloudHQ; Boomerang via EmailAnalytics).
  • $75/hr default — the BLS mean wage for US software developers (May 2024) is $66.78/hr ($138,890/yr). $75 nudges that up slightly to reflect the startup premium most readers will recognize.
  • We assume 220 workdays per year (US standard, excluding weekends and ~10 holidays). The math doesn't count time spent reading or thinking — only the keystroke-vs-utterance gap on text you compose.

Free for personal use

Free for you. Paid for your company.

Vox is fair-source: free for personal use — your own writing, side projects, hobby work — under the perpetual Personal Use license in the EULA. If you (or anyone on your team) use Vox as part of your job at a company with more than one person, each user needs a commercial license. Pricing starts at $12 USD/seat/mo. See plans & pricing →

How it works

Three keys. No setup.

No account, no model selection, no permissions tour. Vox ships with sensible defaults so the first dictation works.

01.

Hold the hotkey

Default ⌘⌥. on Mac (Ctrl+Alt+. on Windows). Vox shows a small listening pill near your cursor.

02

Speak normally

Filler words and self-corrections are fine — Vox cleans them up. Don't worry about punctuation.

03V

Release. Paste.

Cleaned-up text lands on your clipboard. Press ⌘V where you want it. No silent keystroke synthesis.

Voice modes

One hotkey, the right voice.

Vox picks a mode based on the app you're typing into. Each mode is a tuned cleanup engine — same dictation, different output style.

General

Anywhere

Balanced cleanup. Drops fillers, fixes self-corrections, enumerates lists.

You say

uh so like the meeting is um at three pm tomorrow

Vox writes

The meeting is at 3 PM tomorrow.

Email

Mail · Gmail · Outlook

Formal, fully punctuated email body. Never invents a salutation or sign-off.

You say

hey just wanted to follow up on the proposal um can we sync next week

Vox writes

Just wanted to follow up on the proposal. Can we sync next week?

Chat

Slack · Discord · iMessage

Casual and short. Fragments OK, contractions preserved, ruthless trim.

You say

yeah I think that works for me um lets just do it on tuesday then

Vox writes

yeah works for me — let's do tuesday

Code Comment

Xcode · GitHub · VS Code

Present-tense third-person. Preserves identifiers verbatim. No markdown synthesis.

You say

so this method invalidates the cache when the user updates their profile

Vox writes

Invalidates the cache when the user updates their profile.

Notes

Apple Notes · Notion · Obsidian

Full sentences. Bullets on enumeration, paragraph breaks at topic shifts.

You say

things to do today buy groceries pick up dry cleaning email Sara also need to book the flight

Vox writes

Things to do today: • Buy groceries • Pick up dry cleaning • Email Sara • Book the flight

Roll your own.

Custom modes with your own system prompt, post-processing rules, and per-app auto-trigger. Useful for stand-up updates, PR reviews, or your own tone.

Vox vs cloud dictation

The architecture is the differentiator.

Most dictation tools send your audio to a server. Vox doesn't. The differences cascade from there.

  • Where audio is processedVoxOn your deviceCloudTypically on their servers
  • Internet required at runtimeVoxNo (after first-run model download)CloudTypically yes
  • App account requiredVoxNoCloudTypically yes
  • Audio retained after transcriptionVoxNever written to diskCloudDepends on the vendor
  • Telemetry during dictationVoxNoneCloudVaries by vendor
  • Works on a planeVoxYesCloudTypically no
  • Network inspectionVoxIndependently verifiable with Little Snitch or GlassWireCloudServer-side, not user-verifiable

We don't name competitors here on purpose — the architecture is the comparison, not the brand.

FAQ

Questions, answered.

  • Does Vox work offline?
    Yes. Audio is transcribed by a local model — Whisper or the faster NVIDIA Parakeet model — running on your Mac's Neural Engine, Metal GPU, or a Windows NVIDIA GPU (with CPU fallback). Cleanup runs on Apple Intelligence (macOS 26+) or a local Gemma 4 model. No part of the pipeline requires the internet at runtime.
  • What gets sent to a server?
    Nothing during dictation. Vox does not collect audio, transcripts, telemetry, crash reports, or analytics. The only network calls Vox makes are one-time model downloads on first run, and an optional update check.
  • What hardware does Vox need?
    Today: any Apple Silicon Mac (M1 or newer — late 2020 onwards) running macOS 14 or newer. Intel Macs are not supported. A Windows build is on the roadmap — when released, it will target Windows 11 with an optional NVIDIA GPU for faster Parakeet transcription (CPU fallback supported). Cleanup with Gemma 4 will run on either platform; Apple Intelligence cleanup stays macOS-only.
  • Do I need to set up a hotkey?
    Vox ships with ⌘⌥. on Mac (Ctrl+Alt+. on Windows) bound by default — works without macOS Accessibility permission. You can rebind in Settings.
  • Will it auto-paste?
    No. Vox copies the cleaned-up text to your clipboard. You press ⌘V where you want it. This is deliberate: silent keystroke synthesis is fragile, requires Accessibility permission, and breaks confirmations in apps that intercept paste.
  • Is Vox open source?
    The dictation pipeline ships with the desktop app; auditing what does and doesn't leave your machine is something you can verify with any network monitor. We're considering open-sourcing the modes engine — let us know if it would matter to you.
  • Can I use Vox at work?
    Vox is free for personal use — your own writing, side projects, hobby work. If you use Vox as part of your job at a company with more than one person, you need a commercial license. See Vox for Teams.
  • What about other languages?
    Vox supports Whisper (multilingual) and Parakeet (English, with multilingual variants on the roadmap). The cleanup engine is currently tuned for English — other languages will produce competent transcripts with light cleanup until per-language tuning ships.