Whisper is an open-source automatic speech recognition model released by OpenAI. It supports 90+ languages and comes in multiple sizes (tiny, base, small, medium, large). The small model offers an excellent balance of accuracy and speed for desktop applications. Because Whisper is open-source, developers can bundle it directly into applications for fully offline, privacy-preserving voice recognition.

How I Built a Voice-to-Text App That Runs Entirely Offline

Q: Can you build an offline AI app?

Yes. Models like OpenAI's Whisper can run entirely on a user's local machine without any internet connection. TAWK is a real-world example — a voice-to-text macOS app that processes all speech locally using the Whisper small model. The key challenges are bundling the model weights with the application, managing the larger binary size, and handling platform-specific requirements like macOS code signing and notarization.

Q: How hard is it to ship a macOS AI app?

Shipping a macOS AI app involves significant challenges beyond the AI itself. You need to bundle large model files with PyInstaller or similar tools, sign every Mach-O binary in the app bundle individually (not using the --deep flag), pass Apple's notarization process (which can stall on large ML binaries), handle TCC permissions for microphone/accessibility access, and manage CDHash-based permission tracking. The AI part may take a weekend; the shipping part can take weeks.

Most AI apps today work by sending your data to a server. You speak, your audio gets uploaded to the cloud, a model on someone else's hardware processes it, and the result comes back to you. It is fast, it is convenient, and it means a third party hears everything you say.

When I set out to build TAWK, a voice-to-text app for macOS, I made one non-negotiable decision from day one: everything would run locally. No cloud. No API calls. No audio leaving the machine. That decision shaped every technical choice I made, created a set of challenges I did not anticipate, and ultimately produced a product I am genuinely proud of.

This is the story of building an offline AI app from scratch, the specific technical problems I had to solve, and what I learned about shipping a local-first AI product on macOS.

Why Offline Matters More Than You Think

I work as Managing Director at Mindvalley, where I spend my days on strategy calls, writing internal documents, and communicating across teams. I dictate a lot. Meeting notes, Slack messages, emails, personal journal entries. When I looked at the voice-to-text tools available on macOS, every good option sent audio to a server.

Think about what you actually say to a voice-to-text tool throughout a day. You dictate emails to your partner. You narrate journal entries. You capture half-formed business ideas that you would never put in a shared document. You compose messages to your therapist, your lawyer, your doctor. The idea that all of this audio is being transmitted to a third-party server and processed on hardware you do not control should make anyone uncomfortable.

Apple's built-in dictation has improved, but it still relies on server-side processing for the best accuracy. Third-party options were either cloud-based subscriptions or clunky command-line tools that required you to be a developer just to use them. Nothing existed that was simultaneously private, accurate, simple, and native to macOS.

Privacy is not a feature you can bolt on later. It is an architectural decision that must be made at the foundation, and it changes everything downstream.

Choosing Whisper Over Cloud APIs

When OpenAI released Whisper as an open-source model, it fundamentally changed what was possible for local speech recognition. Before Whisper, running speech-to-text locally meant using models that were either slow, inaccurate, or both. Whisper changed that equation.

Whisper comes in several sizes: tiny, base, small, medium, and large. Each step up improves accuracy but increases the model size and processing time. I tested all of them extensively and landed on the small model as the right trade-off for a desktop app.

Here is why. The small model is about 460MB, which is large for a traditional app but manageable for something that includes a neural network. On an Apple Silicon Mac, it transcribes a typical dictation clip in under a second. The accuracy is remarkably good for everyday speech. It handles accents, technical jargon, and natural conversational patterns without issue. And critically, it supports over 90 languages out of the box.

I compared this extensively against cloud APIs. Google's Speech-to-Text and OpenAI's own Whisper API are slightly more accurate on edge cases, but the difference is marginal for typical dictation use. What you gain with local processing is absolute privacy, zero network latency, the ability to work without internet, and no per-request API costs.

The Business Case for Local Processing

Going offline was not just a privacy decision. It was a business model decision. If TAWK used a cloud API, every transcription would cost me money. That means I either eat the margin or pass the cost to users through a subscription. Most voice-to-text SaaS products charge $8 to $15 per month. Over a year, that is $100 to $180 for something that does one thing.

With local processing, my cost per user after the sale is essentially zero. No servers. No API bills. No infrastructure to maintain. That let me price TAWK at a one-time $19 fee. Users pay once and own it forever. It is a simpler, more honest transaction. And from a business perspective, it means every sale is pure margin with no ongoing liability.

The Technical Architecture

TAWK is built with Python, which is not the conventional choice for a macOS app. Swift would be the "correct" language. But I know Python deeply, Whisper has first-class Python bindings, and the machine learning ecosystem is built on Python. Choosing familiarity over convention let me move significantly faster.

The core stack is:

Python 3 as the runtime
OpenAI Whisper (small model) for speech-to-text
rumps for the macOS menu bar interface
PyAudio for microphone input
CGEventPost for simulating keyboard input to paste transcribed text
PyInstaller for bundling everything into a native .app

The flow is straightforward: the user presses a global keyboard shortcut, TAWK records from the microphone until the user releases, the audio is fed into the Whisper model, and the transcribed text is typed out at the current cursor position. The entire process takes about a second for a typical sentence.

How rumps Powers the Menu Bar

The rumps library is a lightweight Python framework for building macOS status bar applications. It gives you a menu bar icon, dropdown menus, and callback handling without needing to write any Objective-C or Swift. For a utility app like TAWK that lives in the menu bar and has minimal UI, it is perfect.

One critical lesson I learned: macOS menu bar apps must have LSUIElement: true set in the Info.plist file. Without it, the app appears in the Dock and the Cmd+Tab application switcher, which is wrong for a background utility. I also discovered that calling setActivationPolicy_(Accessory) after displaying a modal window causes focus and rendering issues on macOS 15. The solution was to rely on LSUIElement from the start and never touch the activation policy at runtime.

The Hardest Part: PyInstaller Bundling

Getting Whisper to run in a Python script on my own machine was straightforward. Packaging that into a distributable macOS application was where things got painful.

PyInstaller is the standard tool for turning Python scripts into standalone executables. It analyzes your imports, bundles the Python interpreter and all dependencies, and produces a self-contained application. In theory, it is simple. In practice, it broke in several ways that cost me days.

Missing model assets. Whisper depends on a file called mel_filters.npz at runtime. PyInstaller does not know about it because it is loaded dynamically, not imported. The app would build fine, launch fine, and then crash the first time you tried to transcribe. I had to manually add it to the datas list in the PyInstaller spec file. This pattern repeated for several other assets.

Massive binary size. PyTorch, which Whisper depends on, pulls in CUDA libraries by default. On macOS, CUDA is useless because Apple does not support it. But PyInstaller does not know that and bundles everything. The initial build was over 1.2GB. I spent considerable time surgically excluding unnecessary libraries while making sure I did not accidentally remove something that was needed.

Architecture-specific builds. Apple Silicon and Intel Macs require different binaries. PyInstaller builds for the architecture it is running on. I had to set up separate build pipelines and test on both architectures to make sure nothing was broken.

Code Signing and Notarization: The Real Boss Fight

If PyInstaller bundling was hard, Apple's code signing and notarization process was harder. And unlike bundling, where the errors are at least somewhat logical, signing and notarization errors are often cryptic and the documentation is scattered.

Here is what macOS requires before it will run your app without blocking it:

Code signing: Every binary in the app bundle must be signed with a valid Apple Developer ID certificate.
Notarization: The signed app must be uploaded to Apple's notarization service, which scans it for malware and validates the signatures.
Stapling: The notarization ticket must be stapled to the app so it can be verified offline.

The critical detail that cost me the most time: you must sign every single Mach-O binary inside the bundle individually. Not just .so and .dylib files, but also standalone executables like protoc and the Python framework itself. The --deep flag for codesign seems like it should handle this, but Apple explicitly discourages it for notarization. You need to enumerate every binary and sign them from the inside out.

Notarization introduced its own challenges. For apps that include PyTorch, the upload and processing step can take a very long time and sometimes gets stuck in an "In Progress" state. This is documented behavior from Apple. The only remedy is patience.

The TCC and CDHash Problem

macOS uses a system called Transparency, Consent, and Control (TCC) to manage app permissions. When a user grants TAWK access to the microphone, macOS stores that permission against the app's CDHash, which is a cryptographic hash of the binary. Here is the problem: ad-hoc signed apps get a new CDHash on every rebuild. That means every time I shipped an update, users would have to re-grant microphone permissions. The fix was getting a proper Apple Developer ID certificate, which provides a stable signing identity across builds.

The Keyboard Event Gotcha Nobody Warns You About

TAWK types out transcribed text by simulating keyboard events using CGEventPost. This is the macOS API for programmatically generating input events. It works well, with one nasty exception.

CGEventPost inherits the current modifier key state. If the user was holding Shift when they triggered the keyboard shortcut to start recording, the simulated keystrokes would also have Shift applied. The result was random capitalization, special characters, or completely garbled output depending on what keys were held.

The fix was a single line of code: CGEventSetFlags(event, 0) before posting each keystroke. It explicitly clears all modifier flags. A trivial solution, but it took hours to diagnose because the behavior was inconsistent and hard to reproduce reliably.

Logging: The Invisible Lifeline

Here is something I wish someone had told me before I started: when you ship a macOS .app bundle, stdout and stderr are invisible. There is no terminal. There is no console output. If something goes wrong for a user and they email you saying "it does not work," you have nothing to go on.

I added file-based logging from day one. Every significant action, every error, every state transition is written to a log file. When a user reports an issue, I ask them to send the log file. It has saved me dozens of hours of guesswork. If you are building any kind of distributed desktop application, this is not optional. It is the single most important debugging tool you have.

What I Learned About Building Offline AI Products

Building TAWK taught me lessons that apply to anyone considering an offline-first AI product:

The model is the easy part. Whisper works beautifully. Getting it into a signed, notarized, distributable macOS app that handles permissions, errors, and edge cases gracefully is where 80% of the engineering effort went.

Binary size matters. Users notice when they are downloading a 500MB app. I spent significant effort trimming unnecessary dependencies. Every megabyte removed is friction reduced.

Platform-specific knowledge is the moat. Anyone can run Whisper in a Python script. The hard part is knowing that LSUIElement must be true, that CGEventSetFlags needs to clear modifiers, that codesign must go inside-out, that TCC uses CDHash. This knowledge only comes from building and shipping. It cannot be learned from documentation alone, because much of it is undocumented.

Offline-first is a competitive advantage. In a world where every app wants to send your data to a server, building something that works entirely locally is a meaningful differentiator. Privacy-conscious users will find you and they will be loyal.

TAWK is available at gettawk.com for $19. It runs on any Mac, processes everything locally, and supports 90+ languages. I use it every day, including to dictate parts of this post. If you have been looking for voice-to-text that respects your privacy, give it a try.

Frequently Asked Questions

Can you build an offline AI app?

Absolutely. Models like OpenAI's Whisper can run entirely on a user's local machine without any internet connection. TAWK is a working example of this. The key challenges are bundling the model weights with the application, managing the larger binary size, and handling platform-specific requirements like macOS code signing and notarization. The model itself is the straightforward part. The distribution and packaging is where the real complexity lives.

How does local voice recognition work?

Local voice recognition works by running a speech-to-text model directly on the user's hardware. The app records audio from the microphone, feeds it into the model (such as Whisper), and the model outputs transcribed text, all without sending data over the internet. On modern Apple Silicon Macs, Whisper's small model can transcribe speech in under a second, making local processing feel nearly instantaneous. The trade-off is a larger application size since the model weights must be bundled with the app.

What is Whisper AI?

Whisper is an open-source automatic speech recognition model released by OpenAI. It was trained on 680,000 hours of multilingual audio data and supports over 90 languages. It comes in multiple sizes (tiny, base, small, medium, large), each offering different trade-offs between accuracy and speed. Because it is open-source, developers can bundle it directly into applications for fully offline, privacy-preserving voice recognition without paying per-request API costs.

How hard is it to ship a macOS AI app?

Significantly harder than building the AI functionality itself. Beyond getting the model to run, you need to handle PyInstaller bundling (which does not auto-detect dynamically loaded assets), sign every Mach-O binary in the bundle individually, pass Apple's notarization process, manage TCC permissions for microphone and accessibility access, handle architecture differences between Apple Silicon and Intel Macs, and implement file-based logging since stdout is invisible in app bundles. The AI component might take a weekend. The shipping infrastructure can take weeks.