OpenAI, the company behind the image generation and meme spawning program DALL-E and the powerful text autocomplete engine GPT-3, has launched a new, open-source neural network intended to convert audio into written text (through TechCrunch). It’s called Whisper, and the company says: it “approaches human-level robustness and accuracy for English speech recognition” and that it can also automatically recognize, transcribe and translate other languages, such as Spanish, Italian and Japanese.
As someone who constantly records and transcribes interviews, I was immediately excited about this news – I thought I could write my own app to securely transcribe audio directly from my computer. While cloud-based services like Otter.ai and Trint work for most things and are relatively secure, there are only a few interviews where I, or my sources, would feel more comfortable with the audio file left off the internet.
Using it turned out to be even easier than I thought; I’ve already set up Python and several developer tools on my computer, so installing Whisper was as easy as running a single Terminal command. Within 15 minutes, I was able to use Whisper to transcribe a test audio clip I recorded. For someone who is relatively tech savvy and hasn’t set up Python, FFmpeg, Xcode, and Homebrew yet, it would probably take an hour or two. However, someone is already working on making the process much easier and more user-friendly, which we’ll talk about in a moment.
While OpenAI definitely saw this use case as a possibility, it’s pretty clear that the company is targeting researchers and developers with this release. In the blog post announcing Whisper, the team said the code “can serve as a foundation for building useful applications and for further research into robust speech processing” and hopes that “Whisper’s high accuracy and ease of use will enable developers to add speech interfaces to a much broader set of applications.” However, this approach is still remarkable: the company has limited access to the most popular machine learning projects such as DALL-E or GPT-3, invoke a desire to “learn about real-world usage and continue to iterate on our safety systems.”
There’s also the fact that installing Whisper isn’t exactly an easy-to-use process for most people. However, Journalist Peter Sterne Is Teaming Up With GitHub Developer Attorney Christina Warren to try to fix that, announcing that they are creating a “free, secure and easy-to-use transcription app for journalists” based on Whisper’s machine learning model. I spoke to Sterne and he said he decided that the program, called Stage Whisper, should exist after taking some interviews through it and determining that it was “the best transcription I’d ever used, with the exception of human transcribers.”
I compared a Whisper-generated transcription to what Otter.ai and Trint released for the same file, and I’d say it was relatively similar. They’re all bugged enough that I’d never just copy quotes from them and paste them into an article without double-checking the audio (which, of course, is best practice no matter what service you’re using). But Whisper’s version would definitely do the job for me; I can search through it to find the sections I need and then manually check them. In theory, Stage Whisper should perform exactly the same as it will use the same model, just with a GUI around it.
Sterne admitted that technology from Apple and Google could make Stage Whisper obsolete in a few years — Pixel’s voice recorder app has been able to do offline transcriptions for years, and a version of that feature is starting roll out to some other Android devicesand Apple has built-in offline dictation iOS (although there is currently no good way to actually transcribe audio files with it). “But we can’t wait that long,” Sterne said. “Journalists like us today need good automatic transcription apps.” He hopes to have a bare-bones version of the Whisper-based app ready in two weeks.
To be clear, Whisper probably won’t completely deprecate cloud-based services like Otter.ai and Trint, no matter how easy it is to use. First, OpenAI’s model lacks one of the greatest features of traditional transcription services: being able to label who said what. Sterne said Stage Whisper probably wouldn’t support this feature: “We’re not developing our own machine learning model.”
The cloud is just someone else’s computer – which probably means it’s a lot faster
And while you get the advantages of local processing, you also get the disadvantages. Most importantly, your laptop is almost certainly significantly less powerful than the computers a professional transcription service uses. For example, I entered the audio of a 24-minute interview into Whisper, which is running on my M1 MacBook Pro; it took about 52 minutes to transcribe the entire file. (Yes, I made sure it used the Apple Silicon version of Python instead of the Intel version.) Otter spat out a transcript in less than eight minutes.
However, OpenAI’s technology has one major advantage: the price. The cloud-based subscription services will almost certainly cost you money if you use them professionally (Otter has a free tier, but upcoming changes will make it less useful for people who often transcribe things), and the built-in transcription features -in platforms like Microsoft Word or require the Pixel to pay for separate software or hardware. Stage Whisper — and Whisper itself — is free and can run on whatever computer you already have.
Again, OpenAI has higher expectations of Whisper than it is the foundation for a secure transcription app – and I’m very excited about what researchers will do with it or what they will learn by looking at the machine learning model, which was trained on “680,000 hours of multilingual and multitask-supervised data collected from the Internet.” But the fact that it is actually used practically these days makes it all the more exciting.