I've been playing around with speech-to-text tools lately, and Deepgram has really impressed me. Surprisingly, it's both cheaper and faster than OpenAI Whisper, which is kind of wild since Whisper is already cheaper and more reliable than the options we've had for a while like Google Text-To-Speech. While OpenAI might be slightly ahead in accuracy (in my experience), it's a trade-off I'm willing to make, especially considering how much easier Deepgram is to work with.
Their documentation is amazing, which is almost a miracle. According to scientists, you're more likely to be struck by lightning on a Tuesday evening when the Bears are going to the Superbowl after a pig gets its wings, than you are to find good documentation in software. Go figure.
It's the efficiency, speed, and suite of transcription-focused features that make Deepgram, to me, the biggest no-brainer when it comes to text-to-speech.
My Tech Stack of Choice: Next.js and Fastify
For this project, I'm using Next.js for the frontend and Fastify for the backend.
-
Next.js (Frontend): Next.js is a super popular React framework that's known for its server-side rendering capabilities. This means your app is SEO-friendly right out of the box, although that's not relevant for our use case here. The real win is it handles API calls and authentication smoothly since it has built-in server components for the frontend. This simplifies things like refresh token logic a lot.
-
Fastify (Backend): While Next.js is a solid tool, it's not the best at handling real-time communication like we need for live transcription. That's where Fastify comes in. It's a Node.js framework designed for speed and efficiency – perfect for managing our WebSocket connections. It's way more memory-efficient and faster to use compared to the usual suspects like Express.js.
One thing I'll always enjoy about the JS ecosystem - things are constantly getting better.
How It All Works
- User Interface (Next.js): The user enters a phone number in our React-based interface.
- Token Generation: The Next.js app requests a Twilio access token from its backend, which will act as a sort of authentication key for the call. From there, the call is initiated from the frontend
- Twilio's Role: Twilio takes over, connecting the call and sending the audio data in real-time to the Fastify backend via their "streams" feature
- Deepgram Receives Raw Audio Data: The backend forks off the audio stream to Deepgram, which then performs its magic and transcribes the conversation as it happens.
- WebSocket Connection: Deepgram sends the transcription updates to the frontend through a WebSocket. This is the real-time communication channel that keeps everything flowing smoothly.
- Displaying the Magic: The frontend formats and updates the user interface to show the live transcript.
Code Walkthrough
Let's walk through the code for our real-time transcription app and I'll explain each part as we go.
First up, we need to set up an endpoint to generate a Twilio access token. This token is like a golden ticket that lets our frontend initiate and manage calls. We use Twilio's Voice Grant to give permission for making and receiving calls.
Here’s a snippet to get that token sorted:
import { NextRequest } from 'next/server';
import AccessToken, { VoiceGrant } from 'twilio/lib/jwt/AccessToken';
import { checkEnvVars } from './utils';
export const dynamic = 'force-dynamic'; // Opt out of caching
const { sid, sec, accSid, twiml } = checkEnvVars();
export async function GET(request: NextRequest) {
try {
const voiceGrant = new VoiceGrant({
outgoingApplicationSid: twiml, // Specify the TwiML application SID
});
const token = new AccessToken(accSid, sid, sec, {
identity: 'user', // User identity for the token
});
token.addGrant(voiceGrant); // Add the VoiceGrant to the token
return Response.json({ token: token.toJwt() }); // Return the token as JSON
} catch (error) {
console.error('Error getting token:', error);
}
}
So what's happening here? We're creating a Twilio AccessToken
with a VoiceGrant
. This grant allows the user to make calls using Twilio's Programmable Voice API. Think of it as giving your app a hall pass to use Twilio's voice features.
Next, let’s move to our Dialer component in the frontend. This is where users enter a phone number and hit the call button. We’re using a form to capture the phone number and a hook to manage the call state.
import { useForm } from 'react-hook-form';
import { useTwilio } from '@/lib/twilio/useTwilio';
import { Input } from './ui/input';
import { Phone } from 'lucide-react';
export default function Dialer() {
const { status, startCall, hangUp, transcriptions, timer } = useTwilio();
const form = useForm({
defaultValues: { phone: '' },
});
function onSubmit(data) {
startCall(data.phone); // Start call with entered phone number
}
return (
<form onSubmit={form.handleSubmit(onSubmit)}>
<div>
<Phone />
<Input
type='tel'
placeholder='Enter phone number'
{...form.register('phone')}
/>
</div>
<button type='submit'>Submit</button>
{(status === 'Ringing' || status === 'Connected') && (
<button onClick={hangUp}>Hang Up</button>
)}
<p>Status: {status}</p>
<p>{timer}</p>
<p>{transcriptions}</p>
</form>
);
}
Here, we're capturing the phone number using useForm
and calling startCall
when the form is submitted. The Twilio hook manages the call state, handling the start and end of the call, and showing the current status.
Now, onto the backend. We’re setting up WebSocket connections to handle real-time audio streaming. Twilio sends audio data to our Fastify server (which we configure in the Twilio console), which then forwards it to Deepgram for transcription.
import Fastify from 'fastify';
import fastifyWebsocket from '@fastify/websocket';
import { createClient } from '@deepgram/sdk';
const fastify = Fastify({ logger: true });
fastify.register(fastifyWebsocket);
const deepgramApiKey = process.env.DEEPGRAM_API_KEY;
const deepgram = createClient(deepgramApiKey);
fastify.register(async function (fastify) {
fastify.get('/twilio', { websocket: true }, (ws) => {
const connection = deepgram.listen.live({
model: 'nova-2',
smart_format: true,
});
ws.on('message', (data) => {
const message = JSON.parse(data.toString());
if (message.event === 'media') {
const audio = Buffer.from(message.media.payload, 'base64');
connection.send(audio); // Send audio to Deepgram for transcription
} else if (message.event === 'start') {
console.log('Call started');
// Initialize transcription session
}
// Additional event handling...
});
connection.on('transcript', (transcript) => {
ws.send(JSON.stringify({ type: 'transcription', data: transcript })); // Send transcript to client
});
ws.on('close', () => {
console.log('WebSocket connection closed');
// Clean up resources if necessary
});
ws.on('error', (error) => {
console.error('WebSocket error:', error);
});
});
});
During the connection, different events like media
(audio data) and start
(call initiation) are processed. When audio data comes in, it’s sent to Deepgram for transcription. The transcriptions are then sent back to the client through WebSocket messages.
Finally, let’s look at the Twilio Voice SDK in the frontend. This part is all about managing call initiation and handling.
import { useState, useRef, useEffect } from 'react';
import { Device } from '@twilio/voice-sdk';
export function useTwilio() {
const [status, setStatus] = useState('Idle');
const deviceRef = useRef(null);
useEffect(() => {
async function initDevice() {
const token = await fetchToken(); // Fetch the Twilio token from the backend
const device = new Device(token, { closeProtection: true });
deviceRef.current = device;
device.on('ready', () => setStatus('Ready')); // Device is ready to make calls
device.on('connect', () => setStatus('Connected')); // Call is connected
device.on('disconnect', () => setStatus('Ready')); // Call is disconnected
device.on('error', () => setStatus('Error')); // Handle errors
}
initDevice();
}, []);
const startCall = async (to) => {
if (deviceRef.current) {
const call = await deviceRef.current.connect({ params: { To: to } });
call.on('accept', () => setStatus('Connected')); // Call is accepted
call.on('disconnect', () => setStatus('Ready')); // Call is disconnected
// Additional error handling...
}
};
const hangUp = () => {
if (deviceRef.current) {
deviceRef.current.disconnectAll(); // Disconnect all calls
setStatus('Ready');
}
};
return { status, startCall, hangUp };
}
The Device
object from Twilio Voice SDK is used to manage voice calls. It handles call states and events like ready
, connect
, disconnect
, and error
. The startCall
function initiates a call, and hangUp
ends it, managing the call lifecycle and updating the status accordingly.
And that’s pretty much it! With these pieces in place, your app will be able to handle real-time transcription of phone calls using Next.js, Fastify, Deepgram, and Twilio.
Wrapping Up
Combining Deepgram's accurate and speedy transcription with Twilio's reliable communication and the power of Next.js and Fastify, we can build a solid real-time transcription app that's easy to use and maintain.
Feel free to reach out if you have more questions.