Unlocking the Power of Speech AI: A Step-by-Step Guide to Integrating NVIDIA RIVA NIMs with LLM/RAG Applications

6 min readJul 8, 2024

Date: July 2024

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

NVIDIA NIMs provide an easy experience for developers to deploy AI models and develop applications with model inferencing securely. RIVA state-of-the-art models provide real-time interaction with high accuracy and low latency for RAG-based enterprise LLM applications.

RIVA models provide Automatic Speech Recognition (ASR), Text to Speech (TTS) and Neural Machine Translation (NMT) capabilities to enhance user prompt experience and unlock valuable enterprise data, such as medical consultations and emergency response calls in Healthcare, investor calls and customer service calls in Financial Services and voice-based shopping assistants and sales calls in Retail/CPG to enhance the accuracy of LLM responses in RAG applications. Please check out the GTC 24 session for more information.

The role of Speech AI in RAG-based architectures ( https://www.nvidia.com/en-us/on-demand/session/gtc24-s61253/)

RIVA Speech AI models are also customizable via NeMo framework. Please check out the Github repo for more information.

In the Part 1 of this blog post, we will explore how to easily deploy one of the RIVA ASR models, Parakeet CTC 0.6b, as a NIM. We will be using Brev.dev, which provides an easy, multi-cloud AI/ML development experience with quick access to NVIDIA GPUs.

First login to Brev.dev and create an instance selecting an NVIDIA GPU (in this case, 1 A100 40 GiB Memory) and provision your instance.

Click on the Access tab and follow the instructions to access your instance in your Terminal (or VS Code). After you connect to your VM, verify the NVIDIA GPU using nvidia-smi command, install the NGC CLI and configure the CLI for your NGC org.

 % brev ssh riva-nim-deploy
⡿ waiting for SSH connection to be available Agent pid 43671
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.2.0-37-generic x86_64)
 .============.
 ||   __      ||    _                    _         _
 ||   \_\     ||   | |    __ _ _ __ ___ | |__   __| | __ _
 ||    \_\    ||   | |   / _` | '_ ` _ \| '_ \ / _` |/ _` |
 ||   /_λ_\   ||   | |__| (_| | | | | | | |_) | (_| | (_| |
 ||  /_/ \_\  ||   |_____\__,_|_| |_| |_|_.__/ \__,_|\__,_|
  .============.                                  GPU CLOUD

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Fri Jul  5 16:07:00 UTC 2024

...

Expanded Security Maintenance for Applications is not enabled.

292 updates can be applied immediately.
180 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable

20 additional security updates can be applied with ESM Apps.
Learn more about enabling ESM Apps service at https://ubuntu.com/esm

$ sudo usermod -aG docker $USER

$ docker run --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
9c704ecd0c69: Pull complete 
Digest: sha256:2e863c44b718727c860746568e1d54afd13b2fa71b160f5cd9058fc436217b30
Status: Downloaded newer image for ubuntu:latest
Mon Jul  8 21:15:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              53W / 400W |   4903MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

$ wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.41.3/files/ngccli_linux.zip -O ~/ngccli_linux.zip && \
unzip ~/ngccli_linux.zip -d ~/ngc && \
chmod u+x ~/ngc/ngc-cli/ngc && \
echo "export PATH=\"\$PATH:~/ngc/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile

$ ngc config set
CLI_VERSION: Latest - 3.44.0 available (current: 3.41.3). Please update by using the command 'ngc version upgrade' 

Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: 1234 # your API Key here
Enter CLI output format type [ascii]. Choices: ['ascii', 'csv', 'json']: ascii
Enter org [no-org]. Choices: ['xx']: xx # your org here
Enter team [no-team]. Choices: ['no-team']: no-team
Enter ace [no-ace]. Choices: ['no-ace']: no-ace
Validating configuration...
Successfully validated configuration.
Saving configuration...
Successfully saved NGC configuration to /home/ubuntu/.ngc/config

Next, login to nvcr.io using your NVIDIA AI Enterprise key as below. See https://www.nvidia.com/en-us/data-center/products/ai-enterprise/ and https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-authentication for more.) Once login succeeds, you can download the speech-nim image into your local Docker environment.


$ docker login nvcr.io
Username: $oauthtoken
Password: 
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

# Export your NGC Key below
$ export NGC_CLI_API_KEY="xx"

$ docker pull nvcr.io/nvidia/nim/speech_nim:24.03
24.03: Pulling from nvidia/nim/speech_nim
43f89b94cd7d: Already exists 
750c5f66fa16: Pulling fs layer 
4f4fb700ef54: Pulling fs layer 
f26a3cd5b0b0: Pulling fs layer 
80e4540f5dd3: Waiting 
91c5e1ef3269: Waiting 
e7863eae44bb: Waiting 
791aaccf91fb: Waiting 
407fbcfbe3c1: Waiting 
89cc47268090: Waiting 
52a9e143f3bc: Waiting 
711e10b4e0e8: Waiting 
2518a6ae2d91: Waiting
...

Next, download the parakeet-ctc-riva-0–6b en-US model as below;

$ export GPU_TYPE=a100x1
$ export MODEL_DIRECTORY=~/nim_model
$ mkdir -p ${MODEL_DIRECTORY}
$ ngc registry model download-version nvidia/nim/parakeet-ctc-riva-0-6b:en-us_${GPU_TYPE}_fp16_24.03 --dest ${MODEL_DIRECTORY}
Getting files to download...
  ━ • … • Remaining: … • … • Elapsed: 0… • Total: 19 - Completed: 19 - Failed: 0
      …                  …                                                      

--------------------------------------------------------------------------------
   Download status: COMPLETED
   Downloaded local path model: /home/ubuntu/nim_model/parakeet-ctc-riva-0-6b_ven-us_a100x1_fp16_24.03
   Total files downloaded: 19
   Total transferred: 1.67 GB
   Started at: 2024-07-05 16:23:23
   Completed at: 2024-07-05 16:23:50
   Duration taken: 26s
--------------------------------------------------------------------------------

and then start the container in detached mode. You can verify in the Docker logs that the models are successfully registered.

$ docker run -d --rm --name riva-speech \
--gpus "device=0" \
-e CUDA_VISIBLE_DEVICES=0 \
--shm-size=1G \
-v ${MODEL_DIRECTORY}/parakeet-ctc-riva-0-6b_ven-us_${GPU_TYPE}_fp16_24.03:/config/models/parakeet-ctc-riva-0-6b-en-us \
-e MODEL_REPOS="--model-repository /config/models/parakeet-ctc-riva-0-6b-en-us" \
-p 50051:50051 \
nvcr.io/nvidia/nim/speech_nim:24.03 start-riva
8c53be5e17ed8ae254c2230b6ddacefdcef6fab888b1251d8d9ddd1dafe7b846

$ docker logs riva-speech

==========================
=== Riva Speech Skills ===
==========================

NVIDIA Release 24.02 (build 85835493)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.129.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

 
...
I0705 13:55:55.435817   183 model_registry.cc:143] Successfully registered: parakeet-0.6b-en-US-asr-streaming-throughput-asr-bls-ensemble for ASR Triton URI: localhost:8001
I0705 13:55:55.439783   183 model_registry.cc:143] Successfully registered: riva-punctuation-en-US for NLP Triton URI: localhost:8001
W0705 13:55:55.441037   183 grpc_riva_asr.cc:99] parakeet-0.6b-en-US-asr-streaming-throughput-asr-bls-ensemble has no configured wfst normalizer model 
I0705 13:55:55.442978   183 model_registry.cc:143] Successfully registered: riva-punctuation-en-US for NLP Triton URI: localhost:8001
I0705 13:55:55.450754   183 riva_server.cc:171] Riva Conversational AI Server listening on 0.0.0.0:50051
I0705 16:34:40.073287   179 model_registry.cc:143] Successfully registered: parakeet-0.6b-en-US-asr-streaming-throughput-asr-bls-ensemble for ASR Triton URI: localhost:8001

...

You can also verify the health check endpoints using the command below:

$ docker run --rm --net host fullstorydev/grpcurl -plaintext -H "content-type: application/grpc" 0.0.0.0:50051 grpc.health.v1.Health/Check
Unable to find image 'fullstorydev/grpcurl:latest' locally
latest: Pulling from fullstorydev/grpcurl
2b1ed7bf7455: Pull complete 
b379458dfd20: Pull complete 
0853d1017c2a: Pull complete 
Digest: sha256:8bc96d11c8c08388b30cffafd177a1083a84e18c5bed314de5520c81171236a9
Status: Downloaded newer image for fullstorydev/grpcurl:latest
{
  "status": "SERVING"
}

The model deployment is complete.

Update: Please check out an end to end example of how this NIM is used in a Digital Human AI blueprint: https://github.com/NVIDIA-AI-Blueprints/digital-human

Resources:

https://docs.nvidia.com/nim/index.html#riva

https://developer.nvidia.com/docs/nim/nim-asr/latest/parakeet-ctc-riva.html

https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html

Unlocking the Power of Speech AI: A Step-by-Step Guide to Integrating NVIDIA RIVA NIMs with LLM/RAG Applications

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Eda Johnson

Responses (1)

More from Eda Johnson

Operationalizing your code with Snowpark Python Stored Procedures

What are Python Stored Procedures and why should I be excited?

Let’s talk about some “better practices” with Snowpark Python, Python UDFs and Stored Procs

Date: September 2022

Getting your SnowPro Advanced Data Engineer Certification

SnowPro Advanced Data Engineer certification is now available. This new and exciting advanced certification is designed to assess advanced…

Five Reasons to be a ‘Snowflake’

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Recommended from Medium

Starting with Whisper Large V3 for Real-Time Audio Transcription in Python

I was late to register for beach volleyball at Pier 25, so I thought, what will I do with all this extra time? As it turned out, I decided…

OpenAI : Audio to Audio Conversation Through LLMs

Introduction

OpenAI Realtime API (Voice Mode), Getting Started on Colab

Everything you need to know, and a hands-on introduction to OpenAI’s voice mode API that you can run on Colab.

Optimizing Whisper ASR Model — Parameters for Enhanced Performance

When working with Whisper for automatic speech recognition (ASR), tuning the model parameters significantly affects the accuracy…

Whisper: Functionality and Finetuning

Introduction

Choosing the Best Java Speech Library: A Comprehensive Guide to Voice Recognition and Synthesis…

With voice-enabled applications becoming a staple in digital interfaces, choosing the right speech library is crucial. From virtual…