Unlocking the Power of Speech AI: A Step-by-Step Guide to Integrating NVIDIA RIVA NIMs with LLM/RAG Applications
Date: July 2024
Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.
NVIDIA NIMs provide an easy experience for developers to deploy AI models and develop applications with model inferencing securely. RIVA state-of-the-art models provide real-time interaction with high accuracy and low latency for RAG-based enterprise LLM applications.
RIVA models provide Automatic Speech Recognition (ASR), Text to Speech (TTS) and Neural Machine Translation (NMT) capabilities to enhance user prompt experience and unlock valuable enterprise data, such as medical consultations and emergency response calls in Healthcare, investor calls and customer service calls in Financial Services and voice-based shopping assistants and sales calls in Retail/CPG to enhance the accuracy of LLM responses in RAG applications. Please check out the GTC 24 session for more information.

RIVA Speech AI models are also customizable via NeMo framework. Please check out the Github repo for more information.
In the Part 1 of this blog post, we will explore how to easily deploy one of the RIVA ASR models, Parakeet CTC 0.6b, as a NIM. We will be using Brev.dev, which provides an easy, multi-cloud AI/ML development experience with quick access to NVIDIA GPUs.
First login to Brev.dev and create an instance selecting an NVIDIA GPU (in this case, 1 A100 40 GiB Memory) and provision your instance.

Click on the Access tab and follow the instructions to access your instance in your Terminal (or VS Code). After you connect to your VM, verify the NVIDIA GPU using nvidia-smi command, install the NGC CLI and configure the CLI for your NGC org.

% brev ssh riva-nim-deploy
⡿ waiting for SSH connection to be available Agent pid 43671
Warning: Permanently added 'xxx.xxx.xxx.xxx' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.2.0-37-generic x86_64)
.============.
|| __ || _ _ _
|| \_\ || | | __ _ _ __ ___ | |__ __| | __ _
|| \_\ || | | / _` | '_ ` _ \| '_ \ / _` |/ _` |
|| /_λ_\ || | |__| (_| | | | | | | |_) | (_| | (_| |
|| /_/ \_\ || |_____\__,_|_| |_| |_|_.__/ \__,_|\__,_|
.============. GPU CLOUD
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Fri Jul 5 16:07:00 UTC 2024
...
Expanded Security Maintenance for Applications is not enabled.
292 updates can be applied immediately.
180 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable
20 additional security updates can be applied with ESM Apps.
Learn more about enabling ESM Apps service at https://ubuntu.com/esm
$ sudo usermod -aG docker $USER
$ docker run --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
9c704ecd0c69: Pull complete
Digest: sha256:2e863c44b718727c860746568e1d54afd13b2fa71b160f5cd9058fc436217b30
Status: Downloaded newer image for ubuntu:latest
Mon Jul 8 21:15:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 33C P0 53W / 400W | 4903MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
$ wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.41.3/files/ngccli_linux.zip -O ~/ngccli_linux.zip && \
unzip ~/ngccli_linux.zip -d ~/ngc && \
chmod u+x ~/ngc/ngc-cli/ngc && \
echo "export PATH=\"\$PATH:~/ngc/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
$ ngc config set
CLI_VERSION: Latest - 3.44.0 available (current: 3.41.3). Please update by using the command 'ngc version upgrade'
Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: 1234 # your API Key here
Enter CLI output format type [ascii]. Choices: ['ascii', 'csv', 'json']: ascii
Enter org [no-org]. Choices: ['xx']: xx # your org here
Enter team [no-team]. Choices: ['no-team']: no-team
Enter ace [no-ace]. Choices: ['no-ace']: no-ace
Validating configuration...
Successfully validated configuration.
Saving configuration...
Successfully saved NGC configuration to /home/ubuntu/.ngc/config
Next, login to nvcr.io using your NVIDIA AI Enterprise key as below. See https://www.nvidia.com/en-us/data-center/products/ai-enterprise/ and https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#ngc-authentication for more.) Once login succeeds, you can download the speech-nim image into your local Docker environment.
$ docker login nvcr.io
Username: $oauthtoken
Password:
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
# Export your NGC Key below
$ export NGC_CLI_API_KEY="xx"
$ docker pull nvcr.io/nvidia/nim/speech_nim:24.03
24.03: Pulling from nvidia/nim/speech_nim
43f89b94cd7d: Already exists
750c5f66fa16: Pulling fs layer
4f4fb700ef54: Pulling fs layer
f26a3cd5b0b0: Pulling fs layer
80e4540f5dd3: Waiting
91c5e1ef3269: Waiting
e7863eae44bb: Waiting
791aaccf91fb: Waiting
407fbcfbe3c1: Waiting
89cc47268090: Waiting
52a9e143f3bc: Waiting
711e10b4e0e8: Waiting
2518a6ae2d91: Waiting
...
Next, download the parakeet-ctc-riva-0–6b en-US model as below;
$ export GPU_TYPE=a100x1
$ export MODEL_DIRECTORY=~/nim_model
$ mkdir -p ${MODEL_DIRECTORY}
$ ngc registry model download-version nvidia/nim/parakeet-ctc-riva-0-6b:en-us_${GPU_TYPE}_fp16_24.03 --dest ${MODEL_DIRECTORY}
Getting files to download...
━ • … • Remaining: … • … • Elapsed: 0… • Total: 19 - Completed: 19 - Failed: 0
… …
--------------------------------------------------------------------------------
Download status: COMPLETED
Downloaded local path model: /home/ubuntu/nim_model/parakeet-ctc-riva-0-6b_ven-us_a100x1_fp16_24.03
Total files downloaded: 19
Total transferred: 1.67 GB
Started at: 2024-07-05 16:23:23
Completed at: 2024-07-05 16:23:50
Duration taken: 26s
--------------------------------------------------------------------------------
and then start the container in detached mode. You can verify in the Docker logs that the models are successfully registered.
$ docker run -d --rm --name riva-speech \
--gpus "device=0" \
-e CUDA_VISIBLE_DEVICES=0 \
--shm-size=1G \
-v ${MODEL_DIRECTORY}/parakeet-ctc-riva-0-6b_ven-us_${GPU_TYPE}_fp16_24.03:/config/models/parakeet-ctc-riva-0-6b-en-us \
-e MODEL_REPOS="--model-repository /config/models/parakeet-ctc-riva-0-6b-en-us" \
-p 50051:50051 \
nvcr.io/nvidia/nim/speech_nim:24.03 start-riva
8c53be5e17ed8ae254c2230b6ddacefdcef6fab888b1251d8d9ddd1dafe7b846
$ docker logs riva-speech
==========================
=== Riva Speech Skills ===
==========================
NVIDIA Release 24.02 (build 85835493)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.129.03.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
...
I0705 13:55:55.435817 183 model_registry.cc:143] Successfully registered: parakeet-0.6b-en-US-asr-streaming-throughput-asr-bls-ensemble for ASR Triton URI: localhost:8001
I0705 13:55:55.439783 183 model_registry.cc:143] Successfully registered: riva-punctuation-en-US for NLP Triton URI: localhost:8001
W0705 13:55:55.441037 183 grpc_riva_asr.cc:99] parakeet-0.6b-en-US-asr-streaming-throughput-asr-bls-ensemble has no configured wfst normalizer model
I0705 13:55:55.442978 183 model_registry.cc:143] Successfully registered: riva-punctuation-en-US for NLP Triton URI: localhost:8001
I0705 13:55:55.450754 183 riva_server.cc:171] Riva Conversational AI Server listening on 0.0.0.0:50051
I0705 16:34:40.073287 179 model_registry.cc:143] Successfully registered: parakeet-0.6b-en-US-asr-streaming-throughput-asr-bls-ensemble for ASR Triton URI: localhost:8001
...
You can also verify the health check endpoints using the command below:
$ docker run --rm --net host fullstorydev/grpcurl -plaintext -H "content-type: application/grpc" 0.0.0.0:50051 grpc.health.v1.Health/Check
Unable to find image 'fullstorydev/grpcurl:latest' locally
latest: Pulling from fullstorydev/grpcurl
2b1ed7bf7455: Pull complete
b379458dfd20: Pull complete
0853d1017c2a: Pull complete
Digest: sha256:8bc96d11c8c08388b30cffafd177a1083a84e18c5bed314de5520c81171236a9
Status: Downloaded newer image for fullstorydev/grpcurl:latest
{
"status": "SERVING"
}
The model deployment is complete.
Update: Please check out an end to end example of how this NIM is used in a Digital Human AI blueprint: https://github.com/NVIDIA-AI-Blueprints/digital-human
Resources:
https://docs.nvidia.com/nim/index.html#riva
https://developer.nvidia.com/docs/nim/nim-asr/latest/parakeet-ctc-riva.html
https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html