paddlespeech asr语音转录文字；FunASR使用；sherpa 实时、离线、rtsp流语音转录

1、paddlespeech asr语音转录文字

参考：
/PaddlePaddle/PaddleSpeech

安装后运行可能会numpy相关报错；可能是python和numpy版本高的问题，我这里最终解决是python 3.10 numpy 1.22.0；

pip install paddlepaddle -i /pypi/simple
pip install paddlespeech

1）代码

模型默认下载保存位置：C:\Users\\models下

from  import ASRExecutor
asr = ASRExecutor()
result = asr(audio_file="")  ##第一次运行会首先下载自动模型
print(result)

在这里插入图片描述

###标点恢复

!paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭

##或
from  import TextExecutor
text_punc = TextExecutor()
result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")

在这里插入图片描述

2）实时语音转录

参考：/chenkui164/p/
/PaddlePaddle/PaddleSpeech/blob/develop/demos/streaming_asr_server/
/PaddlePaddle/PaddleSpeech/tree/develop/demos/streaming_asr_server/web

paddlespeech_server stats --task asr  ##可以擦好看支持的模型，更改模型该yaml文件

在这里插入图片描述

## 首先运行asr服务器
# 开启流式语音识别服务
cd PaddleSpeech/demos/streaming_asr_server
paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application_faster.yaml

在这里插入图片描述
运行后运行demo里的\demos\streaming_asr_server\web\文件测试：

2、阿里FunASR

/alibaba-damo-academy/FunASR/blob/main/runtime/docs/SDK_advanced_guide_online_zh.md
测试下来感受速度不快，英文识别不大准，优点支持标点断句
在这里插入图片描述

直接docker运行服务

##1、拉取镜像
sudo docker pull /funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
mkdir -p ./funasr-runtime-resources/models


## 2、运行容器，进入容器启动服务
sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models /funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2


##执行服务，模型会自动下载，启动 funasr-wss-server-2pass服务程序
cd FunASR/funasr/runtime
nohup bash run_server_2pass.sh \
  --download-model-dir /workspace/models \
  --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx \
  --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx  \
  --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx  \
  --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx \
  --itn-dir thuduj12/fst_itn_zh >  2>&1 &

# 如果您想关闭ssl，增加参数：--certfile 0
# 如果您想使用时间戳或者热词模型进行部署，请设置--model-dir为对应模型：
# damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-onnx（时间戳）
# 或者 damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404-onnx（热词）

## 3、客服端运行（支持python 、CPP、html网页版本、Java、c#），代码下载链接：wget /ics/MaaS/ASR/sample/funasr_samples.

##运行python脚本
python3 funasr_wss_client.py --host "127.0.0.1" --port 10095 --mode 2pass

## 上面第二步服务器运行可以合并一起（不去掉nohup会运行不起来）：
docker run -p 10095:10095 -d --privileged=true -v D:\funasr-runtime-resources\models:/workspace/models /funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2 /bin/bash -c "cd /workspace/FunASR/funasr/runtime && bash run_server_2pass.sh --download-model-dir /workspace/models --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx --itn-dir thuduj12/fst_itn_zh "

如果docker镜像版本是funasr:funasr-runtime-sdk-online-cpu是（0.1.6,0.1.7，0.1.8）；这容器启动会一会需要等待*（没启动前运行客服端肯呢个报ConnectionResetError）

docker run -p 10095:10095 -d --privileged=true -v D:\funasr-runtime-resources\models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2 /bin/bash -c "cd /workspace/FunASR/funasr/runtime && bash run_server_2pass.sh --download-model-dir /workspace/models --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx --itn-dir thuduj12/fst_itn_zh "

在这里插入图片描述
然后运行客服端即可使用：
（代码下载链接：wget /ics/MaaS/ASR/sample/funasr_samples.）

代码：/alibaba-damo-academy/FunASR/blob/main/runtime/python/websocket/funasr_wss_client.py

python funasr_wss_client.py --host "127.0.0.1" --port 10095 --mode 2pass

在这里插入图片描述

3、sherpa 实时语音转录

1）ncnn版本
参考：/k2-fsa/sherpa-ncnn
/video/BV1K44y197Fg

版本：sherpa-ncnn-2.1.7

安装：

pip install sherpa-ncnn   sounddevice  -i /pypi/simple

下载：
a、下载项目：git clone /k2-fsa/
在这里插入图片描述
b、下载模型
/marcoyang/sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23
下载这7个文件

a-1、实时麦克风转录文本

/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/
/sherpa/ncnn/python/#start-recording

#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-ncnn Python API
#
# Please refer to
# /sherpa/ncnn/pretrained_models/
# to download pre-trained models

import sys

try:
    import sounddevice as sd
except ImportError as e:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    (-1)

import sherpa_ncnn


def create_recognizer():
    # Please replace the model files if needed.
    # See /sherpa/ncnn/pretrained_models/
    # for download links.
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/",
        encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
        encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
        decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
        decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
        joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
        joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
        num_threads=4,
    )
    return recognizer


def main():
    print("Started! Please speak")
    recognizer = create_recognizer()
    sample_rate = recognizer.sample_rate
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    with (channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = (samples_per_read)  # a blocking read
            samples = (-1)
            recognizer.accept_waveform(sample_rate, samples)
            result = 
            if last_result != result:
                last_result = result
                print("\r{}".format(result), end="", flush=True)


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = [0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")

在这里插入图片描述

**修改结果打印效果，去除重复打印结果，结果每次只打印新增的，避免上面每次都打印一遍之前已经识别的内容

if last_result != result:
                if i==0:
                    print("{}".format(result),end='')
                    last_result = result
                    i=i+1
                else:
                    last_result_len=len(last_result)
                    
                    new_word = result[last_result_len:]
                    # print(last_result,result,new_word)
                    print("{}".format(new_word),end='', flush=True)
                    last_result = result

在这里插入图片描述

a-2、实时麦克风转录文本，endpoint逗号
参考：/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/

#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-ncnn Python API
# with endpoint detection.
#
# Please refer to
# https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# to download pre-trained models

import sys

try:
    import sounddevice as sd
except ImportError as e:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    sys.exit(-1)

import sherpa_ncnn


# def create_recognizer():
#     # Please replace the model files if needed.
#     # See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
#     # for download links.
#     recognizer = sherpa_ncnn.Recognizer(
#         tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
#         encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
#         encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
#         decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
#         decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
#         joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
#         joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
#         num_threads=4,
#         decoding_method="modified_beam_search",
#         enable_endpoint_detection=True,
#         rule1_min_trailing_silence=2.4,
#         rule2_min_trailing_silence=1.2,
#         rule3_min_utterance_length=300,
#     )
#     return recognizer

def create_recognizer():
    # Please replace the model files if needed.
    # See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
    # for download links.
    # base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"
    # base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"
    base_file = "sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./{}/tokens.txt".format(base_file),
        encoder_param="./{}/encoder_jit_trace-pnnx.ncnn.param".format(base_file),
        encoder_bin="./{}/encoder_jit_trace-pnnx.ncnn.bin".format(base_file),
        decoder_param="./{}/decoder_jit_trace-pnnx.ncnn.param".format(base_file),
        decoder_bin="./{}/decoder_jit_trace-pnnx.ncnn.bin".format(base_file),
        joiner_param="./{}/joiner_jit_trace-pnnx.ncnn.param".format(base_file),
        joiner_bin="./{}/joiner_jit_trace-pnnx.ncnn.bin".format(base_file),
        num_threads=4,
        decoding_method="modified_beam_search",
        enable_endpoint_detection=True,
        rule1_min_trailing_silence=2.4,
        rule2_min_trailing_silence=1.2,
        rule3_min_utterance_length=300,
        hotwords_file="",
        hotwords_score=1.5,
    )
    return recognizer


def main():
    print("Started! Please speak")
    recognizer = create_recognizer()
    sample_rate = recognizer.sample_rate
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    segment_id = 0


    with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = s.read(samples_per_read)  # a blocking read
            samples = samples.reshape(-1)
            recognizer.accept_waveform(sample_rate, samples)

            is_endpoint = recognizer.is_endpoint

            result = recognizer.text
            if result and (last_result != result):
                last_result = result
                print("\r{}:{}".format(segment_id, result), end="", flush=True)

            if is_endpoint:
                if result:
                    print("\r{}:{}".format(segment_id, result), flush=True)
                    segment_id += 1
                recognizer.reset()


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = sd.default.device[0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")

在这里插入图片描述

2）onnx版本（推荐）
参考：/sherpa/onnx/python/
/k2-fsa/sherpa-onnx/blob/master/python-api-examples/

安装：
pip install sherpa-onnx

a、最新版推荐

模型下载：/k2-fsa/sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tree/main

运行代码：

 python .\speech-recognition-from-microphone-onnx.py  --tokens=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt --encoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-chunk-16-left-128.onnx   --decoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.onnx --joiner=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.onnx

##代码
#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-onnx Python API
#
# Please refer to
# https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
# to download pre-trained models

import argparse
import sys
from pathlib import Path

from typing import List

try:
    import sounddevice as sd
except ImportError:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    sys.exit(-1)

import sherpa_onnx


def assert_file_exists(filename: str):
    assert Path(filename).is_file(), (
        f"{filename} does not exist!\n"
        "Please refer to "
        "https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html to download it"
    )


def get_args():
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument(
        "--tokens",
        type=str,
        required=True,
        help="Path to tokens.txt",
    )

    parser.add_argument(
        "--encoder",
        type=str,
        required=True,
        help="Path to the encoder model",
    )

    parser.add_argument(
        "--decoder",
        type=str,
        required=True,
        help="Path to the decoder model",
    )

    parser.add_argument(
        "--joiner",
        type=str,
        help="Path to the joiner model",
    )

    parser.add_argument(
        "--decoding-method",
        type=str,
        default="greedy_search",
        help="Valid values are greedy_search and modified_beam_search",
    )

    parser.add_argument(
        "--max-active-paths",
        type=int,
        default=4,
        help="""Used only when --decoding-method is modified_beam_search.
        It specifies number of active paths to keep during decoding.
        """,
    )

    parser.add_argument(
        "--provider",
        type=str,
        default="cpu",
        help="Valid values: cpu, cuda, coreml",
    )

    parser.add_argument(
        "--hotwords-file",
        type=str,
        default="",
        help="""
        The file containing hotwords, one words/phrases per line, and for each
        phrase the bpe/cjkchar are separated by a space. For example:

        ▁HE LL O ▁WORLD
        你 好 世 界
        """,
    )

    parser.add_argument(
        "--hotwords-score",
        type=float,
        default=1.5,
        help="""
        The hotword score of each token for biasing word/phrase. Used only if
        --hotwords-file is given.
        """,
    )

    parser.add_argument(
        "--blank-penalty",
        type=float,
        default=0.0,
        help="""
        The penalty applied on blank symbol during decoding.
        Note: It is a positive value that would be applied to logits like
        this `logits[:, 0] -= blank_penalty` (suppose logits.shape is
        [batch_size, vocab] and blank id is 0).
        """,
    )

    return parser.parse_args()


def create_recognizer(args):
    assert_file_exists(args.encoder)
    assert_file_exists(args.decoder)
    assert_file_exists(args.joiner)
    assert_file_exists(args.tokens)
    # Please replace the model files if needed.
    # See https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
    # for download links.
    recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
        tokens=args.tokens,
        encoder=args.encoder,
        decoder=args.decoder,
        joiner=args.joiner,
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method=args.decoding_method,
        max_active_paths=args.max_active_paths,
        provider=args.provider,
        hotwords_file=args.hotwords_file,
        hotwords_score=args.hotwords_score,
        blank_penalty=args.blank_penalty,
    )
    return recognizer


def main():
    args = get_args()

    devices = sd.query_devices()
    if len(devices) == 0:
        print("No microphone devices found")
        sys.exit(0)

    print(devices)
    default_input_device_idx = sd.default.device[0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    recognizer = create_recognizer(args)
    print("Started! Please speak")

    # The model is using 16 kHz, we use 48 kHz here to demonstrate that
    # sherpa-onnx will do resampling inside.
    sample_rate = 48000
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    stream = recognizer.create_stream()
    with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = s.read(samples_per_read)  # a blocking read
            samples = samples.reshape(-1)
            stream.accept_waveform(sample_rate, samples)
            while recognizer.is_ready(stream):
                recognizer.decode_stream(stream)
            result = recognizer.get_result(stream)
            if last_result != result:
                last_result = result
                print("\r{}".format(result), end="", flush=True)


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")

在这里插入图片描述

b、下载模型：
/csukuangfj/sherpa-onnx-streaming-conformer-zh-2023-05-23/tree/main

代码：
运行：python ./ --tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/ --encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/ --decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/ --joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/

#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-onnx Python API
#
# Please refer to
# /sherpa/onnx/pretrained_models/
# to download pre-trained models

import argparse
import sys
from pathlib import Path

try:
    import sounddevice as sd
except ImportError:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    (-1)

import sherpa_onnx


def assert_file_exists(filename: str):
    assert Path(filename).is_file(), (
        f"{filename} does not exist!\n"
        "Please refer to "
        "/sherpa/onnx/pretrained_models/ to download it"
    )


def get_args():
    parser = (
        formatter_class=
    )

    parser.add_argument(
        "--tokens",
        type=str,
        help="Path to ",
    )

    parser.add_argument(
        "--encoder",
        type=str,
        help="Path to the encoder model",
    )

    parser.add_argument(
        "--decoder",
        type=str,
        help="Path to the decoder model",
    )

    parser.add_argument(
        "--joiner",
        type=str,
        help="Path to the joiner model",
    )

    parser.add_argument(
        "--decoding-method",
        type=str,
        default="greedy_search",
        help="Valid values are greedy_search and modified_beam_search",
    )

    return parser.parse_args()


def create_recognizer():
    args = get_args()
    assert_file_exists()
    assert_file_exists()
    assert_file_exists()
    assert_file_exists()
    # Please replace the model files if needed.
    # See /sherpa/onnx/pretrained_models/
    # for download links.
    recognizer = sherpa_onnx.OnlineRecognizer(
        tokens=,
        encoder=,
        decoder=,
        joiner=,
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method=args.decoding_method,
    )
    return recognizer


def main():
    recognizer = create_recognizer()
    print("Started! Please speak")

    # The model is using 16 kHz, we use 48 kHz here to demonstrate that
    # sherpa-onnx will do resampling inside.
    sample_rate = 48000
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    stream = recognizer.create_stream()
    with (channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = (samples_per_read)  # a blocking read
            samples = (-1)
            stream.accept_waveform(sample_rate, samples)
            while recognizer.is_ready(stream):
                recognizer.decode_stream(stream)
            result = recognizer.get_result(stream)
            if last_result != result:
                last_result = result
                print("\r{}".format(result), end="", flush=True)


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = [0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")

3）离线wav音频文件转录

注意：如果本地音频比特率不是256kps，需要转换；比特率（Bitrate）是指音频或视频文件中每秒的比特数。通常用于表示数据传输速率或压缩率。

对于音频文件，比特率表示每秒音频数据的传输速率，单位是kbps（千比特每秒）。通常，比特率越高，音频数据的质量越好，但文件大小也会增加。

例如，256 kbps意味着每秒音频数据的传输速率为256千比特。这种表示方式通常用于指定音频文件的压缩率或输出质量。

另外：windows安装sox参考：/yyy430/article/details/88408273

sox   -r 16k  -c 1

在这里插入图片描述

##官方代码

#!/usr/bin/env python3

"""
This file demonstrates how to use sherpa-ncnn Python API to recognize
a single file.

Please refer to
/sherpa/ncnn/
to install sherpa-ncnn and to download the pre-trained models
used in this file.
"""

import time
import wave

import numpy as np
import sherpa_ncnn


def main():
    # Please refer to /sherpa/ncnn/
    # to download the model files
    # recognizer = sherpa_ncnn.Recognizer(
    #     tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/",
    #     encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
    #     encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
    #     decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
    #     decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
    #     joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
    #     joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
    #     num_threads=4,
    # )
    base_file = "sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./{}/".format(base_file),
        encoder_param="./{}/encoder_jit_trace".format(base_file),
        encoder_bin="./{}/encoder_jit_trace".format(base_file),
        decoder_param="./{}/decoder_jit_trace".format(base_file),
        decoder_bin="./{}/decoder_jit_trace".format(base_file),
        joiner_param="./{}/joiner_jit_trace".format(base_file),
        joiner_bin="./{}/joiner_jit_trace".format(base_file),
        num_threads=4,
    )

    filename = r'D:\sound\'
    with (filename) as f:
        # Note: If wave_file_sample_rate is different from
        # recognizer.sample_rate, we will do resampling inside sherpa-ncnn
        wave_file_sample_rate = ()
        num_channels = ()
        assert () == 2, ()  # it is in bytes
        num_samples = ()
        samples = (num_samples)
        samples_int16 = (samples, dtype=np.int16)
        samples_int16 = samples_int16.reshape(-1, num_channels)[:, 0]
        samples_float32 = samples_int16.astype(np.float32)

        samples_float32 = samples_float32 / 32768

    # simulate streaming
    chunk_size = int(0.1 * wave_file_sample_rate)  # 0.1 seconds
    start = 0
    while start < samples_float32.shape[0]:
        end = start + chunk_size
        end = min(end, samples_float32.shape[0])
        recognizer.accept_waveform(wave_file_sample_rate, samples_float32[start:end])
        start = end
        text = 
        if text:
            print(text)

        # simulate streaming by sleeping
        (0.1)

    tail_paddings = (int(wave_file_sample_rate * 0.5), dtype=np.float32)
    recognizer.accept_waveform(wave_file_sample_rate, tail_paddings)
    recognizer.input_finished()
    text = 
    if text:
        print(text)


if __name__ == "__main__":
    main()

4）或者通过ffmpeg离线读取mp4、mav，网络读取rtsp链接，自己整理推荐这份代码

import subprocess
import sounddevice as sd
import numpy as np
from  import MinMaxScaler

import sherpa_ncnn

def create_recognizer():
    # Please replace the model files if needed.
    # See /sherpa/ncnn/pretrained_models/
    # for download links.
    # base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"
    # base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"
    base_file = r"D:\llm\sherpa*******mples\sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="{}\\".format(base_file),
        encoder_param="{}\encoder_jit_trace".format(base_file),
        encoder_bin="{}\encoder_jit_trace".format(base_file),
        decoder_param="{}\decoder_jit_trace".format(base_file),
        decoder_bin="{}\decoder_jit_trace".format(base_file),
        joiner_param="{}\joiner_jit_trace".format(base_file),
        joiner_bin="{}\joiner_jit_trace".format(base_file),
        num_threads=4,
    )
    return recognizer



print("Started! Please speak")
recognizer = create_recognizer()
# sample_rate = recognizer.sample_rate
# samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms

# 远程RTSP音频流的URL（wav\mp4/rtsp都可以）
# url = "your_rtsp_url"
# url = r'D:\sound\'
url = r'D:\sound\222.mp4'

# FFmpeg命令参数
ffmpeg_cmd = [
    "ffmpeg",
    "-i", url,
    "-f", "s16le",
    "-acodec", "pcm_s16le",
    "-ar", "16000",
    "-ac","1",
    "-",
    
]

# 创建FFmpeg进程
process = (
    ffmpeg_cmd,
    stdout=,
    stderr=,
    bufsize=1600
)

# 定义音频流的采样率、通道数和每次读取的样本数量
sample_rate = 16000
channels = 1
frames_per_read = 1600



last_result = ""
i=0
# 读取和处理音频数据
while True:
    # 从FFmpeg进程中读取音频数据
    data = (frames_per_read * channels * 2)  # 每个样本16位，乘以2
    if not data:
        break
    
    # 将音频数据转换为numpy数组
    samples = (data, dtype=np.int16)
    samples = (np.float32)
    # samples = MinMaxScaler(feature_range=(-1, 1)).fit_transform((-1, 1))
    samples /= 32768.0  # 归一化到[-1, 1]范围
    # print(, samples)

    # 处理音频数据
    # 在这里添加您的音频处理代码
    recognizer.accept_waveform(sample_rate, samples)
    result = 
    # print("result:",result,"last_result:",last_result)
    
    if last_result != result:
        if i==0:
            print("{}".format(result),end='')
            last_result = result
            i=i+1
        else:
            last_result_len=len(last_result)
            
            new_word = result[last_result_len:]
            # print(last_result,result,new_word)
            print("{}".format(new_word),end='', flush=True)
            last_result = result



# 关闭FFmpeg进程
()
()

5）ffmpeg 实时读取本地麦克风声音

ffmpeg -list_devices true -f dshow -i dummy 命令可以查看本地电脑可用的dshow设备（包括麦克风）

import subprocess
import sounddevice as sd
import numpy as np
from  import MinMaxScaler

import sherpa_ncnn

def create_recognizer():
    # Please replace the model files if needed.
    # See /sherpa/ncnn/pretrained_models/
    # for download links.
    # base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"
    # base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"
    base_file = r"D:\llm\sherpa-ncnn-master\sherpa-ncnn-master\python-api-examples\sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="{}\\".format(base_file),
        encoder_param="{}\encoder_jit_trace".format(base_file),
        encoder_bin="{}\encoder_jit_trace".format(base_file),
        decoder_param="{}\decoder_jit_trace".format(base_file),
        decoder_bin="{}\decoder_jit_trace".format(base_file),
        joiner_param="{}\joiner_jit_trace".format(base_file),
        joiner_bin="{}\joiner_jit_trace".format(base_file),
        num_threads=4,
    )
    return recognizer



print("Started! Please speak")
recognizer = create_recognizer()
# sample_rate = recognizer.sample_rate
# samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms

# 远程RTSP音频流的URL
# url = "your_rtsp_url"
# url = r'D:\sound\'
# url = r'D:\sound\222.mp4'
url = "rtsp://admin:jc123456@192.168.63.88/Streaming/Channels/2?tcp"

# FFmpeg命令参数
# ffmpeg_cmd = [
#     "ffmpeg",
#     "-i", url,
#     "-f", "s16le",
#     "-acodec", "pcm_s16le",
#     "-ar", "16000",
#     "-ac","1",
#     "-",
    
# ]

ffmpeg_cmd = [
    "ffmpeg",
    "-f", "dshow",  # 使用alsa作为音频输入设备
    "-i", "audio=麦克风阵列 (适用于数字麦克风的英特尔® 智音技术)",  # 使用默认的音频输入设备（麦克风）
    "-f", "s16le",
    "-acodec", "pcm_s16le",
    "-ar", "16000",
    "-ac", "1",
    "-"
]

# 创建FFmpeg进程
process = (
    ffmpeg_cmd,
    stdout=,
    stderr=,
    bufsize=1600
)

# 定义音频流的采样率、通道数和每次读取的样本数量
sample_rate = 16000
channels = 1
frames_per_read = 1600



last_result = ""
i=0
# 读取和处理音频数据
while True:
    # 从FFmpeg进程中读取音频数据
    data = (frames_per_read * channels * 2)  # 每个样本16位，乘以2
    if not data:
        break
    
    # 将音频数据转换为numpy数组
    samples = (data, dtype=np.int16)
    samples = (np.float32)
    # samples = MinMaxScaler(feature_range=(-1, 1)).fit_transform((-1, 1))
    samples /= 32768.0  # 归一化到[-1, 1]范围
    # print(, samples)

    # 处理音频数据
    # 在这里添加您的音频处理代码
    recognizer.accept_waveform(sample_rate, samples)
    result = 
    # print("result:",result,"last_result:",last_result)
    
    if last_result != result:
        if i==0:
            print("{}".format(result),end='')
            last_result = result
            i=i+1
        else:
            last_result_len=len(last_result)
            
            new_word = result[last_result_len:]
            # print(last_result,result,new_word)
            print("{}".format(new_word),end='', flush=True)
            last_result = result



# 关闭FFmpeg进程
()
()

秒客网

paddlespeech asr语音转录文字；FunASR使用；sherpa 实时、离线、rtsp流语音转录

1、paddlespeech asr语音转录文字

1）代码

2）实时语音转录

2、阿里FunASR

直接docker运行服务

3、sherpa 实时语音转录

3）离线wav音频文件转录

4）或者通过ffmpeg离线读取mp4、mav，网络读取rtsp链接，自己整理推荐这份代码

5）ffmpeg 实时读取本地麦克风声音

相关文章