paddlespeech asr语音转录文字;FunASR使用;sherpa 实时、离线、rtsp流语音转录

时间:2024-10-27 19:55:21

1、paddlespeech asr语音转录文字

参考:
/PaddlePaddle/PaddleSpeech

安装后运行可能会numpy相关报错;可能是python和numpy版本高的问题,我这里最终解决是python 3.10 numpy 1.22.0;

pip install paddlepaddle -i /pypi/simple
pip install paddlespeech
  • 1
  • 2

1)代码

模型默认下载保存位置:C:\Users\\models

from  import ASRExecutor
asr = ASRExecutor()
result = asr(audio_file="")  ##第一次运行会首先下载自动模型
print(result)
  • 1
  • 2
  • 3
  • 4

在这里插入图片描述

###标点恢复

!paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭

##或
from  import TextExecutor
text_punc = TextExecutor()
result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8

在这里插入图片描述

2)实时语音转录

参考:/chenkui164/p/
/PaddlePaddle/PaddleSpeech/blob/develop/demos/streaming_asr_server/
/PaddlePaddle/PaddleSpeech/tree/develop/demos/streaming_asr_server/web

paddlespeech_server stats --task asr  ##可以擦好看支持的模型,更改模型该yaml文件
  • 1

在这里插入图片描述

## 首先运行asr服务器
# 开启流式语音识别服务
cd PaddleSpeech/demos/streaming_asr_server
paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application_faster.yaml
  • 1
  • 2
  • 3
  • 4

在这里插入图片描述
运行后运行demo里的\demos\streaming_asr_server\web\文件测试:
在这里插入图片描述

2、阿里FunASR

/alibaba-damo-academy/FunASR/blob/main/runtime/docs/SDK_advanced_guide_online_zh.md
测试下来感受速度不快,英文识别不大准,优点支持标点断句
在这里插入图片描述
在这里插入图片描述

直接docker运行服务
##1、拉取镜像
sudo docker pull /funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2
mkdir -p ./funasr-runtime-resources/models


## 2、运行容器,进入容器启动服务
sudo docker run -p 10095:10095 -it --privileged=true -v ./funasr-runtime-resources/models:/workspace/models /funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2


##执行服务,模型会自动下载,启动 funasr-wss-server-2pass服务程序
cd FunASR/funasr/runtime
nohup bash run_server_2pass.sh \
  --download-model-dir /workspace/models \
  --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx \
  --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx  \
  --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx  \
  --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx \
  --itn-dir thuduj12/fst_itn_zh >  2>&1 &

# 如果您想关闭ssl,增加参数:--certfile 0
# 如果您想使用时间戳或者热词模型进行部署,请设置--model-dir为对应模型:
# damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-onnx(时间戳)
# 或者 damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404-onnx(热词)

## 3、客服端运行(支持python 、CPP、html网页版本、Java、c#),代码下载链接:wget /ics/MaaS/ASR/sample/funasr_samples.

##运行python脚本
python3 funasr_wss_client.py --host "127.0.0.1" --port 10095 --mode 2pass

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
## 上面第二步服务器运行可以合并一起(不去掉nohup会运行不起来):
docker run -p 10095:10095 -d --privileged=true -v D:\funasr-runtime-resources\models:/workspace/models /funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2 /bin/bash -c "cd /workspace/FunASR/funasr/runtime && bash run_server_2pass.sh --download-model-dir /workspace/models --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx --itn-dir thuduj12/fst_itn_zh "
  • 1
  • 2

如果docker镜像版本是funasr:funasr-runtime-sdk-online-cpu是(0.1.6,0.1.7,0.1.8);这容器启动会一会需要等待*(没启动前运行客服端肯呢个报ConnectionResetError)

docker run -p 10095:10095 -d --privileged=true -v D:\funasr-runtime-resources\models:/workspace/models registry.cn-hangzhou.aliyuncs.com/funasr_repo/funasr:funasr-runtime-sdk-online-cpu-0.1.2 /bin/bash -c "cd /workspace/FunASR/funasr/runtime && bash run_server_2pass.sh --download-model-dir /workspace/models --vad-dir damo/speech_fsmn_vad_zh-cn-16k-common-onnx --model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx --online-model-dir damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online-onnx --punc-dir damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx --itn-dir thuduj12/fst_itn_zh "
  • 1

在这里插入图片描述
然后运行客服端即可使用:
(代码下载链接:wget /ics/MaaS/ASR/sample/funasr_samples.)

代码:/alibaba-damo-academy/FunASR/blob/main/runtime/python/websocket/funasr_wss_client.py

python funasr_wss_client.py --host "127.0.0.1" --port 10095 --mode 2pass
  • 1

在这里插入图片描述

在这里插入图片描述

3、sherpa 实时语音转录

1)ncnn版本
参考:/k2-fsa/sherpa-ncnn
/video/BV1K44y197Fg

版本:sherpa-ncnn-2.1.7

安装:

pip install sherpa-ncnn   sounddevice  -i /pypi/simple

  • 1
  • 2

下载:
a、下载项目:git clone /k2-fsa/
在这里插入图片描述
b、下载模型
/marcoyang/sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23
下载这7个文件
在这里插入图片描述

a-1、实时麦克风转录文本

/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/
/sherpa/ncnn/python/#start-recording

#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-ncnn Python API
#
# Please refer to
# /sherpa/ncnn/pretrained_models/
# to download pre-trained models

import sys

try:
    import sounddevice as sd
except ImportError as e:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    (-1)

import sherpa_ncnn


def create_recognizer():
    # Please replace the model files if needed.
    # See /sherpa/ncnn/pretrained_models/
    # for download links.
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/",
        encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
        encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
        decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
        decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
        joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
        joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
        num_threads=4,
    )
    return recognizer


def main():
    print("Started! Please speak")
    recognizer = create_recognizer()
    sample_rate = recognizer.sample_rate
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    with (channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = (samples_per_read)  # a blocking read
            samples = (-1)
            recognizer.accept_waveform(sample_rate, samples)
            result = 
            if last_result != result:
                last_result = result
                print("\r{}".format(result), end="", flush=True)


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = [0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68

在这里插入图片描述

**修改结果打印效果,去除重复打印结果,结果每次只打印新增的,避免上面每次都打印一遍之前已经识别的内容

if last_result != result:
                if i==0:
                    print("{}".format(result),end='')
                    last_result = result
                    i=i+1
                else:
                    last_result_len=len(last_result)
                    
                    new_word = result[last_result_len:]
                    # print(last_result,result,new_word)
                    print("{}".format(new_word),end='', flush=True)
                    last_result = result
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

在这里插入图片描述

a-2、实时麦克风转录文本,endpoint逗号
参考:/k2-fsa/sherpa-ncnn/blob/master/python-api-examples/

#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-ncnn Python API
# with endpoint detection.
#
# Please refer to
# https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
# to download pre-trained models

import sys

try:
    import sounddevice as sd
except ImportError as e:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    sys.exit(-1)

import sherpa_ncnn


# def create_recognizer():
#     # Please replace the model files if needed.
#     # See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
#     # for download links.
#     recognizer = sherpa_ncnn.Recognizer(
#         tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt",
#         encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param",
#         encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin",
#         decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param",
#         decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin",
#         joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param",
#         joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin",
#         num_threads=4,
#         decoding_method="modified_beam_search",
#         enable_endpoint_detection=True,
#         rule1_min_trailing_silence=2.4,
#         rule2_min_trailing_silence=1.2,
#         rule3_min_utterance_length=300,
#     )
#     return recognizer

def create_recognizer():
    # Please replace the model files if needed.
    # See https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html
    # for download links.
    # base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"
    # base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"
    base_file = "sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./{}/tokens.txt".format(base_file),
        encoder_param="./{}/encoder_jit_trace-pnnx.ncnn.param".format(base_file),
        encoder_bin="./{}/encoder_jit_trace-pnnx.ncnn.bin".format(base_file),
        decoder_param="./{}/decoder_jit_trace-pnnx.ncnn.param".format(base_file),
        decoder_bin="./{}/decoder_jit_trace-pnnx.ncnn.bin".format(base_file),
        joiner_param="./{}/joiner_jit_trace-pnnx.ncnn.param".format(base_file),
        joiner_bin="./{}/joiner_jit_trace-pnnx.ncnn.bin".format(base_file),
        num_threads=4,
        decoding_method="modified_beam_search",
        enable_endpoint_detection=True,
        rule1_min_trailing_silence=2.4,
        rule2_min_trailing_silence=1.2,
        rule3_min_utterance_length=300,
        hotwords_file="",
        hotwords_score=1.5,
    )
    return recognizer


def main():
    print("Started! Please speak")
    recognizer = create_recognizer()
    sample_rate = recognizer.sample_rate
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    segment_id = 0


    with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = s.read(samples_per_read)  # a blocking read
            samples = samples.reshape(-1)
            recognizer.accept_waveform(sample_rate, samples)

            is_endpoint = recognizer.is_endpoint

            result = recognizer.text
            if result and (last_result != result):
                last_result = result
                print("\r{}:{}".format(segment_id, result), end="", flush=True)

            if is_endpoint:
                if result:
                    print("\r{}:{}".format(segment_id, result), flush=True)
                    segment_id += 1
                recognizer.reset()


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = sd.default.device[0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114

在这里插入图片描述

2)onnx版本(推荐)
参考:/sherpa/onnx/python/
/k2-fsa/sherpa-onnx/blob/master/python-api-examples/

安装:
pip install sherpa-onnx

a、最新版推荐

模型下载:/k2-fsa/sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tree/main

运行代码:

 python .\speech-recognition-from-microphone-onnx.py  --tokens=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt --encoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-chunk-16-left-128.onnx   --decoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.onnx --joiner=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.onnx
  • 1
##代码
#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-onnx Python API
#
# Please refer to
# https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
# to download pre-trained models

import argparse
import sys
from pathlib import Path

from typing import List

try:
    import sounddevice as sd
except ImportError:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    sys.exit(-1)

import sherpa_onnx


def assert_file_exists(filename: str):
    assert Path(filename).is_file(), (
        f"{filename} does not exist!\n"
        "Please refer to "
        "https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html to download it"
    )


def get_args():
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter
    )

    parser.add_argument(
        "--tokens",
        type=str,
        required=True,
        help="Path to tokens.txt",
    )

    parser.add_argument(
        "--encoder",
        type=str,
        required=True,
        help="Path to the encoder model",
    )

    parser.add_argument(
        "--decoder",
        type=str,
        required=True,
        help="Path to the decoder model",
    )

    parser.add_argument(
        "--joiner",
        type=str,
        help="Path to the joiner model",
    )

    parser.add_argument(
        "--decoding-method",
        type=str,
        default="greedy_search",
        help="Valid values are greedy_search and modified_beam_search",
    )

    parser.add_argument(
        "--max-active-paths",
        type=int,
        default=4,
        help="""Used only when --decoding-method is modified_beam_search.
        It specifies number of active paths to keep during decoding.
        """,
    )

    parser.add_argument(
        "--provider",
        type=str,
        default="cpu",
        help="Valid values: cpu, cuda, coreml",
    )

    parser.add_argument(
        "--hotwords-file",
        type=str,
        default="",
        help="""
        The file containing hotwords, one words/phrases per line, and for each
        phrase the bpe/cjkchar are separated by a space. For example:

        ▁HE LL O ▁WORLD
        你 好 世 界
        """,
    )

    parser.add_argument(
        "--hotwords-score",
        type=float,
        default=1.5,
        help="""
        The hotword score of each token for biasing word/phrase. Used only if
        --hotwords-file is given.
        """,
    )

    parser.add_argument(
        "--blank-penalty",
        type=float,
        default=0.0,
        help="""
        The penalty applied on blank symbol during decoding.
        Note: It is a positive value that would be applied to logits like
        this `logits[:, 0] -= blank_penalty` (suppose logits.shape is
        [batch_size, vocab] and blank id is 0).
        """,
    )

    return parser.parse_args()


def create_recognizer(args):
    assert_file_exists(args.encoder)
    assert_file_exists(args.decoder)
    assert_file_exists(args.joiner)
    assert_file_exists(args.tokens)
    # Please replace the model files if needed.
    # See https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index.html
    # for download links.
    recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
        tokens=args.tokens,
        encoder=args.encoder,
        decoder=args.decoder,
        joiner=args.joiner,
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method=args.decoding_method,
        max_active_paths=args.max_active_paths,
        provider=args.provider,
        hotwords_file=args.hotwords_file,
        hotwords_score=args.hotwords_score,
        blank_penalty=args.blank_penalty,
    )
    return recognizer


def main():
    args = get_args()

    devices = sd.query_devices()
    if len(devices) == 0:
        print("No microphone devices found")
        sys.exit(0)

    print(devices)
    default_input_device_idx = sd.default.device[0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    recognizer = create_recognizer(args)
    print("Started! Please speak")

    # The model is using 16 kHz, we use 48 kHz here to demonstrate that
    # sherpa-onnx will do resampling inside.
    sample_rate = 48000
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    stream = recognizer.create_stream()
    with sd.InputStream(channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = s.read(samples_per_read)  # a blocking read
            samples = samples.reshape(-1)
            stream.accept_waveform(sample_rate, samples)
            while recognizer.is_ready(stream):
                recognizer.decode_stream(stream)
            result = recognizer.get_result(stream)
            if last_result != result:
                last_result = result
                print("\r{}".format(result), end="", flush=True)


if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158
  • 159
  • 160
  • 161
  • 162
  • 163
  • 164
  • 165
  • 166
  • 167
  • 168
  • 169
  • 170
  • 171
  • 172
  • 173
  • 174
  • 175
  • 176
  • 177
  • 178
  • 179
  • 180
  • 181
  • 182
  • 183
  • 184
  • 185
  • 186
  • 187
  • 188
  • 189
  • 190
  • 191
  • 192
  • 193
  • 194

在这里插入图片描述

b、下载模型:
/csukuangfj/sherpa-onnx-streaming-conformer-zh-2023-05-23/tree/main

代码:
运行:python ./ --tokens=./sherpa-onnx-streaming-conformer-zh-2023-05-23/ --encoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/ --decoder=./sherpa-onnx-streaming-conformer-zh-2023-05-23/ --joiner=./sherpa-onnx-streaming-conformer-zh-2023-05-23/

#!/usr/bin/env python3

# Real-time speech recognition from a microphone with sherpa-onnx Python API
#
# Please refer to
# /sherpa/onnx/pretrained_models/
# to download pre-trained models

import argparse
import sys
from pathlib import Path

try:
    import sounddevice as sd
except ImportError:
    print("Please install sounddevice first. You can use")
    print()
    print("  pip install sounddevice")
    print()
    print("to install it")
    (-1)

import sherpa_onnx


def assert_file_exists(filename: str):
    assert Path(filename).is_file(), (
        f"{filename} does not exist!\n"
        "Please refer to "
        "/sherpa/onnx/pretrained_models/ to download it"
    )


def get_args():
    parser = (
        formatter_class=
    )

    parser.add_argument(
        "--tokens",
        type=str,
        help="Path to ",
    )

    parser.add_argument(
        "--encoder",
        type=str,
        help="Path to the encoder model",
    )

    parser.add_argument(
        "--decoder",
        type=str,
        help="Path to the decoder model",
    )

    parser.add_argument(
        "--joiner",
        type=str,
        help="Path to the joiner model",
    )

    parser.add_argument(
        "--decoding-method",
        type=str,
        default="greedy_search",
        help="Valid values are greedy_search and modified_beam_search",
    )

    return parser.parse_args()


def create_recognizer():
    args = get_args()
    assert_file_exists()
    assert_file_exists()
    assert_file_exists()
    assert_file_exists()
    # Please replace the model files if needed.
    # See /sherpa/onnx/pretrained_models/
    # for download links.
    recognizer = sherpa_onnx.OnlineRecognizer(
        tokens=,
        encoder=,
        decoder=,
        joiner=,
        num_threads=1,
        sample_rate=16000,
        feature_dim=80,
        decoding_method=args.decoding_method,
    )
    return recognizer


def main():
    recognizer = create_recognizer()
    print("Started! Please speak")

    # The model is using 16 kHz, we use 48 kHz here to demonstrate that
    # sherpa-onnx will do resampling inside.
    sample_rate = 48000
    samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms
    last_result = ""
    stream = recognizer.create_stream()
    with (channels=1, dtype="float32", samplerate=sample_rate) as s:
        while True:
            samples, _ = (samples_per_read)  # a blocking read
            samples = (-1)
            stream.accept_waveform(sample_rate, samples)
            while recognizer.is_ready(stream):
                recognizer.decode_stream(stream)
            result = recognizer.get_result(stream)
            if last_result != result:
                last_result = result
                print("\r{}".format(result), end="", flush=True)


if __name__ == "__main__":
    devices = sd.query_devices()
    print(devices)
    default_input_device_idx = [0]
    print(f'Use default device: {devices[default_input_device_idx]["name"]}')

    try:
        main()
    except KeyboardInterrupt:
        print("\nCaught Ctrl + C. Exiting")
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
3)离线wav音频文件转录

注意:如果本地音频比特率不是256kps,需要转换;比特率(Bitrate)是指音频或视频文件中每秒的比特数。通常用于表示数据传输速率或压缩率。

对于音频文件,比特率表示每秒音频数据的传输速率,单位是kbps(千比特每秒)。通常,比特率越高,音频数据的质量越好,但文件大小也会增加。

例如,256 kbps意味着每秒音频数据的传输速率为256千比特。这种表示方式通常用于指定音频文件的压缩率或输出质量。

另外:windows安装sox参考:/yyy430/article/details/88408273

sox   -r 16k  -c 1    
  • 1

在这里插入图片描述

##官方代码

#!/usr/bin/env python3

"""
This file demonstrates how to use sherpa-ncnn Python API to recognize
a single file.

Please refer to
/sherpa/ncnn/
to install sherpa-ncnn and to download the pre-trained models
used in this file.
"""

import time
import wave

import numpy as np
import sherpa_ncnn


def main():
    # Please refer to /sherpa/ncnn/
    # to download the model files
    # recognizer = sherpa_ncnn.Recognizer(
    #     tokens="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/",
    #     encoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
    #     encoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace",
    #     decoder_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
    #     decoder_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace",
    #     joiner_param="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
    #     joiner_bin="./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace",
    #     num_threads=4,
    # )
    base_file = "sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="./{}/".format(base_file),
        encoder_param="./{}/encoder_jit_trace".format(base_file),
        encoder_bin="./{}/encoder_jit_trace".format(base_file),
        decoder_param="./{}/decoder_jit_trace".format(base_file),
        decoder_bin="./{}/decoder_jit_trace".format(base_file),
        joiner_param="./{}/joiner_jit_trace".format(base_file),
        joiner_bin="./{}/joiner_jit_trace".format(base_file),
        num_threads=4,
    )

    filename = r'D:\sound\'
    with (filename) as f:
        # Note: If wave_file_sample_rate is different from
        # recognizer.sample_rate, we will do resampling inside sherpa-ncnn
        wave_file_sample_rate = ()
        num_channels = ()
        assert () == 2, ()  # it is in bytes
        num_samples = ()
        samples = (num_samples)
        samples_int16 = (samples, dtype=np.int16)
        samples_int16 = samples_int16.reshape(-1, num_channels)[:, 0]
        samples_float32 = samples_int16.astype(np.float32)

        samples_float32 = samples_float32 / 32768

    # simulate streaming
    chunk_size = int(0.1 * wave_file_sample_rate)  # 0.1 seconds
    start = 0
    while start < samples_float32.shape[0]:
        end = start + chunk_size
        end = min(end, samples_float32.shape[0])
        recognizer.accept_waveform(wave_file_sample_rate, samples_float32[start:end])
        start = end
        text = 
        if text:
            print(text)

        # simulate streaming by sleeping
        (0.1)

    tail_paddings = (int(wave_file_sample_rate * 0.5), dtype=np.float32)
    recognizer.accept_waveform(wave_file_sample_rate, tail_paddings)
    recognizer.input_finished()
    text = 
    if text:
        print(text)


if __name__ == "__main__":
    main()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
4)或者通过ffmpeg离线读取mp4、mav,网络读取rtsp链接,自己整理推荐这份代码
import subprocess
import sounddevice as sd
import numpy as np
from  import MinMaxScaler

import sherpa_ncnn

def create_recognizer():
    # Please replace the model files if needed.
    # See /sherpa/ncnn/pretrained_models/
    # for download links.
    # base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"
    # base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"
    base_file = r"D:\llm\sherpa*******mples\sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="{}\\".format(base_file),
        encoder_param="{}\encoder_jit_trace".format(base_file),
        encoder_bin="{}\encoder_jit_trace".format(base_file),
        decoder_param="{}\decoder_jit_trace".format(base_file),
        decoder_bin="{}\decoder_jit_trace".format(base_file),
        joiner_param="{}\joiner_jit_trace".format(base_file),
        joiner_bin="{}\joiner_jit_trace".format(base_file),
        num_threads=4,
    )
    return recognizer



print("Started! Please speak")
recognizer = create_recognizer()
# sample_rate = recognizer.sample_rate
# samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms

# 远程RTSP音频流的URL(wav\mp4/rtsp都可以)
# url = "your_rtsp_url"
# url = r'D:\sound\'
url = r'D:\sound\222.mp4'

# FFmpeg命令参数
ffmpeg_cmd = [
    "ffmpeg",
    "-i", url,
    "-f", "s16le",
    "-acodec", "pcm_s16le",
    "-ar", "16000",
    "-ac","1",
    "-",
    
]

# 创建FFmpeg进程
process = (
    ffmpeg_cmd,
    stdout=,
    stderr=,
    bufsize=1600
)

# 定义音频流的采样率、通道数和每次读取的样本数量
sample_rate = 16000
channels = 1
frames_per_read = 1600



last_result = ""
i=0
# 读取和处理音频数据
while True:
    # 从FFmpeg进程中读取音频数据
    data = (frames_per_read * channels * 2)  # 每个样本16位,乘以2
    if not data:
        break
    
    # 将音频数据转换为numpy数组
    samples = (data, dtype=np.int16)
    samples = (np.float32)
    # samples = MinMaxScaler(feature_range=(-1, 1)).fit_transform((-1, 1))
    samples /= 32768.0  # 归一化到[-1, 1]范围
    # print(, samples)

    # 处理音频数据
    # 在这里添加您的音频处理代码
    recognizer.accept_waveform(sample_rate, samples)
    result = 
    # print("result:",result,"last_result:",last_result)
    
    if last_result != result:
        if i==0:
            print("{}".format(result),end='')
            last_result = result
            i=i+1
        else:
            last_result_len=len(last_result)
            
            new_word = result[last_result_len:]
            # print(last_result,result,new_word)
            print("{}".format(new_word),end='', flush=True)
            last_result = result



# 关闭FFmpeg进程
()
()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
5)ffmpeg 实时读取本地麦克风声音

ffmpeg -list_devices true -f dshow -i dummy 命令可以查看本地电脑可用的dshow设备(包括麦克风)

import subprocess
import sounddevice as sd
import numpy as np
from  import MinMaxScaler

import sherpa_ncnn

def create_recognizer():
    # Please replace the model files if needed.
    # See /sherpa/ncnn/pretrained_models/
    # for download links.
    # base_file = "sherpa-ncnn-conv-emformer-transducer-2022-12-06"
    # base_file = "sherpa-ncnn-lstm-transducer-small-2023-02-13"
    base_file = r"D:\llm\sherpa-ncnn-master\sherpa-ncnn-master\python-api-examples\sherpa-ncnn-streaming-zipformer-bilingual-zh-en-2023-02-13"
    # base_file = "sherpa-ncnn-streaming-zipformer-small-bilingual-zh-en-2023-02-16"
    # base_file = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17"
    recognizer = sherpa_ncnn.Recognizer(
        tokens="{}\\".format(base_file),
        encoder_param="{}\encoder_jit_trace".format(base_file),
        encoder_bin="{}\encoder_jit_trace".format(base_file),
        decoder_param="{}\decoder_jit_trace".format(base_file),
        decoder_bin="{}\decoder_jit_trace".format(base_file),
        joiner_param="{}\joiner_jit_trace".format(base_file),
        joiner_bin="{}\joiner_jit_trace".format(base_file),
        num_threads=4,
    )
    return recognizer



print("Started! Please speak")
recognizer = create_recognizer()
# sample_rate = recognizer.sample_rate
# samples_per_read = int(0.1 * sample_rate)  # 0.1 second = 100 ms

# 远程RTSP音频流的URL
# url = "your_rtsp_url"
# url = r'D:\sound\'
# url = r'D:\sound\222.mp4'
url = "rtsp://admin:jc123456@192.168.63.88/Streaming/Channels/2?tcp"

# FFmpeg命令参数
# ffmpeg_cmd = [
#     "ffmpeg",
#     "-i", url,
#     "-f", "s16le",
#     "-acodec", "pcm_s16le",
#     "-ar", "16000",
#     "-ac","1",
#     "-",
    
# ]

ffmpeg_cmd = [
    "ffmpeg",
    "-f", "dshow",  # 使用alsa作为音频输入设备
    "-i", "audio=麦克风阵列 (适用于数字麦克风的英特尔® 智音技术)",  # 使用默认的音频输入设备(麦克风)
    "-f", "s16le",
    "-acodec", "pcm_s16le",
    "-ar", "16000",
    "-ac", "1",
    "-"
]

# 创建FFmpeg进程
process = (
    ffmpeg_cmd,
    stdout=,
    stderr=,
    bufsize=1600
)

# 定义音频流的采样率、通道数和每次读取的样本数量
sample_rate = 16000
channels = 1
frames_per_read = 1600



last_result = ""
i=0
# 读取和处理音频数据
while True:
    # 从FFmpeg进程中读取音频数据
    data = (frames_per_read * channels * 2)  # 每个样本16位,乘以2
    if not data:
        break
    
    # 将音频数据转换为numpy数组
    samples = (data, dtype=np.int16)
    samples = (np.float32)
    # samples = MinMaxScaler(feature_range=(-1, 1)).fit_transform((-1, 1))
    samples /= 32768.0  # 归一化到[-1, 1]范围
    # print(, samples)

    # 处理音频数据
    # 在这里添加您的音频处理代码
    recognizer.accept_waveform(sample_rate, samples)
    result = 
    # print("result:",result,"last_result:",last_result)
    
    if last_result != result:
        if i==0:
            print("{}".format(result),end='')
            last_result = result
            i=i+1
        else:
            last_result_len=len(last_result)
            
            new_word = result[last_result_len:]
            # print(last_result,result,new_word)
            print("{}".format(new_word),end='', flush=True)
            last_result = result



# 关闭FFmpeg进程
()
()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119