如何检测字符串的语言?

时间:2022-08-26 00:08:49

What's the best way to detect the language of a string?

检测字符串语言的最佳方法是什么?

9 个解决方案

#1


If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

如果您的代码的上下文具有互联网访问权限,您可以尝试使用Google API进行语言检测。 http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And, since you are using c#, take a look at this article on how to call the API from c#.

而且,由于您使用的是c#,请查看有关如何从c#调用API的文章。

UPDATE: That c# link is gone, here's a cached copy of the core of it:

更新:那个c#链接消失了,这里是它的核心的缓存副本:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
   new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
   key);

TextBoxTranslation.Text = gTranslator.Translation;

Basically, you need to create a URI and send it to Google that looks like:

基本上,您需要创建一个URI并将其发送给Google,如下所示:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

这告诉API您要将“hello world”从英语翻译为希伯来语,Google的JSON响应将如下所示:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

我选择创建一个代表典型Google JSON响应的基类:

[Serializable]
public class JSONResponse
{
   public string responseDetails = null;
   public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

然后,继承自此类的Translation对象:

[Serializable]
public class Translation: JSONResponse
{
   public TranslationResponseData responseData = 
    new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

这个Translation类有一个TranslationResponseData对象,如下所示:

[Serializable]
public class TranslationResponseData
{
   public string translatedText;
}

Finally, we can make the GoogleTranslator class:

最后,我们可以制作GoogleTranslator类:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

   public class GoogleTranslator
   {
      private string _q = "";
      private string _v = "";
      private string _key = "";
      private string _langPair = "";
      private string _requestUrl = "";
      private string _translation = "";

      public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
         LANGUAGE languageTo, string key)
      {
         _q = HttpUtility.UrlPathEncode(queryTerm);
         _v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
         _langPair =
            HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
            "|" + EnumStringUtil.GetStringValue(languageTo));
         _key = HttpUtility.UrlEncode(key);

         string encodedRequestUrlFragment =
            string.Format("?v={0}&q={1}&langpair={2}&key={3}",
            _v, _q, _langPair, _key);

         _requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

         GetTranslation();
      }

      public string Translation
      {
         get { return _translation; }
         private set { _translation = value; }
      }

      private void GetTranslation()
      {
         try
         {
            WebRequest request = WebRequest.Create(_requestUrl);
            WebResponse response = request.GetResponse();

            StreamReader reader = new StreamReader(response.GetResponseStream());
            string json = reader.ReadLine();
            using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
            {
               DataContractJsonSerializer ser =
                  new DataContractJsonSerializer(typeof(Translation));
               Translation translation = ser.ReadObject(ms) as Translation;

               _translation = translation.responseData.translatedText;
            }
         }
         catch (Exception) { }
      }
   }
}

#2


Fast answer: NTextCat (NuGet, Online Demo)

快速回答:NTextCat(NuGet,在线演示)

Long answer:

Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.

目前,最好的方法似乎是使用经过训练的分类器将一段文本分类为预定义集合中的一种(或多种)语言。

There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.

有一个名为TextCat的Perl工具。它拥有74种最流行语言的语言模型。这个工具有大量的端口用于不同的编程语言。

There were no ports in .Net. So I have written one: NTextCat on GitHub.

.Net中没有端口。所以我在GitHub上写了一篇:NTextCat。

It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.

它是纯.NET Framework DLL +命令行界面。默认情况下,它使用14种语言的配置文件。

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

任何反馈都非常感谢!欢迎新的想法和功能要求:)

Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).

替代方案是使用大量在线服务(例如,来自Google提到的,detectlanguage.com,langid.net等)。

#3


A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

使用有向图或三字母的统计方法是一个非常好的指标。例如,以下是英语中最常见的有向图:http://www.letterfrequency.org/#digraph-frequency(可以找到更好或更完整的列表)。对于短文本片段,此方法可能比单词分析具有更好的成功率,因为​​文本中的有向图比完整单词更多。

#4


If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?

如果你的意思是自然(即人类)语言,这通常是一个难题。什么语言是“服务器” - 英语还是土耳其语?什么语言是“聊天” - 英语还是法语?什么语言是“uno” - 意大利语或西班牙语(或拉丁语!)?

Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.

没有注意上下文,并做一些艰难的自然语言处理(<-----这是谷歌的短语),你没有机会。

You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...

你可能会喜欢Frengly - 这是一个很好的用户界面,可以在Google Translate服务中尝试猜测输入文本的语言......

#5


Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.

对字符串进行统计分析:将字符串拆分为单词。获取您要测试的每种语言的字典。然后找到具有最高字数的语言。

In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).

在C#中,内存中的每个字符串都是unicode,并且不进行编码。同样在文本文件中,不存储编码。 (有时仅指示8位或16位)。

If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).

如果你想区分两种语言,你可能会发现一些简单的技巧。例如,如果您要识别荷兰语的英语,则包含“y”的字符串主要是英语。 (不可靠但很快)。

#6


CLD (Compact Language Detector) library from Google's Chromium browser

来自Google Chromium浏览器的CLD(紧凑语言检测器)库

You could wrap the CLD library, which is written in C++

您可以包装CLD库,它是用C ++编写的

http://code.google.com/p/chromium-compact-language-detector/

#7


You may use the C# package for language identification from Microsoft Research:

您可以使用C#包进行Microsoft Research的语言识别:

This package implements several algorithms for language identification, and includes two sets of pre-compiled language profiles. One set covers 52 languages and was trained on Wikipedia (i.e. a well-written corpus); the other covers 26 languages and was constructed from Twitter (i.e. a highly colloquial corpus). The language identifiers are packaged up as a C# library, and be easily embedded into other C# projects.

该软件包实现了几种语言识别算法,包括两组预编译语言配置文件。一套涵盖52种语言,并在*上进行了培训(即一个写得很好的语料库);另一种涵盖26种语言,由Twitter构建(即高度口语化的语料库)。语言标识符打包为C#库,可以轻松嵌入到其他C#项目中。

Download the package from the above link.

从上面的链接下载包。

#8


We can use Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+") to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.
To detect Arabic:

我们可以使用Regex.IsMatch(文本,“[\\ uxxxx - \\ uxxxx] +”)来检测特定语言。这里xxxx是一个字符的4位Unicode id。检测阿拉伯语:

bool isArabic = Regex.IsMatch(yourtext, @"[\u0600-\u06FF]+")

#9


One alternative is to use 'Translator Text API' which is

另一种方法是使用'Translator Text API'

... part of the Azure Cognitive Services API collection of machine learning and AI algorithms in the cloud, and is readily consumable in your development projects

...云中的机器学习和AI算法的Azure Cognitive Services API集合的一部分,可以在您的开发项目中随时使用

Here's a quickstart guide on how to detect language from text using this API

这是一个快速入门指南,介绍如何使用此API从文本中检测语言

#1


If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

如果您的代码的上下文具有互联网访问权限,您可以尝试使用Google API进行语言检测。 http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And, since you are using c#, take a look at this article on how to call the API from c#.

而且,由于您使用的是c#,请查看有关如何从c#调用API的文章。

UPDATE: That c# link is gone, here's a cached copy of the core of it:

更新:那个c#链接消失了,这里是它的核心的缓存副本:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
   new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
   key);

TextBoxTranslation.Text = gTranslator.Translation;

Basically, you need to create a URI and send it to Google that looks like:

基本上,您需要创建一个URI并将其发送给Google,如下所示:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

这告诉API您要将“hello world”从英语翻译为希伯来语,Google的JSON响应将如下所示:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

我选择创建一个代表典型Google JSON响应的基类:

[Serializable]
public class JSONResponse
{
   public string responseDetails = null;
   public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

然后,继承自此类的Translation对象:

[Serializable]
public class Translation: JSONResponse
{
   public TranslationResponseData responseData = 
    new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

这个Translation类有一个TranslationResponseData对象,如下所示:

[Serializable]
public class TranslationResponseData
{
   public string translatedText;
}

Finally, we can make the GoogleTranslator class:

最后,我们可以制作GoogleTranslator类:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

   public class GoogleTranslator
   {
      private string _q = "";
      private string _v = "";
      private string _key = "";
      private string _langPair = "";
      private string _requestUrl = "";
      private string _translation = "";

      public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
         LANGUAGE languageTo, string key)
      {
         _q = HttpUtility.UrlPathEncode(queryTerm);
         _v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
         _langPair =
            HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
            "|" + EnumStringUtil.GetStringValue(languageTo));
         _key = HttpUtility.UrlEncode(key);

         string encodedRequestUrlFragment =
            string.Format("?v={0}&q={1}&langpair={2}&key={3}",
            _v, _q, _langPair, _key);

         _requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

         GetTranslation();
      }

      public string Translation
      {
         get { return _translation; }
         private set { _translation = value; }
      }

      private void GetTranslation()
      {
         try
         {
            WebRequest request = WebRequest.Create(_requestUrl);
            WebResponse response = request.GetResponse();

            StreamReader reader = new StreamReader(response.GetResponseStream());
            string json = reader.ReadLine();
            using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
            {
               DataContractJsonSerializer ser =
                  new DataContractJsonSerializer(typeof(Translation));
               Translation translation = ser.ReadObject(ms) as Translation;

               _translation = translation.responseData.translatedText;
            }
         }
         catch (Exception) { }
      }
   }
}

#2


Fast answer: NTextCat (NuGet, Online Demo)

快速回答:NTextCat(NuGet,在线演示)

Long answer:

Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.

目前,最好的方法似乎是使用经过训练的分类器将一段文本分类为预定义集合中的一种(或多种)语言。

There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.

有一个名为TextCat的Perl工具。它拥有74种最流行语言的语言模型。这个工具有大量的端口用于不同的编程语言。

There were no ports in .Net. So I have written one: NTextCat on GitHub.

.Net中没有端口。所以我在GitHub上写了一篇:NTextCat。

It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.

它是纯.NET Framework DLL +命令行界面。默认情况下,它使用14种语言的配置文件。

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

任何反馈都非常感谢!欢迎新的想法和功能要求:)

Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).

替代方案是使用大量在线服务(例如,来自Google提到的,detectlanguage.com,langid.net等)。

#3


A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

使用有向图或三字母的统计方法是一个非常好的指标。例如,以下是英语中最常见的有向图:http://www.letterfrequency.org/#digraph-frequency(可以找到更好或更完整的列表)。对于短文本片段,此方法可能比单词分析具有更好的成功率,因为​​文本中的有向图比完整单词更多。

#4


If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?

如果你的意思是自然(即人类)语言,这通常是一个难题。什么语言是“服务器” - 英语还是土耳其语?什么语言是“聊天” - 英语还是法语?什么语言是“uno” - 意大利语或西班牙语(或拉丁语!)?

Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.

没有注意上下文,并做一些艰难的自然语言处理(<-----这是谷歌的短语),你没有机会。

You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...

你可能会喜欢Frengly - 这是一个很好的用户界面,可以在Google Translate服务中尝试猜测输入文本的语言......

#5


Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.

对字符串进行统计分析:将字符串拆分为单词。获取您要测试的每种语言的字典。然后找到具有最高字数的语言。

In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).

在C#中,内存中的每个字符串都是unicode,并且不进行编码。同样在文本文件中,不存储编码。 (有时仅指示8位或16位)。

If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).

如果你想区分两种语言,你可能会发现一些简单的技巧。例如,如果您要识别荷兰语的英语,则包含“y”的字符串主要是英语。 (不可靠但很快)。

#6


CLD (Compact Language Detector) library from Google's Chromium browser

来自Google Chromium浏览器的CLD(紧凑语言检测器)库

You could wrap the CLD library, which is written in C++

您可以包装CLD库,它是用C ++编写的

http://code.google.com/p/chromium-compact-language-detector/

#7


You may use the C# package for language identification from Microsoft Research:

您可以使用C#包进行Microsoft Research的语言识别:

This package implements several algorithms for language identification, and includes two sets of pre-compiled language profiles. One set covers 52 languages and was trained on Wikipedia (i.e. a well-written corpus); the other covers 26 languages and was constructed from Twitter (i.e. a highly colloquial corpus). The language identifiers are packaged up as a C# library, and be easily embedded into other C# projects.

该软件包实现了几种语言识别算法,包括两组预编译语言配置文件。一套涵盖52种语言,并在*上进行了培训(即一个写得很好的语料库);另一种涵盖26种语言,由Twitter构建(即高度口语化的语料库)。语言标识符打包为C#库,可以轻松嵌入到其他C#项目中。

Download the package from the above link.

从上面的链接下载包。

#8


We can use Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+") to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.
To detect Arabic:

我们可以使用Regex.IsMatch(文本,“[\\ uxxxx - \\ uxxxx] +”)来检测特定语言。这里xxxx是一个字符的4位Unicode id。检测阿拉伯语:

bool isArabic = Regex.IsMatch(yourtext, @"[\u0600-\u06FF]+")

#9


One alternative is to use 'Translator Text API' which is

另一种方法是使用'Translator Text API'

... part of the Azure Cognitive Services API collection of machine learning and AI algorithms in the cloud, and is readily consumable in your development projects

...云中的机器学习和AI算法的Azure Cognitive Services API集合的一部分,可以在您的开发项目中随时使用

Here's a quickstart guide on how to detect language from text using this API

这是一个快速入门指南,介绍如何使用此API从文本中检测语言