downloadstring()返回带有特殊字符的字符串

时间:2022-10-05 20:14:00

I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.

我有一个问题,我们从网上下载的一些内容,我正在构建一个屏幕抓取工具。

in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.

在下面的代码中,从web客户端下载字符串方法返回的字符串返回一些奇怪的字符,供一些(不是全部)web站点的源代码下载。

I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.

我最近添加了如下所示的http头。在此之前,不使用头文件调用相同的代码。我还没有尝试过“Accept-Charset”标题的变体,除了基本的文本编码之外,我对文本编码知之甚少。

The characters, or character sequences that I refer to are:

我所指的字符或字符序列是:

""

“我害怕»”

and

"Â"

“一个”

These characters are not seen when you use "view source" in a web browser. What could be causing this and how can I rectify the problem?

当您在web浏览器中使用“查看源”时,不会看到这些字符。是什么导致了这种情况?我该如何解决这个问题?

string urlData = String.Empty;
WebClient wc = new WebClient();

// Add headers to impersonate a web browser. Some web sites 
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
wc.Headers.Add("Accept", "*/*");
wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

urlData = wc.DownloadString(uri);

6 个解决方案

#1


89  

 is the windows-1252 representation of the octets EF BB BF. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs, WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8 and things should theoretically work.

i¿»是octets EF BB BF的windows-1252表示。这是UTF-8字节顺序标记,这意味着您的远程web页面是用UTF-8编码的,但是您正在读取它,就好像它是windows-1252。根据文档,WebClient。DownloadString使用Webclient。将远程资源转换为字符串时,将其编码为其编码。System.Text.Encoding。UTF8和一些东西理论上应该是可行的。

#2


40  

The way WebClient.DownloadString is implemented is very dumb. It should get the character encoding from the Content-Type header in the response, but instead it expects the developer to tell the expected encoding beforehand. I don't know what the developers of this class were thinking.

WebClient的方式。DownloadString的实现非常愚蠢。它应该从响应中的Content-Type头获取字符编码,但它期望开发人员事先告知预期的编码。我不知道这门课的开发者在想什么。

I have created an auxiliary class that retrieves the encoding name from the Content-Type header of the response:

我创建了一个辅助类,从响应的Content-Type头部检索编码名称:

public static class WebUtils
{
    public static Encoding GetEncodingFrom(
        NameValueCollection responseHeaders,
        Encoding defaultEncoding = null)
    {
        if(responseHeaders == null)
            throw new ArgumentNullException("responseHeaders");

        //Note that key lookup is case-insensitive
        var contentType = responseHeaders["Content-Type"];
        if(contentType == null)
            return defaultEncoding;

        var contentTypeParts = contentType.Split(';');
        if(contentTypeParts.Length <= 1)
            return defaultEncoding;

        var charsetPart =
            contentTypeParts.Skip(1).FirstOrDefault(
                p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));
        if(charsetPart == null)
            return defaultEncoding;

        var charsetPartParts = charsetPart.Split('=');
        if(charsetPartParts.Length != 2)
            return defaultEncoding;

        var charsetName = charsetPartParts[1].Trim();
        if(charsetName == "")
            return defaultEncoding;

        try
        {
            return Encoding.GetEncoding(charsetName);
        }
        catch(ArgumentException ex) 
        {
            throw new UnknownEncodingException(
                charsetName,   
                "The server returned data in an unknown encoding: " + charsetName, 
                ex);
        }
    }
}

(UnknownEncodingException is a custom exception class, feel free to replace for InvalidOperationException or whatever else if you want)

(UnknownEncodingException是一个自定义的异常类,如果你想的话,可以为InvalidOperationException或其他类型的异常替换)

Then the following extension method for the WebClient class will do the trick:

然后,WebClient类的下面扩展方法将完成以下操作:

public static class WebClientExtensions
{
    public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)
    {
        var rawData = webClient.DownloadData(uri);
        var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);
        return encoding.GetString(rawData);
    }
}

So in your example you would do:

在你的例子中你会这样做:

urlData = wc.DownloadStringAwareOfEncoding(uri);

...and that's it.

…就是这样。

#3


10  

var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };

var json = client.DownloadString(url);

#4


1  

In my case the data returned was gzipped and had to be decompressed first, so I found this answer helpful:

在我的例子中,返回的数据是gzip格式的,必须先解压缩,所以我发现这个答案很有帮助:

https://*.com/a/34418228/74585

https://*.com/a/34418228/74585

#5


0  

in my case , i deleted ever header related to language ,charset etc EXcept user agent and cookie . it worked..

在我的例子中,我删除了任何与语言、字符集等相关的标题,除了用户代理和cookie。它工作. .

 // try commenting
 //wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
 //wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

#6


0  

None of them didn't work for me for some special websites such as "www.yahoo.com". The only way which I resolve my problem was changing DownloadString to OpenRead and using UserAgent header like sample code. However, a few sites like "www.varzesh3.com" didn't work with any of methods!

他们中没有人不为我工作,比如“www.yahoo.com”之类的特别网站。我解决问题的唯一方法是将DownloadString更改为OpenRead并使用UserAgent header(如示例代码)。然而,像“www.varzesh3.com”这样的网站没有使用任何方法!

WebClient client = new WebClient()    
client.Headers.Add(HttpRequestHeader.UserAgent, "");
var stream = client.OpenRead("http://www.yahoo.com");
StreamReader sr = new StreamReader(stream);
s = sr.ReadToEnd();

#1


89  

 is the windows-1252 representation of the octets EF BB BF. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs, WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8 and things should theoretically work.

i¿»是octets EF BB BF的windows-1252表示。这是UTF-8字节顺序标记,这意味着您的远程web页面是用UTF-8编码的,但是您正在读取它,就好像它是windows-1252。根据文档,WebClient。DownloadString使用Webclient。将远程资源转换为字符串时,将其编码为其编码。System.Text.Encoding。UTF8和一些东西理论上应该是可行的。

#2


40  

The way WebClient.DownloadString is implemented is very dumb. It should get the character encoding from the Content-Type header in the response, but instead it expects the developer to tell the expected encoding beforehand. I don't know what the developers of this class were thinking.

WebClient的方式。DownloadString的实现非常愚蠢。它应该从响应中的Content-Type头获取字符编码,但它期望开发人员事先告知预期的编码。我不知道这门课的开发者在想什么。

I have created an auxiliary class that retrieves the encoding name from the Content-Type header of the response:

我创建了一个辅助类,从响应的Content-Type头部检索编码名称:

public static class WebUtils
{
    public static Encoding GetEncodingFrom(
        NameValueCollection responseHeaders,
        Encoding defaultEncoding = null)
    {
        if(responseHeaders == null)
            throw new ArgumentNullException("responseHeaders");

        //Note that key lookup is case-insensitive
        var contentType = responseHeaders["Content-Type"];
        if(contentType == null)
            return defaultEncoding;

        var contentTypeParts = contentType.Split(';');
        if(contentTypeParts.Length <= 1)
            return defaultEncoding;

        var charsetPart =
            contentTypeParts.Skip(1).FirstOrDefault(
                p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));
        if(charsetPart == null)
            return defaultEncoding;

        var charsetPartParts = charsetPart.Split('=');
        if(charsetPartParts.Length != 2)
            return defaultEncoding;

        var charsetName = charsetPartParts[1].Trim();
        if(charsetName == "")
            return defaultEncoding;

        try
        {
            return Encoding.GetEncoding(charsetName);
        }
        catch(ArgumentException ex) 
        {
            throw new UnknownEncodingException(
                charsetName,   
                "The server returned data in an unknown encoding: " + charsetName, 
                ex);
        }
    }
}

(UnknownEncodingException is a custom exception class, feel free to replace for InvalidOperationException or whatever else if you want)

(UnknownEncodingException是一个自定义的异常类,如果你想的话,可以为InvalidOperationException或其他类型的异常替换)

Then the following extension method for the WebClient class will do the trick:

然后,WebClient类的下面扩展方法将完成以下操作:

public static class WebClientExtensions
{
    public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)
    {
        var rawData = webClient.DownloadData(uri);
        var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);
        return encoding.GetString(rawData);
    }
}

So in your example you would do:

在你的例子中你会这样做:

urlData = wc.DownloadStringAwareOfEncoding(uri);

...and that's it.

…就是这样。

#3


10  

var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };

var json = client.DownloadString(url);

#4


1  

In my case the data returned was gzipped and had to be decompressed first, so I found this answer helpful:

在我的例子中,返回的数据是gzip格式的,必须先解压缩,所以我发现这个答案很有帮助:

https://*.com/a/34418228/74585

https://*.com/a/34418228/74585

#5


0  

in my case , i deleted ever header related to language ,charset etc EXcept user agent and cookie . it worked..

在我的例子中,我删除了任何与语言、字符集等相关的标题,除了用户代理和cookie。它工作. .

 // try commenting
 //wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
 //wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

#6


0  

None of them didn't work for me for some special websites such as "www.yahoo.com". The only way which I resolve my problem was changing DownloadString to OpenRead and using UserAgent header like sample code. However, a few sites like "www.varzesh3.com" didn't work with any of methods!

他们中没有人不为我工作,比如“www.yahoo.com”之类的特别网站。我解决问题的唯一方法是将DownloadString更改为OpenRead并使用UserAgent header(如示例代码)。然而,像“www.varzesh3.com”这样的网站没有使用任何方法!

WebClient client = new WebClient()    
client.Headers.Add(HttpRequestHeader.UserAgent, "");
var stream = client.OpenRead("http://www.yahoo.com");
StreamReader sr = new StreamReader(stream);
s = sr.ReadToEnd();