如何屏幕抓取网页邮件页面？

I am doing a project, in which i need to login into a site and scrape the webpage contents. i tried the following code:

我正在做一个项目,我需要登录一个站点并抓取网页内容。我尝试了以下代码:

protected void Page_Load(object sender, EventArgs e)
{
    WebClient webClient = new WebClient();
    string strUrl = "http://www.mail.yahoo.com?username=sakthivel123&password=operator&login=1";
    byte[] reqHTML;
    reqHTML = webClient.DownloadData(strUrl);
    UTF8Encoding objUTF8 = new UTF8Encoding();
    Label1.Text = objUTF8.GetString(reqHTML1);
}

This scrapes the login page of the mail . But i need to scrape my inbox details. Please instruct me on how to proceed further, thanks in advance.

这会刮掉邮件的登录页面。但我需要抓住我的收件箱细节。请提前告知我如何进一步处理。

3 个解决方案

#1

Please see this questions and the related questions. We have to study the HTML source of a webpage before we can scrap it properly. So login manually and get the source of the inbox page and then study it to scrape it.

请查看此问题和相关问题。我们必须先研究网页的HTML源代码才能正确废弃它。因此,手动登录并获取收件箱页面的来源,然后研究它以进行刮擦。

Why dont you use yahoo's webmail API? Which is a better solution.

为什么不使用雅虎的webmail API?哪个是更好的解决方案。

#2

See this question - Writing a C# program that scans ecommerce website and extracts products pictures + prices + description from them

看到这个问题 - 编写一个C#程序,扫描电子商务网站并从中提取产品图片+价格+描述

P.S.: It's called "scrape" and the act of performing a screen scrape would be called (You guessed it!) "Screen scraping". The word "scrap" when used as a verb means to discard - Such as "the project has been scrapped!" ;-)

P.S。:它被称为“刮擦”,并且将执行屏幕刮擦的行为(你猜对了!)“屏幕刮擦”。当用作动词时,“废料”一词意味着丢弃 - 例如“项目已被废弃!” ;-)

#3

I'd suggest you first use a tool called Fiddler to analize the communication between the target site and your browser. You can look at all the http headers, cookies, content,etc.

我建议你首先使用一个名为Fiddler的工具来分析目标站点和浏览器之间的通信。您可以查看所有http标头,Cookie,内容等。

Once your webClient object is able to replicate the actions of a browser, including logging in, setting the appropriate cookies, etc, you can automate the procedure.

一旦您的webClient对象能够复制浏览器的操作,包括登录,设置适当的cookie等,您就可以自动执行该过程。

And finally, once you have the desired HTML, use regular expressions to extract the information you want from it.

最后,一旦你有了所需的HTML,使用正则表达式从中提取你想要的信息。

#1