I am doing a project, in which i need to login into a site and scrape the webpage contents. i tried the following code:
我正在做一个项目,我需要登录一个站点并抓取网页内容。我尝试了以下代码:
protected void Page_Load(object sender, EventArgs e)
{
WebClient webClient = new WebClient();
string strUrl = "http://www.mail.yahoo.com?username=sakthivel123&password=operator&login=1";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
Label1.Text = objUTF8.GetString(reqHTML1);
}
This scrapes the login page of the mail . But i need to scrape my inbox details. Please instruct me on how to proceed further, thanks in advance.
这会刮掉邮件的登录页面。但我需要抓住我的收件箱细节。请提前告知我如何进一步处理。
3 个解决方案
#1
Please see this questions and the related questions. We have to study the HTML source of a webpage before we can scrap it properly. So login manually and get the source of the inbox page and then study it to scrape it.
请查看此问题和相关问题。我们必须先研究网页的HTML源代码才能正确废弃它。因此,手动登录并获取收件箱页面的来源,然后研究它以进行刮擦。
Why dont you use yahoo's webmail API? Which is a better solution.
为什么不使用雅虎的webmail API?哪个是更好的解决方案。
#2
See this question - Writing a C# program that scans ecommerce website and extracts products pictures + prices + description from them
看到这个问题 - 编写一个C#程序,扫描电子商务网站并从中提取产品图片+价格+描述
P.S.: It's called "scrape" and the act of performing a screen scrape would be called (You guessed it!) "Screen scraping". The word "scrap" when used as a verb means to discard - Such as "the project has been scrapped!" ;-)
P.S。:它被称为“刮擦”,并且将执行屏幕刮擦的行为(你猜对了!)“屏幕刮擦”。当用作动词时,“废料”一词意味着丢弃 - 例如“项目已被废弃!” ;-)
#3
I'd suggest you first use a tool called Fiddler to analize the communication between the target site and your browser. You can look at all the http headers, cookies, content,etc.
我建议你首先使用一个名为Fiddler的工具来分析目标站点和浏览器之间的通信。您可以查看所有http标头,Cookie,内容等。
Once your webClient object is able to replicate the actions of a browser, including logging in, setting the appropriate cookies, etc, you can automate the procedure.
一旦您的webClient对象能够复制浏览器的操作,包括登录,设置适当的cookie等,您就可以自动执行该过程。
And finally, once you have the desired HTML, use regular expressions to extract the information you want from it.
最后,一旦你有了所需的HTML,使用正则表达式从中提取你想要的信息。
#1
Please see this questions and the related questions. We have to study the HTML source of a webpage before we can scrap it properly. So login manually and get the source of the inbox page and then study it to scrape it.
请查看此问题和相关问题。我们必须先研究网页的HTML源代码才能正确废弃它。因此,手动登录并获取收件箱页面的来源,然后研究它以进行刮擦。
Why dont you use yahoo's webmail API? Which is a better solution.
为什么不使用雅虎的webmail API?哪个是更好的解决方案。
#2
See this question - Writing a C# program that scans ecommerce website and extracts products pictures + prices + description from them
看到这个问题 - 编写一个C#程序,扫描电子商务网站并从中提取产品图片+价格+描述
P.S.: It's called "scrape" and the act of performing a screen scrape would be called (You guessed it!) "Screen scraping". The word "scrap" when used as a verb means to discard - Such as "the project has been scrapped!" ;-)
P.S。:它被称为“刮擦”,并且将执行屏幕刮擦的行为(你猜对了!)“屏幕刮擦”。当用作动词时,“废料”一词意味着丢弃 - 例如“项目已被废弃!” ;-)
#3
I'd suggest you first use a tool called Fiddler to analize the communication between the target site and your browser. You can look at all the http headers, cookies, content,etc.
我建议你首先使用一个名为Fiddler的工具来分析目标站点和浏览器之间的通信。您可以查看所有http标头,Cookie,内容等。
Once your webClient object is able to replicate the actions of a browser, including logging in, setting the appropriate cookies, etc, you can automate the procedure.
一旦您的webClient对象能够复制浏览器的操作,包括登录,设置适当的cookie等,您就可以自动执行该过程。
And finally, once you have the desired HTML, use regular expressions to extract the information you want from it.
最后,一旦你有了所需的HTML,使用正则表达式从中提取你想要的信息。