I need a short code snippet to get a directory listing from an HTTP server.
我需要一个简短的代码片段来从HTTP服务器获取目录列表。
Thanks
7 个解决方案
#1
30
A few important considerations before the code:
代码之前的一些重要注意事项:
- The HTTP Server has to be configured to allow directories listing for the directories you want;
- Because directory listings are normal HTML pages there is no standard that defines the format of a directory listing;
- Due to consideration 2 you are in the land where you have to put specific code for each server.
必须将HTTP Server配置为允许列出所需目录的目录;
因为目录列表是普通的HTML页面,所以没有标准来定义目录列表的格式;
由于考虑2,您所在的国家/地区必须为每台服务器提供特定代码。
My choice is to use regular expressions. This allows for rapid parsing and customization. You can get specific regular expressions pattern per site and that way you have a very modular approach. Use an external source for mapping URL to regular expression patterns if you plan to enhance the parsing module with new sites support without changing the source code.
我的选择是使用正则表达式。这允许快速解析和定制。您可以为每个站点获取特定的正则表达式模式,这样您就可以采用非常模块化的方法。如果您计划在不更改源代码的情况下使用新站点支持来增强解析模块,请使用外部源将URL映射到正则表达式模式。
Example to print directory listing from http://www.ibiblio.org/pub/
从http://www.ibiblio.org/pub/打印目录列表的示例
namespace Example
{
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
public class MyExample
{
public static string GetDirectoryListingRegexForUrl(string url)
{
if (url.Equals("http://www.ibiblio.org/pub/"))
{
return "<a href=\".*\">(?<name>.*)</a>";
}
throw new NotSupportedException();
}
public static void Main(String[] args)
{
string url = "http://www.ibiblio.org/pub/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex(GetDirectoryListingRegexForUrl(url));
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine(match.Groups["name"]);
}
}
}
}
}
Console.ReadLine();
}
}
}
#2
8
Basic understanding:
Directory listings are just HTML pages generated by a web server. Each web server generates these HTML pages in its own way because there is no standard way for a web server to list these directories.
目录列表只是Web服务器生成的HTML页面。每个Web服务器都以自己的方式生成这些HTML页面,因为Web服务器没有标准的方法来列出这些目录。
The best way to get a directory listing, is to simply do an HTTP request to the URL you'd like the directory listing for and to try to parse and extract all of the links from the HTML returned to you.
获取目录列表的最佳方法是简单地对您希望目录列表的URL执行HTTP请求,并尝试解析并从返回给您的HTML中提取所有链接。
To parse the HTML links please try to use the HTML Agility Pack.
要解析HTML链接,请尝试使用HTML Agility Pack。
Directory Browsing:
The web server you'd like to list directories from must have directory browsing turned on to get this HTML representation of the files in its directories. So you can only get the directory listing if the HTTP server wants you to be able to.
您要列出目录的Web服务器必须启用目录浏览才能在其目录中获取文件的HTML表示形式。因此,只有HTTP服务器希望您能够获取目录列表。
A quick example of the HTML Agility Pack:
HTML Agility Pack的一个简单示例:
HtmlDocument doc = new HtmlDocument();
doc.Load(strURL);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
HtmlAttribute att = link"href";
//do something with att.Value;
}
Cleaner alternative:
If it is possible in your situation, a cleaner method is to use an intended protocol for directory listings, like the File Transfer Protocol (FTP), SFTP (FTP like over SSH) or FTPS (FTP over SSL).
如果在您的情况下可行,则更简洁的方法是使用目标列表的目标协议,如文件传输协议(FTP),SFTP(FTP之类的SSH)或FTPS(基于SSL的FTP)。
What if directory browsing is not turned on:
如果未打开目录浏览,该怎么办:
If the web server does not have directory browsing turned on, then there is no easy way to get the directory listing.
如果Web服务器没有打开目录浏览,则没有简单的方法来获取目录列表。
The best you could do in this case is to start at a given URL, follow all HTML links on the same page, and try to build a virtual listing of directories yourself based on the relative paths of the resources on these HTML pages. This will not give you a complete listing of what files are actually on the web server though.
在这种情况下,您可以做的最好的事情是从给定的URL开始,遵循同一页面上的所有HTML链接,并尝试根据这些HTML页面上的资源的相对路径自己构建目录的虚拟列表。这不会为您提供Web服务器上实际存在的文件的完整列表。
#3
4
i just modified above and found this best
我刚才修改过,发现这个最好
public static class GetallFilesFromHttp
{
public static string GetDirectoryListingRegexForUrl(string url)
{
if (url.Equals("http://ServerDirPath/"))
{
return "\\\"([^\"]*)\\\"";
}
throw new NotSupportedException();
}
public static void ListDiractory()
{
string url = "http://ServerDirPath/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex(GetDirectoryListingRegexForUrl(url));
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine(match.ToString());
}
}
}
}
Console.ReadLine();
}
}
}
#4
2
Thanks for the great post. for me the pattern below worked better.
谢谢你的精彩帖子。对我来说,下面的模式效果更好。
<AHREF=\\"\S+\">(?<name>\S+)</A>
I also tested it at http://regexhero.net/tester.
我还在http://regexhero.net/tester上测试了它。
to use it in your C# code, you have to add more backslashes () before any backslash and double quotes in the pattern for i
要在你的C#代码中使用它,你必须在任何反斜杠之前添加更多的反斜杠(),并在i的模式中添加双引号
<AHREF=\\"\S+\">(?<name>\S+)</A>
nstance, in the GetDirectoryListingRegexForUrl method you should use something like this
nstance,在GetDirectoryListingRegexForUrl方法中你应该使用这样的东西
return "< A HREF=\\"\S+\\">(?\S+)";
返回“(?\ S +)”;
Cheers!
#5
1
The following code works well for me when I do not have access to the ftp server:
当我无法访问ftp服务器时,以下代码适用于我:
public static string[] GetFiles(string url)
{
List<string> files = new List<string>(500);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex("<a href=\".*\">(?<name>.*)</a>");
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
string[] matchData = match.Groups[0].ToString().Split('\"');
files.Add(matchData[1]);
}
}
}
}
}
return files.ToArray();
}
However, when I do have access to the ftp server, the following code works much faster:
但是,当我有权访问ftp服务器时,以下代码的工作速度要快得多:
public static string[] getFtpFolderItems(string ftpURL)
{
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(ftpURL);
request.Method = WebRequestMethods.Ftp.ListDirectory;
//You could add Credentials, if needed
//request.Credentials = new NetworkCredential("anonymous", "password");
FtpWebResponse response = (FtpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
return reader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
}
#6
0
You can't, unless the particular directory you want has directory listing enabled and no default file (usually index.htm, index.html or default.html but always configurable). Only then will you be presented with a directory listing, which will usually be marked up with HTML and require parsing.
除非您想要的特定目录启用了目录列表且没有默认文件(通常是index.htm,index.html或default.html但始终可配置),否则您不能这样做。只有这样,您才会看到目录列表,该目录列表通常会标记为HTML并需要解析。
#1
30
A few important considerations before the code:
代码之前的一些重要注意事项:
- The HTTP Server has to be configured to allow directories listing for the directories you want;
- Because directory listings are normal HTML pages there is no standard that defines the format of a directory listing;
- Due to consideration 2 you are in the land where you have to put specific code for each server.
必须将HTTP Server配置为允许列出所需目录的目录;
因为目录列表是普通的HTML页面,所以没有标准来定义目录列表的格式;
由于考虑2,您所在的国家/地区必须为每台服务器提供特定代码。
My choice is to use regular expressions. This allows for rapid parsing and customization. You can get specific regular expressions pattern per site and that way you have a very modular approach. Use an external source for mapping URL to regular expression patterns if you plan to enhance the parsing module with new sites support without changing the source code.
我的选择是使用正则表达式。这允许快速解析和定制。您可以为每个站点获取特定的正则表达式模式,这样您就可以采用非常模块化的方法。如果您计划在不更改源代码的情况下使用新站点支持来增强解析模块,请使用外部源将URL映射到正则表达式模式。
Example to print directory listing from http://www.ibiblio.org/pub/
从http://www.ibiblio.org/pub/打印目录列表的示例
namespace Example
{
using System;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
public class MyExample
{
public static string GetDirectoryListingRegexForUrl(string url)
{
if (url.Equals("http://www.ibiblio.org/pub/"))
{
return "<a href=\".*\">(?<name>.*)</a>";
}
throw new NotSupportedException();
}
public static void Main(String[] args)
{
string url = "http://www.ibiblio.org/pub/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex(GetDirectoryListingRegexForUrl(url));
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine(match.Groups["name"]);
}
}
}
}
}
Console.ReadLine();
}
}
}
#2
8
Basic understanding:
Directory listings are just HTML pages generated by a web server. Each web server generates these HTML pages in its own way because there is no standard way for a web server to list these directories.
目录列表只是Web服务器生成的HTML页面。每个Web服务器都以自己的方式生成这些HTML页面,因为Web服务器没有标准的方法来列出这些目录。
The best way to get a directory listing, is to simply do an HTTP request to the URL you'd like the directory listing for and to try to parse and extract all of the links from the HTML returned to you.
获取目录列表的最佳方法是简单地对您希望目录列表的URL执行HTTP请求,并尝试解析并从返回给您的HTML中提取所有链接。
To parse the HTML links please try to use the HTML Agility Pack.
要解析HTML链接,请尝试使用HTML Agility Pack。
Directory Browsing:
The web server you'd like to list directories from must have directory browsing turned on to get this HTML representation of the files in its directories. So you can only get the directory listing if the HTTP server wants you to be able to.
您要列出目录的Web服务器必须启用目录浏览才能在其目录中获取文件的HTML表示形式。因此,只有HTTP服务器希望您能够获取目录列表。
A quick example of the HTML Agility Pack:
HTML Agility Pack的一个简单示例:
HtmlDocument doc = new HtmlDocument();
doc.Load(strURL);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
HtmlAttribute att = link"href";
//do something with att.Value;
}
Cleaner alternative:
If it is possible in your situation, a cleaner method is to use an intended protocol for directory listings, like the File Transfer Protocol (FTP), SFTP (FTP like over SSH) or FTPS (FTP over SSL).
如果在您的情况下可行,则更简洁的方法是使用目标列表的目标协议,如文件传输协议(FTP),SFTP(FTP之类的SSH)或FTPS(基于SSL的FTP)。
What if directory browsing is not turned on:
如果未打开目录浏览,该怎么办:
If the web server does not have directory browsing turned on, then there is no easy way to get the directory listing.
如果Web服务器没有打开目录浏览,则没有简单的方法来获取目录列表。
The best you could do in this case is to start at a given URL, follow all HTML links on the same page, and try to build a virtual listing of directories yourself based on the relative paths of the resources on these HTML pages. This will not give you a complete listing of what files are actually on the web server though.
在这种情况下,您可以做的最好的事情是从给定的URL开始,遵循同一页面上的所有HTML链接,并尝试根据这些HTML页面上的资源的相对路径自己构建目录的虚拟列表。这不会为您提供Web服务器上实际存在的文件的完整列表。
#3
4
i just modified above and found this best
我刚才修改过,发现这个最好
public static class GetallFilesFromHttp
{
public static string GetDirectoryListingRegexForUrl(string url)
{
if (url.Equals("http://ServerDirPath/"))
{
return "\\\"([^\"]*)\\\"";
}
throw new NotSupportedException();
}
public static void ListDiractory()
{
string url = "http://ServerDirPath/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex(GetDirectoryListingRegexForUrl(url));
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
Console.WriteLine(match.ToString());
}
}
}
}
Console.ReadLine();
}
}
}
#4
2
Thanks for the great post. for me the pattern below worked better.
谢谢你的精彩帖子。对我来说,下面的模式效果更好。
<AHREF=\\"\S+\">(?<name>\S+)</A>
I also tested it at http://regexhero.net/tester.
我还在http://regexhero.net/tester上测试了它。
to use it in your C# code, you have to add more backslashes () before any backslash and double quotes in the pattern for i
要在你的C#代码中使用它,你必须在任何反斜杠之前添加更多的反斜杠(),并在i的模式中添加双引号
<AHREF=\\"\S+\">(?<name>\S+)</A>
nstance, in the GetDirectoryListingRegexForUrl method you should use something like this
nstance,在GetDirectoryListingRegexForUrl方法中你应该使用这样的东西
return "< A HREF=\\"\S+\\">(?\S+)";
返回“(?\ S +)”;
Cheers!
#5
1
The following code works well for me when I do not have access to the ftp server:
当我无法访问ftp服务器时,以下代码适用于我:
public static string[] GetFiles(string url)
{
List<string> files = new List<string>(500);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
string html = reader.ReadToEnd();
Regex regex = new Regex("<a href=\".*\">(?<name>.*)</a>");
MatchCollection matches = regex.Matches(html);
if (matches.Count > 0)
{
foreach (Match match in matches)
{
if (match.Success)
{
string[] matchData = match.Groups[0].ToString().Split('\"');
files.Add(matchData[1]);
}
}
}
}
}
return files.ToArray();
}
However, when I do have access to the ftp server, the following code works much faster:
但是,当我有权访问ftp服务器时,以下代码的工作速度要快得多:
public static string[] getFtpFolderItems(string ftpURL)
{
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(ftpURL);
request.Method = WebRequestMethods.Ftp.ListDirectory;
//You could add Credentials, if needed
//request.Credentials = new NetworkCredential("anonymous", "password");
FtpWebResponse response = (FtpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
StreamReader reader = new StreamReader(responseStream);
return reader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
}
#6
0
You can't, unless the particular directory you want has directory listing enabled and no default file (usually index.htm, index.html or default.html but always configurable). Only then will you be presented with a directory listing, which will usually be marked up with HTML and require parsing.
除非您想要的特定目录启用了目录列表且没有默认文件(通常是index.htm,index.html或default.html但始终可配置),否则您不能这样做。只有这样,您才会看到目录列表,该目录列表通常会标记为HTML并需要解析。