从PHP中的URL解析域

I need to build a function which parses the domain from a URL.

我需要构建一个从URL解析域的函数。

So, with

所以,与

http://google.com/dhasjkdas/sadsdds/sdda/sdads.html

http://google.com/dhasjkdas/sadsdds/sdda/sdads.html

或

http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html

http://www.google.com/dhasjkdas/sadsdds/sdda/sdads.html

it should return google.com

它应该返回google.com

with

与

http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html

http://google.co.uk/dhasjkdas/sadsdds/sdda/sdads.html

it should return google.co.uk.

它应该返回google.co.uk。

18 个解决方案

#1

223

Check out parse_url():

看看parse_url():

$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$parse = parse_url($url);
echo $parse['host']; // prints 'google.com'

parse_url doesn't handle really badly mangled urls very well, but is fine if you generally expect decent urls.

parse_url并不能很好地处理严重损坏的url，但是如果您通常希望得到合适的url，那么它就很好。

#2

$domain = str_ireplace('www.', '', parse_url($url, PHP_URL_HOST));

This would return the google.com for both http://google.com/... and http://www.google.com/...

这将返回两个http://google.com/..。和http://www.google.com/..。

#3

From http://us3.php.net/manual/en/function.parse-url.php#93983

从http://us3.php.net/manual/en/function.parse-url.php # 93983

for some odd reason, parse_url returns the host (ex. example.com) as the path when no scheme is provided in the input url. So I've written a quick function to get the real host:

由于某些奇怪的原因，当输入url中没有提供scheme时，parse_url返回主机(例如。example.com)作为路径。我写了一个快速函数来获取真正的主机

function getHost($Address) { 
   $parseUrl = parse_url(trim($Address)); 
   return trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2))); 
} 

getHost("example.com"); // Gives example.com 
getHost("http://example.com"); // Gives example.com 
getHost("www.example.com"); // Gives www.example.com 
getHost("http://example.com/xyz"); // Gives example.com

#4

The code that was meant to work 100% didn't seem to cut it for me, I did patch the example a little but found code that wasn't helping and problems with it. so I changed it out to a couple of functions (to save asking for the list from Mozilla all the time, and removing the cache system). This has been tested against a set of 1000 URLs and seemed to work.

原本要100%工作的代码似乎并没有帮到我，我对这个例子做了一点补丁，但是发现了一些没用的代码和问题。所以我把它改成了几个函数(保存了从Mozilla请求列表，并删除缓存系统)。这已经在一组1000个url上进行了测试，似乎可以工作。

function domain($url)
{
    global $subtlds;
    $slds = "";
    $url = strtolower($url);

    $host = parse_url('http://'.$url,PHP_URL_HOST);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub){
        if (preg_match('/\.'.preg_quote($sub).'$/', $host, $xyz)){
            preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
        }
    }

    return @$matches[0];
}

function get_tlds() {
    $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    $content = file($address);
    foreach ($content as $num => $line) {
        $line = trim($line);
        if($line == '') continue;
        if(@substr($line[0], 0, 2) == '/') continue;
        $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
        if($line == '') continue;  //$line = '.'.$line;
        if(@$line[0] == '.') $line = substr($line, 1);
        if(!strstr($line, '.')) continue;
        $subtlds[] = $line;
        //echo "{$num}: '{$line}'"; echo "<br>";
    }

    $subtlds = array_merge(array(
            'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk', 
            'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
            'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au'
        ), $subtlds);

    $subtlds = array_unique($subtlds);

    return $subtlds;    
}

Then use it like

然后使用它就像

$subtlds = get_tlds();
echo domain('www.example.com') //outputs: example.com
echo domain('www.example.uk.com') //outputs: example.uk.com
echo domain('www.example.fr') //outputs: example.fr

I know I should have turned this into a class, but didn't have time.

我知道我应该把它变成一门课，但是我没有时间。

#5

function get_domain($url = SITE_URL)
{
    preg_match("/[a-z0-9\-]{1,63}\.[a-z\.]{2,6}$/", parse_url($url, PHP_URL_HOST), $_domain_tld);
    return $_domain_tld[0];
}

get_domain('http://www.cdl.gr'); //cdl.gr
get_domain('http://cdl.gr'); //cdl.gr
get_domain('http://www2.cdl.gr'); //cdl.gr

#6

If you want extract host from string http://google.com/dhasjkdas/sadsdds/sdda/sdads.html, usage of parse_url() is acceptable solution for you.

如果您希望从字符串http://google.com/dhasjkdas/sadsdds/sdda/sdads.html提取主机，使用parse_url()是可以接受的解决方案。

But if you want extract domain or its parts, you need package that using Public Suffix List. Yes, you can use string functions arround parse_url(), but it will produce incorrect results sometimes.

但是如果您想要提取域或它的部分，您需要使用公共后缀列表的包。是的，您可以使用字符串函数arround parse_url()，但是有时会产生不正确的结果。

I recomend TLDExtract for domain parsing, here is sample code that show diff:

我推荐TLDExtract进行域解析，以下是显示diff的示例代码:

$extract = new LayerShifter\TLDExtract\Extract();

# For 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html'

$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';

parse_url($url, PHP_URL_HOST); // will return google.com

$result = $extract->parse($url);
$result->getFullHost(); // will return 'google.com'
$result->getRegistrableDomain(); // will return 'google.com'
$result->getSuffix(); // will return 'com'

# For 'http://search.google.com/dhasjkdas/sadsdds/sdda/sdads.html'

$url = 'http://search.google.com/dhasjkdas/sadsdds/sdda/sdads.html';

parse_url($url, PHP_URL_HOST); // will return 'search.google.com'

$result = $extract->parse($url);
$result->getFullHost(); // will return 'search.google.com'
$result->getRegistrableDomain(); // will return 'google.com'

#7

Here is the code i made that 100% finds only the domain name, since it takes mozilla sub tlds to account. Only thing you have to check is how you make cache of that file, so you dont query mozilla every time.

这是我做的代码，100%只找到域名，因为它需要考虑mozilla子tlds。你需要检查的唯一一件事就是你如何缓存那个文件，所以你不会每次都去查询mozilla。

For some strange reason, domains like co.uk are not in the list, so you have to make some hacking and add them manually. Its not cleanest solution but i hope it helps someone.

由于某些奇怪的原因，像co.uk这样的域名不在列表中，所以您必须进行一些黑客攻击并手动添加它们。它不是最干净的解决方案，但我希望它能帮助别人。

//=====================================================
static function domain($url)
{
    $slds = "";
    $url = strtolower($url);

            $address = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';
    if(!$subtlds = @kohana::cache('subtlds', null, 60)) 
    {
        $content = file($address);
        foreach($content as $num => $line)
        {
            $line = trim($line);
            if($line == '') continue;
            if(@substr($line[0], 0, 2) == '/') continue;
            $line = @preg_replace("/[^a-zA-Z0-9\.]/", '', $line);
            if($line == '') continue;  //$line = '.'.$line;
            if(@$line[0] == '.') $line = substr($line, 1);
            if(!strstr($line, '.')) continue;
            $subtlds[] = $line;
            //echo "{$num}: '{$line}'"; echo "<br>";
        }
        $subtlds = array_merge(Array(
            'co.uk', 'me.uk', 'net.uk', 'org.uk', 'sch.uk', 'ac.uk', 
            'gov.uk', 'nhs.uk', 'police.uk', 'mod.uk', 'asn.au', 'com.au',
            'net.au', 'id.au', 'org.au', 'edu.au', 'gov.au', 'csiro.au',
            ),$subtlds);

        $subtlds = array_unique($subtlds);
        //echo var_dump($subtlds);
        @kohana::cache('subtlds', $subtlds);
    }


    preg_match('/^(http:[\/]{2,})?([^\/]+)/i', $url, $matches);
    //preg_match("/^(http:\/\/|https:\/\/|)[a-zA-Z-]([^\/]+)/i", $url, $matches);
    $host = @$matches[2];
    //echo var_dump($matches);

    preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    foreach($subtlds as $sub) 
    {
        if (preg_match("/{$sub}$/", $host, $xyz))
        preg_match("/[^\.\/]+\.[^\.\/]+\.[^\.\/]+$/", $host, $matches);
    }

    return @$matches[0];
}

#8

You can pass PHP_URL_HOST into parse_url function as second parameter

您可以将PHP_URL_HOST作为第二个参数传递给parse_url函数

$url = 'http://google.com/dhasjkdas/sadsdds/sdda/sdads.html';
$host = parse_url($url, PHP_URL_HOST);
print $host; // prints 'google.com'

#9

$domain = parse_url($url, PHP_URL_HOST);
echo implode('.', array_slice(explode('.', $domain), -2, 2))

#10

I've found that @philfreo's solution (referenced from php.net) is pretty well to get fine result but in some cases it shows php's "notice" and "Strict Standards" message. Here a fixed version of this code.

我发现@philfreo的解决方案(引用自php.net)可以得到很好的结果，但是在某些情况下它显示了php的“通知”和“严格的标准”消息。这里是这个代码的一个固定版本。

function getHost($url) { 
   $parseUrl = parse_url(trim($url)); 
   if(isset($parseUrl['host']))
   {
       $host = $parseUrl['host'];
   }
   else
   {
        $path = explode('/', $parseUrl['path']);
        $host = $path[0];
   }
   return trim($host); 
} 

echo getHost("http://example.com/anything.html");           // example.com
echo getHost("http://www.example.net/directory/post.php");  // www.example.net
echo getHost("https://example.co.uk");                      // example.co.uk
echo getHost("www.example.net");                            // example.net
echo getHost("subdomain.example.net/anything");             // subdomain.example.net
echo getHost("example.net");                                // example.net

#11

parse_url didn't work for me. It only returned the path. Switching to basics using php5.3+:

parse_url对我不起作用。它只返回了路径。使用php5.3+切换到基础:

$url  = str_replace('http://', '', strtolower( $s->website));
if (strpos($url, '/'))  $url = strstr($url, '/', true);

#12

I have edited for you:

我为你编辑过:

function getHost($Address) { 
    $parseUrl = parse_url(trim($Address));
    $host = trim($parseUrl['host'] ? $parseUrl['host'] : array_shift(explode('/', $parseUrl['path'], 2))); 

    $parts = explode( '.', $host );
    $num_parts = count($parts);

    if ($parts[0] == "www") {
        for ($i=1; $i < $num_parts; $i++) { 
            $h .= $parts[$i] . '.';
        }
    }else {
        for ($i=0; $i < $num_parts; $i++) { 
            $h .= $parts[$i] . '.';
        }
    }
    return substr($h,0,-1);
}

All type url (www.domain.ltd, sub1.subn.domain.ltd will result to : domain.ltd.

所有类型的url (www.domain.ltd, sub1.subn.domain)。有限公司的结果为:域名有限公司。

#13

Check out parse_url()

看看parse_url()

#14

Here my crawler based on the above answers.

在这里我的爬虫基于以上的答案。

Class implementation ( I Like Obj :)
类实现(我喜欢Obj:)
it uses Curl so we can use http auth is required
它使用Curl，所以我们可以使用http auth。
it only crawl link that belongs to the start url domain
它只抓取属于起始url域的链接
it prints the http header response code ( useful to check problems on a site )
它打印http报头响应代码(用于检查站点上的问题)

CRAWL CLASS CODE

爬行类的代码

class crawler
{
    protected $_url;
    protected $_depth;
    protected $_host;

    public function __construct($url, $depth = 5)
    {
        $this->_url = $url;
        $this->_depth = $depth;
        $parse = parse_url($url);
        $this->_host = $parse['host'];
    }

    public function run()
    {
        $this->crawl_page($this->_url, $this->_depth = 5);
    }

    public function crawl_page($url, $depth = 5)
    {
        static $seen = array();
        if (isset($seen[$url]) || $depth === 0) {
            return;
        }
        $seen[$url] = true;
        list($content, $httpcode) = $this->getContent($url);

        $dom = new DOMDocument('1.0');
        @$dom->loadHTML($content);
        $this->processAnchors($dom, $url, $depth);

        ob_end_flush();
        echo "CODE::$httpcode, URL::$url <br>";
        ob_start();
        flush();
        // echo "URL:", $url, PHP_EOL, "CONTENT:", PHP_EOL, $dom->saveHTML(), PHP_EOL, PHP_EOL;
    }

    public function processAnchors($dom, $url, $depth)
    {
        $anchors = $dom->getElementsByTagName('a');
        foreach ($anchors as $element) {
            $href = $element->getAttribute('href');
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            // Crawl only link that belongs to the start domain
            if (strpos($href, $this->_host) !== false)
                $this->crawl_page($href, $depth - 1);
        }
    }

    public function getContent($url)
    {
        $handle = curl_init($url);
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the HTML or whatever is linked in $url. */
        $response = curl_exec($handle);

        /* Check for 404 (file not found). */
        $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
        if ($httpCode == 404) {
            /* Handle 404 here. */
        }

        curl_close($handle);
        return array($response, $httpCode);
    }
}

// USAGE
$startURL = 'http://YOUR_START_ULR';
$depth = 2;
$crawler = new crawler($startURL, $depth);
$crawler->run();

#15

I'm adding this answer late since this is the answer that pops up most on Google...

我在后面加上这个答案，因为这是谷歌上最常见的答案…

You can use PHP to...

您可以使用PHP来……

$url = "www.google.co.uk";
$host = parse_url($url, PHP_URL_HOST);
// $host == "www.google.co.uk"

to grab the host but not the private domain to which the host refers. (Example www.google.co.uk is the host, but google.co.uk is the private domain)

获取主机而不是主机引用的私有域。(例如www.google.co.uk是主机，但是google.co。英国是私有领域)

To grab the private domain, you must need know the list of public suffixes to which one can register a private domain. This list happens to be curated by Mozilla at https://publicsuffix.org/

要获取私有域，您必须知道可以注册私有域的公共后缀列表。这个列表碰巧是由Mozilla提供的https://publicsuffix.org/。

The below code works when an array of public suffixes has been created already. Simply call

当已经创建了一个公共后缀数组时，下面的代码就可以工作了。简单的电话

$domain = get_private_domain("www.google.co.uk");

with the remaining code...

剩下的代码…

// find some way to parse the above list of public suffix
// then add them to a PHP array
$suffix = [... all valid public suffix ...];

function get_public_suffix($host) {
  $parts = split("\.", $host);
  while (count($parts) > 0) {
    if (is_public_suffix(join(".", $parts)))
      return join(".", $parts);

    array_shift($parts);
  }

  return false;
}

function is_public_suffix($host) {
  global $suffix;
  return isset($suffix[$host]);
}

function get_private_domain($host) {
  $public = get_public_suffix($host);
  $public_parts = split("\.", $public);
  $all_parts = split("\.", $host);

  $private = [];

  for ($x = 0; $x < count($public_parts); ++$x) 
    $private[] = array_pop($all_parts);

  if (count($all_parts) > 0)
    $private[] = array_pop($all_parts);

  return join(".", array_reverse($private));
}

#16

-1

This will generally work very well if the input URL is not total junk. It removes the subdomain.

如果输入URL不全是垃圾，这通常会工作得很好。它消除了子域。

$host = parse_url( $Row->url, PHP_URL_HOST );
$parts = explode( '.', $host );
$parts = array_reverse( $parts );
$domain = $parts[1].'.'.$parts[0];

Example

例子

Input: http://www2.website.com:8080/some/file/structure?some=parameters

输入:http://www2.website.com:8080 /一些/文件/结构? =参数

Output: website.com

输出:website.com

#17

-1

Combining the answers of worldofjr and Alix Axel into one small function that will handle most use-cases:

将worldofjr和Alix Axel的答案组合成一个小函数来处理大多数用例:

function get_url_hostname($url) {

    $parse = parse_url($url);
    return str_ireplace('www.', '', $parse['host']);

}

get_url_hostname('http://www.google.com/example/path/file.html'); // google.com

#18

-6

Just use as like following ...

就像下面一样使用……

<?php
   echo $_SERVER['SERVER_NAME'];
?>

#1

223