获取HTML页面并将其存储在MYSQL中 - 如何

时间:2022-03-07 09:17:10
  • What's the best way to store a formatted html page with CSS on to MYSQL database? Is it possible?
  • 将带有CSS的格式化html页面存储到MYSQL数据库的最佳方法是什么?可能吗?
  • What the column type should be? How to retrieve the stored formatted HTML and display it correctly using PHP?

    列类型应该是什么?如何检索存储的格式化HTML并使用PHP正确显示?

  • What if the page I would like to fetch has pics and videos, show I store the page as blob

    如果我想要获取的页面有图片和视频,显示我将页面存储为blob

  • What's the best way to fetch a page using PHP-CURL,fopen,..-?

    使用PHP-CURL,fopen,..-获取页面的最佳方法是什么?

Many questions guys but I really need your help to put me on the right way to do it.

很多问题的人,但我真的需要你的帮助,让我以正确的方式去做。

Thanks a lot.

非常感谢。

5 个解决方案

#1


8  

Quite simple, try this code I made for you.

很简单,试试我为您制作的代码。

It's the basics to grab and save the source in a DB.

这是在数据库中获取和保存源的基础知识。

I didn't put error handling or whatever else, just keep it simple for the moment...

我没有把错误处理或其他任何东西,只是暂时保持简单......

I didn't made the function to show the result, but you can print the $source to view the result.

我没有让函数显示结果,但你可以打印$ source来查看结果。

Hope this will help you.

希望这会帮助你。

<?php

function GetPage($URL)
{
    #Get the source content of the URL
    $source = file_get_contents($URL);

    #Extract the raw URl from the current one
    $scheme = parse_url($URL, PHP_URL_SCHEME); //Ex: http
    $host = parse_url($URL, PHP_URL_HOST); //Ex: www.google.com
    $raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

    #Replace the relative link by an absolute one
    $relative = array();
    $absolute = array();

    #String to search
    $relative[0] = '/src="\//';
    $relative[1] = '/href="\//';

    #String to remplace by
    $absolute[0] = 'src="' . $raw_url . '/';
    $absolute[1] = 'href="' . $raw_url . '/';

    $source = preg_replace($relative, $absolute, $source); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

    return $source;
}

function SaveToDB($source)
{
    #Connect to the DB
    $db = mysql_connect('localhost', 'root', '');

    #Select the DB name
    mysql_select_db('test');

    #Ask for UTF-8 encoding
    mysql_query("SET NAMES 'utf8'");

    #Escape special chars
    $source = mysql_real_escape_string($source);

    #Set the Query
    $query = "INSERT INTO website (source) VALUES ('$source')"; //Save it in a text row, that's it...

    #Run the query
    mysql_query($query);

    #Close the connection
    mysql_close($db);
}

$source = GetPage('http://www.google.com');

SaveToDB($source);

?>

#2


1  

Pull down the whole page using fopen and parse out any URLs (like images and css). You'll want to run a loop to grab each of the urls for files that generate the page. Store these as well, and replace the urls that used to link to the other sites files with your new links. (this will avoid any issues if the files should change or be removed in the future).

使用fopen下拉整个页面并解析出任何URL(如图像和CSS)。您需要运行循环来获取生成页面的文件的每个URL。也存储这些,并用新链接替换用于链接到其他站点文件的URL。 (如果文件应该更改或将来删除,这将避免任何问题)。

I'd recomend using a blob datatype just because it would allow you store all the files in one table, but you could do a table for the pages with a text datatype and another with blob to store images and other files.

我建议使用blob数据类型,因为它允许您将所有文件存储在一个表中,但是您可以为具有text数据类型的页面执行表格,而使用blob存储另一个表格来存储图像和其他文件。

Edit: If you are storing as a blob datatype look into base64_encode() it will increase the storage footprint on the server but you'll avoid any issues with quotes and special characters.

编辑:如果您将blob数据类型存储到base64_encode()中,它将增加服务器上的存储空间,但您将避免引号和特殊字符的任何问题。

#3


1  

Don't use a relation database to store files. Use a filesystem or a NoSQL solution.

不要使用关系数据库来存储文件。使用文件系统或NoSQL解决方案。

You might want to look into the various open source spider that are available (htdig and httrack come to mind).

您可能想要查看可用的各种开源蜘蛛(想想htdig和httrack)。

#4


1  

I'd store the URLs in a database, and make a cron job to wget the pages regularly, storing them in their own keyed local directories. Using wget will allow you to cache the page, and optionally cache its images, scripts, etc... as well. You can also have your wget command change the embedded URLs so that you don't have to cache everything.

我将URL存储在数据库中,并创建一个cron作业来定期查看页面,将它们存储在自己的键控本地目录中。使用wget将允许您缓存页面,并可选择缓存其图像,脚本等。您还可以让wget命令更改嵌入的URL,这样您就不必缓存所有内容。

Here is the man page for wget, you may also consider searching for "wget backup website" or similar.

这是wget的手册页,您也可以考虑搜索“wget backup website”或类似内容。

(By "keyed directories" I mean that your database table would have 2 fields, a 'key' and a 'url', the [unique] 'key' would then be the path where you archive the website to using wget.)

(通过“键控目录”我的意思是你的数据库表将有2个字段,一个'key'和一个'url',[unique]'key'将是你使用wget将网站存档的路径。)

#5


-2  

You can store the data as text datatype in mysql
but you have to convert the data bcz page may content many quotes and special characters.
you can see this question THIS Its not exact to your question but it will help when you will store the data in database.
about that images and videos...if you are storing page content then there will be only paths of that images and videos.. so no problem will come when you will store in database.

您可以将数据存储为mysql中的text数据类型,但是您必须转换数据bcz页面可能包含许多引号和特殊字符。你可以看到这个问题这不是你问题的确切问题,但是当你将数据存储在数据库中时它会有所帮助。关于那些图像和视频...如果你正在存储页面内容,那么将只有那些图像和视频的路径..所以当你将存储在数据库中时不会出现问题。

#1


8  

Quite simple, try this code I made for you.

很简单,试试我为您制作的代码。

It's the basics to grab and save the source in a DB.

这是在数据库中获取和保存源的基础知识。

I didn't put error handling or whatever else, just keep it simple for the moment...

我没有把错误处理或其他任何东西,只是暂时保持简单......

I didn't made the function to show the result, but you can print the $source to view the result.

我没有让函数显示结果,但你可以打印$ source来查看结果。

Hope this will help you.

希望这会帮助你。

<?php

function GetPage($URL)
{
    #Get the source content of the URL
    $source = file_get_contents($URL);

    #Extract the raw URl from the current one
    $scheme = parse_url($URL, PHP_URL_SCHEME); //Ex: http
    $host = parse_url($URL, PHP_URL_HOST); //Ex: www.google.com
    $raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

    #Replace the relative link by an absolute one
    $relative = array();
    $absolute = array();

    #String to search
    $relative[0] = '/src="\//';
    $relative[1] = '/href="\//';

    #String to remplace by
    $absolute[0] = 'src="' . $raw_url . '/';
    $absolute[1] = 'href="' . $raw_url . '/';

    $source = preg_replace($relative, $absolute, $source); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

    return $source;
}

function SaveToDB($source)
{
    #Connect to the DB
    $db = mysql_connect('localhost', 'root', '');

    #Select the DB name
    mysql_select_db('test');

    #Ask for UTF-8 encoding
    mysql_query("SET NAMES 'utf8'");

    #Escape special chars
    $source = mysql_real_escape_string($source);

    #Set the Query
    $query = "INSERT INTO website (source) VALUES ('$source')"; //Save it in a text row, that's it...

    #Run the query
    mysql_query($query);

    #Close the connection
    mysql_close($db);
}

$source = GetPage('http://www.google.com');

SaveToDB($source);

?>

#2


1  

Pull down the whole page using fopen and parse out any URLs (like images and css). You'll want to run a loop to grab each of the urls for files that generate the page. Store these as well, and replace the urls that used to link to the other sites files with your new links. (this will avoid any issues if the files should change or be removed in the future).

使用fopen下拉整个页面并解析出任何URL(如图像和CSS)。您需要运行循环来获取生成页面的文件的每个URL。也存储这些,并用新链接替换用于链接到其他站点文件的URL。 (如果文件应该更改或将来删除,这将避免任何问题)。

I'd recomend using a blob datatype just because it would allow you store all the files in one table, but you could do a table for the pages with a text datatype and another with blob to store images and other files.

我建议使用blob数据类型,因为它允许您将所有文件存储在一个表中,但是您可以为具有text数据类型的页面执行表格,而使用blob存储另一个表格来存储图像和其他文件。

Edit: If you are storing as a blob datatype look into base64_encode() it will increase the storage footprint on the server but you'll avoid any issues with quotes and special characters.

编辑:如果您将blob数据类型存储到base64_encode()中,它将增加服务器上的存储空间,但您将避免引号和特殊字符的任何问题。

#3


1  

Don't use a relation database to store files. Use a filesystem or a NoSQL solution.

不要使用关系数据库来存储文件。使用文件系统或NoSQL解决方案。

You might want to look into the various open source spider that are available (htdig and httrack come to mind).

您可能想要查看可用的各种开源蜘蛛(想想htdig和httrack)。

#4


1  

I'd store the URLs in a database, and make a cron job to wget the pages regularly, storing them in their own keyed local directories. Using wget will allow you to cache the page, and optionally cache its images, scripts, etc... as well. You can also have your wget command change the embedded URLs so that you don't have to cache everything.

我将URL存储在数据库中,并创建一个cron作业来定期查看页面,将它们存储在自己的键控本地目录中。使用wget将允许您缓存页面,并可选择缓存其图像,脚本等。您还可以让wget命令更改嵌入的URL,这样您就不必缓存所有内容。

Here is the man page for wget, you may also consider searching for "wget backup website" or similar.

这是wget的手册页,您也可以考虑搜索“wget backup website”或类似内容。

(By "keyed directories" I mean that your database table would have 2 fields, a 'key' and a 'url', the [unique] 'key' would then be the path where you archive the website to using wget.)

(通过“键控目录”我的意思是你的数据库表将有2个字段,一个'key'和一个'url',[unique]'key'将是你使用wget将网站存档的路径。)

#5


-2  

You can store the data as text datatype in mysql
but you have to convert the data bcz page may content many quotes and special characters.
you can see this question THIS Its not exact to your question but it will help when you will store the data in database.
about that images and videos...if you are storing page content then there will be only paths of that images and videos.. so no problem will come when you will store in database.

您可以将数据存储为mysql中的text数据类型,但是您必须转换数据bcz页面可能包含许多引号和特殊字符。你可以看到这个问题这不是你问题的确切问题,但是当你将数据存储在数据库中时它会有所帮助。关于那些图像和视频...如果你正在存储页面内容,那么将只有那些图像和视频的路径..所以当你将存储在数据库中时不会出现问题。