I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
我正在开发一个对页面访问敏感的网站。例如,它有一些部分将向用户显示网站的哪些部分(哪些项目)访问次数最多。为实现此功能,我想到了两种策略:
- Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
- 创建一个页面点击计数器,按访问次数对页面进行排序并选择最高的页面。
- Create a Google Analytics account and use its info.
- 创建Google Analytics帐户并使用其信息。
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
如果选择了第一个策略,我需要一个非常快速和准确的命中计数器,能够区分唯一的IP(或用户)。我相信使用MySQL不会是一个不错的选择,因为大量的页面访问,意味着很多数据库锁和性能问题。我认为快速记录类是一个很好的。
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
当第一个选项的所有问题出现时,第二个选项似乎非常有趣,但我不知道Google Analytics是否有办法(如API)能够访问我想要的信息。如果有,是否足够快?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
你建议我采取哪种方法(甚至是替代方法)?哪一个更快?表现是我的首要任务。谢谢。
UPDATE: Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
更新:谢谢。看到不同的答案很有意思。这些答案提醒我一个重要因素。我的网站每8分钟更新一次“访问量最大”的项目,因此我不需要实时数据,但我需要每8分钟左右准确一次。我的想法是:
- Log every page visit to a simple text log file
- 将每个页面访问记录到一个简单的文本日志文件
- Send a cookie to the user to separate unique users
- 向用户发送cookie以分隔唯一用户
- Every 8 minutes, load the log file, collect the info and update the MySQL tables.
- 每8分钟,加载日志文件,收集信息并更新MySQL表。
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.
也就是说,我不想重新发明*。如果第三方服务能满足我的要求,我会很乐意使用它。
3 个解决方案
#1
0
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
鉴于您计划使用页面命中数据来确定您网站上显示的数据,我建议您自己记录页面信息。您不希望依赖于为了创建页面而必须询问的某些第三方服务。如果您实时加载该数据,则尤其如此,因为您必须为每个传入的站点请求询问该服务。
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
我倾向于将数据保存在数据库中。如果您真的关心插入的性能,那么您可以调查拦截请求(我不确定您如何在PHP中进行此操作,但我假设它是可能的。)然后将请求数据传递给一个单独的线程来存储请求信息。通过让一个单独的线程处理日志记录,您不会中断对最终用户的响应。
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
此外,鉴于您计划使用收集的数据“...向用户显示网站的哪些部分(哪些项目)访问次数最多”,那么您需要考虑访问此数据以构建动态页面。也许为每个资源存储合并计数会很好。例如,可能有一行显示index.php请求30000次,而不是有30000行显示请求index.php。这肯定比在必须对可能变得非常大的表执行查询时更快地引用。
#2
0
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
谷歌分析有一个延迟,它会对返回到API的一些数据进行采样,以便完成。
You could try the API from Clicky. Bear in mind that:
您可以尝试Clicky的API。请记住:
Free accounts are limited to the last 30 days of history, and 100 results per request.
免费帐户仅限于过去30天的历史记录,每个请求可获得100个结果。
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
有许多点击计数器的例子,但听起来你没有找到满足你需求的计数器。
#3
0
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
我假设你不需要实时数据。如果是这种情况,我可能只是从Web服务器日志文件中读取数据。
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
您的Web服务器可以区分IP地址。没有完全可靠的方法来区分用户。我住在一个大学城;一半的宿舍学生拥有相同的大学IP地址。我认为谷歌分析依赖于cookie来识别用户,但共享计算机的可靠性低于100%。 (但这可能不是什么大问题。)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
“访问最多”也有点模糊。简单的方法是将特定页面上的每个匹配计为访问。但是300毫秒的“访问”值得怀疑。 (可能是他们点击了错误的链接,然后在页面呈现之前点击“后退”按钮。)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.
除非有我不知道的要求,否则我可能首先使用awk将时间戳,IP地址和页面名称提取到CSV文件中,然后将CSV文件加载到数据库中。
#1
0
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
鉴于您计划使用页面命中数据来确定您网站上显示的数据,我建议您自己记录页面信息。您不希望依赖于为了创建页面而必须询问的某些第三方服务。如果您实时加载该数据,则尤其如此,因为您必须为每个传入的站点请求询问该服务。
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
我倾向于将数据保存在数据库中。如果您真的关心插入的性能,那么您可以调查拦截请求(我不确定您如何在PHP中进行此操作,但我假设它是可能的。)然后将请求数据传递给一个单独的线程来存储请求信息。通过让一个单独的线程处理日志记录,您不会中断对最终用户的响应。
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
此外,鉴于您计划使用收集的数据“...向用户显示网站的哪些部分(哪些项目)访问次数最多”,那么您需要考虑访问此数据以构建动态页面。也许为每个资源存储合并计数会很好。例如,可能有一行显示index.php请求30000次,而不是有30000行显示请求index.php。这肯定比在必须对可能变得非常大的表执行查询时更快地引用。
#2
0
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
谷歌分析有一个延迟,它会对返回到API的一些数据进行采样,以便完成。
You could try the API from Clicky. Bear in mind that:
您可以尝试Clicky的API。请记住:
Free accounts are limited to the last 30 days of history, and 100 results per request.
免费帐户仅限于过去30天的历史记录,每个请求可获得100个结果。
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
有许多点击计数器的例子,但听起来你没有找到满足你需求的计数器。
#3
0
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
我假设你不需要实时数据。如果是这种情况,我可能只是从Web服务器日志文件中读取数据。
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
您的Web服务器可以区分IP地址。没有完全可靠的方法来区分用户。我住在一个大学城;一半的宿舍学生拥有相同的大学IP地址。我认为谷歌分析依赖于cookie来识别用户,但共享计算机的可靠性低于100%。 (但这可能不是什么大问题。)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
“访问最多”也有点模糊。简单的方法是将特定页面上的每个匹配计为访问。但是300毫秒的“访问”值得怀疑。 (可能是他们点击了错误的链接,然后在页面呈现之前点击“后退”按钮。)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.
除非有我不知道的要求,否则我可能首先使用awk将时间戳,IP地址和页面名称提取到CSV文件中,然后将CSV文件加载到数据库中。