After reading several of the hits on a quick google search, it seems there is not a whole lot of consistency when it comes to determining average URL length.
在快速谷歌搜索中阅读了几个点击后,在确定平均URL长度时似乎没有很多一致性。
I know IE has a maximum URL length of 2083 characters (from here) - so I have a good maximum to work with.
我知道IE的最大URL长度为2083个字符(从这里开始) - 所以我有一个很好的最大值。
My concern is that I am writing a URL-shortener in PHP (similar to some other questions on SO), and want to make sure I am not likely to exceed the storage capability of the server hosting it.
我担心的是我在PHP中编写了一个URL缩短程序(类似于SO上的其他一些问题),并且希望确保我不太可能超出托管它的服务器的存储能力。
If all URLs are the IE maximum, then 2^32
won't fit comfortably anywhere - it'd take 2K x 4B ~= 8TB
of storage: an unrealistic expectation.
如果所有的URL都是IE的最大值,那么2 ^ 32将无法轻松适应任何地方 - 它需要2K x 4B~ = 8TB的存储空间:一个不切实际的期望。
Without adding-in a trimming function (ie, purging "old" shortened URLs), what is the safest way to calculate storage usage of the app?
如果没有添加修剪功能(即清除“旧的”缩短的URL),那么计算应用程序存储使用情况的最安全的方法是什么?
Is ~34 characters a safe guess? If so, then a fully-populated (using an int
type for a primary key) database would chew 292GB of space (double 146GB for any meta data that may want to be stored).
大约34个字符是安全的猜测吗?如果是这样,那么完全填充(使用主键的int类型)数据库将咀嚼292GB的空间(对于可能想要存储的任何元数据,双重146GB)。
What is the best-guess for an application such as this?
这样的应用程序的最佳猜测是什么?
4 个解决方案
#1
2
Well, you don't need to know the avarage url length. It is a guess, but I'd figure that an URL shortener is mainly used to shorten long URLs. Why bother shortening one that is short already? :)
好吧,你不需要知道avarage网址长度。这是猜测,但我认为URL缩短器主要用于缩短长URL。为什么还要缩短已经缩短的那个呢? :)
That said, there's another issue. A database will have some overhead too, so you can't just calculate an avarage and said that is the avarage byte size.
那就是说,还有另外一个问题。数据库也会有一些开销,所以你不能只计算一个avarage,并说这是avarage字节大小。
I've written an url shortener myself and it already contains about 45 items. So I'd suggest you write yours, and by the time it actually contains 2^32 URLs, buying an 8TB hard disk will probably not pose a problem anymore. ;-)
我自己写了一个url shortener,它已经包含了大约45个项目。所以我建议你写你的,当它实际上包含2 ^ 32个URL时,购买8TB硬盘可能不会再造成问题了。 ;-)
#2
21
This is probably unknowable without indexing the entire Internet, but according to an analysis by Kelvin Tan on a dataset of 6,627,999 unique URLs from 78,764 unique domains, the answer is 76.97:
如果没有索引整个互联网,这可能是不可知的,但根据Kelvin Tan对来自78,764个独特域的6,627,999个唯一URL的数据集的分析,答案是76.97:
Mean: 76.97
Standard Deviation: 37.41
标准差:37.41
95th% confidence interval: 157
95%置信区间:157
99.5th% confidence interval: 218
99.5%置信区间:218
#3
4
I'm not sure what is typical, but of 11,000 urls in our request database, the average length is 62 characters. We may be an exception because every month we receive hundreds of requests from our customer for items from Japan. Our database includes hundreds of urls with several hundred characters. The longest is a google translate link at 1689 characters.
我不确定什么是典型的,但在我们的请求数据库中有11,000个网址,平均长度为62个字符。我们可能会例外,因为我们每个月都会收到客户提出的数百件来自日本的请求。我们的数据库包括数百个包含数百个字符的网址。最长的是一个1689个字符的谷歌翻译链接。
top 10 len(producturl): 1689 792 707 693 647 606 574 569 562 560
前10 len(producturl):1689 792 707 693 647 606 574 569 562 560
sample url 647 characters:
示例网址647个字符:
for estimating purposes you should extrapolate from some dataset after applying standard deviation to throw out the outliers which could distort your mean.
为了估算的目的,你应该在应用标准偏差后从某些数据集中推断出可能会扭曲你的平均值的异常值。
#4
3
From RFC 2068 section 3.2.1:
从RFC 2068第3.2.1节:
The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).
HTTP协议不对URI的长度设置任何先验限制。服务器必须能够处理它们所服务的任何资源的URI,并且如果它们提供可以生成这种URI的基于GET的表单,则应该能够处理无限长度的URI。如果URI长于服务器可以处理的长度,服务器应该返回414(Request-URI Too Long)状态(参见10.4.15节)。
Note: Servers should be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations may not properly support these lengths.
注意:服务器应谨慎依赖于大于255字节的URI长度,因为某些较旧的客户端或代理实现可能无法正确支持这些长度。
Although IE (and probably most other browsers) support much longer URI lengths, I don't believe most forms or client-side apps rely on anything above 255 bytes working. Your server logs should provide some statistics about what kind of urls you are seeing.
虽然IE(可能还有大多数其他浏览器)支持更长的URI长度,但我不认为大多数表单或客户端应用程序依赖于255字节以上的任何工作。您的服务器日志应提供有关您所看到的网址类型的一些统计信息。
#1
2
Well, you don't need to know the avarage url length. It is a guess, but I'd figure that an URL shortener is mainly used to shorten long URLs. Why bother shortening one that is short already? :)
好吧,你不需要知道avarage网址长度。这是猜测,但我认为URL缩短器主要用于缩短长URL。为什么还要缩短已经缩短的那个呢? :)
That said, there's another issue. A database will have some overhead too, so you can't just calculate an avarage and said that is the avarage byte size.
那就是说,还有另外一个问题。数据库也会有一些开销,所以你不能只计算一个avarage,并说这是avarage字节大小。
I've written an url shortener myself and it already contains about 45 items. So I'd suggest you write yours, and by the time it actually contains 2^32 URLs, buying an 8TB hard disk will probably not pose a problem anymore. ;-)
我自己写了一个url shortener,它已经包含了大约45个项目。所以我建议你写你的,当它实际上包含2 ^ 32个URL时,购买8TB硬盘可能不会再造成问题了。 ;-)
#2
21
This is probably unknowable without indexing the entire Internet, but according to an analysis by Kelvin Tan on a dataset of 6,627,999 unique URLs from 78,764 unique domains, the answer is 76.97:
如果没有索引整个互联网,这可能是不可知的,但根据Kelvin Tan对来自78,764个独特域的6,627,999个唯一URL的数据集的分析,答案是76.97:
Mean: 76.97
Standard Deviation: 37.41
标准差:37.41
95th% confidence interval: 157
95%置信区间:157
99.5th% confidence interval: 218
99.5%置信区间:218
#3
4
I'm not sure what is typical, but of 11,000 urls in our request database, the average length is 62 characters. We may be an exception because every month we receive hundreds of requests from our customer for items from Japan. Our database includes hundreds of urls with several hundred characters. The longest is a google translate link at 1689 characters.
我不确定什么是典型的,但在我们的请求数据库中有11,000个网址,平均长度为62个字符。我们可能会例外,因为我们每个月都会收到客户提出的数百件来自日本的请求。我们的数据库包括数百个包含数百个字符的网址。最长的是一个1689个字符的谷歌翻译链接。
top 10 len(producturl): 1689 792 707 693 647 606 574 569 562 560
前10 len(producturl):1689 792 707 693 647 606 574 569 562 560
sample url 647 characters:
示例网址647个字符:
for estimating purposes you should extrapolate from some dataset after applying standard deviation to throw out the outliers which could distort your mean.
为了估算的目的,你应该在应用标准偏差后从某些数据集中推断出可能会扭曲你的平均值的异常值。
#4
3
From RFC 2068 section 3.2.1:
从RFC 2068第3.2.1节:
The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).
HTTP协议不对URI的长度设置任何先验限制。服务器必须能够处理它们所服务的任何资源的URI,并且如果它们提供可以生成这种URI的基于GET的表单,则应该能够处理无限长度的URI。如果URI长于服务器可以处理的长度,服务器应该返回414(Request-URI Too Long)状态(参见10.4.15节)。
Note: Servers should be cautious about depending on URI lengths above 255 bytes, because some older client or proxy implementations may not properly support these lengths.
注意:服务器应谨慎依赖于大于255字节的URI长度,因为某些较旧的客户端或代理实现可能无法正确支持这些长度。
Although IE (and probably most other browsers) support much longer URI lengths, I don't believe most forms or client-side apps rely on anything above 255 bytes working. Your server logs should provide some statistics about what kind of urls you are seeing.
虽然IE(可能还有大多数其他浏览器)支持更长的URI长度,但我不认为大多数表单或客户端应用程序依赖于255字节以上的任何工作。您的服务器日志应提供有关您所看到的网址类型的一些统计信息。