We've got a healthy debate going on in the office this week. We're creating a Db to store proxy information, for the most part we have the schema worked out except for how we should store IPs. One camp wants to use 4 smallints, one for each octet and the other wants to use a 1 big int,INET_ATON.
本周我们在办公室里进行了一场健康的辩论。我们正在创建一个Db来存储代理信息,除了我们应该如何存储IP之外,我们大部分都已经制定了模式。一个阵营想要使用4个小点,一个用于每个八位位组,另一个想要使用一个大的int,INET_ATON。
These tables are going to be huge so performance is key. I am in middle here as I normally use MS SQL and 4 small ints in my world. I don't have enough experience with this type of volume storing IPs.
这些表格将是巨大的,因此性能是关键。我在这里中间,因为我通常在我的世界中使用MS SQL和4个小的int。我对这种类型的存储IP的经验不足。
We'll be using perl and python scripts to access the database to further normalize the data into several other tables for top talkers, interesting traffic etc.
我们将使用perl和python脚本来访问数据库,以进一步将数据标准化为其他几个表,用于*谈话者,有趣的流量等。
I am sure there are some here in the community that have done something simular to what we are doing and I am interested in hearing about their experiences and which route is best, 1 big int, or 4 small ints for IP addresses.
我相信社区中有一些人已经做了类似于我们正在做的事情,我有兴趣听听他们的经历,哪条路线最好,1个大的int,或4个小的IP地址。
EDIT - One of our concerns is space, this database is going to be huge like in 500,000,000 records a day. So we are trying to weigh the space issue along with the performance issue.
编辑 - 我们关注的一个问题是空间,这个数据库将像每天500,000,000条记录一样巨大。因此,我们正在尝试权衡空间问题以及性能问题。
EDIT 2 Some of the conversation has turned over to the volume of data we are going to store...that's not my question. The question is which is the preferable way to store an IP address and why. Like I've said in my comments, we work for a large fortune 50 company. Our log files contain usage data from our users. This data in turn will be used within a security context to drive some metrics and to drive several security tools.
编辑2一些谈话已经转移到我们要存储的数据量......这不是我的问题。问题是哪个是存储IP地址的最佳方式以及原因。就像我在评论中所说,我们为一家大型财富50强公司工作。我们的日志文件包含用户的使用数据。反过来,这些数据将用于安全上下文中,以驱动一些指标并驱动多个安全工具。
5 个解决方案
#1
24
I would suggest looking at what type of queries you will be running to decide which format you adopt.
我建议您查看将要运行的查询类型,以决定采用哪种格式。
Only if you need to pull out or compare individual octets would you have to consider splitting them up into separate fields.
只有当您需要拉出或比较单个八位字节时,您才需要考虑将它们分成单独的字段。
Otherwise, store it as a 4 byte integer. That also has the bonus of allowing you to use the MySQL built-in INET_ATON()
and INET_NTOA()
functions.
否则,将其存储为4字节整数。这也有允许您使用MySQL内置INET_ATON()和INET_NTOA()函数的好处。
Performance vs. Space
Storage:
存储:
If you are only going to support IPv4 addresses then your datatype in MySQL can be an UNSIGNED INT
which only uses 4 bytes of storage.
如果您只支持IPv4地址,那么MySQL中的数据类型可以是UNSIGNED INT,它只使用4个字节的存储空间。
To store the individual octets you would only need to use UNSIGNED TINYINT
datatypes, not SMALLINTS
, which would use up 1 byte each of storage.
要存储单个八位字节,您只需要使用UNSIGNED TINYINT数据类型,而不是SMALLINTS,每个存储将占用1个字节。
Both methods would use similar storage with perhaps slightly more for separate fields for some overhead.
对于某些开销,这两种方法都会使用类似的存储,而对于单独的字
More info:
更多信息:
- Numeric Type Overview
- 数字类型概述
- Integer Types (Exact Value) - INTEGER, INT, SMALLINT, TINYINT, MEDIUMINT, BIGINT
- 整数类型(精确值) - INTEGER,INT,SMALLINT,TINYINT,MEDIUMINT,BIGINT
Performance:
性能:
Using a single field will yield much better performance, it's a single comparison instead of 4. You mentioned that you will only run queries against the whole IP address, so there should be no need to keep the octets separate. Using the INET_*
functions of MySQL will do the conversion between the text and integer representations once for the comparison.
使用单个字段将产生更好的性能,它是单个比较而不是4.您提到您将仅针对整个IP地址运行查询,因此不需要将八位字节分开。使用MySQL的INET_ *函数将进行一次文本和整数表示之间的转换以进行比较。
#2
13
A BIGINT
is 8
bytes in MySQL
.
BIGINT在MySQL中是8个字节。
To store IPv4
addresses, an UNSINGED INT
is enough, which I think is what you shoud use.
要存储IPv4地址,UNSINGED INT就足够了,我认为这是你应该使用的。
I can't imagine a scenario where 4
octets would gain more performance than a single INT
, and the latter is much more convenient.
我无法想象一个场景,其中4个八位位组比单个INT获得更多的性能,而后者更方便。
Also note that if you are going to issue queries like this:
另请注意,如果您要发出如下查询:
SELECT *
FROM ips
WHERE ? BETWEEN start_ip AND end_ip
, where start_ip
and end_ip
are columns in your table, the performance will be poor.
,其中start_ip和end_ip是表中的列,性能会很差。
These queries are used to find out if a given IP
is within a subnet range (usually to ban it).
这些查询用于查明给定的IP是否在子网范围内(通常是禁止它)。
To make these queries efficient, you should store the whole range as a LineString
object with a SPATIAL
index on it, and query like this:
要使这些查询有效,您应该将整个范围存储为具有SPATIAL索引的LineString对象,并且查询如下:
SELECT *
FROM ips
WHERE MBRContains(?, ip_range)
See this entry in my blog for more detail on how to do it:
有关如何执行此操作的详细信息,请参阅我的博客中的此条目:
- Banning IPs
- 禁止IP
#3
3
Use PostgreSQL, there's a native data type for that.
使用PostgreSQL,有一个原生数据类型。
More seriously, I would fall into the "one 32-bit integer" camp. An IP address only makes sense when all four octets are considered together, so there's no reason to store the octets in separate columns in the database. Would you store a phone number using three (or more) different fields?
更严重的是,我会陷入“一个32位整数”阵营。只有当所有四个八位字节被一起考虑时,IP地址才有意义,因此没有理由将八位字节存储在数据库的单独列中。您会使用三个(或更多)不同的字段存储电话号码吗?
#4
1
Having seperate fields doesn't sound particularly sensible to me - much like splitting a zipcode into sections or a phone number.
拥有单独的字段对我来说听起来并不是特别明智 - 就像将邮政编码拆分成部分或电话号码一样。
Might be useful if you wanted specific info on the sections, but I see no real reason to not use a 32 bit int.
如果你想要各部分的具体信息,可能会有用,但我认为没有真正的理由不使用32位int。
#5
-1
Efficient transformation of ip to int and int to ip (could be useful to you): (PERL)
有效地将ip转换为int和int转换为ip(对您有用):( PERL)
sub ip2dec {
my @octs = split /\./,shift;
return ($octs[0] << 24) + ($octs[1] << 16) + ($octs[2] << 8) + $octs[3];
}
sub dec2ip {
my $number = shift;
my $first_oct = $number >> 24;
my $reverse_1_ = $number - ($first_oct << 24);
my $secon_oct = $reverse_1_ >> 16;
my $reverse_2_ = $reverse_1_ - ($secon_oct << 16);
my $third_oct = $reverse_2_ >> 8;
my $fourt_oct = $reverse_2_ - ($third_oct << 8);
return "$first_oct.$secon_oct.$third_oct.$fourt_oct";
}
#1
24
I would suggest looking at what type of queries you will be running to decide which format you adopt.
我建议您查看将要运行的查询类型,以决定采用哪种格式。
Only if you need to pull out or compare individual octets would you have to consider splitting them up into separate fields.
只有当您需要拉出或比较单个八位字节时,您才需要考虑将它们分成单独的字段。
Otherwise, store it as a 4 byte integer. That also has the bonus of allowing you to use the MySQL built-in INET_ATON()
and INET_NTOA()
functions.
否则,将其存储为4字节整数。这也有允许您使用MySQL内置INET_ATON()和INET_NTOA()函数的好处。
Performance vs. Space
Storage:
存储:
If you are only going to support IPv4 addresses then your datatype in MySQL can be an UNSIGNED INT
which only uses 4 bytes of storage.
如果您只支持IPv4地址,那么MySQL中的数据类型可以是UNSIGNED INT,它只使用4个字节的存储空间。
To store the individual octets you would only need to use UNSIGNED TINYINT
datatypes, not SMALLINTS
, which would use up 1 byte each of storage.
要存储单个八位字节,您只需要使用UNSIGNED TINYINT数据类型,而不是SMALLINTS,每个存储将占用1个字节。
Both methods would use similar storage with perhaps slightly more for separate fields for some overhead.
对于某些开销,这两种方法都会使用类似的存储,而对于单独的字
More info:
更多信息:
- Numeric Type Overview
- 数字类型概述
- Integer Types (Exact Value) - INTEGER, INT, SMALLINT, TINYINT, MEDIUMINT, BIGINT
- 整数类型(精确值) - INTEGER,INT,SMALLINT,TINYINT,MEDIUMINT,BIGINT
Performance:
性能:
Using a single field will yield much better performance, it's a single comparison instead of 4. You mentioned that you will only run queries against the whole IP address, so there should be no need to keep the octets separate. Using the INET_*
functions of MySQL will do the conversion between the text and integer representations once for the comparison.
使用单个字段将产生更好的性能,它是单个比较而不是4.您提到您将仅针对整个IP地址运行查询,因此不需要将八位字节分开。使用MySQL的INET_ *函数将进行一次文本和整数表示之间的转换以进行比较。
#2
13
A BIGINT
is 8
bytes in MySQL
.
BIGINT在MySQL中是8个字节。
To store IPv4
addresses, an UNSINGED INT
is enough, which I think is what you shoud use.
要存储IPv4地址,UNSINGED INT就足够了,我认为这是你应该使用的。
I can't imagine a scenario where 4
octets would gain more performance than a single INT
, and the latter is much more convenient.
我无法想象一个场景,其中4个八位位组比单个INT获得更多的性能,而后者更方便。
Also note that if you are going to issue queries like this:
另请注意,如果您要发出如下查询:
SELECT *
FROM ips
WHERE ? BETWEEN start_ip AND end_ip
, where start_ip
and end_ip
are columns in your table, the performance will be poor.
,其中start_ip和end_ip是表中的列,性能会很差。
These queries are used to find out if a given IP
is within a subnet range (usually to ban it).
这些查询用于查明给定的IP是否在子网范围内(通常是禁止它)。
To make these queries efficient, you should store the whole range as a LineString
object with a SPATIAL
index on it, and query like this:
要使这些查询有效,您应该将整个范围存储为具有SPATIAL索引的LineString对象,并且查询如下:
SELECT *
FROM ips
WHERE MBRContains(?, ip_range)
See this entry in my blog for more detail on how to do it:
有关如何执行此操作的详细信息,请参阅我的博客中的此条目:
- Banning IPs
- 禁止IP
#3
3
Use PostgreSQL, there's a native data type for that.
使用PostgreSQL,有一个原生数据类型。
More seriously, I would fall into the "one 32-bit integer" camp. An IP address only makes sense when all four octets are considered together, so there's no reason to store the octets in separate columns in the database. Would you store a phone number using three (or more) different fields?
更严重的是,我会陷入“一个32位整数”阵营。只有当所有四个八位字节被一起考虑时,IP地址才有意义,因此没有理由将八位字节存储在数据库的单独列中。您会使用三个(或更多)不同的字段存储电话号码吗?
#4
1
Having seperate fields doesn't sound particularly sensible to me - much like splitting a zipcode into sections or a phone number.
拥有单独的字段对我来说听起来并不是特别明智 - 就像将邮政编码拆分成部分或电话号码一样。
Might be useful if you wanted specific info on the sections, but I see no real reason to not use a 32 bit int.
如果你想要各部分的具体信息,可能会有用,但我认为没有真正的理由不使用32位int。
#5
-1
Efficient transformation of ip to int and int to ip (could be useful to you): (PERL)
有效地将ip转换为int和int转换为ip(对您有用):( PERL)
sub ip2dec {
my @octs = split /\./,shift;
return ($octs[0] << 24) + ($octs[1] << 16) + ($octs[2] << 8) + $octs[3];
}
sub dec2ip {
my $number = shift;
my $first_oct = $number >> 24;
my $reverse_1_ = $number - ($first_oct << 24);
my $secon_oct = $reverse_1_ >> 16;
my $reverse_2_ = $reverse_1_ - ($secon_oct << 16);
my $third_oct = $reverse_2_ >> 8;
my $fourt_oct = $reverse_2_ - ($third_oct << 8);
return "$first_oct.$secon_oct.$third_oct.$fourt_oct";
}