为了tensorflow项目要求测试protobuf序列化/反序列化的性能,测试过程及测试结果如下:
一. 测试环境
python 2.7 + proto3
二. 测试方法
1. 自定义一个proto消息(使用protobuf example里的例子,进行修改)
message Person {
string name = 1;
int32 id = 2; // Unique ID number for this person.
string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
string number = 1;
PhoneType type = 2;
}
repeated PhoneNumber phones = 4;
}
// Our address book file is just one of these.
message AddressBook {
repeated Person people = 1;
}
protoc --python_out=. address.proto
得到 addressbook_pb2.py
3. 在测试文件中,通过修改循环的大小,修改序列化内容的大小。并
for i in range(1024 * 1024):
PromptForAddress(address_book.people.add())
4. 序列化
begin = datetime.datetime.now()
serialized = address_book.SerializeToString()
end = datetime.datetime.now()
print end-begin
print len(serialized)
f.write(serialized)
5. 反序列化
book = f.read()
parsebegin = datetime.datetime.now()
address_book.ParseFromString(book)
parseend = datetime.datetime.now()
print parseend-parsebegin
print len(book)
完整的py文件如下:
#! /usr/bin/env python
# See README.txt for information and build instructions.
import addressbook_pb2
import sys
import datetime
# This function fills in a Person message based on user input.
def PromptForAddress(person):
person.id = 160824
person.name = "xxxxx xxxxx"
person.email = "xxxxxxxx@xxxxx.com"
phone_number = person.phones.add()
phone_number.number = "12345678"
phone_number.type = addressbook_pb2.Person.MOBILE
phone_number = person.phones.add()
phone_number.number = "23456789"
phone_number.type = addressbook_pb2.Person.HOME
phone_number = person.phones.add()
phone_number.number = "34567890"
phone_number.type = addressbook_pb2.Person.WORK
# Main procedure: Reads the entire address book from a file,
# adds one person based on user input, then writes it back out to the same
# file.
if len(sys.argv) != 2:
print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE"
sys.exit(-1)
address_book = addressbook_pb2.AddressBook()
# Read the existing address book.
try:
with open(sys.argv[1], "rb") as f:
book = f.read()
parsebegin = datetime.datetime.now()
address_book.ParseFromString(book)
parseend = datetime.datetime.now()
print parseend-parsebegin
print len(book)
# address_book.ParseFromString(f.read())
except IOError:
print sys.argv[1] + ": File not found. Creating a new file."
# Add an address.
for i in range(1024 * 1024):
PromptForAddress(address_book.people.add())
# Write the new address book back to disk.
with open(sys.argv[1], "wb") as f:
begin = datetime.datetime.now()
serialized = address_book.SerializeToString()
end = datetime.datetime.now()
print end-begin
print len(serialized)
'''
address_book = addressbook_pb2.AddressBook()
# Read the existing address book.
try:
with open(sys.argv[1], "rb") as f:
book = f.read()
parsebegin = datetime.datetime.now()
address_book.ParseFromString(book)
parseend = datetime.datetime.now()
print parseend-parsebegin
print len(book)
'''
6. 修改循环次数,记录不同大小的protobuf序列化反序列的性能
三. 测试结果
字节(MB) |
序列化(s) |
反序列化(s) |
1.03 |
0.799453 |
0.950107 |
53.00 |
36.759911 |
43.303041 |
61.64 |
41.674104 |
52.206466 |
81.00 |
63.077295 |
79.234909 |
106.00 |
72.048027 |
88.280556 |
102.83 |
81.08806 |
102.28786 |
162.00 |
128.883403 |
164.042591 |
205.66 |
163.994605 |
199.729636 |
243.00 |
197.582673 |
246.699898 |
注:表中字节大小为序列化后得到的字符串大小,即程序中的 len(serialized)
四. 测试分析及问题
根据测试的结果看是基本成线性增长,字节数越大,所用时间越多。当字节数为243MB时,序列化耗时3s左右,反序列化耗时4s左右。在测试结果上有几个问题如下:
1. 测试方法是否正确,我感觉应该是可行的,但是结果比我预期的要大。
2. 本次测试是用Python测试的,我在c++下进行测试,得到的结果比python好很多(C++部分参考FlatBuffers与protobuf性能比较)。
我只对比测试了小数据量(1KB)的,序列化及反序列化均循环100次,结果如下:(两次测试的proto文件为同一个,在C++中用的序列化/反序列化函数为ParseFromArray/SerializeToArray,python中用的序列化/反序列化函数是ParseFromString/SerializeToString)
|
序列化(毫秒) |
反序列化(毫秒) |
Python |
63.879 |
82.89 |
C++ |
1.336 |
1.352 |
3. 经查阅相关资料,序列化反序列化跟proto的结构也是有关系的(比如多层嵌套),所以建议在学习tensorflow之后结合tensorflow再进行一次测试,在训练某一个模型时,将其中序列化反序列化的过程单独计时。
以上两个问题还需讨论,也欢迎各位批评指正。
五. 参考及学习文章
2. (pbc lua 加入)c++_lua_Python with/without extension性能测试 (10万次SerializeToString & ParseFromString)