测试hbase和hadoop操作文件的性能
1:单线程hbase的文件存入
String parentPath = "F:/pic/2003-zhujiajian";
File[] files = getAllFilePath(parentPath);
HBaseConfiguration config = new HBaseConfiguration();
HTable table = new HTable(config, new Text("offer"));
long start = System.currentTimeMillis();
for (File file :files) {
if(file.isFile()) {
byte[] data = getData(file);
createRecore(table,file.getName(),"image_big",data);
}
}
long end = System.currentTimeMillis();
System.out.println("time cost=" + (end-start));
108037206 bytes, 303个files write from local windows to remote hbase,cost 23328 or 21001 milliseconds
2:单线程hadoop的文件存入
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path src = new Path("F:/pic/2003-zhujiajian");
Path dst = new Path("/user/zxf/image");
long start = System.currentTimeMillis();
fs.copyFromLocalFile(src, dst);
long end = System.currentTimeMillis();
System.out.println("time cost=" + (end-start));
108037206 bytes, 303 files write from local windows to remote hdfs,cost 26531 or 32407 milliseconds
3:单线程hbase的文件读取
花费的时间慢的难以置信
108037206 bytes, 303 files read from hdfs to local cost 479350 milliseconds
4:单线程hadoop的文件读取
108037206 bytes, 303 files read from hdfs to local cost 14188 milliseconds
5:深入测试
取几个文件对比
fileSize(byte) hdfs time(ms) hbase time(ms)
12341140 1313 14688
708474 63 4359
82535 15 3907
55296 16 125
6 思考
测试期间发生了一个region offline的错误,重启服务也还是报错,后然重新format namenode, delete datanode上数据,重启发现还有datanode没有起来,ssh上去发现java进程死了
浪费了1个多小时,仔细想了一下 HTable分散到各个HRegionServer上的各子表,一台datanode挂了,当有数据请求时,连不上,所以报region offline错误
为什么hbase读取的performance那么差?我单个读取11m的文件需要14000 milliseconds,而hdfs真个文件目录的读取才14188 milliseconds
http://blog.rapleaf.com/dev/?p=26,这篇文章中说到
Finally, another thing you shouldn’t do with HBase (or an RDBMS, forthat matter), is store large amounts of binary data. When I say largeamounts, I mean tens to hundreds of megabytes. Certainly both RDBMSsand HBase have the capabilities to store large amounts of binary data.However, again, we have an impedance mismatch. RDBMSs are built to befast metadata stores; HBase is designed to have lots of rows and cells,but functions best when the rows are (relatively) small. HBase splitsthe virtual table space into regions that can be spread out across manyservers. The default size of individual files in a region is 256MB. Thecloser to the region limit you make each row, the more overhead you arepaying to host those rows. If you have to store a lot of big files,then you’re best off storing in the local filesystem, or if you haveLOTS of data, HDFS. You can still keep the metadata in an RDBMS orHBase - but do us all a favor and just keep the path in the metadata.
看来,hbase不合适存放二进制文件,存放图片这样的application还是hdfs更合适了
alter table offer change image_big IN_MEMORY;
a:重新测试了几遍,包括重启hbase,hdfs,hbase的读取速度还是和原先没大差别
b:删除原有数据,重新写入后,再测试读发现,小文件的读取效率搞了很多
fileSize(byte) 1(ms) 2(ms) 3(ms)
12341140 11750 11109 11718
708474 625 610 672
82535 78 78 78
55296 47 62 47
这样就是说读cache有较大的性能提升,在data数量不是非常大的时候,瓶颈是在读取速度上,100k一下的数据读取效率还是可以的,花费时间基本上和要读取的data的长度成正比
但是之前的效率为什么没有变?难道不能cache从磁盘读取的数据?
然后试着读取了最先放入的一批文件中的几个,现在还是很慢,重复b的操作后效率提升了
原因可能是系统在创建row's clunm data的时候打上了cache标志,cache适合clunm系统绑定在一起的,hbase启动的时候会把打了cache标志的colunm数据读到memory中.
所以在我执行alter table offer change image_big IN_MEMORY之前所创建的数据都没有cache标志, 此cache不是像其他的cache,启动的时候不做load,访问后再cache,这样一来,cache的数据愈多必然造成启动速度的加慢,我这里也有所感觉了,当然对用户体验是好的,不会在第一次访问的时候特别慢
c:那为hbase读取数据的速度为什么比hdfs慢,特别是大文件的时候慢那么多呢?过多的网络交互?
从debug日志来看,情况的确是这样,文件越大,regionServer response clinet的次数非常多.具体还需分析源代码仔细看看了.