Java 读取HDFS文件系统

最近有个需求，计算用户画像。

系统大概有800W的用户量，算每个用户的一些数据。

数据量比较大，算用hive还是毫无压力的，但是写的oracle，在给出数据给前端，就比较难受了。

然后换了种解决方法：

　　1.hive计算，写的HDFS

　　2.API读出来，写到hbase（hdfs和hbase的版本不匹配，没办法用sqoop 直接导）

然后问题就来了。

需要写个API，读HDFS上的文件。

主要类：ReadHDFS

public class ReadHDFS {

    public static void main(String[]args){

        long startLong = System.currentTimeMillis();

        HDFSReadLog.writeLog("start read file");

        String path;

        if (args.length > 1) {

//            path = args[0];

            Constant.init(args[0],args[1]);

        }

        HDFSReadLog.writeLog(Constant.PATH);

        try {

            getFile(Constant.URI + Constant.PATH);

        } catch (IOException e) {

            e.printStackTrace();

        }

        long endLong = System.currentTimeMillis();

        HDFSReadLog.writeLog("cost " + (endLong -startLong)/1000 + " seconds");

        HDFSReadLog.writeLog("cost " + (endLong -startLong)/1000/60 + " minute");

    }

    private static void getFile(String filePath) throws IOException {

        FileSystem fs = FileSystem.get(URI.create(filePath), HDFSConf.getConf());

        Path path = new Path(filePath);

        if (fs.exists(path) && fs.isDirectory(path)) {

            FileStatus[] stats = fs.listStatus(path);

            FSDataInputStream is;

            FileStatus stat;

            byte[] buffer;

            int index;

            StringBuilder lastStr = new StringBuilder();

            for(FileStatus file : stats){

                try{

                    HDFSReadLog.writeLog("start read : " + file.getPath());

                    is = fs.open(file.getPath());

                    stat = fs.getFileStatus(path);

                    int sum  = is.available();

                    if(sum == 0){

                        HDFSReadLog.writeLog("have no data : " + file.getPath() );

                        continue;

                    }

                    HDFSReadLog.writeLog("there have  : " + sum + " bytes" );

                    buffer = new byte[sum];
　　　　　　　　　　　　// 注意一点，如果文件太大了，可能会内存不够用。在本机测得时候，读一个100多M的文件，导致内存不够。

                    is.readFully(0,buffer);

                    String result = Bytes.toString(buffer);

                    // 写到 hbase

                    WriteHBase.writeHbase(result);

                    is.close();

                    HDFSReadLog.writeLog("read : " + file.getPath() + " end");

                }catch (IOException e){

                    e.printStackTrace();

                    HDFSReadLog.writeLog("read " + file.getPath() +" error");

                    HDFSReadLog.writeLog(e.getMessage());

                }

            }

            HDFSReadLog.writeLog("Read End");

            fs.close();

        }else {

            HDFSReadLog.writeLog(path + " is not exists");

        }

    }

}

配置类：HDFSConfie(赶紧没什么用，url和path配好了，不需要配置就可以读)

public class HDFSConf {

    public static Configuration conf = null;

    public static Configuration getConf(){

        if (conf == null){

            conf = new Configuration();

            String path  = Constant.getSysEnv("HADOOP_HOME")+"/etc/hadoop/";

            HDFSReadLog.writeLog("Get hadoop home : " + Constant.getSysEnv("HADOOP_HOME"));

            // hdfs conf

            conf.addResource(path+"core-site.xml");

            conf.addResource(path+"hdfs-site.xml");

            conf.addResource(path+"mapred-site.xml");

            conf.addResource(path+"yarn-site.xml");

        }

        return conf;

    }

}

一些常量：

　url ： hdfs:ip:prot

　path : HDFS的路径

注：考虑到读的表，可能不止有一个文件，做了循环。

看下篇，往hbase写数据

秒客网

Java 读取HDFS文件系统

相关文章