
时间:2021-12-16 21:32:54

I am trying to run a weka classifier on map reduce and loading entire arff file of even 200mb is leading to heap space error, so I want to split the arff file into chunks, but the thing is it has to maintain the block information ie the arff attributes information in every chunk so as to run the classifier in each mapper. Here is the code that I am trying to split the data but not able to do with efficiency,


 List<InputSplit> splits = new ArrayList<InputSplit>();
        for (FileStatus file: listStatus(job)) {
            Path path = file.getPath();
            FileSystem fs = path.getFileSystem(job.getConfiguration());

            //number of bytes in this file
            long length = file.getLen();
            BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

            // make sure this is actually a valid file
            if(length != 0) {
                // set the number of splits to make. NOTE: the value can be changed to anything
                int count = job.getConfiguration().getInt("Run-num.splits",1);
                for(int t = 0; t < count; t++) {
                    //split the file and add each chunk to the list
                    splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts())); 
            else {
                // Create empty array for zero length files
                splits.add(new FileSplit(path, 0, length, new String[0]));
        return splits;

1 个解决方案


Have you tried this first?


In mapred-site.xml, add this property:



// memory allocation for MR jobs

// MR作业的内存分配


Have you tried this first?


In mapred-site.xml, add this property:



// memory allocation for MR jobs

// MR作业的内存分配