hadoop中实现定制Writable类

Hadoop中有一套Writable实现可以满足大部分需求，但是在有些情况下，我们需要根据自己的需要构造一个新的实现，有了定制的Writable，我们就可以完全控制二进制表示和排序顺序。

为了演示如何新建一个定制的writable类型，我们需要写一个表示一对字符串的实现：

blic class TextPair implements WritableComparable<TextPair> {

    private Text first;

    private Text second;

    public TextPair() {

        set(new Text(), new Text());

    }

    public TextPair(String first, String second) {

        set(new Text(first), new Text(second));

    }

    public TextPair(Text first, Text second) {

        set(first, second);

    }

    public void set(Text first, Text second) {

        this.first = first;

        this.second = second;

    }

    public Text getFirst() {

        return first;

    }

    public Text getScond() {

        return second;

    }

    public void write(DataOutput out) throws IOException {

        first.write(out);

        second.write(out);

    }

    public void readFields(DataInput in) throws IOException {

        first.readFields(in);

        second.readFields(in);

    }

    public int hashCode() {

        return first.hashCode() * 163 + second.hashCode();

    }

    public boolean equals(Object o) {

        if(o instanceof TextPair) {

            TextPair tp = (TextPair)o;

            return first.equals(tp.first) && second.equals(tp.second);

        }

        return false;

    }

    public String toString() {

        return first + "\t" + second;

    }

    public int compareTo(TextPair tp) {

        int cmp = first.compareTo(tp.first);

        if(cmp != 0) {

            return cmp;

        }

        return second.compareTo(tp.second);

    }

}

为速度实现一个RawComparator

还可以进一步的优化，当作为MapReduce里的key，需要进行比较时，因为他已经被序列化，想要比较他们，那么首先要先反序列化成一个对象，然后再调用compareTo对象进行比较，但是这样效率太低了，有没有可能可以直接比较序列化后的结果呢，答案是肯定的，可以。

RawComparator接口允许执行者比较流中读取的未被反序列化为对象的记录，从而省去了创建对象的所有的开销，其中，compare() 比较时需要的两个参数所对应的记录位于字节数组b1和b2指定开始位置s1和s2，记录长度为l1和l2，代码如下：

public interface RawComparator<T> extends Comparator<T> {

  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}

以IntWritable为例，它的RawComparator实现中，compare() 方法通过readInt()直接在字节数组中读入需要比较的两个整数，然后输出Comparable接口要求的比较结果。

值得注意的是，该过程中compare()方法避免使用IntWritable对象，从而避免了不必要的对象分配，相关代码如下：

  /** A Comparator optimized for IntWritable. */

  public static class Comparator extends WritableComparator {

    public Comparator() {

      super(IntWritable.class);

    }

    public int compare(byte[] b1, int s1, int l1,

                       byte[] b2, int s2, int l2) {

      int thisValue = readInt(b1, s1);

      int thatValue = readInt(b2, s2);

      return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));

    }

  }

Writablecomparator是RawComparator对WritableComparable类的一个通用实现，它提供两个主要功能：

1、提供了一个RawComparator的compare()默认实现，该实现从数据流中反序列化要进行比较的对象，然后调用对象的compare()方法进行比较

2、它充当了RawComparator实例的一个工厂方法。例如，可以通过下面的代码获得IntWritable的RawComparator：

RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);

我们只需要把EmploeeWritable的序列化后的结果拆成成员对象，然后比较成员对象即可：

class Comparator extends WritableComparator {

    private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

    public Comparator() {

        super(TextPair.class);

    }

    public int compara(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

        try {

            int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);

            int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

            int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

            if(cmp != 0) {

                return cmp;

            }

            return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1, b2, s2 + firstL2, l2 -  firstL2);

        } catch(IOException e) {

            throw new IllegalArgumentException(e);

        }

    }

}

定制comparators

有时候，除了默认的comparator，你可能还需要一些自定义的comparator来生成不同的排序队列，看一下下面这个示例：

    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

        try {

            int firstL1 = WritableUtils.decodeVIntSize(b1[s1])+ readVInt(b1, s1);

            int firstL2 = WritableUtils.decodeVIntSize(b2[s2])+ readVInt(b2, s2);

            return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

        } catch (IOException e) {

            throw new IllegalArgumentException(e);

        }

    }

    public int compare(WritableComparable a, WritableComparable b) {

        if(a instanceof Textpair && b instanceof TextPair) {

            return ((TextPair) a).first.compareTo(((TextPair) b).first);

        }

        return super.compare(a, b);

    }

秒客网

hadoop中实现定制Writable类

为速度实现一个RawComparator

定制comparators

相关文章