hadoop源码 - HDFS的文件操作流写操作(客户端)

在前面的博文中我主要从客户端的角度讲述了HDFS文件写操作的工作流程，但是关于客户端是如何把数据块传送到数据节点，同时数据节点又是如何来接受来自客户端的数据块呢？这就是本文将要讨论的。

在客户端负责数据写入处理的核心类是DFSOutputStream，它的内部主要有数据包发送器DataStream、数据包确认处理器ResponseProcessor和数据包封装器Packet，其整体设计架构图如下：

1.数据校验器FSOutputSummer

[java] view plain copy

abstract public class FSOutputSummer extends OutputStream {
private Checksum sum; //数据校验器
private byte buf[]; //数据缓存
private byte checksum[]; //校验数据缓存
private int count; //数据缓存的当前写入位置
protected FSOutputSummer(Checksum sum, int maxChunkSize, int checksumSize) {
this.sum = sum;
this.buf = new byte[maxChunkSize];
this.checksum = new byte[checksumSize];
this.count = 0;
}
/*
* 写入数据及其检验数据
*/
protected abstract void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException;
/** Write one byte */
public synchronized void write(int b) throws IOException {
sum.update(b);
buf[count++] = (byte)b;
if(count == buf.length) {
flushBuffer();
}
}
/**
*
*/
public synchronized void write(byte b[], int off, int len) throws IOException {
if (off < 0 || len < 0 || off > b.length - len) {
throw new ArrayIndexOutOfBoundsException();
}
for (int n=0; n<len; n+=write1(b, off+n, len-n)) {
}
}
/**
* Write a portion of an array, flushing to the underlying
* stream at most once if necessary.
*/
private int write1(byte b[], int off, int len) throws IOException {
if(count==0 && len>=buf.length) {
// local buffer is empty and user data has one chunk
// checksum and output data
final int length = buf.length;
sum.update(b, off, length);
writeChecksumChunk(b, off, length, false);
return length;
}
// copy user data to local buffer
int bytesToCopy = buf.length-count;
bytesToCopy = (len<bytesToCopy) ? len : bytesToCopy;
sum.update(b, off, bytesToCopy);
System.arraycopy(b, off, buf, count, bytesToCopy);
count += bytesToCopy;
if (count == buf.length) {
// local buffer is full
flushBuffer();
}
return bytesToCopy;
}
/** Generate checksum for the data chunk and output data chunk & checksum
* to the underlying output stream. If keep is true then keep the
* current checksum intact, do not reset it.
*/
private void writeChecksumChunk(byte b[], int off, int len, boolean keep) throws IOException {
int tempChecksum = (int)sum.getValue();
if (!keep) {
sum.reset();
}
int2byte(tempChecksum, checksum);
writeChunk(b, off, len, checksum);
}
}

2.DFSOutputStream

上一次在HDFS的文件操作流(1)——写操作(客户端概述)一文中粗略的提到了DataStreamer线程，那么现在我们就来具体的看看客户端是如何传输数据的。先来看看底层文件写入流DFSOutputSream的核心代码：

[java] view plain copy

class DFSOutputStream extends FSOutputSummer implements Syncable {
private Socket s; //客户端与副本复制流水线中第一个存储节点的连接
boolean closed = false; //写入操作是否关闭
private String src; //文件路径
private DataOutputStream blockStream; //客户端向副本复制流水线中第一个存储节点写入数据的IO流
private DataInputStream blockReplyStream; //客户端从副本复制流水线中第一个存储节点读取数据包确认的IO流
private Block block; //当前正在写入的数据块
final private long blockSize; //数据块大小
private DataChecksum checksum; //保存数据校验和计算器的类型信息
private LinkedList<Packet> dataQueue = new LinkedList<Packet>(); //等待发送的数据包队列
private LinkedList<Packet> ackQueue = new LinkedList<Packet>(); //等待确认的数据包队列
private Packet currentPacket = null;
private int maxPackets = 80; //客户端写入时可缓存的数据包最大数量
// private int maxPackets = 1000; // each packet 64K, total 64MB
private DataStreamer streamer = new DataStreamer(); //后台工作线程，负责数据包的发送
private ResponseProcessor response = null; //后台工作线程，负责确认包接受与处理，一个数据块对应一个确认处理线程
private long currentSeqno = 0; //下一个数据包的序号
private long bytesCurBlock = 0; //当前数据块已写入数据的大小
private int packetSize = 0; //一个数据包的实际大小
private int chunksPerPacket = 0; //一个数据包最多包含多少个校验块
private DatanodeInfo[] nodes = null; //当前数据块副本的存储节点
private volatile boolean hasError = false; //数据节点接收数据包时是否出错
private volatile int errorIndex = 0; //出错的数据节点
private volatile IOException lastException = null;
private long artificialSlowdown = 0;
private long lastFlushOffset = -1; // offset when flush was invoked
private boolean persistBlocks = false; // persist blocks on namenode
private int recoveryErrorCount = 0; // number of times block recovery failed
private int maxRecoveryErrorCount = 5; //数据写入出错时，数据块恢复可尝试的最大次数
private volatile boolean appendChunk = false; //当前数据块中最后一个校验块是否未满(文件追加)
private long initialFileSize = 0; //文件追加写开始时，文件的初始大小
...
/**
* 写入数据及其检验数据
*/
protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException {
checkOpen();
isClosed();
int cklen = checksum.length;
int bytesPerChecksum = this.checksum.getBytesPerChecksum();
if (len > bytesPerChecksum) { //如果待写入的数据长度大于检验块长度，则出现错误状态，抛出异常
throw new IOException("writeChunk() buffer size is " + len + " is larger than supported bytesPerChecksum " + bytesPerChecksum);
}
if (checksum.length != this.checksum.getChecksumSize()) {
throw new IOException("writeChunk() checksum size is supposed to be " + this.checksum.getChecksumSize() + " but found to be " + checksum.length);
}
synchronized (dataQueue) {
//如果待写入数据包数量与待确认数据包数量之和大于一个阈值，则当前写线程开始阻塞
while (!closed && dataQueue.size() + ackQueue.size() > maxPackets) {
try {
dataQueue.wait();
} catch (InterruptedException e) {
}
}
isClosed();
if(currentPacket == null) {
currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock);
//if (LOG.isDebugEnabled()) {
LOG.debug("DFSClient writeChunk allocating new packet seqno=" + currentPacket.seqno + ", src=" + src + ", packetSize=" + packetSize + ", chunksPerPacket=" + chunksPerPacket + ", bytesCurBlock=" + bytesCurBlock);
//}
}
//将校验数据转存到数据包中
currentPacket.writeChecksum(checksum, 0, cklen);
//将数据转存到数据包中
currentPacket.writeData(b, offset, len);
//记录当前数据包中已写入的校验块数量
currentPacket.numChunks++;
//记录当前数据块已写入数据的长度
bytesCurBlock += len;
//如果数据包已写满，或者当前数据块已写满
if(currentPacket.numChunks == currentPacket.maxChunks || bytesCurBlock == blockSize) {
if(LOG.isDebugEnabled()) {
LOG.debug("DFSClient writeChunk packet full seqno=" + currentPacket.seqno + ", src=" + src + ", bytesCurBlock=" + bytesCurBlock + ", blockSize=" + blockSize + ", appendChunk=" + appendChunk);
}
//如果当前数据块已写满，则标记当前数据包为最后一个数据包，数据块已写数据的计数器置0
if(bytesCurBlock == blockSize) {
currentPacket.lastPacketInBlock = true;
bytesCurBlock = 0;
lastFlushOffset = -1;
}
LOG.debug("the packet is full,so put it to dataQueue");
dataQueue.addLast(currentPacket);
dataQueue.notifyAll();
//将当前数据包置空，以使得下次写入时重新创建数据包对象
currentPacket = null;
// If this was the first write after reopening a file, then the above
// write filled up any partial chunk. Tell the summer to generate full
// crc chunks from now on.
if(appendChunk) {
appendChunk = false;
resetChecksumChunk(bytesPerChecksum);
}
//根据当前数据块的剩余空间从新计算数据包的实际大小
int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize);
computePacketChunkSize(psize, bytesPerChecksum);
}
}
//LOG.debug("DFSClient writeChunk done length " + len + " checksum length " + cklen);
}

3.数据包装器Packet

客户端向第一个存储节点发送的数据是按照数据包的形式来组织的，以此来提高网络IO的效率。

[java] view plain copy

private class Packet {
ByteBuffer buffer;
byte[] buf; // 数据缓存区
long seqno; // 数据包在数据块中的序列号
long offsetInBlock; // 数据包在数据块中的偏移位置
boolean lastPacketInBlock; // 是否是数据块的最后一个数据包
int numChunks; // 数据包当前存放了多少个校验块
int maxChunks; // 数据包最多可有多少个校验块
int dataStart; // 数据在该数据包中的开始位置
int dataPos; // 当前的数据写入位置
int checksumStart; // 校验数据在该数据包中的开始位置
int checksumPos; // 当前的校验数据写入位置
Packet(int pktSize, int chunksPerPkt, long offsetInBlock) {
this.lastPacketInBlock = false;
this.numChunks = 0;
this.offsetInBlock = offsetInBlock;
this.seqno = currentSeqno;
currentSeqno++;
buffer = null;
buf = new byte[pktSize];
//计算校验数据在数据包中的开始位置
checksumStart = DataNode.PKT_HEADER_LEN + SIZE_OF_INTEGER;
checksumPos = checksumStart;
//计算数据在数据包中的开始位置
dataStart = checksumStart + chunksPerPkt * checksum.getChecksumSize();
dataPos = dataStart;
maxChunks = chunksPerPkt;
}
//写入数据
void writeData(byte[] inarray, int off, int len) {
if ( dataPos + len > buf.length) {
throw new BufferOverflowException();
}
System.arraycopy(inarray, off, buf, dataPos, len);
dataPos += len;
}
//写入检验数据
void writeChecksum(byte[] inarray, int off, int len) {
if (checksumPos + len > dataStart) {
throw new BufferOverflowException();
}
System.arraycopy(inarray, off, buf, checksumPos, len);
checksumPos += len;
}
/**
* 获取整个数据包的数据(数据包头部+校验数据+数据)
* 数据包头部:
* 1).数据包的长度(4+校验数据长度+数据长度)(4)
* 2).数据包的数据在数据块中的偏移位置(8)
* 3).数据包在数据块中的序列号(8)
* 4).是否是数据块中的最后一个数据包(1)
*/
ByteBuffer getBuffer() {
/* Once this is called, no more data can be added to the packet.
* setting 'buf' to null ensures that.
* This is called only when the packet is ready to be sent.
*/
if (buffer != null) {
return buffer;
}
int dataLen = dataPos - dataStart; //数据长度
int checksumLen = checksumPos - checksumStart; //校验数据长度
// 如果检验数据和数据不是连续的，则将校验数据往后移动，使得检验数据和数据是连续的.
// 之所以移动检验数据而不移动数据主要是应为检验数据一般要比数据少得多，这样移动的开销也就小
if (checksumPos != dataStart) {
System.arraycopy(buf, checksumStart, buf, dataStart - checksumLen , checksumLen);
}
int pktLen = SIZE_OF_INTEGER + dataLen + checksumLen;
//开始写入数据包的头部，同时也使得头部数据和校验数据是连续的
buffer = ByteBuffer.wrap(buf, dataStart - checksumPos, DataNode.PKT_HEADER_LEN + pktLen);
buf = null;
buffer.mark();
buffer.putInt(pktLen); // 数据包的长度(4+校验数据长度+数据长度)(4)
buffer.putLong(offsetInBlock); // 数据包的数据在数据块中的偏移位置(8)
buffer.putLong(seqno); // 数据包在数据块中的序列号(8)
buffer.put((byte) ((lastPacketInBlock) ? 1 : 0)); // 是否是数据块中的最后一个数据包(1)
//end of pkt header
buffer.putInt(dataLen); // 数据长度
buffer.reset();
return buffer;
}
}

4.数据包发送器DataStreamer

从DFSOutputSream的核心函数writeChunk()我们可以看出，DFSOutputSream先把写入的数据缓存到packet中，当packet满了，或者是当前Block满了，则把packet放入队列dataQueue，等待其它的工作者把该packet发送到目标数据节点上。其实，这个工作者就是DataStreamer，它是DFSOutputSream的一个内部线程类，下面就来看看DataStreamer是如何工作的吧！

[java] view plain copy

private class DataStreamer extends Daemon {
private volatile boolean closed = false;
public void run() {
while (!closed && clientRunning) {
//如果接收到了数据节点返回的错误确认包，则关闭确认处理器
if (hasError && response != null) {
try {
response.close();
response.join();
response = null;
} catch (InterruptedException e) {
}
}
Packet one = null;
synchronized (dataQueue) {
// process IO errors if any
boolean doSleep = processDatanodeError(hasError, false);
// wait for a packet to be sent.
while ((!closed && !hasError && clientRunning && dataQueue.size() == 0) || doSleep) {
try {
dataQueue.wait(1000);
} catch (InterruptedException e) {
}
doSleep = false;
}
if (closed || hasError || dataQueue.size() == 0 || !clientRunning) {
continue;
}
try {
// get packet to be sent.
one = dataQueue.getFirst();
long offsetInBlock = one.offsetInBlock;
//向NameNode节点申请一个新的数据块，并创建与第一个存储节点的网络I/O流
if (blockStream == null) {
LOG.debug("Allocating new block");
nodes = nextBlockOutputStream(src);
this.setName("DataStreamer for file " + src + " block " + block);
response = new ResponseProcessor(nodes);
response.start();
lastTime = System.currentTimeMillis();
}
//如果将要写入的数据在当前数据块中的起始位置超过了该数据块的大小，则抛出异常
if (offsetInBlock >= blockSize) {
throw new IOException("BlockSize " + blockSize + " is smaller than data size. Offset of packet in block " + offsetInBlock + " Aborting file " + src);
}
ByteBuffer buf = one.getBuffer();
// move packet from dataQueue to ackQueue
dataQueue.removeFirst();
dataQueue.notifyAll();
synchronized (ackQueue) {
ackQueue.addLast(one);
ackQueue.notifyAll();
}
// write out data to remote datanode
blockStream.write(buf.array(), buf.position(), buf.remaining());
if (one.lastPacketInBlock) {
blockStream.writeInt(0); // indicate end-of-block
}
blockStream.flush();
if (LOG.isDebugEnabled()) {
LOG.debug("DataStreamer block " + block + " wrote packet seqno:" + one.seqno + " size:" + buf.remaining() + " offsetInBlock:" + one.offsetInBlock + " lastPacketInBlock:" + one.lastPacketInBlock);
}
} catch (Throwable e) {
LOG.warn("DataStreamer Exception: " + StringUtils.stringifyException(e));
if (e instanceof IOException) {
setLastException((IOException)e);
}
hasError = true;
}
}
if (closed || hasError || !clientRunning) {
continue;
}
//当前数据块已经写满，在收到该数据块的所有确认包之后，关闭相应的网络I/O流，确认包处理线程
if (one.lastPacketInBlock) {
synchronized (ackQueue) {
while (!hasError && ackQueue.size() != 0 && clientRunning) {
try {
ackQueue.wait(); // wait for acks to arrive from datanodes
} catch (InterruptedException e) {
}
}
}
LOG.debug("Closing old block " + block);
this.setName("DataStreamer for file " + src);
response.close(); // ignore all errors in Response
try {
response.join();
response = null;
} catch (InterruptedException e) {
}
if (closed || hasError || !clientRunning) {
continue;
}
synchronized (dataQueue) {
try {
LOG.debug("start to close wrtiter stream to DataNode["+s.getInetAddress()+"].");
blockStream.close();
LOG.debug("start to close reader stream from DataNode["+s.getInetAddress()+"].");
blockReplyStream.close();
System.out.println("send a block takes "+(System.currentTimeMillis()-lastTime)+" ms");
} catch (IOException e) {
}
nodes = null;
response = null;
blockStream = null;
blockReplyStream = null;
}
}
if (progress != null) { progress.progress(); }
// This is used by unit test to trigger race conditions.
if (artificialSlowdown != 0 && clientRunning) {
try {
Thread.sleep(artificialSlowdown);
} catch (InterruptedException e) {}
}
}
}
// shutdown thread
void close() {
closed = true;
synchronized (dataQueue) {
dataQueue.notifyAll();
}
synchronized (ackQueue) {
ackQueue.notifyAll();
}
this.interrupt();
}
}

上面的代码值得让我们注意的是，在Hadoop的官网上有关于介绍HDFS的一句话：A client request to create a file does not reach the NameNode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode. 翻译这句话，我就在这里不献丑了。很多分析过源代码的朋友都认为这句话说得有问题，但是我想说的，这就话在本质上是没有问题的，因为DataStreamer总是一个数据块接着一个数据块向目标数据节点发送，也就是对于已经向某一个数据节点发送了一个Block后，DataStreamer并不是马上发送下一个Block，而是要等到packet得到确认后才发送下一个Block，假设当一个用户调用HDFS的API写入了2个Block的数据。此时DataStreamer还在等待第一个Block的所有packet的ack，那么用户的第2个Block的数据还缓存在dataQueue中，同时DataStreamer也没有向NameNode申请第二个Block。那么现在大家再来体会一下刚才那句话。是不是还有点意思呢？另外，用户不能一味的发送数据，否则缓存扛不住，所以就有一个限制了，也就是总的缓存数据不能超过maxPackets个packet，这个值视运行环境而定，目前默认是80或者是1000。ok，再来看看nextBlockOutputStream函数到底为数据块向数据节点的传送到底干了那些工作。

[java] view plain copy

/**
* 申请一个新的数据块来存储用户写入的数据，并且与第一个存储节点建立网络连接
*/
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {
LocatedBlock lb = null;
boolean retry = false;
DatanodeInfo[] nodes;
int count = conf.getInt("dfs.client.block.write.retries", 3);
boolean success;
do {
hasError = false;
lastException = null;
errorIndex = 0;
retry = false;
nodes = null;
success = false;
long startTime = System.currentTimeMillis();
lb = locateFollowingBlock(startTime);
block = lb.getBlock();
nodes = lb.getLocations();
LOG.debug("locate a block["+block.getBlockId()+"] for file["+src+"]: "+nodes);
//与第一个存储节点建立网络连接
success = createBlockOutputStream(nodes, clientName, false);
if(!success) {
LOG.info("Abandoning block " + block);
//与第一个存储节点建立网络连接失败，所以放弃该数据块
namenode.abandonBlock(block, src, clientName);
// Connection failed. Let's wait a little bit and retry
retry = true;
try {
if (System.currentTimeMillis() - startTime > 5000) {
LOG.info("Waiting to find target node: " + nodes[0].getName());
}
Thread.sleep(6000);
} catch (InterruptedException iex) {
}
}
} while (retry && --count >= 0);
if (!success) {
throw new IOException("Unable to create new block.");
}
return nodes;
}
/*
* 与第一个存储节点建立网络连接,并写入数据块传输的头部信息:
* 1).数据传输协议版本号
* 2).数据传输类型
* 3).数据块基本信息
* a).数据块id
* b).数据块时间戳
* c).数据块的副本数
* 4).数据块恢复标记
* 5).客户端名称
* 6).源节点是否是数据节点
* 7).复制流水线中剩余的存储节点数量
* 8).剩余存储节点信息
* 9).数据校验器信息
*/
private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) {
String firstBadLink = "";
if (LOG.isDebugEnabled()) {
for (int i = 0; i < nodes.length; i++) {
LOG.debug("pipeline = " + nodes[i].getName());
}
}
// persist blocks on namenode on next flush
persistBlocks = true;
try {
LOG.debug("Connecting to " + nodes[0].getName());
InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());
s = socketFactory.createSocket();
int timeoutValue = 3000 * nodes.length + socketTimeout;
NetUtils.connect(s, target, timeoutValue);
s.setSoTimeout(timeoutValue);
s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
LOG.debug("Send buf size " + s.getSendBufferSize());
long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout;
LOG.debug("create a write stream to a datanode["+target+"]");
//与第一个存储节点建立网络写入流
DataOutputStream out = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE));
LOG.debug("create a read stream from a datanode["+target+"]");
//与第一个存储节点建立网络读取流
blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));
out.writeShort(DataTransferProtocol.DATA_TRANSFER_VERSION); //写入数据传输协议版本号
out.write(DataTransferProtocol.OP_WRITE_BLOCK); //写入数据传输类型
out.writeLong(block.getBlockId()); //写入数据块id
out.writeLong(block.getGenerationStamp()); //写入数据块时间戳
out.writeInt(nodes.length); //写入数据块的副本数
out.writeBoolean(recoveryFlag); //写入数据块恢复标记
Text.writeString(out, client); //写入客户端名称
out.writeBoolean(false); //写入源节点是否是数据节点
out.writeInt(nodes.length - 1); //写入复制流水线中剩余的存储节点数量
for (int i = 1; i < nodes.length; i++) { //写入剩余存储节点信息
nodes[i].write(out);
}
checksum.writeHeader(out); //写入数据校验器信息
out.flush();
//从接收端读取确认恢复包
firstBadLink = Text.readString(blockReplyStream);
if (firstBadLink.length() != 0) {
throw new IOException("Bad connect ack with firstBadLink " + firstBadLink);
}
blockStream = out;
return true; // success
} catch (IOException ie) {
LOG.info("Exception in createBlockOutputStream " + ie);
// find the datanode that matches
if(firstBadLink.length() != 0) {
for (int i = 0; i < nodes.length; i++) {
if (nodes[i].getName().equals(firstBadLink)) {
errorIndex = i;
break;
}
}
}
hasError = true;
setLastException(ie);
blockReplyStream = null;
return false; //error
}
}
/**
* 为文件申请一个新的数据块，放回数据块的存放位置
*/
private LocatedBlock locateFollowingBlock(long start) throws IOException {
int retries = conf.getInt("dfs.client.block.write.locateFollowingBlock.retries", 5);
long sleeptime = 400;
while (true) {
long localstart = System.currentTimeMillis();
while (true) {
try {
return namenode.addBlock(src, clientName);
} catch (RemoteException e) {
IOException ue = e.unwrapRemoteException(FileNotFoundException.class, AccessControlException.class, NSQuotaExceededException.class, DSQuotaExceededException.class);
if (ue != e) {
throw ue; // no need to retry these exceptions
}
if(NotReplicatedYetException.class.getName().equals(e.getClassName())) {
if (retries == 0) {
throw e;
} else {
--retries;
LOG.info(StringUtils.stringifyException(e));
if (System.currentTimeMillis() - localstart > 5000) {
LOG.info("Waiting for replication for " + (System.currentTimeMillis() - localstart) / 1000 + " seconds");
}
try {
LOG.warn("NotReplicatedYetException sleeping " + src + " retries left " + retries);
Thread.sleep(sleeptime);
sleeptime *= 2;
} catch (InterruptedException ie) {
}
}
} else {
throw e;
}
}
}
}//while
}

5.确认包处理器

对于客户端向存储节点发送的每一个数据包，客户端都要确认每一个数据包是否被所有的存储节点所真确接受了，如果有一个存储节点没有真确接受，则客户端就需要立即回复当前数据块。

[java] view plain copy

private class ResponseProcessor extends Thread {
private volatile boolean closed = false; // 响应处理器是否被关闭
private DatanodeInfo[] targets = null; // 当前数据块副本的存储位置
private boolean lastPacketInBlock = false; // 当前处理其响应的数据包是否是数据块中最后一个数据包
ResponseProcessor (DatanodeInfo[] targets) {
this.targets = targets;
}
public void run() {
this.setName("ResponseProcessor[" + block + "]");
PipelineAck ack = new PipelineAck();
while (!closed && clientRunning && !lastPacketInBlock) {
// process responses from datanodes.
try {
// read an ack from the pipeline
LOG.debug("Start to read a ack for a data packet from datanode["+s.getInetAddress()+"]");
ack.readFields(blockReplyStream, targets.length);
LOG.debug("Receive an Ack[" + ack + "]");
long seqno = ack.getSeqno();
if (seqno == PipelineAck.HEART_BEAT.getSeqno()) {
continue;
} else if (seqno == -2) {
// This signifies that some pipeline node failed to read downstream
// and therefore has no idea what sequence number the message corresponds
// to. So, we don't try to match it up with an ack.
assert ! ack.isSuccess();
} else {
Packet one = null;
synchronized (ackQueue) {
one = ackQueue.getFirst();
}
if (one.seqno != seqno) {
throw new IOException("Responseprocessor: Expecting seqno for block " + block + one.seqno + " but received " + seqno);
}
lastPacketInBlock = one.lastPacketInBlock;
}
//处理来自所有副本存放节点的响应，只要有一个出错，就抛出异常
for (int i = 0; i < targets.length && clientRunning; i++) {
short reply = ack.getReply(i);
if(reply != DataTransferProtocol.OP_STATUS_SUCCESS) {
errorIndex = i; // first bad datanode
throw new IOException("Bad response " + reply + " for block " + block + " from datanode " + targets[i].getName());
}
}
synchronized (ackQueue) {
ackQueue.removeFirst();
ackQueue.notifyAll();
}
} catch (Exception e) {
if (!closed) {
hasError = true;
if (e instanceof IOException) {
setLastException((IOException)e);
}
LOG.warn("DFSOutputStream ResponseProcessor exception for block " + block + StringUtils.stringifyException(e));
closed = true;
}
}
synchronized (dataQueue) {
dataQueue.notifyAll();
}
synchronized (ackQueue) {
ackQueue.notifyAll();
}
}
}
void close() {
closed = true;
this.interrupt();
}
}

对于客户端向数据节点传送过程中，难免会发生错误，这些错误包括，客户端向第一个数据节点写数据时发生网络错误，数据节点向数据节点写数据时发生错误，从数据节点获取packet的确认信息是发生错误等，它们都统一交给DFSOutputSream中的函数processDatanodeError来处理的。

[java] view plain copy

/*
* 处理在传输当前数据块的过程中发生的错误
*/
private boolean processDatanodeError(boolean hasError, boolean isAppend) {
if(!hasError) {
return false;
}
if (response != null) {
LOG.info("Error Recovery for block " + block + " waiting for responder to exit. ");
return true;
}
if (errorIndex >= 0) {
LOG.warn("Error Recovery for block " + block + " bad datanode[" + errorIndex + "] " + (nodes == null? "nodes == null": nodes[errorIndex].getName()));
}
if (blockStream != null) {
try {
LOG.debug("start to close wrtiter stream to DataNode["+s.getInetAddress()+"].");
blockStream.close();
LOG.debug("start to close reader stream from DataNode["+s.getInetAddress()+"]");
blockReplyStream.close();
} catch (IOException e) {
}
}
blockStream = null;
blockReplyStream = null;
// move packets from ack queue to front of the data queue
synchronized (ackQueue) {
dataQueue.addAll(0, ackQueue);
ackQueue.clear();
}
boolean success = false;
while (!success && clientRunning) {
DatanodeInfo[] newnodes = null;
if (nodes == null) {
String msg = "Could not get block locations. " + "Source file \"" + src + "\" - Aborting...";
LOG.warn(msg);
setLastException(new IOException(msg));
closed = true;
if (streamer != null) streamer.close();
return false;
}
StringBuilder pipelineMsg = new StringBuilder();
for (int j = 0; j < nodes.length; j++) {
pipelineMsg.append(nodes[j].getName());
if (j < nodes.length - 1) {
pipelineMsg.append(", ");
}
}
//从当前数据块的存储节点集合中删除出错的数据节点
if (errorIndex < 0) {
newnodes = nodes;
} else {
if (nodes.length <= 1) {
lastException = new IOException("All datanodes " + pipelineMsg + " are bad. Aborting...");
closed = true;
if (streamer != null) streamer.close();
return false;
}
LOG.warn("Error Recovery for block " + block + " in pipeline " + pipelineMsg + ": bad datanode " + nodes[errorIndex].getName());
newnodes = new DatanodeInfo[nodes.length-1];
System.arraycopy(nodes, 0, newnodes, 0, errorIndex);
System.arraycopy(nodes, errorIndex+1, newnodes, errorIndex, newnodes.length-errorIndex);
}
// Tell the primary datanode to do error recovery by stamping appropriate generation stamps.
LocatedBlock newBlock = null;
ClientDatanodeProtocol primary = null;
DatanodeInfo primaryNode = null;
try {
//从当前有效的存储节点中选取一个节点作为数据块恢复的主节点
primaryNode = Collections.min(Arrays.asList(newnodes));
primary = createClientDatanodeProtocolProxy(primaryNode, conf);
//过RPC让主存储节点恢复数据块
newBlock = primary.recoverBlock(block, isAppend, newnodes);
} catch (IOException e) {
recoveryErrorCount++;
if (recoveryErrorCount > maxRecoveryErrorCount) {
if (nodes.length > 1) {
// if the primary datanode failed, remove it from the list.
// The original bad datanode is left in the list because it is
// conservative to remove only one datanode in one iteration.
for (int j = 0; j < nodes.length; j++) {
if (nodes[j].equals(primaryNode)) {
errorIndex = j; // forget original bad node.
}
}
//主存储节点恢复数据块失败多次，从存储节点集合中删除
newnodes = new DatanodeInfo[nodes.length-1];
System.arraycopy(nodes, 0, newnodes, 0, errorIndex);
System.arraycopy(nodes, errorIndex+1, newnodes, errorIndex, newnodes.length-errorIndex);
nodes = newnodes;
LOG.warn("Error Recovery for block " + block + " failed " + " because recovery from primary datanode " + primaryNode + " failed " + recoveryErrorCount + " times. Pipeline was " + pipelineMsg + ". Marking primary datanode as bad.");
recoveryErrorCount = 0;
errorIndex = -1;
return true; // sleep when we return from here
}
String emsg = "Error Recovery for block " + block + " failed because recovery from primary datanode " + primaryNode + " failed " + recoveryErrorCount + " times. Pipeline was " + pipelineMsg + ". Aborting...";
LOG.warn(emsg);
lastException = new IOException(emsg);
//已无可用的存储节点来恢复当前数据块，所以抛出异常终止写入
closed = true;
if (streamer != null) streamer.close();
return false; // abort with IOexception
}
LOG.warn("Error Recovery for block " + block + " failed because recovery from primary datanode " + primaryNode + " failed " + recoveryErrorCount + " times. " + " Pipeline was " + pipelineMsg + ". Will retry...");
return true; // sleep when we return from here
} finally {
RPC.stopProxy(primary);
}
recoveryErrorCount = 0;
if (newBlock != null) {
block = newBlock.getBlock();
nodes = newBlock.getLocations();
}
//当前数据块恢复成功，获取该数据块新的存储位置
this.hasError = false;
lastException = null;
errorIndex = 0;
success = createBlockOutputStream(nodes, clientName, true);
}//while
response = new ResponseProcessor(nodes);
response.start();
return false; // do not sleep, continue processing
}

上面的代码可能有些复杂，我还是简单的描述一下关于processDatanodeError函数处理I/O过程中的错误：当有错误发生时。肯定是某一个数据节点发生了问题，那么首先会把这个有问题的数据节点删除掉。然后从剩余的可用的数据节点中选取一个，让它来恢复当前的这个Block，如果成功了ok，而失败，则删除再删除这个出问题的节点，则继续选节点来恢复Block，直到成功，如果最后没有可用的数据节点来恢复Block，则宣告这个Block写入失败，将关闭DFSOutputSream流，当用户再次写入时抛出异常。
从HDFS对OutputStream的实现来看，还存在一定的缺陷，它将数据流切包发送以降低风险及异常恢复的时间，这种设计本身是极好的，但对于每一次的封包操作，它在实现上都是重新new一个Packet对象来存储的，这样带来的影响就是可能会频繁的触发Minor GC，如果你Young Gen中的Eden区大小配置小于一个数据块的长度的话，很有可能触发Full GC；此外，对于每一个数据块都会开启一个数据包确认线程，如果是写大文件的话，也会频繁的创建线程而造成不小的资源开销。所以，如果能够复用Packet对象和数据包确认线程的话，将有可能提高客户端写的性能。

秒客网

hadoop源码 - HDFS的文件操作流写操作(客户端)

1.数据校验器FSOutputSummer

2.DFSOutputStream

3.数据包装器Packet

4.数据包发送器DataStreamer

5.确认包处理器

相关文章

hadoop源码 - HDFS的文件操作流 写操作(客户端)

1.数据校验器FSOutputSummer

2.DFSOutputStream

3.数据包装器Packet

4.数据包发送器DataStreamer

5.确认包处理器

相关文章

hadoop源码 - HDFS的文件操作流写操作(客户端)