Fetcher分析
Fetcher: 根据offset从服务器端获取数据,发送FetchRequest请求获取指定的消息集合,处理FetchResponse,更新消息位置
一 比较重要的字段
ConsumerNetworkClient client: 负责网络通信,发送请求
int minBytes:在服务器端收到FetchRequest之后,并不是立即响应,而是当可返回的消息数据积累到至少在minbytes才进行响应,这样每一个FetchResponse就包含多条消息,我们可以配置这个minbytes: fetch.min.bytes
int maxBytes:在服务器端收到FetchRequest之后,并不是立即响应,而是当可返回的消息数据积累到至少在minbytes才进行响应,但是不能超过maxBytes字节数,参数:fetch.max.bytes
int maxWaitMs:等待FetchResponse的最长时间,服务端根据此时间决定何时进行响应。参数fetch.max.wait.ms
int fetchSize:每一次fetch操作最大的字节数
int maxPollRecords: 每次获取record的最大数量
Metadata metadata:kafka集群元数据
SubscriptionState subscriptions:记录每一个TopicPartition的消费情况,主要是为了快速查找offset
ConcurrentLinkedQueue<CompletedFetch>completedFetches:每个FetchResponse首先会转换成CompletedFetch对象进入此队列缓存,此时并没有解析消息
Deserializer<K> keyDeserializer: key的反序列化器
Deserializer<V> valueDeserializer:value的反序列化器
PartitionRecords<K, V> nextInLineRecords:PartitionRecords类型。PartitionRecords保存了CompletedFetch解析后的结果,其中有三个字段: records是消息集合;fetchOffset是记录records中第一个消息的offset,partition记录消息对应的TopPartition
二 fetch 消息
消费者要获取消息,需要发送FetchRequest请求,那我们就来分析
FetchRequest和FetchResponse
首先分析FetchRequest和FetchResponse的消息体格式:
FetchRequest消息体格式:
replica_id: 用来标志follower的id
max_wait_time: 请求最大等待时间
min_bytes: 相应的最小负载
fetch_offset:需要fetch消息的offset
max_bytes: 每次fetch的最大字节数
FetchResponse消息体格式:
high_watermark: leader的high_watermark
record_set: fetch到的消息数据
2.1 createFetchRequest
createFetchRequest方法负责创建FetchRequest请求,返回的值是Map
<Node,FetchRequest>类型:
# 首先查找可以fetchable的分区:即当前没有被暂停的且有有效的offset的分区;nextInLineRecords没有来自此分区的消息;completedFetches队列中没有来自分区的CompleteFetch
# 根据partition找到该partition的leader replica,因为只有leader 分区才可以处理读写请求;如果没有找到leader,则进行元数据更新
# 判断该是不是有未发送的请求到该leader节点,如果没有则获取每一个分区的partition的offset,把offset和retchSize封装成PartitionData
# 按照node进行分类把发往同一节点的所有TopicPartition和PartitionData封装成FetchRequest
private
Map<Node,
FetchRequest> createFetchRequests() {
// 获取集群信息
Cluster cluster
= metadata.fetch();
// 创建Node-> (TopicPartition->PartitionData)的映射,比如(0=(topic-0,data),0=(topic-1,data),1=(topic-1,data))
Map<Node,
LinkedHashMap<TopicPartition,
FetchRequest.PartitionData>>
fetchable = new LinkedHashMap<>();
// 遍历那些可以获取数据的分区
for (TopicPartition
partition : fetchablePartitions()) {
// 根据分区找到对应的broker
Node node
= cluster.leaderFor(partition);
// 如果没有发现leader,则更信息元数据
if (node
== null) {
metadata.requestUpdate();
} else if (this.client.pendingRequestCount(node) ==
0) {
// 根据可以获取数据的分区根据node查找(分区,partition
data)
LinkedHashMap<TopicPartition,
FetchRequest.PartitionData>
fetch = fetchable.get(node);
if (fetch
== null) {
fetch =
new LinkedHashMap<>();
fetchable.put(node,
fetch);
}
// 获取分区的position
long position
= this.subscriptions.position(partition);
// 然后把(partition,PartitionData)放到fetch
map集合里
fetch.put(partition,
new FetchRequest.PartitionData(position,
this.fetchSize));
log.trace("Added fetch request for partition{} at offset {}",
partition, position);
} else {
log.trace("Skipping fetch forpartition {} because there is an in-flight request to {}",
partition, node);
}
}
// 按照node进行分类把发往同一节点的所有TopicPartition和PartitionData封装成FetchRequest
Map<Node,
FetchRequest> requests
= new HashMap<>();
for (Map.Entry<Node,
LinkedHashMap<TopicPartition,
FetchRequest.PartitionData>>
entry : fetchable.entrySet()) {
Node node
= entry.getKey();
FetchRequest fetch
= new FetchRequest(this.maxWaitMs,
this.minBytes,
this.maxBytes,
entry.getValue());
requests.put(node,
fetch);
}
return requests;
}
2.2 sendFetches
sendFetches方法主要是将FetchRequest添加到ConsumerNetwork
Client的unsent列表里,等待被发送,并注册FetchResponse处理函数。FetchResponse的处理只要是解析FetchResponse后按照TopicPartition分类,将获取到的消息数据和offset组装成CompletedFetch并添加到CompletedFetches队里。
public void sendFetches() {
// 遍历创建的<node,fetch请求>映射
for (Map.Entry<Node, FetchRequest> fetchEntry : createFetchRequests().entrySet()) {
final FetchRequest request = fetchEntry.getValue();
final Node fetchTarget = fetchEntry.getKey();
// 向Node发送fetch request
client.send(fetchTarget, ApiKeys.FETCH, request)
.addListener(new RequestFutureListener<ClientResponse>() {
// 处理成功时的回调函数
@Override
public void onSuccess(ClientResponse resp) {
FetchResponse response = new FetchResponse(resp.responseBody());
// 检查FetchRequest和FetchResponse的key是不是相等的,如果不相等则直接返回
if (!matchesRequestedPartitions(request, response)) {
log.warn("Ignoring fetch response containing partitions {} since it does not match " +
"the requested partitions {}", response.responseData().keySet(),
request.fetchData().keySet());
return;
}
// 获取FetchResponse里所有的key
Set<TopicPartition> partitions = new HashSet<>(response.responseData().keySet());
FetchResponseMetricAggregator metricAggregator = new FetchResponseMetricAggregator(sensors, partitions);
// 获取分区,fetch offset和数据,封装成CompltetedFetch对象,并且添加到CompletedFetches队列里
for (Map.Entry<TopicPartition, FetchResponse.PartitionData> entry : response.responseData().entrySet()) {
TopicPartition partition = entry.getKey();
long fetchOffset = request.fetchData().get(partition).offset;
FetchResponse.PartitionData fetchData = entry.getValue();
completedFetches.add(new CompletedFetch(partition, fetchOffset, fetchData, metricAggregator));
}
sensors.fetchLatency.record(resp.requestLatencyMs());
sensors.fetchThrottleTimeSensor.record(response.getThrottleTime());
}
@Override
public void onFailure(RuntimeException e) {
log.debug("Fetch request to {} failed", fetchTarget, e);
}
});
}
}
2.3 fetchRecords
存储在completedFetches队列中的消息数据还是未解析的FetchResponse.PartitionData对象。在fetchRecords方法中会将CompletedFetch中的消息数据进行解析,得到Record集合并返回;同时还会修改对应TopicPartitionState的position
public Map<TopicPartition, List<ConsumerRecord<K, V>>> fetchedRecords() {
// 创建一个以分区为key,ConsumerRecord列表为value的map,表示一个分区对应着它取到的结果
Map<TopicPartition, List<ConsumerRecord<K, V>>> drained = new HashMap<>();
// 初始化最多可以取maxPollRecords个record
int recordsRemaining = maxPollRecords;
// 只要剩余的records数量大于0
while (recordsRemaining > 0) {
// nextInLineRecords已经被消费了
if (nextInLineRecords == null || nextInLineRecords.isDrained()) {
// 从CompletedFetches队列poll一个CompletedFetch对象,如果不为空就解析CompletedFetch成PartitionRecords
CompletedFetch completedFetch = completedFetches.poll();
if (completedFetch == null)
break;
nextInLineRecords = parseFetchedData(completedFetch);
} else {// nextInLineRecords还没有被消费
// 从nextInLineRecords获取partition
TopicPartition partition = nextInLineRecords.partition;
// 获取recordsRemaining个数据
List<ConsumerRecord<K, V>> records = drainRecords(nextInLineRecords, recordsRemaining);
// 将record按照partition分类
if (!records.isEmpty()) {
// 获取对应分区的record
List<ConsumerRecord<K, V>> currentRecords = drained.get(partition);
if (currentRecords == null) {
drained.put(partition, records);
} else {
// this case shouldn't usually happen because we only send one fetch at a time per partition,
// but it might conceivably happen in some rare cases (such as partition leader changes).
// we have to copy to a new list because the old one may be immutable
List<ConsumerRecord<K, V>> newRecords = new ArrayList<>(records.size() + currentRecords.size());
newRecords.addAll(currentRecords);
newRecords.addAll(records);
drained.put(partition, newRecords);
}
// 更新剩余的record数量
recordsRemaining -= records.size();
}
}
}
return drained;
}
private List<ConsumerRecord<K, V>> drainRecords(PartitionRecords<K, V> partitionRecords, int maxRecords) {
if (partitionRecords.isDrained())
return Collections.emptyList();
// 检查返回的数据对应分区是否是已经分配
if (!subscriptions.isAssigned(partitionRecords.partition)) {
// this can happen when a rebalance happened before fetched records are returned to the consumer's poll call
log.debug("Not returning fetched records for partition {} since it is no longer assigned", partitionRecords.partition);
} else {// 如果该分区是已经分配了的
// 获取该分区的offset
long position = subscriptions.position(partitionRecords.partition);
// 检查返回的数据对应分区是不是可以fetch数据,即是不是被暂停或者没有有效的offset
if (!subscriptions.isFetchable(partitionRecords.partition)) {
// this can happen when a partition is paused before fetched records are returned to the consumer's poll call
log.debug("Not returning fetched records for assigned partition {} since it is no longer fetchable", partitionRecords.partition);
} else if (partitionRecords.fetchOffset == position) {
// 根据返回的数据获取maxRecords个数量的record
List<ConsumerRecord<K, V>> partRecords = partitionRecords.drainRecords(maxRecords);
// 获取下一次开始取的offset(当前最后一个record对应offset+1)
long nextOffset = partRecords.get(partRecords.size() - 1).offset() + 1;
log.trace("Returning fetched records at offset {} for assigned partition {} and update " +
"position to {}", position, partitionRecords.partition, nextOffset);
// 更新下一次的offsets
subscriptions.position(partitionRecords.partition, nextOffset);
return partRecords;
} else {
// these records aren't next in line based on the last consumed position, ignore them
// they must be from an obsolete request
log.debug("Ignoring fetched records for {} at offset {} since the current position is {}",
partitionRecords.partition, partitionRecords.fetchOffset, position);
}
}
// 把返回的数据置为空
partitionRecords.drain();
return Collections.emptyList();
}
private PartitionRecords<K, V> parseFetchedData(CompletedFetch completedFetch) {
// 获取这个完成的fetch所对应的分区
TopicPartition tp = completedFetch.partition;
// 获取这个完成的fetch所对应的PartitionData
FetchResponse.PartitionData partition = completedFetch.partitionData;
// 获取这个完成的fetch所对应offset
long fetchOffset = completedFetch.fetchedOffset;
int bytes = 0;
int recordsCount = 0;
PartitionRecords<K, V> parsedRecords = null;
Errors error = Errors.forCode(partition.errorCode);
try {
// 判断该分区是否没有分配,或者分配了还不能fetch数据
if (!subscriptions.isFetchable(tp)) {
// 一般发生在rebalance或者分区暂停的时候
log.debug("Ignoring fetched records for partition {} since it is no longer fetchable", tp);
} else if (error == Errors.NONE) {// 如果没有错误
// 只有当start offset与当前消耗的位置匹配时,我们才对该fetch感兴趣
// 获取分区位置
Long position = subscriptions.position(tp);
if (position == null || position != fetchOffset) {
log.debug("Discarding stale fetch response for partition {} since its offset {} does not match " +
"the expected offset {}", tp, fetchOffset, position);
return null;
}
// 取出数据
ByteBuffer buffer = partition.recordSet;
// 转换成MemoryRecords
MemoryRecords records = MemoryRecords.readableRecords(buffer);
List<ConsumerRecord<K, V>> parsed = new ArrayList<>();
for (LogEntry logEntry : records) {
// 跳过那些比现在位置还早的消息
if (logEntry.offset() >= position) {
parsed.add(parseRecord(tp, logEntry));
// 更新消息字节数
bytes += logEntry.size();
}
}
// 获取消息数据量
recordsCount = parsed.size();
this.sensors.recordTopicFetchMetrics(tp.topic(), bytes, recordsCount);
// 获取分区position位置上的record
if (!parsed.isEmpty()) {
log.trace("Adding fetched record for partition {} with offset {} to buffered record list", tp, position);
parsedRecords = new PartitionRecords<>(fetchOffset, tp, parsed);
// 获取记录
ConsumerRecord<K, V> record = parsed.get(parsed.size() - 1);
this.sensors.recordsFetchLag.record(partition.highWatermark - record.offset());
}
} else if (error == Errors.NOT_LEADER_FOR_PARTITION) {// 如果有partition没有leader的错误
log.debug("Error in fetch for partition {}: {}", tp, error.exceptionName());
// 请求更新当前的集群元数据信息,在更新之前返回当前版本
this.metadata.requestUpdate();
} else if (error == Errors.UNKNOWN_TOPIC_OR_PARTITION) {// 如果存在位置的topic或者partition
log.warn("Received unknown topic or partition error in fetch for partition {}. The topic/partition " +
"may not exist or the user may not have Describe access to it", tp);
// 请求更新当前的集群元数据信息,在更新之前返回当前版本
this.metadata.requestUpdate();
} else if (error == Errors.OFFSET_OUT_OF_RANGE) {// 如果offset越界
if (fetchOffset != subscriptions.position(tp)) {
log.debug("Discarding stale fetch response for partition {} since the fetched offset {}" +
"does not match the current offset {}", tp, fetchOffset, subscriptions.position(tp));
} else if (subscriptions.hasDefaultOffsetResetPolicy()) {
log.info("Fetch offset {} is out of range for partition {}, resetting offset", fetchOffset, tp);
subscriptions.needOffsetReset(tp);
} else {
throw new OffsetOutOfRangeException(Collections.singletonMap(tp, fetchOffset));
}
} else if (error == Errors.TOPIC_AUTHORIZATION_FAILED) {// 如果topic没有被授权
log.warn("Not authorized to read from topic {}.", tp.topic());
throw new TopicAuthorizationException(Collections.singleton(tp.topic()));
} else if (error == Errors.UNKNOWN) {
log.warn("Unknown error fetching data for topic-partition {}", tp);
} else {
throw new IllegalStateException("Unexpected error code " + error.code() + " while fetching data");
}
} finally {
completedFetch.metricAggregator.record(tp, bytes, recordsCount);
}
// 如果我们接受了一些数或者这里没有错误,我们可以把分区移到末尾,这样有相同主题的分区就可以在一起
if (bytes > 0 || error != Errors.NONE)
subscriptions.movePartitionToEnd(tp);
return parsedRecords;
}
三 更新position
第一次消费某一个topic的分区,服务器端内部offset topics中并没有记录最近提交的offset。此时让用户手动指定消费的起始的offset,则可以从指定的offset开始消费
否则就需要重置TopicPartitionState.position字段,重置过程涉及到
ListOffsetRequest和;ListOffsetResponse:
分析一下ListOffsetRequest和ListOffsetResponse的消息体格式:
updateFetchPositions: 更新partition的position
# 检测分区是否已分配或者是否fetchable
# 如果设置resetStrategy,则按照指定个重置的策略进行重置:EARLIST重置为最下的offset,LATEST重置为最大的offset
# 如果没有指定的重置策略,则将TopicPartition的position重置为commited(上一次提交的offset)
# 如果coomited上一次提交的offset为空,则使用默认的重置策略进行重置
public void updateFetchPositions(Set<TopicPartition> partitions) {
// 重设fetch position成提交的position
for (TopicPartition tp : partitions) {
// 检测分区是否已分配或者分区是否可以fetchable
if (!subscriptions.isAssigned(tp) || subscriptions.isFetchable(tp))
continue;
// 判断分区是否需要重设offset,取决于resetStrategy是否为空
if (subscriptions.isOffsetResetNeeded(tp)) {
// 按照指定策略重置offset
resetOffset(tp);
} else if (subscriptions.committed(tp) == null) {
// 最近提交的offset为空,我们也需要使用默认的重置策略重置offset
subscriptions.needOffsetReset(tp);
resetOffset(tp);
} else {
// 如果上次提交的offset不为空,则获取这个offset
long committed = subscriptions.committed(tp).offset();
log.debug("Resetting offset for partition {} to the committed offset {}", tp, committed);
// 将TopicPartition的position更新为最近一次提交的offset
subscriptions.seek(tp, committed);
}
}
}
private void resetOffset(TopicPartition partition) {
// 如果指定了resetStrategy,则获取该重设策略
OffsetResetStrategy strategy = subscriptions.resetStrategy(partition);
final long timestamp;
// 根据不同的重设策略,更新timestamp
if (strategy == OffsetResetStrategy.EARLIEST)
timestamp = ListOffsetRequest.EARLIEST_TIMESTAMP;
else if (strategy == OffsetResetStrategy.LATEST)
timestamp = ListOffsetRequest.LATEST_TIMESTAMP;
else
throw new NoOffsetForPartitionException(partition);
log.debug("Resetting offset for partition {} to {} offset.", partition, strategy.name().toLowerCase(Locale.ROOT));
// 根据timestamp获取offset
long offset = getOffsetsByTimes(Collections.singletonMap(partition, timestamp), Long.MAX_VALUE).get(partition).offset();
// we might lose the assignment while fetching the offset, so check it is still active
// 在获取offset的时候,我们可能失去这个分区的分配信息,所以我们再次检查他是否有效
if (subscriptions.isAssigned(partition))
this.subscriptions.seek(partition, offset);
}
// 根据时间戳获取offset
public Map<TopicPartition, OffsetAndTimestamp> getOffsetsByTimes(Map<TopicPartition, Long> timestampsToSearch,
long timeout) {
// 如果分区和时间戳信息为空,则不能获取offset,返回空
if (timestampsToSearch.isEmpty())
return Collections.emptyMap();
long startMs = time.milliseconds();
// 初始化剩余的时间 = timeout
long remaining = timeout;
do {
// 发送offsetRequest请求
RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> future = sendListOffsetRequests(timestampsToSearch);
client.poll(future, remaining);
if (!future.isDone())
break;
if (future.succeeded())
return future.value();
if (!future.isRetriable())
throw future.exception();
long elapsed = time.milliseconds() - startMs;
remaining = timeout - elapsed;
if (remaining <= 0)
break;
if (future.exception() instanceof InvalidMetadataException)
client.awaitMetadataUpdate(remaining);
else
time.sleep(Math.min(remaining, retryBackoffMs));
elapsed = time.milliseconds() - startMs;
remaining = timeout - elapsed;
} while (remaining > 0);
throw new TimeoutException("Failed to get offsets by times in " + timeout + " ms");
}
private RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> sendListOffsetRequests(final Map<TopicPartition, Long> timestampsToSearch) {
// 按照Node对<TopicPartition,Long>分类,构建一个新的集合
final Map<Node, Map<TopicPartition, Long>> timestampsToSearchByNode = new HashMap<>();
for (Map.Entry<TopicPartition, Long> entry: timestampsToSearch.entrySet()) {
// 获取TopicPartition
TopicPartition tp = entry.getKey();
// 获取指定topic分区的元数据信息
PartitionInfo info = metadata.fetch().partition(tp);
// 指定topic分区的元数据信息为空,则添加大到metadata
if (info == null) {
metadata.add(tp.topic());
log.debug("Partition {} is unknown for fetching offset, wait for metadata refresh", tp);
return RequestFuture.staleMetadata();
} else if (info.leader() == null) { // 如果没有leader
log.debug("Leader for partition {} unavailable for fetching offset, wait for metadata refresh", tp);
return RequestFuture.leaderNotAvailable();
} else {
// 获取leader
Node node = info.leader();
Map<TopicPartition, Long> topicData = timestampsToSearchByNode.get(node);
if (topicData == null) {
topicData = new HashMap<>();
timestampsToSearchByNode.put(node, topicData);
}
topicData.put(entry.getKey(), entry.getValue());
}
}
final RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> listOffsetRequestsFuture = new RequestFuture<>();
final Map<TopicPartition, OffsetAndTimestamp> fetchedTimestampOffsets = new HashMap<>();
final AtomicInteger remainingResponses = new AtomicInteger(timestampsToSearchByNode.size());
for (Map.Entry<Node, Map<TopicPartition, Long>> entry : timestampsToSearchByNode.entrySet()) {
// 根据之前构建的map集合发送ListOffsetRequest请求
sendListOffsetRequest(entry.getKey(), entry.getValue())
.addListener(new RequestFutureListener<Map<TopicPartition, OffsetAndTimestamp>>() {
@Override
public void onSuccess(Map<TopicPartition, OffsetAndTimestamp> value) {
synchronized (listOffsetRequestsFuture) {
fetchedTimestampOffsets.putAll(value);
if (remainingResponses.decrementAndGet() == 0 && !listOffsetRequestsFuture.isDone())
listOffsetRequestsFuture.complete(fetchedTimestampOffsets);
}
}
@Override
public void onFailure(RuntimeException e) {
synchronized (listOffsetRequestsFuture) {
// This may cause all the requests to be retried, but should be rare.
if (!listOffsetRequestsFuture.isDone())
listOffsetRequestsFuture.raise(e);
}
}
});
}
return listOffsetRequestsFuture;
}
private RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> sendListOffsetRequest(final Node node,
final Map<TopicPartition, Long> timestampsToSearch) {
// 创建一个ListOffsetRequest请求
ListOffsetRequest request = new ListOffsetRequest(timestampsToSearch, ListOffsetRequest.CONSUMER_REPLICA_ID);
log.trace("Sending ListOffsetRequest {} to broker {}", request, node);
// 将这个请求发送到unsent列表等待被发送发
return client.send(node, ApiKeys.LIST_OFFSETS, request)
.compose(new RequestFutureAdapter<ClientResponse, Map<TopicPartition, OffsetAndTimestamp>>() {
@Override
public void onSuccess(ClientResponse response, RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> future) {
ListOffsetResponse lor = new ListOffsetResponse(response.responseBody());
log.trace("Received ListOffsetResponse {} from broker {}", lor, node);
handleListOffsetResponse(timestampsToSearch, lor, future);
}
});
}