Fetcher分析

Fetcher: 根据offset从服务器端获取数据,发送FetchRequest请求获取指定的消息集合，处理FetchResponse,更新消息位置

一比较重要的字段

ConsumerNetworkClient client: 负责网络通信，发送请求

int minBytes：在服务器端收到FetchRequest之后，并不是立即响应，而是当可返回的消息数据积累到至少在minbytes才进行响应，这样每一个FetchResponse就包含多条消息，我们可以配置这个minbytes： fetch.min.bytes

int maxBytes：在服务器端收到FetchRequest之后，并不是立即响应，而是当可返回的消息数据积累到至少在minbytes才进行响应,但是不能超过maxBytes字节数，参数：fetch.max.bytes

int maxWaitMs：等待FetchResponse的最长时间，服务端根据此时间决定何时进行响应。参数fetch.max.wait.ms

int fetchSize：每一次fetch操作最大的字节数

int maxPollRecords：每次获取record的最大数量

Metadata metadata：kafka集群元数据

SubscriptionState subscriptions：记录每一个TopicPartition的消费情况，主要是为了快速查找offset

ConcurrentLinkedQueue<CompletedFetch>completedFetches：每个FetchResponse首先会转换成CompletedFetch对象进入此队列缓存，此时并没有解析消息

Deserializer<K> keyDeserializer： key的反序列化器

Deserializer<V> valueDeserializer：value的反序列化器

PartitionRecords<K, V> nextInLineRecords：PartitionRecords类型。PartitionRecords保存了CompletedFetch解析后的结果，其中有三个字段: records是消息集合；fetchOffset是记录records中第一个消息的offset,partition记录消息对应的TopPartition

二 fetch 消息

消费者要获取消息，需要发送FetchRequest请求，那我们就来分析

FetchRequest和FetchResponse

首先分析FetchRequest和FetchResponse的消息体格式：

Fetcher分析

FetchRequest消息体格式：

replica_id: 用来标志follower的id

max_wait_time: 请求最大等待时间

min_bytes: 相应的最小负载

fetch_offset:需要fetch消息的offset

max_bytes: 每次fetch的最大字节数

FetchResponse消息体格式：

high_watermark: leader的high_watermark

record_set: fetch到的消息数据

2.1 createFetchRequest

createFetchRequest方法负责创建FetchRequest请求，返回的值是Map

<Node,FetchRequest>类型：

# 首先查找可以fetchable的分区：即当前没有被暂停的且有有效的offset的分区；nextInLineRecords没有来自此分区的消息；completedFetches队列中没有来自分区的CompleteFetch

# 根据partition找到该partition的leader replica,因为只有leader 分区才可以处理读写请求；如果没有找到leader，则进行元数据更新

# 判断该是不是有未发送的请求到该leader节点,如果没有则获取每一个分区的partition的offset，把offset和retchSize封装成PartitionData

# 按照node进行分类把发往同一节点的所有TopicPartition和PartitionData封装成FetchRequest

private Map<Node, FetchRequest> createFetchRequests() {
    // 获取集群信息
    Cluster cluster = metadata.fetch();
    // 创建Node-> (TopicPartition->PartitionData)的映射，比如（0=(topic-0,data),0=(topic-1,data),1=(topic-1,data)）
    Map<Node, LinkedHashMap<TopicPartition, FetchRequest.PartitionData>> fetchable = new LinkedHashMap<>();
    // 遍历那些可以获取数据的分区
    for (TopicPartition partition : fetchablePartitions()) {
        // 根据分区找到对应的broker
        Node node = cluster.leaderFor(partition);
        // 如果没有发现leader,则更信息元数据
        if (node == null) {
            metadata.requestUpdate();
        } else if (this.client.pendingRequestCount(node) == 0) {
            // 根据可以获取数据的分区根据node查找(分区，partition data)
            LinkedHashMap<TopicPartition, FetchRequest.PartitionData> fetch = fetchable.get(node);
            if (fetch == null) {
                fetch = new LinkedHashMap<>();
                fetchable.put(node, fetch);
            }
            // 获取分区的position
            long position = this.subscriptions.position(partition);
            // 然后把(partition，PartitionData)放到fetch map集合里
            fetch.put(partition, new FetchRequest.PartitionData(position, this.fetchSize));
            log.trace("Added fetch request for partition{} at offset {}", partition, position);
        } else {
            log.trace("Skipping fetch forpartition {} because there is an in-flight request to {}", partition, node);
        }
    }

    // 按照node进行分类把发往同一节点的所有TopicPartition和PartitionData封装成FetchRequest
    Map<Node, FetchRequest> requests = new HashMap<>();
    for (Map.Entry<Node, LinkedHashMap<TopicPartition, FetchRequest.PartitionData>> entry : fetchable.entrySet()) {
        Node node = entry.getKey();
        FetchRequest fetch = new FetchRequest(this.maxWaitMs, this.minBytes, this.maxBytes, entry.getValue());
        requests.put(node, fetch);
    }
    return requests;
}

2.2 sendFetches

sendFetches方法主要是将FetchRequest添加到ConsumerNetwork

Client的unsent列表里，等待被发送，并注册FetchResponse处理函数。FetchResponse的处理只要是解析FetchResponse后按照TopicPartition分类，将获取到的消息数据和offset组装成CompletedFetch并添加到CompletedFetches队里。

public void sendFetches() {
    // 遍历创建的<node,fetch请求>映射
    for (Map.Entry<Node, FetchRequest> fetchEntry : createFetchRequests().entrySet()) {
        final FetchRequest request = fetchEntry.getValue();
        final Node fetchTarget = fetchEntry.getKey();
        // 向Node发送fetch request
        client.send(fetchTarget, ApiKeys.FETCH, request)
                .addListener(new RequestFutureListener<ClientResponse>() {
                    // 处理成功时的回调函数
                    @Override
                    public void onSuccess(ClientResponse resp) {
                        FetchResponse response = new FetchResponse(resp.responseBody());
                        // 检查FetchRequest和FetchResponse的key是不是相等的，如果不相等则直接返回
                        if (!matchesRequestedPartitions(request, response)) {
                            log.warn("Ignoring fetch response containing partitions {} since it does not match " +
                                    "the requested partitions {}", response.responseData().keySet(),
                                    request.fetchData().keySet());
                            return;
                        }
                        // 获取FetchResponse里所有的key
                        Set<TopicPartition> partitions = new HashSet<>(response.responseData().keySet());
                        FetchResponseMetricAggregator metricAggregator = new FetchResponseMetricAggregator(sensors, partitions);

                        // 获取分区，fetch offset和数据，封装成CompltetedFetch对象，并且添加到CompletedFetches队列里
                        for (Map.Entry<TopicPartition, FetchResponse.PartitionData> entry : response.responseData().entrySet()) {
                            TopicPartition partition = entry.getKey();
                            long fetchOffset = request.fetchData().get(partition).offset;
                            FetchResponse.PartitionData fetchData = entry.getValue();
                            completedFetches.add(new CompletedFetch(partition, fetchOffset, fetchData, metricAggregator));
                        }

                        sensors.fetchLatency.record(resp.requestLatencyMs());
                        sensors.fetchThrottleTimeSensor.record(response.getThrottleTime());
                    }

                    @Override
                    public void onFailure(RuntimeException e) {
                        log.debug("Fetch request to {} failed", fetchTarget, e);
                    }
                });
    }
}

2.3 fetchRecords

存储在completedFetches队列中的消息数据还是未解析的FetchResponse.PartitionData对象。在fetchRecords方法中会将CompletedFetch中的消息数据进行解析，得到Record集合并返回；同时还会修改对应TopicPartitionState的position

public Map<TopicPartition, List<ConsumerRecord<K, V>>> fetchedRecords() {
    // 创建一个以分区为key，ConsumerRecord列表为value的map,表示一个分区对应着它取到的结果
    Map<TopicPartition, List<ConsumerRecord<K, V>>> drained = new HashMap<>();
    // 初始化最多可以取maxPollRecords个record
    int recordsRemaining = maxPollRecords;
    // 只要剩余的records数量大于0
    while (recordsRemaining > 0) {
        // nextInLineRecords已经被消费了
        if (nextInLineRecords == null || nextInLineRecords.isDrained()) {
            // 从CompletedFetches队列poll一个CompletedFetch对象，如果不为空就解析CompletedFetch成PartitionRecords
            CompletedFetch completedFetch = completedFetches.poll();
            if (completedFetch == null)
                break;
            nextInLineRecords = parseFetchedData(completedFetch);
        } else {// nextInLineRecords还没有被消费
            // 从nextInLineRecords获取partition
            TopicPartition partition = nextInLineRecords.partition;
            // 获取recordsRemaining个数据
            List<ConsumerRecord<K, V>> records = drainRecords(nextInLineRecords, recordsRemaining);
            // 将record按照partition分类
            if (!records.isEmpty()) {
                // 获取对应分区的record
                List<ConsumerRecord<K, V>> currentRecords = drained.get(partition);
                if (currentRecords == null) {
                    drained.put(partition, records);
                } else {
                    // this case shouldn't usually happen because we only send one fetch at a time per partition,
                    // but it might conceivably happen in some rare cases (such as partition leader changes).
                    // we have to copy to a new list because the old one may be immutable
                    List<ConsumerRecord<K, V>> newRecords = new ArrayList<>(records.size() + currentRecords.size());
                    newRecords.addAll(currentRecords);
                    newRecords.addAll(records);
                    drained.put(partition, newRecords);
                }
                // 更新剩余的record数量
                recordsRemaining -= records.size();
            }
        }
    }

    return drained;
}

private List<ConsumerRecord<K, V>> drainRecords(PartitionRecords<K, V> partitionRecords, int maxRecords) {
    if (partitionRecords.isDrained())
        return Collections.emptyList();
    // 检查返回的数据对应分区是否是已经分配
    if (!subscriptions.isAssigned(partitionRecords.partition)) {
        // this can happen when a rebalance happened before fetched records are returned to the consumer's poll call
        log.debug("Not returning fetched records for partition {} since it is no longer assigned", partitionRecords.partition);
    } else {// 如果该分区是已经分配了的
        // 获取该分区的offset
        long position = subscriptions.position(partitionRecords.partition);
        // 检查返回的数据对应分区是不是可以fetch数据,即是不是被暂停或者没有有效的offset
        if (!subscriptions.isFetchable(partitionRecords.partition)) {
            // this can happen when a partition is paused before fetched records are returned to the consumer's poll call
            log.debug("Not returning fetched records for assigned partition {} since it is no longer fetchable", partitionRecords.partition);
        } else if (partitionRecords.fetchOffset == position) {
            // 根据返回的数据获取maxRecords个数量的record
            List<ConsumerRecord<K, V>> partRecords = partitionRecords.drainRecords(maxRecords);
            // 获取下一次开始取的offset（当前最后一个record对应offset+1）
            long nextOffset = partRecords.get(partRecords.size() - 1).offset() + 1;

            log.trace("Returning fetched records at offset {} for assigned partition {} and update " +
                    "position to {}", position, partitionRecords.partition, nextOffset);
            // 更新下一次的offsets
            subscriptions.position(partitionRecords.partition, nextOffset);
            return partRecords;
        } else {
            // these records aren't next in line based on the last consumed position, ignore them
            // they must be from an obsolete request
            log.debug("Ignoring fetched records for {} at offset {} since the current position is {}",
                    partitionRecords.partition, partitionRecords.fetchOffset, position);
        }
    }
    // 把返回的数据置为空
    partitionRecords.drain();
    return Collections.emptyList();
}

private PartitionRecords<K, V> parseFetchedData(CompletedFetch completedFetch) {
    // 获取这个完成的fetch所对应的分区
    TopicPartition tp = completedFetch.partition;
    // 获取这个完成的fetch所对应的PartitionData
    FetchResponse.PartitionData partition = completedFetch.partitionData;
    // 获取这个完成的fetch所对应offset
    long fetchOffset = completedFetch.fetchedOffset;
    int bytes = 0;
    int recordsCount = 0;
    PartitionRecords<K, V> parsedRecords = null;
    Errors error = Errors.forCode(partition.errorCode);

    try {
        // 判断该分区是否没有分配，或者分配了还不能fetch数据
        if (!subscriptions.isFetchable(tp)) {
            // 一般发生在rebalance或者分区暂停的时候
            log.debug("Ignoring fetched records for partition {} since it is no longer fetchable", tp);
        } else if (error == Errors.NONE) {// 如果没有错误
            // 只有当start offset与当前消耗的位置匹配时，我们才对该fetch感兴趣
            // 获取分区位置
            Long position = subscriptions.position(tp);
            if (position == null || position != fetchOffset) {
                log.debug("Discarding stale fetch response for partition {} since its offset {} does not match " +
                        "the expected offset {}", tp, fetchOffset, position);
                return null;
            }
            // 取出数据
            ByteBuffer buffer = partition.recordSet;
            // 转换成MemoryRecords
            MemoryRecords records = MemoryRecords.readableRecords(buffer);
            List<ConsumerRecord<K, V>> parsed = new ArrayList<>();
            for (LogEntry logEntry : records) {
                // 跳过那些比现在位置还早的消息
                if (logEntry.offset() >= position) {
                    parsed.add(parseRecord(tp, logEntry));
                    // 更新消息字节数
                    bytes += logEntry.size();
                }
            }
            // 获取消息数据量
            recordsCount = parsed.size();
            this.sensors.recordTopicFetchMetrics(tp.topic(), bytes, recordsCount);
            // 获取分区position位置上的record
            if (!parsed.isEmpty()) {
                log.trace("Adding fetched record for partition {} with offset {} to buffered record list", tp, position);
                parsedRecords = new PartitionRecords<>(fetchOffset, tp, parsed);
                // 获取记录
                ConsumerRecord<K, V> record = parsed.get(parsed.size() - 1);
                this.sensors.recordsFetchLag.record(partition.highWatermark - record.offset());
            }
        } else if (error == Errors.NOT_LEADER_FOR_PARTITION) {// 如果有partition没有leader的错误
            log.debug("Error in fetch for partition {}: {}", tp, error.exceptionName());
            // 请求更新当前的集群元数据信息，在更新之前返回当前版本
            this.metadata.requestUpdate();
        } else if (error == Errors.UNKNOWN_TOPIC_OR_PARTITION) {// 如果存在位置的topic或者partition
            log.warn("Received unknown topic or partition error in fetch for partition {}. The topic/partition " +
                    "may not exist or the user may not have Describe access to it", tp);
            // 请求更新当前的集群元数据信息，在更新之前返回当前版本
            this.metadata.requestUpdate();
        } else if (error == Errors.OFFSET_OUT_OF_RANGE) {// 如果offset越界
            if (fetchOffset != subscriptions.position(tp)) {
                log.debug("Discarding stale fetch response for partition {} since the fetched offset {}" +
                        "does not match the current offset {}", tp, fetchOffset, subscriptions.position(tp));
            } else if (subscriptions.hasDefaultOffsetResetPolicy()) {
                log.info("Fetch offset {} is out of range for partition {}, resetting offset", fetchOffset, tp);
                subscriptions.needOffsetReset(tp);
            } else {
                throw new OffsetOutOfRangeException(Collections.singletonMap(tp, fetchOffset));
            }
        } else if (error == Errors.TOPIC_AUTHORIZATION_FAILED) {// 如果topic没有被授权
            log.warn("Not authorized to read from topic {}.", tp.topic());
            throw new TopicAuthorizationException(Collections.singleton(tp.topic()));
        } else if (error == Errors.UNKNOWN) {
            log.warn("Unknown error fetching data for topic-partition {}", tp);
        } else {
            throw new IllegalStateException("Unexpected error code " + error.code() + " while fetching data");
        }
    } finally {
        completedFetch.metricAggregator.record(tp, bytes, recordsCount);
    }

    // 如果我们接受了一些数或者这里没有错误，我们可以把分区移到末尾，这样有相同主题的分区就可以在一起
    if (bytes > 0 || error != Errors.NONE)
        subscriptions.movePartitionToEnd(tp);

    return parsedRecords;
}

三更新position

第一次消费某一个topic的分区，服务器端内部offset topics中并没有记录最近提交的offset。此时让用户手动指定消费的起始的offset,则可以从指定的offset开始消费

否则就需要重置TopicPartitionState.position字段，重置过程涉及到

ListOffsetRequest和;ListOffsetResponse:

分析一下ListOffsetRequest和ListOffsetResponse的消息体格式：

updateFetchPositions：更新partition的position

# 检测分区是否已分配或者是否fetchable

# 如果设置resetStrategy，则按照指定个重置的策略进行重置：EARLIST重置为最下的offset，LATEST重置为最大的offset

# 如果没有指定的重置策略，则将TopicPartition的position重置为commited（上一次提交的offset）

# 如果coomited上一次提交的offset为空，则使用默认的重置策略进行重置

public void updateFetchPositions(Set<TopicPartition> partitions) {
    // 重设fetch position成提交的position
    for (TopicPartition tp : partitions) {
        // 检测分区是否已分配或者分区是否可以fetchable
        if (!subscriptions.isAssigned(tp) || subscriptions.isFetchable(tp))
            continue;
        // 判断分区是否需要重设offset,取决于resetStrategy是否为空
        if (subscriptions.isOffsetResetNeeded(tp)) {
            // 按照指定策略重置offset
            resetOffset(tp);
        } else if (subscriptions.committed(tp) == null) {
            // 最近提交的offset为空，我们也需要使用默认的重置策略重置offset
            subscriptions.needOffsetReset(tp);
            resetOffset(tp);
        } else {
            // 如果上次提交的offset不为空，则获取这个offset
            long committed = subscriptions.committed(tp).offset();
            log.debug("Resetting offset for partition {} to the committed offset {}", tp, committed);
            // 将TopicPartition的position更新为最近一次提交的offset
            subscriptions.seek(tp, committed);
        }
    }
}

private void resetOffset(TopicPartition partition) {
    // 如果指定了resetStrategy，则获取该重设策略
    OffsetResetStrategy strategy = subscriptions.resetStrategy(partition);
    final long timestamp;
    // 根据不同的重设策略，更新timestamp
    if (strategy == OffsetResetStrategy.EARLIEST)
        timestamp = ListOffsetRequest.EARLIEST_TIMESTAMP;
    else if (strategy == OffsetResetStrategy.LATEST)
        timestamp = ListOffsetRequest.LATEST_TIMESTAMP;
    else
        throw new NoOffsetForPartitionException(partition);

    log.debug("Resetting offset for partition {} to {} offset.", partition, strategy.name().toLowerCase(Locale.ROOT));
    // 根据timestamp获取offset
    long offset = getOffsetsByTimes(Collections.singletonMap(partition, timestamp), Long.MAX_VALUE).get(partition).offset();

    // we might lose the assignment while fetching the offset, so check it is still active
    // 在获取offset的时候，我们可能失去这个分区的分配信息，所以我们再次检查他是否有效
    if (subscriptions.isAssigned(partition))
        this.subscriptions.seek(partition, offset);
}

// 根据时间戳获取offset
public Map<TopicPartition, OffsetAndTimestamp> getOffsetsByTimes(Map<TopicPartition, Long> timestampsToSearch,
        long timeout) {
    // 如果分区和时间戳信息为空，则不能获取offset，返回空
    if (timestampsToSearch.isEmpty())
        return Collections.emptyMap();

    long startMs = time.milliseconds();
    // 初始化剩余的时间 = timeout
    long remaining = timeout;
    do {
        // 发送offsetRequest请求
        RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> future = sendListOffsetRequests(timestampsToSearch);
        client.poll(future, remaining);

        if (!future.isDone())
            break;

        if (future.succeeded())
            return future.value();

        if (!future.isRetriable())
            throw future.exception();

        long elapsed = time.milliseconds() - startMs;
        remaining = timeout - elapsed;
        if (remaining <= 0)
            break;

        if (future.exception() instanceof InvalidMetadataException)
            client.awaitMetadataUpdate(remaining);
        else
            time.sleep(Math.min(remaining, retryBackoffMs));

        elapsed = time.milliseconds() - startMs;
        remaining = timeout - elapsed;
    } while (remaining > 0);
    throw new TimeoutException("Failed to get offsets by times in " + timeout + " ms");
}

private RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> sendListOffsetRequests(final Map<TopicPartition, Long> timestampsToSearch) {
    // 按照Node对<TopicPartition,Long>分类，构建一个新的集合
    final Map<Node, Map<TopicPartition, Long>> timestampsToSearchByNode = new HashMap<>();
    for (Map.Entry<TopicPartition, Long> entry: timestampsToSearch.entrySet()) {
        // 获取TopicPartition
        TopicPartition tp  = entry.getKey();
        // 获取指定topic分区的元数据信息
        PartitionInfo info = metadata.fetch().partition(tp);
        // 指定topic分区的元数据信息为空，则添加大到metadata
        if (info == null) {
            metadata.add(tp.topic());
            log.debug("Partition {} is unknown for fetching offset, wait for metadata refresh", tp);
            return RequestFuture.staleMetadata();
        } else if (info.leader() == null) { // 如果没有leader
            log.debug("Leader for partition {} unavailable for fetching offset, wait for metadata refresh", tp);
            return RequestFuture.leaderNotAvailable();
        } else {
            // 获取leader
            Node node = info.leader();
            Map<TopicPartition, Long> topicData = timestampsToSearchByNode.get(node);
            if (topicData == null) {
                topicData = new HashMap<>();
                timestampsToSearchByNode.put(node, topicData);
            }
            topicData.put(entry.getKey(), entry.getValue());
        }
    }

    final RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> listOffsetRequestsFuture = new RequestFuture<>();
    final Map<TopicPartition, OffsetAndTimestamp> fetchedTimestampOffsets = new HashMap<>();
    final AtomicInteger remainingResponses = new AtomicInteger(timestampsToSearchByNode.size());
    for (Map.Entry<Node, Map<TopicPartition, Long>> entry : timestampsToSearchByNode.entrySet()) {
        // 根据之前构建的map集合发送ListOffsetRequest请求
        sendListOffsetRequest(entry.getKey(), entry.getValue())
                .addListener(new RequestFutureListener<Map<TopicPartition, OffsetAndTimestamp>>() {
                    @Override
                    public void onSuccess(Map<TopicPartition, OffsetAndTimestamp> value) {
                        synchronized (listOffsetRequestsFuture) {
                            fetchedTimestampOffsets.putAll(value);
                            if (remainingResponses.decrementAndGet() == 0 && !listOffsetRequestsFuture.isDone())
                                listOffsetRequestsFuture.complete(fetchedTimestampOffsets);
                        }
                    }

                    @Override
                    public void onFailure(RuntimeException e) {
                        synchronized (listOffsetRequestsFuture) {
                            // This may cause all the requests to be retried, but should be rare.
                            if (!listOffsetRequestsFuture.isDone())
                                listOffsetRequestsFuture.raise(e);
                        }
                    }
                });
    }
    return listOffsetRequestsFuture;
}

private RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> sendListOffsetRequest(final Node node,
        final Map<TopicPartition, Long> timestampsToSearch) {
    // 创建一个ListOffsetRequest请求
    ListOffsetRequest request = new ListOffsetRequest(timestampsToSearch, ListOffsetRequest.CONSUMER_REPLICA_ID);
    log.trace("Sending ListOffsetRequest {} to broker {}", request, node);
    // 将这个请求发送到unsent列表等待被发送发
    return client.send(node, ApiKeys.LIST_OFFSETS, request)
            .compose(new RequestFutureAdapter<ClientResponse, Map<TopicPartition, OffsetAndTimestamp>>() {
                @Override
                public void onSuccess(ClientResponse response, RequestFuture<Map<TopicPartition, OffsetAndTimestamp>> future) {
                    ListOffsetResponse lor = new ListOffsetResponse(response.responseBody());
                    log.trace("Received ListOffsetResponse {} from broker {}", lor, node);
                    handleListOffsetResponse(timestampsToSearch, lor, future);
                }
            });
}

相关推荐