Kafka清除数据日志详解
使用Kafka的时候我们一般都会根据需求对Log进行保存,比如保存1天、3天或者7天之类的。
我们可以通过以下的几个参数实现:
(1)配置Log过期时间
log.retention.hours
log.retention.minutes
log.retention.ms
上面三个配置向配置其中一个即可,
kafka默认是配置log.retention.hours,其
值是7天
# The minimum age of a log file to be eligible for deletion due to age
log.retention.hours=168
Kafka处理日志过期时间代码如下
private def getLogRetentionTimeMillis(): Long = {
val millisInMinute = 60L * 1000L
val millisInHour = 60L * millisInMinute
if(props.containsKey("log.retention.ms")){
props.getIntInRange("log.retention.ms", (1, Int.MaxValue))
}
else if(props.containsKey("log.retention.minutes")){
millisInMinute * props.getIntInRange("log.retention.minutes", (1, Int.MaxValue))
}
else {
millisInHour * props.getIntInRange("log.retention.hours", 24*7, (1, Int.MaxValue))
}
}
(2)配置Log保存的大小
默认是不限制Log存储大小
# segments drop below log.retention.bytes. Functions independently of log.retention.hours.
#log.retention.bytes=1073741824
注意:
删除日志的时候是两个参数都需要配置才起作用还是只需要配置其中一个?
其实只需要配置其中一个即可,源代码如下:
def cleanupLogs() {
debug("Beginning log cleanup...")
var total = 0
val startMs = time.milliseconds
for(log <- allLogs; if !log.config.compact) {
debug("Garbage collecting '" + log.name + "'")
total += cleanupExpiredSegments(log) + cleanupSegmentsToMaintainSize(log)
}
debug("Log cleanup completed. " + total + " files deleted in " +
(time.milliseconds - startMs) / 1000 + " seconds")
}
private def cleanupExpiredSegments(log: Log): Int = {
val startMs = time.milliseconds
log.deleteOldSegments(startMs - _.lastModified > log.config.retentionMs)
}
private def cleanupSegmentsToMaintainSize(log: Log): Int = {
if(log.config.retentionSize < 0 || log.size < log.config.retentionSize)
return 0
var diff = log.size - log.config.retentionSize
def shouldDelete(segment: LogSegment) = {
if(diff - segment.size >= 0) {
diff -= segment.size
true
} else {
false
}
}
log.deleteOldSegments(shouldDelete)
}
cleanupLogs
函数就是清理需要删除的日志。其中调用了cleanupExpiredSegments
和cleanupSegmentsToMaintainSize
函数,分别对应于上面按照保存时间和保存的Log大小策略删除的,从这个可以看出,只需要配置一种删除策略即可。
deleteOldSegments
函数就是根据相关的条件找出需要删除的Segments。
def deleteOldSegments(predicate: LogSegment => Boolean): Int = {
// find any segments that match the user-supplied predicate
//UNLESS it is the final segment
// and it is empty (since we would just end up re-creating it
val lastSegment = activeSegment
val deletable = logSegments.takeWhile(s => predicate(s) &&
(s.baseOffset != lastSegment.baseOffset || s.size > 0))
val numToDelete = deletable.size
if(numToDelete > 0) {
lock synchronized {
// we must always have at least one segment, so if we are
// going to delete all the segments, create a new one first
if(segments.size == numToDelete)
roll()
// remove the segments for lookups
deletable.foreach(deleteSegment(_))
}
}
numToDelete
}
deleteOldSegments
根据传进来的predicate找出需要删除的Segments,并存放到deletable中。最后遍历deletable中的Segment,并调用deleteSegment
函数去删除。
private def deleteSegment(segment: LogSegment) {
info("Scheduling log segment %d for log %s for deletion."
.format(segment.baseOffset, name))
lock synchronized {
segments.remove(segment.baseOffset)
asyncDeleteSegment(segment)
}
}
deleteSegment
最终调用的是asyncDeleteSegment
函数
private def asyncDeleteSegment(segment: LogSegment) {
segment.changeFileSuffixes("", Log.DeletedFileSuffix)
def deleteSeg() {
info("Deleting segment %d from log %s.".format(segment.baseOffset, name))
segment.delete()
}
scheduler.schedule("delete-file", deleteSeg, delay = config.fileDeleteDelayMs)
}
这个删除是异步进行的。从实现来看,删除是由另外一个线程执行的。删除之前会将需要删除的Log名字加上.deleted后缀:
然后会经过log.segment.delete.delay.ms
(默认1分钟)时间之后彻底删除那些Segments。
Kafka集群会每隔log.retention.check.interval.ms
(默认5分钟)时间去检测需要删除的Segments。
scheduler.schedule("kafka-log-retention",
cleanupLogs,
delay = InitialTaskDelayMs,
period = retentionCheckMs,
TimeUnit.MILLISECONDS)