T 摘要
T-Digest 是一种概率数据结构,允许您估计数据流的百分位数。
t-digest 是 Redis Stack 中的一种 sketch 数据结构,用于使用紧凑的 sketch 从数据流或大型数据集中估计百分位数。
它可以回答以下问题:
- 数据流中的哪部分值小于给定值?
- 数据流中有多少个值小于给定值?
- 小于数据流中值的 p % 的最大值是多少?(什么是 P 百分位数值)?
什么是 t-digest?
T-Digest 是一种数据结构,它可以估计一个百分位点,而无需存储和排序一组中的所有数据点。例如:要回答“99% 的数据库作的平均延迟是多少”这个问题,我们必须存储每个用户的平均延迟,对值进行排序,去掉最后的 1%,然后才找到其余所有的平均值。这种过程不仅在排序这些值所需的处理方面,而且在存储它们所需的空间方面都非常昂贵。这些正是 t-digest 解决的问题。
T-Digest 还可用于估计与百分位数相关的其他值,例如截尾均值。
截尾均值是草图中的平均值,不包括超出低分断值和高分断值范围的观测值。例如,0.1 截尾均值是草图的平均值,不包括最低的 10% 和最高的 10% 的值。
使用案例
硬件/软件监控
您测量了在线服务器响应延迟,并且希望查询:
-
测得的延迟的第 50 个、第 90 个和第 99 个百分位数是多少?
-
测得的延迟的哪一部分小于 25 毫秒?
-
忽略异常值的平均延迟是多少?或者第 10 个百分位和第 90 个百分位之间的平均延迟是多少?
在线游戏
数以百万计的人正在您的在线游戏平台上玩游戏,您想向每个玩家提供以下信息?
-
您的分数高于所玩游戏会话的 x%。
-
大约有 y 个游戏会话中,人们的得分比你高。
-
要获得比 90% 的比赛更好的分数,您的分数应该是 z。
网络流量监控
您可以测量每秒通过网络传输的 IP 数据包,并尝试通过以下问题来检测拒绝服务攻击:
-
最后一秒的数据包数是否超过之前观察到的值的 99%?
-
在正常网络条件下,我预计会看到多少个数据包? (答案:在 x 和 y 之间,其中 x 代表第 1 个百分位数,y 代表第 99 个百分位数)。
预测性维护
-
测量参数(噪声水平、电流消耗等)是否不规则?(不在 [第 1 个百分位数...第 99 个百分位] 范围)?
-
我应该将警报设置为哪些值?
例子
在以下示例中,您将创建一个压缩率为 100 的 t-digest 并向其添加项目。这COMPRESSION
参数用于指定准确性和内存消耗之间的权衡。默认值为 100。值越高,精度越高。注意:与其他一些概率数据结构不同,TDIGEST.ADD
如果 key 不存在,command 将不会创建新结构。
You can repeat calling TDIGEST.ADD whenever new observations are available
Estimating fractions or ranks by values
Another helpful feature in t-digest is CDF (definition of rank) which gives us the fraction of observations smaller or equal to a certain value. This command is very useful to answer questions like "What's the percentage of observations with a value lower or equal to X".
More precisely, TDIGEST.CDF
will return the estimated fraction of observations in the sketch that are smaller than X plus half the number of observations that are equal to X. We can also use the TDIGEST.RANK
command, which is very similar. Instead of returning a fraction, it returns the ----estimated---- rank of a value. The TDIGEST.RANK
command is also variadic, meaning you can use a single command to retrieve estimations for one or more values.
Here's an example. Given a set of biker's ages, you can ask a question like "What's the percentage of bike racers that are younger than 50 years?"
And lastly, TDIGEST.REVRANK key value...
is similar to TDIGEST.RANK, but returns, for each input value, an estimation of the number of (observations larger than a given value + half the observations equal to the given value).
Estimating values by fractions or ranks
TDIGEST.QUANTILE key fraction...
returns, for each input fraction, an estimation of the value (floating point) that is smaller than the given fraction of observations. TDIGEST.BYRANK key rank...
returns, for each input rank, an estimation of the value (floating point) with that rank.
TDIGEST.BYREVRANK key rank...
returns, for each input reverse rank, an estimation of the value (floating point) with that reverse rank.
Estimating trimmed mean
Use TDIGEST.TRIMMED_MEAN key lowFraction highFraction
to retrieve an estimation of the mean value between the specified fractions.
This is especially useful for calculating the average value ignoring outliers. For example - calculating the average value between the 20th percentile and the 80th percentile.
Merging sketches
Sometimes it is useful to merge sketches. For example, suppose we measure latencies for 3 servers, and we want to calculate the 90%, 95%, and 99% latencies for all the servers combined.
TDIGEST.MERGE destKey numKeys sourceKey... [COMPRESSION compression] [OVERRIDE]
merges multiple sketches into a single sketch.
If destKey
does not exist - a new sketch is created.
If destKey
is an existing sketch, its values are merged with the values of the source keys. To override the destination key contents, use OVERRIDE
.
Retrieving sketch information
Use TDIGEST.MIN
and TDIGEST.MAX
to retrieve the minimal and maximal values in the sketch, respectively.
Both return nan
when the sketch is empty.
Both commands return accurate results and are equivalent to TDIGEST.BYRANK racer_ages 0
and TDIGEST.BYREVRANK racer_ages 0
, respectively.
Use TDIGEST.INFO racer_ages
to retrieve some additional information about the sketch.
Resetting a sketch
Academic sources
References
On this page