kmeans#
The kmeans module performs k-means clustering on numeric enumerated values. It groups entries into k clusters based on the similarity of one or more numeric fields, using Euclidean distance as the distance metric. Each entry is assigned a cluster ID, the distance from its cluster centroid, and the number of entries in its assigned cluster.
The module implements Lloyd’s algorithm with k-means++ initialization for selecting initial centroids.
Syntax#
kmeans [options] <keys...>
One or more enumerated value names must be provided as keys. These form the dimensions of the feature vector used for clustering. All keys must contain numeric (floating point) values; entries where any key cannot be read as a float are dropped.
Supported Options#
-k <n>: Set the number of clusters. Default is 3.-maxtracked <n>: Set the maximum number of entries to track. Default is 1000000.-cluster <name>: Set the name of the enumerated value for the cluster ID assigned to each entry. Defaults tocluster.-distance <name>: Set the name of the enumerated value for the Euclidean distance from the entry to its cluster centroid. Defaults todistance.-count <name>: Set the name of the enumerated value for the number of entries in the associated cluster. Defaults tocount.
Produced Enumerated Values#
The kmeans module produces three enumerated values on each entry (names are configurable via flags):
Enumerated Value |
Default Name |
Description |
|---|---|---|
Cluster ID |
|
Integer identifying which cluster the entry belongs to. |
Distance |
|
Float representing the Euclidean distance from the entry to its cluster centroid. |
Count |
|
Integer representing the total number of entries in the entry’s cluster. |
Examples#
Cluster network connections by source and destination port#
Extract numeric port values from netflow data and cluster into 5 groups:
tag=netflow netflow Src Dst SrcPort DstPort | kmeans -k 5 SrcPort DstPort | table Src Dst SrcPort DstPort cluster distance count
Sort clusters by distance#
Identify outliers by sorting entries by their distance from the cluster centroid:
tag=data json x y | kmeans -k 3 x y | sort by distance desc | table x y cluster distance