kmeans#

The kmeans module performs k-means clustering on numeric enumerated values. It groups entries into k clusters based on the similarity of one or more numeric fields, using Euclidean distance as the distance metric. Each entry is assigned a cluster ID, the distance from its cluster centroid, and the number of entries in its assigned cluster.

The module implements Lloyd’s algorithm with k-means++ initialization for selecting initial centroids.

Syntax#

kmeans [options] <keys...>

One or more enumerated value names must be provided as keys. These form the dimensions of the feature vector used for clustering. All keys must contain numeric (floating point) values; entries where any key cannot be read as a float are dropped.

Supported Options#

  • -k <n>: Set the number of clusters. Default is 3.

  • -maxtracked <n>: Set the maximum number of entries to track. Default is 1000000.

  • -cluster <name>: Set the name of the enumerated value for the cluster ID assigned to each entry. Defaults to cluster.

  • -distance <name>: Set the name of the enumerated value for the Euclidean distance from the entry to its cluster centroid. Defaults to distance.

  • -count <name>: Set the name of the enumerated value for the number of entries in the associated cluster. Defaults to count.

  • -centroids <name>: Write the calculated centroids to a Gravwell resource as a CSV file.

  • -r <name>: Use pre-calculated centroids generated previously with -centroids. Enables offline or stable centroid reuse.

Produced Enumerated Values#

The kmeans module produces three enumerated values on each entry (names are configurable via flags):

Enumerated Value

Default Name

Description

Cluster ID

cluster

Integer identifying which cluster the entry belongs to.

Distance

distance

Float representing the Euclidean distance from the entry to its cluster centroid.

Count

count

Integer representing the total number of entries in the entry’s cluster.

Pre-calculating centroids#

The -centroids flag writes the final calculated centroids to a Gravwell resource in CSV format. This is useful for:

  • Stable clustering: apply the same centroids consistently across different queries.

  • Offline training: compute centroids on a large training set, then apply -r to new queries for near-instant assignment.

To use pre-calculated centroids, use the -r flag with the same resource name created when calculating the centroids.

Examples#

Cluster network connections by source and destination port#

Extract numeric port values from netflow data and cluster into 5 groups:

tag=netflow netflow Src Dst SrcPort DstPort | kmeans -k 5 SrcPort DstPort | table Src Dst SrcPort DstPort cluster distance count

Sort clusters by distance#

Identify outliers by sorting entries by their distance from the cluster centroid:

tag=data json x y | kmeans -k 3 x y | sort by distance desc | table x y cluster distance

Export centroids for later reuse#

Compute centroids on a training set and save them to a resource:

tag=data json x y z | kmeans -k 4 -centroids cluster_model x y z |

Apply pre-computed centroids to new data#

Use saved centroids to cluster a new dataset consistently:

tag=data json x y z | kmeans -k 4 -r cluster_model x y z | table x y z cluster distance