Amazon Simple Storage Service Ingester (S3)#

The Amazon S3 ingester is designed to ingest data directly from the Amazon S3 API. While this is a convenient method of ingestion and integrates well with other tools like Amazon Cloudtrail and a myriad of other cloud based tools, be aware that this ingester is NOT designed to provide near real-time access to data. Nor can it “tail” objects, consuming entries as they are written to a bucket. The Amazon S3 system is a blob storage system with three basic operations: list, get, put; this means that the ingester must consume entire objects when ingesting. If new data is written to an S3 object, it is essentially a replacement of the object and the ingester will treat it as such.

S3 buckets are also not meant to provide low latency access to data. The S3 API is basically an HTTP API and most of the systems that write to S3 buckets batch up data and write large chunks at a time. Be forewarned, you may see delays measured in minutes between when an entry says it was created and when Gravwell can actually consume it. That being said, the S3 ingester makes integrating with many cloud services extremely easy: just point at the S3 bucket and provide the appropriate access keys and you are off.

Basic Configuration#

The Gravwell S3 ingester can consume from multiple buckets at once and will track its progress using a state file so that you can start and stop the ingester at will. By default, the ingester will scan S3 buckets every minute looking for changes and ingest new objects that are discovered. The main configuration for the Gravwell S3 ingester requires a [Global] section that specifies all of the usual configuration parameters which tell the ingester how to connect to Gravwell indexers. The ingester also supports the full suite of custom preprocessors, time format overrides, and custom time format specifications.

The S3 ingester relies on a standard configuration file located at /opt/gravwell/etc/s3.conf by default, but also supports configuration overlays in the /opt/gravwell/etc/s3.conf.d/ directory. Remember to ensure that all configuration overlays have the .conf file extension and are readable by the gravwell user and/or gravwell group.

A single S3 ingester can specify multiple buckets to consume from. Multiple buckets are specified by creating multiple [Bucket "name"] stanzas in either the main s3.conf file or in overlay configuration files.

Note

We HIGHLY recommend creating a dedicated S3 IAM user for the Gravwell ingester. It’s never a good idea to use privileged credentials for dedicated applications like data ingestion.

Bucket Configuration#

Each bucket configuration supports the following configuration options:

Configuration Parameter

Required

Default Value

Description

ID

YES

AWS IAM User ID to authenticate with

Secret

YES

AWS IAM User secret to authenticate with

Region

YES

AWS region the S3 bucket is registered in (Example: us-east-2)

Bucket-ARN

YES

AWS S3 bucket resource name (Example: arn:aws:s3:::aws-cloudtrail-logs-475058115300-aabbccddee)

Tag-Name

YES

Specify the Gravwell tag name to be applied to data ingested from the bucket

Reader

NO

line

Specify how the data is to be interpreted in an S3 object, defaults to line delimited but cloudtrail is supported

MaxRetries

NO

3

Specify the maximum number of retries when requesting a bucket object

Ignore-Timestamps

NO

false

Do not attempt to resolve timestamps from data contents

Assume-Local-Timezone

NO

false

If no timestamp is present, assume local timezone for timestamps

Timezone-Override

NO

Force a specific timezone when interpreting timestamps in data

Timestamp-Format-Override

NO

Force the ingester to look for a specific timestamp format

Max-Line-Size

NO

4MB

Limit the maximum length of a single entry when using the line Reader

Source-Override

NO

Ingester IP

Override the source value attached to each entry

File-Filters

NO

Specify one or more glob patterns for use when matching object names (Example: AWSLogs/**/*.json.gz)

Preprocessor

NO

Specify one or more preprocessors to execute on ingested data

The Bucket-ARN configuration parameter wants a fully qualified ARN value, not the HTTP or HTTPS URL (Example: arn:aws:s3:::aws-cloudtrail-logs-stuff).

Object Match Globs#

The S3 ingester will attempt to ingest all objects in an S3 bucket recursively unless one or more File-Filters patterns are used to establish which objects should be consumed. The File-Filters patterns support the standard globbing patterns for filenames and “double-star” patterns for recursive directory matches.

For example, if we specify a single File-Filters pattern of *.log then the ingester will consume all objects that have the file extension .log at the first directory level only. An object named foo.log will be consumed but an object named stuff/foo.log will not.

Multiple File-Filters can be specified to create an OR match pattern; for example the following set of patterns would match all objects that have an extension of .log and are located in either the this, that, or theother directories:

File-Filters=this/*.log
File-Filters=that/*.log
File-Filters=theother/*.log

Arbitrary directory specifications can be achieved using a “double-star” globbing pattern. For example File-Filters="AWSLogs/**/*.json.gz" will match all objects with a file extension of .json.gz located in any sub-directory within the AWSLogs top level directory. All of the following objects would be matched by this filter:

AWSLogs/475058115300/CloudTrail/us-west-2/2022/11/28/475058115300_CloudTrail_us-west-2_20221128T2320Z_gvNAnhYNeqzmI2bH.json.gz
AWSLogs/475058115300/CloudTrail/us-east-1/2022/11/28/475058115300_CloudTrail_us-east-1_20221128T2320Z_lY7sQmelLGP14BrY.json.gz
AWSLogs/summary/CloudTrail_us-east-1.json.gz

Bucket Data Formats#

By default the S3 ingester will process objects using a line reader, essentially expecting line delimited data. However, the ingester can also natively consume AWS Cloudtrail event records in JSON format. If no Reader is specified for a bucket, line is assumed. The Following options are available for the Reader configuration parameter:

Example Configurations#

The most basic S3 configuration is a single indexer and a single bucket which consumes line delimited data.

Global]
Ingest-Secret = "IngestSecrets"
Cleartext-Backend-Target=172.19.0.2:4023 #example of adding a cleartext connection
Log-File=/tmp/s3.log
State-Store-Location=/tmp/s3.state

[Bucket "default"]
	Region="us-east-1"
	ID=AKI..."
	Secret="SuperSecretKey..."
	Bucket-ARN = "arn:aws:s3:::aws-super-special-logs"
	Tag-Name="s3-logs"

This example ingester will consume every object in the specified bucket, derive a timestamp, then push the data to the s3-logs Gravwell tag.

Amazon Cloudtrail Log Handling#

To consume AWS Cloudtrail logs from an S3 bucket, automatically extracting individual records and deriving the appropriate timestamp, the cloudtrail reader is available.

Global]
Ingest-Secret = "IngestSecrets"
Cleartext-Backend-Target=172.19.0.2:4023 #example of adding a cleartext connection
Log-File=/tmp/s3.log
State-Store-Location=/tmp/s3.state

[Bucket "default"]
	Region="us-east-1"
	ID=AKI..."
	Secret="SuperSecretKey..."
	Bucket-ARN = "arn:aws:s3:::aws-cloudtrail-logs-123456-7890"
	Tag-Name="aws-cloudtrail"
	Reader=cloudtrail

The Cloudtrail reader does the dirty work of interpreting the Cloudtrail JSON schema and extracting out discrete events and processing the “eventTime” timestamp and attaching it to the event. Using the cloudtrail reader you will get nicely segmented events: