Amazon Simple Storage Service Ingester (S3)#

Warning

If you have large amounts of data in S3 buckets or want to “tail” S3 data, this is probably not the ingester you are looking for. Please read this section in its entirety.

The Amazon S3 ingester is designed to ingest data directly from the Amazon S3 API, as well as from references to files indicated via Amazon SQS. While this is a convenient method of ingestion and integrates well with other tools like Amazon Cloudtrail and a myriad of other cloud based tools, be aware that this ingester is NOT designed to provide near real-time access to data. It cannot “tail” objects (i.e. consuming entries as they are written to a bucket). The Amazon S3 system is a blob storage system with three basic operations: list, get, put; this means that the ingester must consume entire objects when ingesting. If new data is written to an S3 object, it is essentially a replacement of the object and the ingester will treat it as such. In other words, getting “new” data that has been added to an existing object requires re-reading the entire object.

S3 buckets are also not meant to provide low latency access to data. The S3 API is basically an HTTP API and most of the systems that write to S3 buckets batch up data and write large chunks at a time. Be forewarned, you may see delays measured in minutes between when an entry says it was created and when Gravwell can actually consume it due to this common S3 usage pattern.

If you have large amounts of S3 data and/or are interested in “tailing” S3 data, you probably should be using the SQS ingester or another mechanism like (Amazon Kinesis)[/ingesters/kinesis].

All of that being said, the S3 ingester makes integrating with many cloud services extremely easy: just point at the S3 bucket and provide the appropriate access keys and you are off.

Installation#

To install the Debian package, make sure the Gravwell Debian repository is configured as described in the quickstart. Then run the following command as root:

apt update && apt install gravwell-s3

To install the Redhat package, make sure the Gravwell Redhat repository is configured as described in the quickstart. Then run the following command as root:

yum install gravwell-s3

To install via the standalone shell installer, download the installer from the downloads page, then run the following command as root, replacing X.X.X with the appropriate version:

bash gravwell_s3_ingest_installer_X.X.X.sh

You may be prompted for additional configuration during the installation.

There is currently no Docker image for this ingester

Basic Configuration#

The Gravwell S3 ingester can consume from multiple buckets at once and will track its progress using a state file so that you can start and stop the ingester at will. By default, the ingester will scan S3 buckets every minute looking for changes and ingest new objects that are discovered. The main configuration for the Gravwell S3 ingester requires a [Global] section that specifies all of the usual configuration parameters which tell the ingester how to connect to Gravwell indexers. The ingester also supports the full suite of custom preprocessors, time format overrides, and custom time format specifications.

The S3 ingester relies on a standard configuration file located at /opt/gravwell/etc/s3.conf by default, but also supports configuration overlays in the /opt/gravwell/etc/s3.conf.d/ directory. Remember to ensure that all configuration overlays have the .conf file extension and are readable by the gravwell user and/or gravwell group.

A single S3 ingester can specify multiple buckets to consume from. Multiple buckets are specified by creating multiple [Bucket "name"] stanzas in either the main s3.conf file or in overlay configuration files.

Note

We HIGHLY recommend creating a dedicated S3 IAM user for the Gravwell ingester. It’s never a good idea to use privileged credentials for dedicated applications like data ingestion.

Worker Pool#

You can configure the number of parallel threads that read from S3 buckets by setting the Worker-Pool-Size parameter in the global configuration. By default, this value is 1. Increasing this number can improve the speed at which files in S3 are read, but at the cost of more parallel network activity.

For example, to create ten workers that operate in parallel:

[Global]
Ingest-Secret = "IngestSecrets"
Cleartext-Backend-Target=172.19.0.2:4023 #example of adding a cleartext connection
Log-File=/tmp/s3.log
State-Store-Location=/tmp/s3.state
Worker-Pool-Size=10

...

Bucket Configuration#

Each bucket configuration supports the following configuration options:

Configuration Parameter	Required	Default Value	Description
ID	YES		AWS IAM User ID to authenticate with
Secret	YES		AWS IAM User secret to authenticate with
Region	YES		AWS region the S3 bucket is registered in (Example: `us-east-2`)
Bucket-ARN	YES		AWS S3 bucket resource name (Example: `arn:aws:s3:::aws-cloudtrail-logs-475058115300-aabbccddee`). Required unless using an alternate Endpoint.
Endpoint	NO		Alternate provider endpoint when not using typical AWS S3 services.
Bucket-Name	NO		Define a bucket name for alternate non-AWS S3 endpoints (see alternate endpoint section).
Disable-TLS	NO	false	Disable SSL/TLS when connecting to an alternate Endpoint (see alternate endpoint section).
S3-Force-Path-Style	NO	false	Force endpoint resolution to append bucket names to URLs rather than domains (see alternate endpoint section).
Tag-Name	YES		Specify the Gravwell tag name to be applied to data ingested from the bucket
Reader	NO	line	Specify how the data is to be interpreted in an S3 object, defaults to line delimited but `cloudtrail` is supported
MaxRetries	NO	3	Specify the maximum number of retries when requesting a bucket object
Ignore-Timestamps	NO	false	Do not attempt to resolve timestamps from data contents
Assume-Local-Timezone	NO	false	If no timestamp is present, assume local timezone for timestamps
Timezone-Override	NO		Force a specific timezone when interpreting timestamps in data
Timestamp-Format-Override	NO		Force the ingester to look for a specific timestamp format
Max-Line-Size	NO	4MB	Limit the maximum length of a single entry when using the `line` Reader
Source-Override	NO	Ingester IP	Override the source value attached to each entry
File-Filters	NO		Specify one or more glob patterns for use when matching object names (Example: `AWSLogs/*/.json.gz`)
Preprocessor	NO		Specify one or more preprocessors to execute on ingested data
Credentials-Type	NO	static	Sets the type of authentication credentials used for accessing the SQS and S3 services.

The Bucket-ARN configuration parameter wants a fully qualified ARN value, not the HTTP or HTTPS URL (Example: arn:aws:s3:::aws-cloudtrail-logs-stuff).

Bucket Data Formats#

By default the S3 ingester will process objects using a line reader, essentially expecting line delimited data. However, the ingester can also natively consume AWS Cloudtrail event records in JSON format. If no Reader is specified for a bucket, line is assumed. The Following options are available for the Reader configuration parameter:

line - Line delimited data.
cloudtrail - Amazon AWS Cloudtrail JSON encoded data.

SQS References#

The S3 Ingester can also pull S3 objects that are referenced in SQS queue messages. This eliminates the need for active polling and works with a number of Amazon services, such as Cloudtrail. To use this type of listener, declare an SQS-S3-Listener, using the following configuration parameters:

Configuration Parameter	Required	Default Value	Description
Tag-Name	YES		Specify the Gravwell tag name to be applied to data ingested from the bucket
ID	YES		AWS IAM User ID to authenticate with
Secret	YES		AWS IAM User secret to authenticate with
Region	YES		AWS region the S3 bucket is registered in (Example: `us-east-2`)
Reader	NO	line	Specify how the data is to be interpreted in an S3 object, defaults to line delimited but `cloudtrail` is supported
Max-Line-Size	NO	4MB	Limit the maximum length of a single entry when using the `line` Reader
Source-Override	NO	Ingester IP	Override the source value attached to each entry
Preprocessor	NO		Specify one or more preprocessors to execute on ingested data
Ignore-Timestamps	NO	false	Do not attempt to resolve timestamps from data contents
Assume-Local-Timezone	NO	false	If no timestamp is present, assume local timezone for timestamps
Timezone-Override	NO		Force a specific timezone when interpreting timestamps in data
Timestamp-Format-Override	NO		Force the ingester to look for a specific timestamp format
Credentials-Type	NO	static	Sets the type of authentication credentials used for accessing the SQS and S3 services.
File-Filters	NO		Specify one or more glob patterns for use when matching object names (Example: `AWSLogs/*/.json.gz`)

Object Match Globs#

The S3 ingester will attempt to ingest all objects in an S3 bucket or SQS/SNS message unless one or more File-Filters patterns are used to establish which objects should be consumed. The File-Filters patterns support the standard globbing patterns for filenames and “double-star” patterns for recursive directory matches.

For example, if we specify a single File-Filters pattern of *.log then the ingester will consume all objects that have the file extension .log at the first directory level only. An object named foo.log will be consumed but an object named stuff/foo.log will not.

Multiple File-Filters can be specified to create an OR match pattern; for example the following set of patterns would match all objects that have an extension of .log and are located in either the this, that, or theother directories:

File-Filters=this/*.log
File-Filters=that/*.log
File-Filters=theother/*.log

Arbitrary directory specifications can be achieved using a “double-star” globbing pattern. For example File-Filters="AWSLogs/**/*.json.gz" will match all objects with a file extension of .json.gz located in any sub-directory within the AWSLogs top level directory. All of the following objects would be matched by this filter:

AWSLogs/475058115300/CloudTrail/us-west-2/2022/11/28/475058115300_CloudTrail_us-west-2_20221128T2320Z_gvNAnhYNeqzmI2bH.json.gz
AWSLogs/475058115300/CloudTrail/us-east-1/2022/11/28/475058115300_CloudTrail_us-east-1_20221128T2320Z_lY7sQmelLGP14BrY.json.gz
AWSLogs/summary/CloudTrail_us-east-1.json.gz

Credentials-Type Authentication Options#

Both listener types (Bucket and SQS-S3-Listener) support multiple authentication methods. By default, the “static” method is used, which requires that you set the ID and Secret fields for the listener. The following additional methods are supported, and can be used by setting the Credentials-Type field:

Credential Type	Description
static	Default credential type. Uses the ID and Secret fields set in the listener.
environment	Uses the `AWS_ACCESS_KEY` and `AWS_SECRET_KEY` environment variables to authenticate.
ec2role	Uses the host-based EC2 role authentication method. See Amazon EC2 Documentation for more information.

Alternate Endpoints#

The S3 ingester is designed to be flexible in handling S3 compatible APIs by providers other than Amazon Web Services. This means that you can use the S3 ingester to consume logs from other S3 providers like Minio, Linode, DigitalOcean and many others. However, the S3 API has some peculiarities when it comes to defining endpoints, bucket names, and regions. This section goes over how to configure the S3 ingester to operate against non-AWS S3 providers and managed AWS endpoints like those provided by Cisco.

When configuring the S3 ingester to operate on a non-AWS S3 provider, you will need to omit the Bucket-ARN parameter and must define the Endpoint, Bucket-Name, and S3-Force-Path-Style. Unfortunately, there are several ways to specify a path to an S3 bucket that involve varying configurations of domain names and HTTP URLs. This section will go over the basics of generating a configuration that will work with an S3 compatible endpoint that is not AWS.

The first configuration parameter required is the Endpoint parameter. This is typically a domain name provided by your S3 provider. The Endpoint may contain the region embedded in it, but you will still need to specify the region in the Region parameter. Some examples in include us-east-l.inodeobjects.com, cisco-managed-us-east-1.s3.amazon.com, or 192.168.1.1:9000. The Endpoint should be the host the ingester will be connecting to. The Bucket-Name parameter is required when providing a custom Endpoint, however depending on your provider, you may need to adjust the Endpoint and S3-Force-Path-Style configuration parameters.

For example, Linode provided an S3 target of testing.us-east-1.linodeobjects.com when we created our bucket named testing; for this case we provide Endpoint="us-east-1.linodeobjects.com" and Bucket-Name="testing". For a custom Minio object storage deployment, we might set Endpoint=192.168.1.1:9000, Bucket-Name="testing", and S3-Force-Path-Style=true. Additional examples for several providers are available below.

Example Configurations#

The most basic S3 configuration is a single indexer and a single bucket which consumes line delimited data.

[Global]
Ingest-Secret = "IngestSecrets"
Cleartext-Backend-Target=172.19.0.2:4023 #example of adding a cleartext connection
Log-File=/tmp/s3.log
State-Store-Location=/tmp/s3.state
Worker-Pool-Size=10

[Bucket "default"]
	Region="us-east-1"
	ID="AKI..."
	Secret="SuperSecretKey..."
	Bucket-ARN = "arn:aws:s3:::aws-super-special-logs"
	Tag-Name="s3-logs"

This example ingester will consume every object in the specified bucket, derive a timestamp, then push the data to the s3-logs Gravwell tag.

Amazon Cloudtrail Log Handling#

To consume AWS Cloudtrail logs from an S3 bucket, automatically extracting individual records and deriving the appropriate timestamp, the cloudtrail reader is available.

[Global]
Ingest-Secret = "IngestSecrets"
Cleartext-Backend-Target=172.19.0.2:4023 #example of adding a cleartext connection
Log-File=/tmp/s3.log
State-Store-Location=/tmp/s3.state

[Bucket "default"]
	Region="us-east-1"
	ID="AKI..."
	Secret="SuperSecretKey..."
	Bucket-ARN = "arn:aws:s3:::aws-cloudtrail-logs-123456-7890"
	Tag-Name="aws-cloudtrail"
	Reader=cloudtrail
	File-Filters=**/*.json.gz

The Cloudtrail reader does the dirty work of interpreting the Cloudtrail JSON schema and extracting out discrete events and processing the “eventTime” timestamp and attaching it to the event. Using the cloudtrail reader you will get nicely segmented events:

Linode Object Storage#

In this example we have a bucket named testing and a provided Linode URL of https://testing.us-east-1.linodeobjects.com.

[Bucket "linode"]
	ID="AKI..."
	Secret="SuperSecretKey..."
	Tag-Name="linode-logs"
	Bucket-Name="testing"
	Endpoint="us-east-1.linodeobjects.com"
	Region="us-east-1"
	S3-Force-Path-Style=false
	File-Filters="**/*.log"

Minio Self-Hosted Object Storage#

Minio provides an excellent open source implementation of an S3 compatible object storage server; they even provide a super easy to use Docker image. Using a default Docker deployment, the Minio server is listening on port 9000 and uses a path style specification for accessing objects. The region defaults to us-east-1 but can be altered in the configuration. Here is an example configuration for Minio:

[Bucket "minio"]
	ID="AKI..."
	Secret="SuperSecretKey..."
	Tag-Name="minio-logs"
	Bucket-Name="testing"
	Endpoint="172.16.0.4:9000"
	Region="us-east-1"
	S3-Force-Path-Style=true
	File-Filters="**/auth.*.log"

Managed AWS S3 Endpoints#

Some companies provide access to managed AWS regions that are technically AWS but not in the standard set of regions. For these cases we can use an alternate Endpoint setup to interact with the custom regions. For this case we will look at a configuration designed to pull back Cisco Umbrella logs that are hosted on the Cisco-managed AWS region.

[Bucket "umbrella"]
	ID="AKI..."
	Secret="SuperSecretKey..."
	Tag-Name="umbrella-logs"
	Bucket-Name="123456_7368697474657227732066756c6c"
	Endpoint="cisco-managed-us-east-1.s3.amazonaws.com"
	Region="cisco-managed-us-east-1"
	S3-Force-Path-Style=true
	File-Filters="auditlog/**/*.log"

AWS IAM Permission Policy#

Following the concept of least priviledge, it is recommended to create a new AWS IAM user or role for the ingester to use with only the bare minimum required permissions. Below are some example AWS Permissions policies that can be used for this newly created user.

A basic permissions policy granting access to all S3 buckets on an account:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "gravwellS3ListBucket",
			"Effect": "Allow",
			"Action": "s3:ListBucket",
			"Resource": "arn:aws:s3:::*"
		},
		{
			"Sid": "gravwellS3GetObject",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:GetObjectVersion"
			],
			"Resource": "arn:aws:s3:::*/*"
		}
	]
}

A basic policy granting access to only a single S3 bucket on the account:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "gravwellS3ListBucket",
			"Effect": "Allow",
			"Action": "s3:ListBucket",
			"Resource": "arn:aws:s3:::<BUCKET_NAME>"
		},
		{
			"Sid": "gravwellS3GetObject",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:GetObjectVersion"
			],
			"Resource": "arn:aws:s3:::<BUCKET_NAME>/*"
		}
	]
}

A policy restricting access to only those systems with a specific originating IP address:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "gravwellS3GetObject",
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:GetObjectVersion"
			],
			"Resource": "arn:aws:s3:::*/*",
			"Condition": {
				"IpAddress": {
					"aws:SourceIp": [
						"10.0.0.1",
						"192.168.0.1"
					]
				}
			}
		},
		{
			"Sid": "gravwellS3ListBucket",
			"Effect": "Allow",
			"Action": "s3:ListBucket",
			"Resource": "arn:aws:s3:::*",
			"Condition": {
				"IpAddress": {
					"aws:SourceIp": [
						"10.0.0.1",
						"192.168.0.1"
					]
				}
			}
		}
	]
}

A Policy allowing access to the Simple Queue Service for the SQS reader:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "gravwellSQS",
			"Effect": "Allow",
			"Action": [
				"sqs:DeleteMessage",
				"sqs:StartMessageMoveTask",
				"sqs:GetQueueUrl",
				"sqs:CancelMessageMoveTask",
				"sqs:ChangeMessageVisibility",
				"sqs:ListMessageMoveTasks",
				"sqs:ReceiveMessage",
				"sqs:SendMessage",
				"sqs:GetQueueAttributes",
				"sqs:ListQueueTags",
				"sqs:ListDeadLetterSourceQueues",
				"sqs:PurgeQueue",
				"sqs:DeleteQueue",
				"sqs:CreateQueue",
				"sqs:SetQueueAttributes"
			],
			"Resource": "arn:aws:sqs:*:640150647918:*"
		},
		{
			"Sid": "gravwellSQSListQueues",
			"Effect": "Allow",
			"Action": "sqs:ListQueues",
			"Resource": "*"
		}
	]
}