Authenticating with S3

Except when interacting with public S3 buckets, the S3A client needs the credentials needed to interact with buckets.

The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of com.amazonaws.auth.AWSCredentialsProvider may also be used. However, with the upgrade to AWS Java SDK V2 in Hadoop 3.4.0, these classes will need to be updated to implement software.amazon.awssdk.auth.credentials.AwsCredentialsProvider. For more information see Upcoming upgrade to AWS Java SDK V2.

Authentication properties

<property>
  <name>fs.s3a.access.key</name>
  <description>AWS access key ID used by S3A file system. Omit for IAM role-based or provider-based authentication.</description>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <description>AWS secret key used by S3A file system. Omit for IAM role-based or provider-based authentication.</description>
</property>

<property>
  <name>fs.s3a.session.token</name>
  <description>Session token, when using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
    as one of the providers.
  </description>
</property>

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>
    org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
    org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
    software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider,
    org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
  </value>
  <description>
    Comma-separated class names of credential provider classes which implement
    software.amazon.awssdk.auth.credentials.AwsCredentialsProvider.

    When S3A delegation tokens are not enabled, this list will be used
    to directly authenticate with S3 and other AWS services.
    When S3A Delegation tokens are enabled, depending upon the delegation
    token binding it may be used
    to communicate wih the STS endpoint to request session/role
    credentials.
  </description>
</property>

<property>
  <name>fs.s3a.aws.credentials.provider.mapping</name>
  <description>
    Comma-separated key-value pairs of mapped credential providers that are
    separated by equal operator (=). The key can be used by
    fs.s3a.aws.credentials.provider config, and it will be translated into
    the specified value of credential provider class based on the key-value
    pair provided by this config.

    Example:
    com.amazonaws.auth.AnonymousAWSCredentials=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider,
    com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper=org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider,
    com.amazonaws.auth.InstanceProfileCredentialsProvider=org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider

    With the above key-value pairs, if fs.s3a.aws.credentials.provider specifies
    com.amazonaws.auth.AnonymousAWSCredentials, it will be remapped to
    org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider by S3A while
    preparing AWS credential provider list for any S3 access.
    We can use the same credentials provider list for both v1 and v2 SDK clients.
  </description>
</property>

Authenticating via the AWS Environment Variables

S3A supports configuration via the standard AWS environment variables.

The core environment variables are for the access key and associated secret:

export AWS_ACCESS_KEY_ID=my.aws.key
export AWS_SECRET_ACCESS_KEY=my.secret.key

If the environment variable AWS_SESSION_TOKEN is set, session authentication using “Temporary Security Credentials” is enabled; the Key ID and secret key must be set to the credentials for that specific session.

export AWS_SESSION_TOKEN=SECRET-SESSION-TOKEN
export AWS_ACCESS_KEY_ID=SESSION-ACCESS-KEY
export AWS_SECRET_ACCESS_KEY=SESSION-SECRET-KEY

These environment variables can be used to set the authentication credentials instead of properties in the Hadoop configuration.

Important: These environment variables are generally not propagated from client to server when YARN applications are launched. That is: having the AWS environment variables set when an application is launched will not permit the launched application to access S3 resources. The environment variables must (somehow) be set on the hosts/processes where the work is executed.

Changing Authentication Providers

The standard way to authenticate is with an access key and secret key set in the Hadoop configuration files.

By default, the S3A client follows the following authentication chain:

  1. The options fs.s3a.access.key, fs.s3a.secret.key and fs.s3a.sesson.key are looked for in the Hadoop XML configuration/Hadoop credential providers, returning a set of session credentials if all three are defined.
  2. The fs.s3a.access.key and fs.s3a.secret.key are looked for in the Hadoop XML configuration//Hadoop credential providers, returning a set of long-lived credentials if they are defined.
  3. The AWS environment variables, are then looked for: these will return session or full credentials depending on which values are set.
  4. An attempt is made to query the Amazon EC2 Instance/k8s container Metadata Service to retrieve credentials published to EC2 VMs.

S3A can be configured to obtain client authentication providers from classes which integrate with the AWS SDK by implementing the software.amazon.awssdk.auth.credentials.AwsCredentialsProvider interface. This is done by listing the implementation classes, in order of preference, in the configuration option fs.s3a.aws.credentials.provider. In previous hadoop releases, providers were required to implement the AWS V1 SDK interface com.amazonaws.auth.AWSCredentialsProvider. Consult the Upgrading S3A to AWS SDK V2 documentation to see how to migrate credential providers.

Important: AWS Credential Providers are distinct from Hadoop Credential Providers. As will be covered later, Hadoop Credential Providers allow passwords and other secrets to be stored and transferred more securely than in XML configuration files. AWS Credential Providers are classes which can be used by the Amazon AWS SDK to obtain an AWS login from a different source in the system, including environment variables, JVM properties and configuration files.

All Hadoop fs.s3a. options used to store login details can all be secured in Hadoop credential providers; this is advised as a more secure way to store valuable secrets.

There are a number of AWS Credential Providers inside the hadoop-aws JAR:

Hadoop module credential provider Authentication Mechanism
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider Session Credentials in configuration
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider Simple name/secret credentials in configuration
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider Anonymous Login
org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider Assumed Role credentials
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider EC2/k8s instance credentials

There are also many in the Amazon SDKs, with the common ones being as follows

classname description
software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider AWS Environment Variables
software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider EC2 Metadata Credentials
software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider EC2/k8s Metadata Credentials

EC2 IAM Metadata Authentication with InstanceProfileCredentialsProvider

Applications running in EC2 may associate an IAM role with the VM and query the EC2 Instance Metadata Service for credentials to access S3. Within the AWS SDK, this functionality is provided by InstanceProfileCredentialsProvider, which internally enforces a singleton instance in order to prevent throttling problem.

Using Named Profile Credentials with ProfileCredentialsProvider

You can configure Hadoop to authenticate to AWS using a named profile.

To authenticate with a named profile:

  1. Declare software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider as the provider.
  2. Set your profile via the AWS_PROFILE environment variable.
  3. Due to a bug in version 1 of the AWS Java SDK, you’ll need to remove the profile prefix from the AWS configuration section heading.

Here’s an example of what your AWS configuration files should look like:

```
$ cat ~/.aws/config
[user1]
region = us-east-1
$ cat ~/.aws/credentials
[user1]
aws_access_key_id = ...
aws_secret_access_key = ...
aws_session_token = ...
aws_security_token = ...
```

Note:

  1. The region setting is only used if fs.s3a.endpoint.region is set to the empty string.
  2. For the credentials to be available to applications running in a Hadoop cluster, the configuration files MUST be in the ~/.aws/ directory on the local filesystem in all hosts in the cluster.

Using Session Credentials with TemporaryAWSCredentialsProvider

Temporary Security Credentials can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.

To authenticate with these:

  1. Declare org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as the provider.
  2. Set the session key in the property fs.s3a.session.token, and the access and secret key properties to those of this temporary session.

Example:

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
  <name>fs.s3a.access.key</name>
  <value>SESSION-ACCESS-KEY</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>SESSION-SECRET-KEY</value>
</property>

<property>
  <name>fs.s3a.session.token</name>
  <value>SECRET-SESSION-TOKEN</value>
</property>

The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS.

Anonymous Login with AnonymousAWSCredentialsProvider

Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. It can be useful for accessing public data sets without requiring AWS credentials.

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>

Once this is done, there’s no need to supply any credentials in the Hadoop configuration or via environment variables.

This option can be used to verify that an object store does not permit unauthenticated access: that is, if an attempt to list a bucket is made using the anonymous credentials, it should fail —unless explicitly opened up for broader access.

hadoop fs -ls \
 -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
 s3a://noaa-isd-pds/
  1. Allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases.

  2. If a list of credential providers is given in fs.s3a.aws.credentials.provider, then the Anonymous Credential provider must come last. If not, credential providers listed after it will be ignored.

Simple name/secret credentials with SimpleAWSCredentialsProvider*

This is the standard credential provider, which supports the secret key in fs.s3a.access.key and token in fs.s3a.secret.key values.

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
</property>

This is the basic authenticator used in the default authentication chain.

This means that the default S3A authentication chain can be defined as

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>
    org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
    org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
    software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider
    org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
  </value>
</property>

Protecting the AWS Credentials

It is critical that you never share or leak your AWS credentials. Loss of credentials can leak/lose all your data, run up large bills, and significantly damage your organisation.

  1. Never share your secrets.

  2. Never commit your secrets into an SCM repository. The git secrets can help here.

  3. Never include AWS credentials in bug reports, files attached to them, or similar.

  4. If you use the AWS_ environment variables, your list of environment variables is equally sensitive.

  5. Never use root credentials. Use IAM user accounts, with each user/application having its own set of credentials.

  6. Use IAM permissions to restrict the permissions individual users and applications have. This is best done through roles, rather than configuring individual users.

  7. Avoid passing in secrets to Hadoop applications/commands on the command line. The command line of any launched program is visible to all users on a Unix system (via ps), and preserved in command histories.

  8. Explore using IAM Assumed Roles for role-based permissions management: a specific S3A connection can be made with a different assumed role and permissions from the primary user account.

  9. Consider a workflow in which users and applications are issued with short-lived session credentials, configuring S3A to use these through the TemporaryAWSCredentialsProvider.

  10. Have a secure process in place for cancelling and re-issuing credentials for users and applications. Test it regularly by using it to refresh credentials.

  11. In installations where Kerberos is enabled, S3A Delegation Tokens can be used to acquire short-lived session/role credentials and then pass them into the shared application. This can ensure that the long-lived secrets stay on the local system.

When running in EC2, the IAM EC2 instance credential provider will automatically obtain the credentials needed to access AWS services in the role the EC2 VM was deployed as. This AWS credential provider is enabled in S3A by default.

Custom AWS Credential Providers and Apache Spark

Apache Spark employs two class loaders, one that loads “distribution” (Spark + Hadoop) classes and one that loads custom user classes. If the user wants to load custom implementations of AWS credential providers, custom signers, delegation token providers or any other dynamically loaded extension class through user provided jars she will need to set the following configuration:

<property>
  <name>fs.s3a.classloader.isolation</name>
  <value>false</value>
</property>
<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>CustomCredentialsProvider</value>
</property>

If the following property is not set or set to true, the following exception will be thrown:

java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class CustomCredentialsProvider not found

S3 Authorization Using S3 Access Grants

S3 Access Grants can be used to grant accesses to S3 data using IAM Principals. In order to enable S3 Access Grants, S3A utilizes the S3 Access Grants plugin on all S3 clients, which is found within the AWS Java SDK bundle (v2.23.19+).

S3A supports both cross-region access (by default) and the fallback-to-IAM configuration which allows S3A to fallback to using the IAM role (and its permission sets directly) to access S3 data in the case that S3 Access Grants is unable to authorize the S3 call.

The following is how to enable this feature:

<property>
  <name>fs.s3a.s3accessgrants.enabled</name>
  <value>true</value>
</property>
<property>
  <!--Optional: Defaults to False-->
  <name>fs.s3a.s3accessgrants.fallback.to.iam</name>
  <value>true</value>
</property>

Note: 1. S3A only enables the S3 Access Grants plugin on the S3 clients as part of this feature. Any usage issues or bug reporting should be done directly at the plugin’s GitHub repo.

For more details on using S3 Access Grants, please refer to Managing access with S3 Access Grants.