Except when interacting with public S3 buckets, the S3A client needs the credentials needed to interact with buckets.
The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of com.amazonaws.auth.AWSCredentialsProvider
may also be used. However, with the upgrade to AWS Java SDK V2 in Hadoop 3.4.0, these classes will need to be updated to implement software.amazon.awssdk.auth.credentials.AwsCredentialsProvider
. For more information see Upcoming upgrade to AWS Java SDK V2.
<property> <name>fs.s3a.access.key</name> <description>AWS access key ID used by S3A file system. Omit for IAM role-based or provider-based authentication.</description> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key used by S3A file system. Omit for IAM role-based or provider-based authentication.</description> </property> <property> <name>fs.s3a.session.token</name> <description>Session token, when using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as one of the providers. </description> </property> <property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider, org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider </value> <description> Comma-separated class names of credential provider classes which implement software.amazon.awssdk.auth.credentials.AwsCredentialsProvider. When S3A delegation tokens are not enabled, this list will be used to directly authenticate with S3 and other AWS services. When S3A Delegation tokens are enabled, depending upon the delegation token binding it may be used to communicate wih the STS endpoint to request session/role credentials. </description> </property> <property> <name>fs.s3a.aws.credentials.provider.mapping</name> <description> Comma-separated key-value pairs of mapped credential providers that are separated by equal operator (=). The key can be used by fs.s3a.aws.credentials.provider config, and it will be translated into the specified value of credential provider class based on the key-value pair provided by this config. Example: com.amazonaws.auth.AnonymousAWSCredentials=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper=org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider, com.amazonaws.auth.InstanceProfileCredentialsProvider=org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider With the above key-value pairs, if fs.s3a.aws.credentials.provider specifies com.amazonaws.auth.AnonymousAWSCredentials, it will be remapped to org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider by S3A while preparing AWS credential provider list for any S3 access. We can use the same credentials provider list for both v1 and v2 SDK clients. </description> </property>
S3A supports configuration via the standard AWS environment variables.
The core environment variables are for the access key and associated secret:
export AWS_ACCESS_KEY_ID=my.aws.key export AWS_SECRET_ACCESS_KEY=my.secret.key
If the environment variable AWS_SESSION_TOKEN
is set, session authentication using “Temporary Security Credentials” is enabled; the Key ID and secret key must be set to the credentials for that specific session.
export AWS_SESSION_TOKEN=SECRET-SESSION-TOKEN export AWS_ACCESS_KEY_ID=SESSION-ACCESS-KEY export AWS_SECRET_ACCESS_KEY=SESSION-SECRET-KEY
These environment variables can be used to set the authentication credentials instead of properties in the Hadoop configuration.
Important: These environment variables are generally not propagated from client to server when YARN applications are launched. That is: having the AWS environment variables set when an application is launched will not permit the launched application to access S3 resources. The environment variables must (somehow) be set on the hosts/processes where the work is executed.
The standard way to authenticate is with an access key and secret key set in the Hadoop configuration files.
By default, the S3A client follows the following authentication chain:
fs.s3a.access.key
, fs.s3a.secret.key
and fs.s3a.sesson.key
are looked for in the Hadoop XML configuration/Hadoop credential providers, returning a set of session credentials if all three are defined.fs.s3a.access.key
and fs.s3a.secret.key
are looked for in the Hadoop XML configuration//Hadoop credential providers, returning a set of long-lived credentials if they are defined.S3A can be configured to obtain client authentication providers from classes which integrate with the AWS SDK by implementing the software.amazon.awssdk.auth.credentials.AwsCredentialsProvider
interface. This is done by listing the implementation classes, in order of preference, in the configuration option fs.s3a.aws.credentials.provider
. In previous hadoop releases, providers were required to implement the AWS V1 SDK interface com.amazonaws.auth.AWSCredentialsProvider
. Consult the Upgrading S3A to AWS SDK V2 documentation to see how to migrate credential providers.
Important: AWS Credential Providers are distinct from Hadoop Credential Providers. As will be covered later, Hadoop Credential Providers allow passwords and other secrets to be stored and transferred more securely than in XML configuration files. AWS Credential Providers are classes which can be used by the Amazon AWS SDK to obtain an AWS login from a different source in the system, including environment variables, JVM properties and configuration files.
All Hadoop fs.s3a.
options used to store login details can all be secured in Hadoop credential providers; this is advised as a more secure way to store valuable secrets.
There are a number of AWS Credential Providers inside the hadoop-aws
JAR:
Hadoop module credential provider | Authentication Mechanism |
---|---|
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider |
Session Credentials in configuration |
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider |
Simple name/secret credentials in configuration |
org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider |
Anonymous Login |
org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider |
Assumed Role credentials |
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider |
EC2/k8s instance credentials |
There are also many in the Amazon SDKs, with the common ones being as follows
classname | description |
---|---|
software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider |
AWS Environment Variables |
software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider |
EC2 Metadata Credentials |
software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider |
EC2/k8s Metadata Credentials |
InstanceProfileCredentialsProvider
Applications running in EC2 may associate an IAM role with the VM and query the EC2 Instance Metadata Service for credentials to access S3. Within the AWS SDK, this functionality is provided by InstanceProfileCredentialsProvider
, which internally enforces a singleton instance in order to prevent throttling problem.
ProfileCredentialsProvider
You can configure Hadoop to authenticate to AWS using a named profile.
To authenticate with a named profile:
software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider
as the provider.AWS_PROFILE
environment variable.profile
prefix from the AWS configuration section heading.Here’s an example of what your AWS configuration files should look like:
``` $ cat ~/.aws/config [user1] region = us-east-1 $ cat ~/.aws/credentials [user1] aws_access_key_id = ... aws_secret_access_key = ... aws_session_token = ... aws_security_token = ... ```
Note:
region
setting is only used if fs.s3a.endpoint.region
is set to the empty string.~/.aws/
directory on the local filesystem in all hosts in the cluster.TemporaryAWSCredentialsProvider
Temporary Security Credentials can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.
To authenticate with these:
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
as the provider.fs.s3a.session.token
, and the access and secret key properties to those of this temporary session.Example:
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value> </property> <property> <name>fs.s3a.access.key</name> <value>SESSION-ACCESS-KEY</value> </property> <property> <name>fs.s3a.secret.key</name> <value>SESSION-SECRET-KEY</value> </property> <property> <name>fs.s3a.session.token</name> <value>SECRET-SESSION-TOKEN</value> </property>
The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS.
AnonymousAWSCredentialsProvider
Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
allows anonymous access to a publicly accessible S3 bucket without any credentials. It can be useful for accessing public data sets without requiring AWS credentials.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value> </property>
Once this is done, there’s no need to supply any credentials in the Hadoop configuration or via environment variables.
This option can be used to verify that an object store does not permit unauthenticated access: that is, if an attempt to list a bucket is made using the anonymous credentials, it should fail —unless explicitly opened up for broader access.
hadoop fs -ls \ -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ s3a://noaa-isd-pds/
Allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases.
If a list of credential providers is given in fs.s3a.aws.credentials.provider
, then the Anonymous Credential provider must come last. If not, credential providers listed after it will be ignored.
SimpleAWSCredentialsProvider
*This is the standard credential provider, which supports the secret key in fs.s3a.access.key
and token in fs.s3a.secret.key
values.
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> </property>
This is the basic authenticator used in the default authentication chain.
This means that the default S3A authentication chain can be defined as
<property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, software.amazon.awssdk.auth.credentials.EnvironmentVariableCredentialsProvider org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider </value> </property>
It is critical that you never share or leak your AWS credentials. Loss of credentials can leak/lose all your data, run up large bills, and significantly damage your organisation.
Never share your secrets.
Never commit your secrets into an SCM repository. The git secrets can help here.
Never include AWS credentials in bug reports, files attached to them, or similar.
If you use the AWS_
environment variables, your list of environment variables is equally sensitive.
Never use root credentials. Use IAM user accounts, with each user/application having its own set of credentials.
Use IAM permissions to restrict the permissions individual users and applications have. This is best done through roles, rather than configuring individual users.
Avoid passing in secrets to Hadoop applications/commands on the command line. The command line of any launched program is visible to all users on a Unix system (via ps
), and preserved in command histories.
Explore using IAM Assumed Roles for role-based permissions management: a specific S3A connection can be made with a different assumed role and permissions from the primary user account.
Consider a workflow in which users and applications are issued with short-lived session credentials, configuring S3A to use these through the TemporaryAWSCredentialsProvider
.
Have a secure process in place for cancelling and re-issuing credentials for users and applications. Test it regularly by using it to refresh credentials.
In installations where Kerberos is enabled, S3A Delegation Tokens can be used to acquire short-lived session/role credentials and then pass them into the shared application. This can ensure that the long-lived secrets stay on the local system.
When running in EC2, the IAM EC2 instance credential provider will automatically obtain the credentials needed to access AWS services in the role the EC2 VM was deployed as. This AWS credential provider is enabled in S3A by default.
Apache Spark employs two class loaders, one that loads “distribution” (Spark + Hadoop) classes and one that loads custom user classes. If the user wants to load custom implementations of AWS credential providers, custom signers, delegation token providers or any other dynamically loaded extension class through user provided jars she will need to set the following configuration:
<property> <name>fs.s3a.classloader.isolation</name> <value>false</value> </property> <property> <name>fs.s3a.aws.credentials.provider</name> <value>CustomCredentialsProvider</value> </property>
If the following property is not set or set to true
, the following exception will be thrown:
java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class CustomCredentialsProvider not found
S3 Access Grants can be used to grant accesses to S3 data using IAM Principals. In order to enable S3 Access Grants, S3A utilizes the S3 Access Grants plugin on all S3 clients, which is found within the AWS Java SDK bundle (v2.23.19+).
S3A supports both cross-region access (by default) and the fallback-to-IAM configuration which allows S3A to fallback to using the IAM role (and its permission sets directly) to access S3 data in the case that S3 Access Grants is unable to authorize the S3 call.
The following is how to enable this feature:
<property> <name>fs.s3a.s3accessgrants.enabled</name> <value>true</value> </property> <property> <!--Optional: Defaults to False--> <name>fs.s3a.s3accessgrants.fallback.to.iam</name> <value>true</value> </property>
Note: 1. S3A only enables the S3 Access Grants plugin on the S3 clients as part of this feature. Any usage issues or bug reporting should be done directly at the plugin’s GitHub repo.
For more details on using S3 Access Grants, please refer to Managing access with S3 Access Grants.