The S3A connector works well with third-party S3 stores if the following requirements are met:
There are also specific deployment requirements: * The clock on the store and the client are close enough that signing works. * The client is correctly configured to connect to the store and not use unavailable features * If HTTPS authentication is used, the client/JVM TLS configurations allows it to authenticate the endpoint.
The features which may be unavailable include:
fs.s3a.change.detection.mode=server
)fs.s3a.change.detection.source=versionid
and fs.s3a.versioned.store=true
)fs.s3a.multiobjectdelete.enable=true
)fs.s3a.encryption.algorithm
)fs.s3a.create.storage.class
fs.s3a.object.content.encoding
.fs.s3a.bucket.probe = 0
). This is now the default -do not change it.fs.s3a.list.version = 1
)The (default) etag-based change detection logic expects stores to provide an Etag header in HEAD/GET requests, and to support it as a precondition in subsequent GET and COPY calls. If a store does not do this, disable the checks.
<property> <name>fs.s3a.change.detection.mode</name> <value>none</value> </property>
The core setting for a third party store is to change the endpoint in fs.s3a.endpoint
.
This can be a URL or a hostname/hostname prefix For third-party stores without virtual hostname support, providing the URL is straightforward; path style access must also be enabled in fs.s3a.path.style.access
.
The v4 signing algorithm requires a region to be set in fs.s3a.endpoint.region
. A non-empty value is generally sufficient, though some deployments may require a specific value.
Finally, assuming the credential source is the normal access/secret key then these must be set, either in XML or (preferred) in a JCEKS file.
<property> <name>fs.s3a.endpoint</name> <value>https://storeendpoint.example.com</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true</value> </property> <property> <name>fs.s3a.endpoint.region</name> <value>anything</value> </property> <property> <name>fs.s3a.access.key</name> <value>13324445</value> </property> <property> <name>fs.s3a.secret.key</name> <value>4C6B906D-233E-4E56-BCEA-304CC73A14F8</value> </property>
If per-bucket settings are used here, then third-party stores and credentials may be used alongside an AWS store.
Not all third-party stores support bucket lifecycle rules to clean up buckets of incomplete uploads.
This can be addressed in two ways * Command line: hadoop s3guard uploads -abort -force \<path>
. * With fs.s3a.multipart.purge
and a purge age set in fs.s3a.multipart.purge.age
* In rename/delete fs.s3a.directory.operations.purge.uploads = true
.
This can be executed on a schedule, or manually
hadoop s3guard uploads -abort -force s3a://bucket/
Consult the S3Guard documentation for the full set of parameters.
fs.s3a.multipart.purge
This lists all uploads in a bucket when a filesystem is created and deletes all of those above a certain age.
This can hurt performance on a large bucket, as the purge scans the entire tree, and is executed whenever a filesystem is created -which can happen many times during hive, spark, distcp jobs.
For this reason, this option may be deleted in future, however it has long been available in the S3A client and so guaranteed to work across versions.
fs.s3a.directory.operations.purge.uploads
When fs.s3a.directory.operations.purge.uploads
is set, when a directory is renamed or deleted, then in parallel with the delete an attempt is made to list all pending uploads. If there are any, they are aborted (sequentially).
The most common problem when talking to third-party stores are
There are some very low level logs.
# Log all HTTP requests made; includes S3 interaction. This may # include sensitive information such as account IDs in HTTP headers. log4j.logger.software.amazon.awssdk.request=DEBUG # Turn on low level HTTP protocol debugging log4j.logger.org.apache.http.wire=DEBUG # async client log4j.logger.io.netty.handler.logging=DEBUG log4j.logger.io.netty.handler.codec.http2.Http2FrameLogger=DEBUG
By default, there’s a lot of retries going on in the AWS connector (which even retries on DNS failures) and in the S3A code which invokes it.
Normally this helps prevent long-lived jobs from failing due to a transient network problem, however it means that when trying to debug connectivity problems, the commands can hang for a long time as they keep trying to reconnect to ports which are never going to be available.
<property> <name>fs.iostatistics.logging.level</name> <value>info</value> </property> <property> <name>fs.s3a.bucket.nonexistent-bucket-example.attempts.maximum</name> <value>0</value> </property> <property> <name>fs.s3a.bucket.nonexistent-bucket-example.retry.limit</name> <value>1</value> </property> <property> <name>fs.s3a.bucket.nonexistent-bucket-example.connection.timeout</name> <value>500</value> </property> <property> <name>fs.s3a.bucket.nonexistent-bucket-example.connection.establish.timeout</name> <value>500</value> </property> <property> <name>fs.s3a.bucket.nonexistent-bucket-example.retry.http.5xx.errors</name> <value>false</value> </property>
Setting the option fs.s3a.retry.http.5xx.errors
to false
stops the S3A client from treating 500 and other HTTP 5xx status codes other than 501 and 503 as errors to retry on. With AWS S3 they are eventually recovered from. On a third-party store they may be cause by other problems, such as:
Disabling the S3A client’s retrying of these errors ensures that failures happen faster; the AWS SDK itself still makes a limited attempt to retry.
There’s an external utility, cloudstore whose storediag exists to debug the connection settings to hadoop cloud storage.
hadoop jar cloudstore-1.0.jar storediag s3a://nonexistent-bucket-example/
The main reason it’s not an ASF release is that it allows for a rapid release cycle, sometimes hours; if anyone doesn’t trust third-party code then they can download and build it themselves.
This is the most common initial problem, as it happens by default.
To fix, set fs.s3a.endpoint
to the URL of the internal store.
org.apache.hadoop.fs.s3a.UnknownStoreException:
s3a://nonexistent-bucket-example/’: Bucket does not exist`Either the bucket doesn’t exist, or the bucket does exist but the endpoint is still set to an AWS endpoint.
stat: `s3a://nonexistent-bucket-example/': Bucket does not exist
The hadoop filesystem commands don’t log stack traces on failure -adding this adds too much risk of breaking scripts, and the output is very uninformative
stat: nonexistent-bucket-example: getS3Region on nonexistent-bucket-example: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 403, Request ID: X26NWV0RJ1697SXF, Extended Request ID: bqq0rRm5Bdwt1oHSfmWaDXTfSOXoYvNhQxkhjjNAOpxhRaDvWArKCFAdL2hDIzgec6nJk1BVpJE=):null
It is possible to turn on debugging
log4j.logger.org.apache.hadoop.fs.shell=DEBUG
After which useful stack traces are logged.
org.apache.hadoop.fs.s3a.UnknownStoreException: `s3a://nonexistent-bucket-example/': Bucket does not exist at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$null$3(S3AFileSystem.java:1075) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getS3Region$4(S3AFileSystem.java:1039) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:445) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2631) at org.apache.hadoop.fs.s3a.S3AFileSystem.getS3Region(S3AFileSystem.java:1038) at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:982) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:622) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
S3Exception: null (Service: S3, Status Code: 403...
or AccessDeniedException
fs.s3a.endpoint.region
unset.If the client doesn’t have any AWS credentials (from hadoop settings, environment variables or elsewhere) then the binding will fail even before the existence of the bucket can be probed for.
hadoop fs -stat s3a://nonexistent-bucket-example
stat: nonexistent-bucket-example: getS3Region on nonexistent-bucket-example: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 403, Request ID: X26NWV0RJ1697SXF, Extended Request ID: bqq0rRm5Bdwt1oHSfmWaDXTfSOXoYvNhQxkhjjNAOpxhRaDvWArKCFAdL2hDIzgec6nJk1BVpJE=):null
Or with a more detailed stack trace:
java.nio.file.AccessDeniedException: nonexistent-bucket-example: getS3Region on nonexistent-bucket-example: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 403, Request ID: X26NWV0RJ1697SXF, Extended Request ID: bqq0rRm5Bdwt1oHSfmWaDXTfSOXoYvNhQxkhjjNAOpxhRaDvWArKCFAdL2hDIzgec6nJk1BVpJE=):null at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:235) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:124) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347) at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getS3Region$4(S3AFileSystem.java:1039) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:543) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:524) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:445) at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2631) at org.apache.hadoop.fs.s3a.S3AFileSystem.getS3Region(S3AFileSystem.java:1038) at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:982) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:622) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3452)
Received an UnknownHostException when attempting to interact with a service
The bucket fs.s3a.endpoint.region
region setting is valid internally, but as the endpoint is still AWS, this region is not recognised. The S3A client’s creation of an endpoint URL generates an unknown host.
<property> <name>fs.s3a.bucket.nonexistent-bucket-example.endpoint.region</name> <value>internal</value> </property>
ls: software.amazon.awssdk.core.exception.SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.: nonexistent-bucket-example.s3.internal.amazonaws.com: nodename nor servname provided, or not known
fs.s3a.path.style.access
is still set to false
fs.s3a.endpoint.region
region setting is valid internally,fs.s3a.endpoint
is set to a hostname (not a URL).fs.s3a.path.style.access
set to false
ls: software.amazon.awssdk.core.exception.SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.: nonexistent-bucket-example.localhost: nodename nor servname provided, or not known
Fix: path style access
<property> <name>fs.s3a.bucket.nonexistent-bucket-example.path.style.access</name> <value>true</value> </property>
It is possible to connect to google cloud storage through the S3A connector. However, Google provide their own Cloud Storage connector. That is a well maintained Hadoop filesystem client which uses their XML API, And except for some very unusual cases, that is the connector to use.
When interacting with a GCS container through the S3A connector may make sense * The installation doesn’t have the gcs-connector JAR. * The different credential mechanism may be convenient. * There’s a desired to use S3A Delegation Tokens to pass secrets with a job. * There’s a desire to use an external S3A extension (delegation tokens etc.)
The S3A connector binding works through the Google Cloud S3 Storage API, which is a subset of the AWS API.
To get a compatible access and secret key, follow the instructions of Simple migration from Amazon S3 to Cloud Storage.
Here are the per-bucket setings for an example bucket “gcs-container” in Google Cloud Storage. Note the multiobject delete option must be disabled; this makes renaming and deleting significantly slower.
<configuration> <property> <name>fs.s3a.bucket.gcs-container.access.key</name> <value>GOOG1EZ....</value> </property> <property> <name>fs.s3a.bucket.gcs-container.secret.key</name> <value>SECRETS</value> </property> <property> <name>fs.s3a.bucket.gcs-container.endpoint</name> <value>https://storage.googleapis.com</value> </property> <property> <name>fs.s3a.bucket.gcs-container.bucket.probe</name> <value>0</value> </property> <property> <name>fs.s3a.bucket.gcs-container.list.version</name> <value>1</value> </property> <property> <name>fs.s3a.bucket.gcs-container.multiobjectdelete.enable</name> <value>false</value> </property> <property> <name>fs.s3a.bucket.gcs-container.path.style.access</name> <value>true</value> </property> <property> <name>fs.s3a.bucket.gcs-container.endpoint.region</name> <value>dummy</value> </property> </configuration>
This is a very rarely used configuration -however, it can be done, possibly as a way to interact with Google Cloud Storage in a deployment which lacks the GCS connector.
It is also a way to regression test foundational S3A third-party store compatibility if you lack access to to any alternative.
<configuration> <property> <name>test.fs.s3a.encryption.enabled</name> <value>false</value> </property> <property> <name>fs.s3a.scale.test.csvfile</name> <value></value> </property> <property> <name>test.fs.s3a.sts.enabled</name> <value>false</value> </property> <property> <name>test.fs.s3a.content.encoding.enabled</name> <value>false</value> </property> </configuration>
Note If anyone is set up to test this reguarly, please let the hadoop developer team know if regressions do surface, as it is not a common test configuration. []