S3Guard was a feature for the S3A client of the S3 object store, which used a consistent database as the store of metadata about objects in an S3 bucket.
It was written been 2016 and 2020, when Amazon S3 was eventually consistent. It compensated for the following S3 inconsistencies: * Newly created objects excluded from directory listings. * Newly deleted objects retained in directory listings. * Deleted objects still visible in existence probes and opening for reading. * S3 Load balancer 404 caching when a probe is made for an object before its creation.
It did not compensate for update inconsistency, though by storing the etag values of objects in the database, it could detect and report problems.
Now that S3 is consistent, there is no need for S3Guard at all. Accordingly, it was removed from the source in 2022 in HADOOP-17409, Remove S3Guard.
Attempting to create an S3A connector instance with S3Guard set to anything but the null or local metastores will now fail.
S3Guard
Permitted a consistent view of the object store.
Could improve performance on directory listing/scanning operations. including those which take place during the partitioning period of query execution, the process where files are listed and the work divided up amongst processes.
The basic idea was that, for each operation in the Hadoop S3 client (s3a) that reads or modifies metadata, a shadow copy of that metadata is stored in a separate MetadataStore implementation. The store was 1. Updated after mutating operations on the store 1. Updated after list operations against S3 discovered changes 1. Looked up whenever a probe was made for a file/directory existing. 1. Queried for all objects under a path when a directory listing was made; the results were merged with the S3 listing in a non-authoritative path, used exclusively in authoritative mode.
For links to early design documents and related patches, see HADOOP-13345.
How to move off S3Guard
fs.s3a.metadatastore.impl
globally/for all buckets for which it was selected.Once you are confident that all applications have been restarted, Delete the DynamoDB table. This is to avoid paying for a database you no longer need. This can be done from the AWS GUI.
The fs.s3a.metadatastore.impl
option must be one of * unset * set to the empty string "" * set to the “Null” Metadata store org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore
.
To aid the migration of external components which used the Local store for a consistent view within the test process, the Local Metadata store option is also recognized: org.apache.hadoop.fs.s3a.s3guard.LocalMetadataStore
. When this option is used the S3A connector will warn and continue.
<property> <name>fs.s3a.metadatastore.impl</name> <value></value> </property>
<property> <name>fs.s3a.metadatastore.impl</name> <value>org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore</value> </property>
More AWS S3 calls may be made once S3Guard is disabled, both for LIST and HEAD operations.
While this may seem to increase cost, as the DDB table is no longer needed, users will save on DDB table storage and use costs.
Some deployments of Apache Hive declared their managed tables to be “authoritative”. The S3 store was no longer checked when listing directories or for updates to entries. The S3Guard table in DynamoDB was used exclusively.
Without S3Guard, listing performance may be slower. However, Hadoop 3.3.0+ has significantly improved listing performance (HADOOP-17400 Optimize S3A for maximum performance in directory listings) so this should not be apparent.
The S3A auditing feature adds information to the S3 server logs about which jobs, users and filesystem operations have been making S3 requests. This auditing information can be used to identify opportunities to reduce load.
s3guard bucket-info
Prints and optionally checks the status of a bucket.
hadoop s3guard bucket-info [-fips] [-magic] [-encryption ENCRYPTION] [-markers MARKER] s3a://BUCKET
Options
argument | meaning |
---|---|
-fips |
Require FIPS endopint to be in use |
-magic |
Require the S3 filesystem to be support the “magic” committer |
-markers |
Directory marker status: aware , keep , delete , authoritative |
-encryption <type> |
Require a specific encryption algorithm |
The server side encryption options are not directly related to S3Guard, but it is often convenient to check them at the same time.
Example
> hadoop s3guard bucket-info -magic -markers keep s3a://test-london/ Filesystem s3a://test-london Location: eu-west-2 S3A Client Signing Algorithm: fs.s3a.signing-algorithm=(unset) Endpoint: fs.s3a.endpoint=(unset) Encryption: fs.s3a.encryption.algorithm=none Input seek policy: fs.s3a.experimental.input.fadvise=normal Change Detection Source: fs.s3a.change.detection.source=etag Change Detection Mode: fs.s3a.change.detection.mode=server S3A Committers The "magic" committer is supported in the filesystem S3A Committer factory class: mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory S3A Committer name: fs.s3a.committer.name=magic Store magic committer integration: fs.s3a.committer.magic.enabled=true Security Delegation token support is disabled Directory Markers The directory marker policy is "keep" Available Policies: delete, keep, authoritative Authoritative paths: fs.s3a.authoritative.path=
s3guard uploads
Lists or deletes all pending (uncompleted) multipart uploads older than given age.
hadoop s3guard uploads (-list | -abort | -expect <num-uploads>) [-verbose] \ [-days <days>] [-hours <hours>] [-minutes <minutes>] [-seconds <seconds>] \ [-force] s3a://bucket/prefix
The command lists or deletes all multipart uploads which are older than the given age, and that match the prefix supplied, if any.
For example, to delete all uncompleted multipart uploads older than two days in the folder at s3a://my-bucket/path/to/stuff
, use the following command:
hadoop s3guard uploads -abort -days 2 s3a://my-bucket/path/to/stuff
We recommend running with -list
first to confirm the parts shown are those that you wish to delete. Note that the command will prompt you with a “Are you sure?” prompt unless you specify the -force
option. This is to safeguard against accidental deletion of data, which is especially risky without a long age parameter as it can affect in-fight uploads.
The -expect
option is similar to -list
, except it is silent by default, and terminates with a success or failure exit code depending on whether or not the supplied number matches the number of uploads found that match the given options (path, age).