Apache Hadoop 3.5.0-SNAPSHOT is an update to the Hadoop 3.4.x release branch.
Users are encouraged to read the full set of release notes. This page provides an overview of the major changes.
HADOOP-18679 Bulk Delete API.
This release provides an API to perform bulk delete of files/objects in an object store or filesystem.
HADOOP-19083 provide hadoop binary tarball without aws v2 sdk
Hadoop has added a new variant of the binary distribution tarball, labeled with “lean” in the file name. This tarball excludes the full AWS SDK v2 bundle, resulting in approximately 50% reduction in file size.
Improvement
HADOOP-18886 S3A: AWS SDK V2 Migration: stabilization and S3Express
This release completes stabilization efforts on the AWS SDK v2 migration and support of Amazon S3 Express One Zone storage. S3 Select is no longer supported.
HADOOP-18993 S3A: Add option fs.s3a.classloader.isolation (#6301)
This introduces configuration property fs.s3a.classloader.isolation
, which defaults to true
. Set to false
to disable S3A classloader isolation, which can be useful for installing custom credential providers in user-provided jars.
HADOOP-19047 Support InMemory Tracking Of S3A Magic Commits
The S3A magic committer now supports configuration property fs.s3a.committer.magic.track.commits.in.memory.enabled
. Set this to true
to track commits in memory instead of on the file system, which reduces the number of remote calls.
HADOOP-19161 S3A: option “fs.s3a.performance.flags” to take list of performance flags
S3A now supports configuration property fs.s3a.performance.flag
for controlling activation of multiple performance optimizations. Refer to the S3A performance documentation for details.
Improvement
HADOOP-18516 [ABFS]: Support fixed SAS token config in addition to Custom SASTokenProvider Implementation
ABFS now supports authentication via a fixed Shared Access Signature token. Refer to ABFS documentation of configuration property fs.azure.sas.fixed.token
for details.
HADOOP-19089 [ABFS] Reverting Back Support of setXAttr() and getXAttr() on root path
HADOOP-18869 previously implemented support for xattrs on the root path in the 3.4.0 release. Support for this has been removed in 3.4.1 to prevent the need for calling container APIs.
HADOOP-19178 WASB Driver Deprecation and eventual removal
This release announces deprecation of the WASB file system in favor of ABFS. Refer to ABFS documentation for additional guidance.
Bug
HADOOP-18542 Azure Token provider requires tenant and client IDs despite being optional
It is no longer necessary to specify a tenant and client ID in configuration for MSI authentication when running in an Azure instance.
A lot of dependencies have been upgraded to address recent CVEs. Many of the CVEs were not actually exploitable through the Hadoop so much of this work is just due diligence. However, applications which have all the library is on a class path may be vulnerable, and the upgrades should also reduce the number of false positives security scanners report.
We have not been able to upgrade every single dependency to the latest version there is. Some of those changes are fundamentally incompatible. If you have concerns about the state of a specific library, consult the Apache JIRA issue tracker to see if an issue has been filed, discussions have taken place about the library in question, and whether or not there is already a fix in the pipeline. Please don’t file new JIRAs about dependency-X.Y.Z having a CVE without searching for any existing issue first
As an open-source project, contributions in this area are always welcome, especially in testing the active branches, testing applications downstream of those branches and of whether updated dependencies trigger regressions.
Hadoop HDFS is a distributed filesystem allowing remote callers to read and write data.
Hadoop YARN is a distributed job submission/execution engine allowing remote callers to submit arbitrary work into the cluster.
Unless a Hadoop cluster is deployed with caller authentication with Kerberos, anyone with network access to the servers has unrestricted access to the data and the ability to run whatever code they want in the system.
In production, there are generally three deployment patterns which can, with care, keep data and computing resources private. 1. Physical cluster: configure Hadoop security, usually bonded to the enterprise Kerberos/Active Directory systems. Good. 2. Cloud: transient or persistent single or multiple user/tenant cluster with private VLAN and security. Good. Consider Apache Knox for managing remote access to the cluster. 3. Cloud: transient single user/tenant cluster with private VLAN and no security at all. Requires careful network configuration as this is the sole means of securing the cluster.. Consider Apache Knox for managing remote access to the cluster.
If you deploy a Hadoop cluster in-cloud without security, and without configuring a VLAN to restrict access to trusted users, you are implicitly sharing your data and computing resources with anyone with network access
If you do deploy an insecure cluster this way then port scanners will inevitably find it and submit crypto-mining jobs. If this happens to you, please do not report this as a CVE or security issue: it is utterly predictable. Secure your cluster if you want to remain exclusively your cluster.
Finally, if you are using Hadoop as a service deployed/managed by someone else, do determine what security their products offer and make sure it meets your requirements.
The Hadoop documentation includes the information you need to get started using Hadoop. Begin with the Single Node Setup which shows you how to set up a single-node Hadoop installation. Then move on to the Cluster Setup to learn how to set up a multi-node Hadoop installation.
Before deploying Hadoop in production, read Hadoop in Secure Mode, and follow its instructions to secure your cluster.