Logging in PySpark#
Introduction#
The pyspark.logger module facilitates structured client-side logging for PySpark users.
This module includes a PySparkLogger
class that provides several methods for logging messages at different levels in a structured JSON format:
The logger can be easily configured to write logs to either the console or a specified file.
Customizing Log Format#
The default log format is JSON, which includes the timestamp, log level, logger name, and the log message along with any additional context provided.
Example log entry:
{
"ts": "2024-06-28 19:53:48,563",
"level": "ERROR",
"logger": "DataFrameQueryContextLogger",
"msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"divide\" was called from\n/.../spark/python/test_error_context.py:17\n",
"context": {
"file": "/path/to/file.py",
"line": "17",
"fragment": "divide"
"errorClass": "DIVIDE_BY_ZERO"
},
"exception": {
"class": "Py4JJavaError",
"msg": "An error occurred while calling o52.showString.\n: org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"divide\" was called from\n/path/to/file.py:17 ...",
"stacktrace": ["Traceback (most recent call last):", " File \".../spark/python/pyspark/errors/exceptions/captured.py\", line 247, in deco", " return f(*a, **kw)", " File \".../lib/python3.9/site-packages/py4j/protocol.py\", line 326, in get_return_value" ...]
},
}
Setting Up#
To start using the PySpark logging module, you need to import the PySparkLogger
from the pyspark.logger.
from pyspark.logger import PySparkLogger
Usage#
Creating a Logger#
You can create a logger instance by calling the PySparkLogger.getLogger()
. By default, it creates a logger named “PySparkLogger” with an INFO log level.
logger = PySparkLogger.getLogger()
Logging Messages#
The logger provides three main methods for log messages: PySparkLogger.info()
, PySparkLogger.warning()
and PySparkLogger.error()
.
PySparkLogger.info: Use this method to log informational messages.
user = "test_user" action = "login" logger.info(f"User {user} performed {action}", user=user, action=action)
PySparkLogger.warning: Use this method to log warning messages.
user = "test_user" action = "access" logger.warning("User {user} attempted an unauthorized {action}", user=user, action=action)
PySparkLogger.error: Use this method to log error messages.
user = "test_user" action = "update_profile" logger.error("An error occurred for user {user} during {action}", user=user, action=action)
Logging to Console#
from pyspark.logger import PySparkLogger
# Create a logger that logs to console
logger = PySparkLogger.getLogger("ConsoleLogger")
user = "test_user"
action = "test_action"
logger.warning(f"User {user} takes an {action}", user=user, action=action)
This logs an information in the following JSON format:
{
"ts": "2024-06-28 19:44:19,030",
"level": "WARNING",
"logger": "ConsoleLogger",
"msg": "User test_user takes an test_action",
"context": {
"user": "test_user",
"action": "test_action"
},
}
Logging to a File#
To log messages to a file, use the PySparkLogger.addHandler()
for adding FileHandler from the standard Python logging module to your logger.
This approach aligns with the standard Python logging practices.
from pyspark.logger import PySparkLogger
import logging
# Create a logger that logs to a file
file_logger = PySparkLogger.getLogger("FileLogger")
handler = logging.FileHandler("application.log")
file_logger.addHandler(handler)
user = "test_user"
action = "test_action"
file_logger.warning(f"User {user} takes an {action}", user=user, action=action)
The log messages will be saved in application.log in the same JSON format.