Spark Sql 参数

简介

1
2
//获取Spark sql配置
spark.sql("SET -v").show(numRows = 200, truncate = false)

Spark Sql 全量参数

key value meaning
spark.sql.hive.version 1.2.11.2.1 Version of Hive used internally by Spark SQL.
spark.sql.hive.metastore.barrierPrefixes A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*).
spark.sql.shuffle.partitions 200 The default number of partitions to use when shuffling data for joins or aggregations.
spark.sql.hive.metastorePartitionPruning true When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information).
spark.sql.streaming.metricsEnabled false Whether Dropwizard/Codahale metrics will be reported for active streaming queries.
spark.sql.statistics.fallBackToHdfs false If the table statistics are not available from table metadata enable fall back to hdfs. This is useful in determining if a table is small enough to use auto broadcast joins.
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j.
spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER Sets the action to take when a case-sensitive schema cannot be read from a Hive table’s properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode– infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don’t attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring).
spark.sql.parquet.respectSummaryFiles false When true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn’t be enabled before knowing what it means exactly.
spark.sql.warehouse.dir /user/hive/warehouse The default location for managed databases and tables.
spark.sql.orderByOrdinal true When true, the ordinal numbers are treated as the position in the select list. When false, the ordinal numbers in order/sort by clause are ignored.
spark.sql.adaptive.shuffle. targetPostShuffleInputSize 67108864b The target post-shuffle input size in bytes of a task.
spark.sql.parquet.compression.codec snappy Sets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, lzo.
spark.sql.crossJoin.enabled false When false, we will throw an error if a query contains a cartesian product without explicit CROSS JOIN syntax.
spark.sql.thriftserver.ui.retainedStatements 200 The number of SQL statements kept in the JDBC/ODBC web UI history.
spark.sql.hive.convertMetastoreParquet.mergeSchema false When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when “spark.sql.hive.convertMetastoreParquet” is true.
spark.sql.parquet.enableVectorizedReader true Enables vectorized parquet decoding.
spark.sql.parquet.mergeSchema false When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.
spark.sql.parquet.binaryAsString false Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
spark.sql.columnNameOfCorruptRecord _corrupt_record The name of internal column for storing raw/un-parsed JSON records that fail to parse.
spark.sql.files.maxPartitionBytes 134217728 The maximum number of bytes to pack into a single partition when reading files.
spark.sql.hive.filesourcePartitionFileCacheSize 262144000 When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
spark.sql.autoBroadcastJoinThreshold 10485760 Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run, and file-based data source tables where the statistics are computed directly on the files of data.
spark.sql.pivotMaxValues 10000 When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without error.
spark.sql.hive.metastore.jars builtin Location of the jars that should be used to instantiate the HiveMetastoreClient.This property can be one of three options: “1. “builtin” Use Hive 1.2.1, which is bundled with the Spark assembly when -Phive is enabled. When this option is chosen, spark.sql.hive.metastore.version must be either 1.2.1 or not defined.2. “maven” Use Hive jars of specified version downloaded from Maven repositories.3. A classpath in the standard format for both Hive and Hadoop.
spark.sql.sources.parallelPartitionDiscovery.threshold 32 The maximum number of files allowed for listing files at driver side. If the number of detected files exceeds this value during partition discovery, it tries to list the files with another Spark distributed job. This applies to Parquet, ORC, CSV, JSON and LibSVM data sources.
spark.sql.broadcastTimeout 300 Timeout in seconds for the broadcast wait time in broadcast joins.
spark.sql.sources.bucketing.enabled true When false, we will treat bucketed table as normal table
spark.sql.optimizer.metadataOnly true When true, enable the metadata-only query optimization that use the table’s metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.
spark.sql.parquet.filterPushdown true Enables Parquet filter push-down optimization when set to true.
spark.sql.adaptive.enabled false When true, enable adaptive query execution.
spark.sql.parquet.cacheMetadata true Turns on caching of Parquet schema metadata. Can speed up querying of static data.
spark.sql.hive.convertMetastoreParquet false When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.
spark.sql.groupByOrdinal true When true, the ordinal numbers in group by clauses are treated as the position in the select list. When false, the ordinal numbers are ignored.
spark.sql.hive.thriftServer.async true When set to true, Hive Thrift server executes SQL queries in an asynchronous way.
spark.sql.thriftserver.scheduler.pool Set a Fair Scheduler pool for a JDBC client session.
spark.sql.orc.filterPushdown false When true, enable filter pushdown for ORC files.
spark.sql.sources.default parquet The default data source to use in input/output.
spark.sql.hive.metastore.version 1.2.1 Version of the Hive metastore. Available options are 0.12.0through 1.2.1.
spark.sql.parquet.writeLegacyFormat false Whether to follow Parquet’s format specification when converting Parquet schema to Spark SQL schema and vice versa.
spark.sql.hive.verifyPartitionPath false When true, check all the partition paths under the table’s root directory when reading data stored in HDFS.
spark.sql.streaming.numRecentProgressUpdates 100 The number of progress updates to retain for a streaming query
spark.sql.variable.substitute true This enables substitution using syntax like var{system:var} and ${env:var}.
spark.sql.files.ignoreCorruptFiles false Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted or non-existing and contents that have been read will still be returned.
spark.sql.hive.manageFilesourcePartitions true When true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition managment is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning.
spark.sql.streaming.checkpointLocation The default location for storing checkpoint data for streaming queries.
spark.sql.parquet.int96AsTimestamp true Some Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
spark.sql.sources.partitionColumnTypeInference.enabled true When true, automatically infer the data types for partitioned columns.
spark.sql.thriftserver.ui.retainedSessions 200 The number of SQL client sessions kept in the JDBC/ODBC web UI history.