Columns¶
-
pyspark_util.columns.prefix_columns(df, prefix, sep='_', exclude=[])[source]¶ Prefix dataframe columns.
- Parameters
df (dataframe) – dataframe to be prefixed.
prefix (str) – string to add before each column.
sep (str, default '_') – separator to join
prefixand each column with.exclude (list of str, default []) – A selection of columns to exclude from being prefixed.
- Returns
dataframe with prefixed columns.
- Return type
dataframe
- Raises
ValueError – If
excludecontains columns that don’t exist in the given dataframe.
Examples
>>> data = [(1, 2, 3)] >>> columns = ['a', 'b', 'c'] >>> df = spark.createDataFrame(data, columns) >>> df.show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.prefix_columns(df, 'x').show() +---+---+---+ |x_a|x_b|x_c| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.prefix_columns(df, 'x', sep='|').show() +---+---+---+ |x|a|x|b|x|c| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.prefix_columns(df, 'x', exclude=['b', 'c']).show() +---+---+---+ |x_a| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+
-
pyspark_util.columns.suffix_columns(df, suffix, sep='_', exclude=[])[source]¶ Suffix dataframe columns.
- Parameters
df (dataframe) – dataframe to be suffixed.
suffix (str) – string to add after each column.
sep (str, default '_') – separator to join each column and
suffixwith.exclude (list of str, default []) – A selection of columns to exclude from being suffixed.
- Returns
dataframe with suffixed columns.
- Return type
dataframe
- Raises
ValueError – If
excludecontains columns that don’t exist in the given dataframe.
Examples
>>> data = [(1, 2, 3)] >>> columns = ['a', 'b', 'c'] >>> df = spark.createDataFrame(data, columns) >>> df.show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.suffix_columns(df, 'x').show() +---+---+---+ |a_x|b_x|c_x| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.suffix_columns(df, 'x', sep='|').show() +---+---+---+ |a|x|b|x|c|x| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.suffix_columns(df, 'x', exclude=['b', 'c']).show() +---+---+---+ |a_x| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+
-
pyspark_util.columns.rename_columns(df, mapper)[source]¶ Rename dataframe columns.
- Parameters
df (dataframe) – dataframe to be renamed.
mapper (dict) – dictionary with old name as keys and new name as values.
- Returns
dataframe with renamed columns.
- Return type
dataframe
- Raises
ValueError – If
mappercontains columns that don’t exist in the given dataframe.
Examples
>>> data = [(1, 2, 3)] >>> columns = ['a', 'b', 'c'] >>> df = spark.createDataFrame(data, columns) >>> df.show() +---+---+---+ | a| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+
>>> psu.rename_columns(df, {'a': 'x'}).show() +---+---+---+ | x| b| c| +---+---+---+ | 1| 2| 3| +---+---+---+
-
pyspark_util.columns.select_columns_regex(df, regex)[source]¶ Select columns that match a given regular expression.
- Parameters
df (dataframe) – dataframe to be selected from.
regex (str) – regular expression.
- Returns
dataframe with matched columns
- Return type
dataframe
Examples
>>> data = [(1, 2)] >>> columns = ['abc', '123'] >>> df = spark.createDataFrame(data, columns) >>> df.show() +---+---+ |abc|123| +---+---+ | 1| 2| +---+---+
>>> psu.select_columns_regex(df, r'[a-z]+').show() +---+ |abc| +---+ | 1| +---+
>>> psu.select_columns_regex(df, r'[0-9]+').show() +---+ |123| +---+ | 2| +---+