Columns¶

pyspark_util.columns.prefix_columns(df, prefix, sep='_', exclude=[])[source]¶

Prefix dataframe columns.

Parameters

df (dataframe) – dataframe to be prefixed.
prefix (str) – string to add before each column.
sep (str, default '_') – separator to join prefix and each column with.
exclude (list of str, default []) – A selection of columns to exclude from being prefixed.

Returns

dataframe with prefixed columns.

Return type

dataframe

Raises

ValueError – If exclude contains columns that don’t exist in the given dataframe.

Examples

>>> data = [(1, 2, 3)]
>>> columns = ['a', 'b', 'c']
>>> df = spark.createDataFrame(data, columns)
>>> df.show()  
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.prefix_columns(df, 'x').show()  
+---+---+---+
|x_a|x_b|x_c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.prefix_columns(df, 'x', sep='|').show()  
+---+---+---+
|x|a|x|b|x|c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.prefix_columns(df, 'x', exclude=['b', 'c']).show()  
+---+---+---+
|x_a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

pyspark_util.columns.suffix_columns(df, suffix, sep='_', exclude=[])[source]¶

Suffix dataframe columns.

Parameters

df (dataframe) – dataframe to be suffixed.
suffix (str) – string to add after each column.
sep (str, default '_') – separator to join each column and suffix with.
exclude (list of str, default []) – A selection of columns to exclude from being suffixed.

Returns

dataframe with suffixed columns.

Return type

dataframe

Raises

ValueError – If exclude contains columns that don’t exist in the given dataframe.

Examples

>>> data = [(1, 2, 3)]
>>> columns = ['a', 'b', 'c']
>>> df = spark.createDataFrame(data, columns)
>>> df.show()  
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.suffix_columns(df, 'x').show()  
+---+---+---+
|a_x|b_x|c_x|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.suffix_columns(df, 'x', sep='|').show()  
+---+---+---+
|a|x|b|x|c|x|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.suffix_columns(df, 'x', exclude=['b', 'c']).show()  
+---+---+---+
|a_x|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

pyspark_util.columns.rename_columns(df, mapper)[source]¶

Rename dataframe columns.

Parameters

df (dataframe) – dataframe to be renamed.
mapper (dict) – dictionary with old name as keys and new name as values.

Returns

dataframe with renamed columns.

Return type

dataframe

Raises

ValueError – If mapper contains columns that don’t exist in the given dataframe.

Examples

>>> data = [(1, 2, 3)]
>>> columns = ['a', 'b', 'c']
>>> df = spark.createDataFrame(data, columns)
>>> df.show()  
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

>>> psu.rename_columns(df, {'a': 'x'}).show()  
+---+---+---+
|  x|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

pyspark_util.columns.select_columns_regex(df, regex)[source]¶

Select columns that match a given regular expression.

Parameters

df (dataframe) – dataframe to be selected from.
regex (str) – regular expression.

Returns

dataframe with matched columns

Return type

dataframe

Examples

>>> data = [(1, 2)]
>>> columns = ['abc', '123']
>>> df = spark.createDataFrame(data, columns)
>>> df.show()  
+---+---+
|abc|123|
+---+---+
|  1|  2|
+---+---+

>>> psu.select_columns_regex(df, r'[a-z]+').show()  
+---+
|abc|
+---+
|  1|
+---+

>>> psu.select_columns_regex(df, r'[0-9]+').show()  
+---+
|123|
+---+
|  2|
+---+