Sunday, September 8, 2024
HomepandasPandas.DataFrame's method of handling outliers

Pandas.DataFrame's method of handling outliers

df.fillna()

value : Filled with static, dictionary, array, series or DataFrame.
method: This method is used if the user does not pass any value. Pandas has different methods, such as bfill, backfill or fill, which fill in values ​​at forward index or forward/backward positions respectively.
axis: axis For rows/columns, you need to enter int or string values. For integers, the input can be 0 or 1; for strings, enter “index” or “columns”.
limit: This is an integer value that specifies the maximum number of subsequent forward/backward NaN value padding.
downcast: It takes a dict specifying what d type to downcast to what type. Such as Float64 to int64.

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(6, 3))
df.loc\[1: 2, 1\] = np.nan
df.loc\[3, 2\] = np.nan
print(df)
tmp\_df = df.fillna(value=1, inplace=False, limit=1)  # Populate by columnnanforvaluevalue,limitLimit each column value to populate to one
print(tmp\_df)
tmp\_df = df.fillna(inplace=False, method='ffill')  # Populate by column with previous value
print(tmp\_df)
tmp\_df = df.fillna(inplace=False, method='ffill', axis=1)  # Use previous padding by row
print(tmp\_df)
df.fillna(inplace=True, method='bfill', axis=0)  # Populate by column with last value
print(df)

df.interpolate()

Used to fill missing values ​​in a data frame or series, using interpolation techniques instead of hard-coded values.

method: linear: Default value, use linear interpolation, determine the missing value in the middle of the straight line based on the two nearest points; time: when the data index is a date; index: use the value of the index for interpolation; polynomial: polynomial For interpolation, you need to specify the order parameter, for example, order=2 for quadratic polynomial interpolation; pad/ffil: fill NaN with the previous non-missing value; nearest: the nearest non-NaN value; quadratic & cubic: quadratic and cubic interpolation, Suitable for nonlinear data; barycentric: barycentric interpolation. It calculates interpolation based on the center of gravity of a given value; krogh: Krogh interpolation; spline: spline interpolation, which is good at handling outliers in the data set.

‘values’, ‘zero’, ‘slinear’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’

axis: 0 fills column by column, 1 fills row by row.
limit: The maximum number of consecutive NaNs to fill. Must be greater than 0.
limit_direction : {‘forward’, ‘backward’, ‘both’}, default ‘forward’ 。
limit_area: Determines which NaN values ​​should be interpolated. NaN values ​​may appear at the beginning or end of the sequence.
None: Default value, all NaN can be interpolated without any restrictions.
inside: Only interpolate NaN values, if surrounded by valid observation values ​​(non-NaN), that is, if there are non-NaN values ​​before and after a NaN sequence, then this NaN sequence can be interpolated, and consecutive NaN values ​​at the beginning and end of the sequence can be interpolated. will not be interpolated.
outside: Only the NaN values ​​at the beginning or end are interpolated. The NaN sequence in the middle will not be interpolated as long as it is surrounded by non-NaN values.

downcast: Downcast dtypes if possible.

import pandas as pd
df = pd.DataFrame({"A": \[12, 4, 5, None, 1\],
                   "B": \[None, 2, 54, 3, None\],
                   "C": \[20, 16, None, 3, 8\],
                   "D": \[14, 3, None, None, 6\]})
print(df)
print("=" \* 30)
tmp\_df = df.interpolate(method='linear', limit\_direction='forward', inplace=False)
print(tmp\_df)
print('=' \* 30)
tmp\_df = df.interpolate(method='linear', limit\_direction='backward', limit=1, inplace=False)
print(tmp\_df)

reference:

Pandas – DataFrame.fillna() replaces null values ​​in DataFrame | Geek Tutorial

RELATED ARTICLES

Most Popular

Recent Comments