跳转至

Chapter 7

处理缺失数据

pandas中的呈现方式,又称哨兵值:

  • numpy 中利用 np.nan 来表示缺失值

?这里的NaN是指numpy中的NaN吗,只针对浮点值吗? - NA(Not Available) Python内置的None值 - > pandas.NA ?

pandas.NA — pandas 2.0.3 documentation (pydata.org)

NA的处理方法 dropna fillna isnull notnull

isnaisnull的区别 ,没有区别

过滤缺失数据

DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)
  • Series,返回一个仅含非空数据和索引值的Series
  • DataFrame,
  • 参数: 轴方向 axis{0 or ‘index’, 1 or ‘columns’}, default 0
  • 删除方式 how{‘any’, ‘all’}, default ‘any’
    • any该行/列中存在NA值,删除该行/列
    • all该行/列中全为NA值,删除该行/列
  • 删除的非缺失值个数阈值 thresh ( 非NA值 没有达到这个数量的相应维度会被删除)

填充缺失数据

DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)

参数:

  • 填充值:value:scalar, dict, Series, or DataFrame 可以是标量,也可以是索引到元素的字典映射

  • 填充方法method:{‘backfill’, ‘bfill’, ‘ffill’, None}, default None

  • bfill后面的元素填充

  • ffill用前面的元素填充

  • 连续缺失值的最大填充次数limit:int, default None

  • 是否对原有对象进行修改inplace:bool, default False

fillna会默认返回新对象,若inplace = True,则直接对现有对象进行就地修改,而非返回新对象

In [44]: df
Out[44]: 
          0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772       NaN  1.343810
3 -0.713544       NaN -2.370232
4 -1.860761       NaN       NaN
5 -1.265934       NaN       NaN


#在第1列中,从第2行开始用前面的元素进行填充(ffill),最大填写2次,即填写2、3两行
In [46]: df.fillna(method="ffill", limit=2)
Out[46]: 
          0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772  0.124121  1.343810
3 -0.713544  0.124121 -2.370232
4 -1.860761       NaN -2.370232
5 -1.265934       NaN -2.370232

数据转换

删除重复数据

DataFrame.duplicated(subset=None, keep='first')

返回表示重复行的bool Series

keep, 默认 ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

pandas.DataFrame.duplicated — pandas 2.0.3 documentation (pydata.org)

DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

删除缺失值

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep.

  • ‘first’ : Drop duplicates except for the first occurrence.
  • ‘last’ : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

pandas.DataFrame.drop_duplicates — pandas 2.0.3 documentation (pydata.org)

利用函数或映射进行数据转换

Series.map(arg, na_action=None)

根据输入映射或函数映射Series,实现元素集转换及其他的数据清洗工作

arg****function, collections.abc.Mapping subclass or Series

na_action, default None 对NA值的处理

  • 如果‘ignore’,则传播 NaN 值,而不将其传递给映射对应。

pandas.Series.map — pandas 2.0.3 documentation (pydata.org)

替换值

DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

value 替换to_replace 中给出的值。

to_replace:str, regex, list, dict, Series, int, float, or None**

to_replace的值的寻找,详见doc

value**: scalar, dict, list, str, regex, default None**

pandas.DataFrame.replace — pandas 2.0.3 documentation (pydata.org)

重命名轴索引

通过map方法

Index.map(mapper, na_action=None)[source]

mapper****function, dict, or Series

na_action****

pandas.Index.map — pandas 2.0.3 documentation (pydata.org)

DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=None, inplace=False, level=None, errors='ignore')

mapper****dict-like or function

index****dict-like or function

columns****dict-like or function

axis**, default 0**

inplace:bool, default False Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.

两种写法:

df.rename(mapper,axis = 0)

df.rename(columns = mapper)

以上两者等价,index同理

pandas.DataFrame.rename (pydata.org)

以上两种写法中:

使用index.map进行修改等同于df.rename(...inplace = True)

离散化和分箱

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)

将数值划分为离散区间

待划分的数组 x:array-like The input array to be binned. Must be 1-dimensional.

bins**:int, sequence of scalars, or IntervalIndex** The criteria to bin by.

  • int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
  • sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
  • IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.

right:bool, default True 决定那一边是封闭的,当right == True 左开右闭,当right == False左闭右开 Indicates whether bins includes the rightmost edge or not. If right == True (the default), then the bins [1, 2, 3, 4] indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.

precision:int, default 3 The precision at which to store and display the bins labels.

pandas.cut — pandas 2.0.3 documentation (pydata.org)

pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

根据样本分位数对样本进行划分,便于得到大小基本相等的分箱

x****1d ndarray or Series

q****int or list-like of float Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

labels:

pandas.qcut — pandas 2.0.3 documentation (pydata.org)

categorical data type

这里对 pandas categorical data type做一个简短的解释

Categorical data — pandas 2.0.3 documentation (pydata.org)

检测和过滤异常值

DataFrame.any(*, axis=0, bool_only=None, skipna=True, **kwargs)

返回任何元素是否为 True,可能在一个轴上。

axis**, default 0**

pandas.DataFrame.any — pandas 2.0.3 documentation (pydata.org)

置换和随机采样

置换

random.permutation(x)

随机排列序列,或返回一个排列范围。

参数:

**x**int or array_like If x is an integer, randomly permute np.arange(x). If x is an array, make a copy and shuffle the elements randomly.

返回值:

**out**ndarray Permuted sequence or array range.

numpy.random.permutation — NumPy v1.25 Manual

DataFrame.take(indices, axis=0, **kwargs)

沿轴返回给定位置索引中的元素。

Parameters

  • axis**, default 0**

  • **indices**array-like

An array of ints indicating which positions to take.

Returns

  • same type as caller

An array-like containing the elements taken from the object.

pandas.DataFrame.take — pandas 2.0.3 documentation (pydata.org)

随机采样

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False

Parameters

  • **n**int, optional

Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

  • **frac**float, optional

Fraction of axis items to return. Cannot be used with n.

  • **replace**bool, default False

Allow or disallow sampling of the same row more than once.

Returns

  • Series or DataFrame

A new object of same type as caller containing n items randomly sampled from the caller object.

pandas.DataFrame.sample — pandas 2.0.3 documentation (pydata.org)

计算指标/虚拟变量

Dummy code

Dummy variable (statistics) - Wikipedia

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

将分类变量转换为虚拟变量/指标变量。

Parameters

  • **data**array-like, Series, or DataFrame

Data of which to get dummy indicators.

  • **prefix**str, list of str, or dict of str, default None

Returns

  • DataFrame

Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.

pandas.get_dummies — pandas 2.0.3 documentation (pydata.org)

DataFrame.add_prefix(prefix, axis=None)

为标签加上字符串前缀。

Parameters

  • prefix str

The string to add before each label.

  • axis}, default None

Axis to add prefix on

Returns

  • Series or DataFrame

New Series or DataFrame with updated labels.

pandas.DataFrame.add_prefix — pandas 2.0.3 documentation (pydata.org)

扩展数据类型

问题:

  • 某些数据类型缺失值的处理不完备
  • 含有大量数据的数据集,计算开销大
  • 某些数据类型需要使用开销大的Python对象数组才能实现高效计算

为解决以上问题,Pandas发展出了扩展类型,可创建NumPy原本不支持的新数据类型。 这些新数据类型可以当作NumPy数组的一级类,等同于其他NumPy原生数据

新数据类型的详解

字符串操作

Python内置的字符串对象方法

Python内置的字符串方法: 表格,或许可以参考Python Cookbook

正则表达式

正则表达式的简单书写和练习

正则表达式常称作regex,Python中内置re模块负责对字符串应用正则表达式

正则表达式:表格

Pandas的字符串函数

字符串规整工作

通过Seriesstr属性中的方法跳过并传播NA

Series的部分字符串方法:表格

分类数据

背景和目标

分类表示法/字典编码表示法:用整数表示的方法

数据的分类/字典/层级:不同值的数组

分类编码:表示分类的整数值

Pandas的分类扩展类型

pandas.Categories

categoriescodes两个属性

Categories的构造

Categorical data — pandas 2.0.3 documentation (pydata.org)

利用Categorical对象进行计算

第九章 分类数据 — Joyful Pandas 1.0 documentation (datawhale.club)

分类方法

.cat set_categories remove_unused_categories

Pandas中Series的分类方法:表格

one-hot code