Chapter 7¶

处理缺失数据¶

pandas中的呈现方式，又称哨兵值：

在 numpy 中利用 np.nan 来表示缺失值

？这里的NaN是指numpy中的NaN吗，只针对浮点值吗？ - NA(Not Available) Python内置的None值 - > pandas.NA ?

pandas.NA — pandas 2.0.3 documentation (pydata.org)

NA的处理方法 dropna fillna isnull notnull

isna 和 isnull的区别，没有区别

过滤缺失数据¶

DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)

Series，返回一个仅含非空数据和索引值的Series
DataFrame，
参数：轴方向 axis{0 or ‘index’, 1 or ‘columns’}, default 0
删除方式 how{‘any’, ‘all’}, default ‘any’
- any该行/列中存在NA值，删除该行/列
- all该行/列中全为NA值，删除该行/列
删除的非缺失值个数阈值 thresh （非NA值没有达到这个数量的相应维度会被删除）

填充缺失数据¶

DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=None)

参数：

填充值:value:scalar, dict, Series, or DataFrame 可以是标量，也可以是索引到元素的字典映射
填充方法method:{‘backfill’, ‘bfill’, ‘ffill’, None}, default None
bfill后面的元素填充
ffill用前面的元素填充
连续缺失值的最大填充次数limit:int, default None
是否对原有对象进行修改inplace:bool, default False

fillna会默认返回新对象，若inplace = True，则直接对现有对象进行就地修改，而非返回新对象

In [44]: df
Out[44]: 
          0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772       NaN  1.343810
3 -0.713544       NaN -2.370232
4 -1.860761       NaN       NaN
5 -1.265934       NaN       NaN


#在第1列中，从第2行开始用前面的元素进行填充(ffill)，最大填写2次，即填写2、3两行
In [46]: df.fillna(method="ffill", limit=2)
Out[46]: 
          0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772  0.124121  1.343810
3 -0.713544  0.124121 -2.370232
4 -1.860761       NaN -2.370232
5 -1.265934       NaN -2.370232

数据转换¶

删除重复数据¶

DataFrame.duplicated(subset=None, keep='first')

返回表示重复行的bool Series

keep, 默认 ‘first’

first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.

pandas.DataFrame.duplicated — pandas 2.0.3 documentation (pydata.org)

DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

删除缺失值

keep{‘first’, ‘last’, False}, default ‘first’

Determines which duplicates (if any) to keep.

‘first’ : Drop duplicates except for the first occurrence.
‘last’ : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

pandas.DataFrame.drop_duplicates — pandas 2.0.3 documentation (pydata.org)

利用函数或映射进行数据转换¶

Series.map(arg, na_action=None)

根据输入映射或函数映射Series，实现元素集转换及其他的数据清洗工作

arg****function, collections.abc.Mapping subclass or Series

na_action, default None 对NA值的处理

如果‘ignore’，则传播 NaN 值，而不将其传递给映射对应。

pandas.Series.map — pandas 2.0.3 documentation (pydata.org)

替换值¶

DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method=_NoDefault.no_default)

用 value 替换to_replace 中给出的值。

to_replace:str, regex, list, dict, Series, int, float, or None**

to_replace的值的寻找，详见doc

value**: scalar, dict, list, str, regex, default None**

pandas.DataFrame.replace — pandas 2.0.3 documentation (pydata.org)

重命名轴索引¶

通过map方法

Index.map(mapper, na_action=None)[source]

mapper****function, dict, or Series

na_action****

pandas.Index.map — pandas 2.0.3 documentation (pydata.org)

DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=None, inplace=False, level=None, errors='ignore')

mapper****dict-like or function

index****dict-like or function

columns****dict-like or function

axis**, default 0**

inplace:bool, default False Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.

两种写法：
df.rename(mapper,axis = 0)

df.rename(columns = mapper)
以上两者等价，index同理

pandas.DataFrame.rename (pydata.org)

以上两种写法中：

使用index.map进行修改等同于df.rename(...inplace = True)

离散化和分箱¶

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)

将数值划分为离散区间

待划分的数组 x:array-like The input array to be binned. Must be 1-dimensional.

bins**:int, sequence of scalars, or IntervalIndex** The criteria to bin by.

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.

right：bool, default True 决定那一边是封闭的，当right == True 左开右闭，当right == False左闭右开 Indicates whether bins includes the rightmost edge or not. If right == True (the default), then the bins [1, 2, 3, 4] indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.

precision:int, default 3 The precision at which to store and display the bins labels.

pandas.cut — pandas 2.0.3 documentation (pydata.org)

pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')

根据样本分位数对样本进行划分，便于得到大小基本相等的分箱

x****1d ndarray or Series

q****int or list-like of float Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

labels:

pandas.qcut — pandas 2.0.3 documentation (pydata.org)

categorical data type

这里对 pandas categorical data type做一个简短的解释

Categorical data — pandas 2.0.3 documentation (pydata.org)

检测和过滤异常值¶

DataFrame.any(*, axis=0, bool_only=None, skipna=True, **kwargs)

返回任何元素是否为 True，可能在一个轴上。

axis**, default 0**

pandas.DataFrame.any — pandas 2.0.3 documentation (pydata.org)

置换和随机采样¶

置换¶

random.permutation(x)

随机排列序列，或返回一个排列范围。

参数：

**x**int or array_like If x is an integer, randomly permute np.arange(x). If x is an array, make a copy and shuffle the elements randomly.

返回值：

**out**ndarray Permuted sequence or array range.

numpy.random.permutation — NumPy v1.25 Manual

DataFrame.take(indices, axis=0, **kwargs)

沿轴返回给定位置索引中的元素。

Parameters

axis**, default 0**
**indices**array-like

An array of ints indicating which positions to take.

Returns

same type as caller

An array-like containing the elements taken from the object.

pandas.DataFrame.take — pandas 2.0.3 documentation (pydata.org)

随机采样¶

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False

Parameters

**n**int, optional

Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

**frac**float, optional

Fraction of axis items to return. Cannot be used with n.

**replace**bool, default False

Allow or disallow sampling of the same row more than once.

Returns

Series or DataFrame

A new object of same type as caller containing n items randomly sampled from the caller object.

pandas.DataFrame.sample — pandas 2.0.3 documentation (pydata.org)

计算指标/虚拟变量¶

Dummy code

Dummy variable (statistics) - Wikipedia

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

将分类变量转换为虚拟变量/指标变量。

Parameters

**data**array-like, Series, or DataFrame

Data of which to get dummy indicators.

**prefix**str, list of str, or dict of str, default None

Returns

DataFrame

Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.

pandas.get_dummies — pandas 2.0.3 documentation (pydata.org)

DataFrame.add_prefix(prefix, axis=None)

为标签加上字符串前缀。

Parameters

prefix str

The string to add before each label.

axis}, default None

Axis to add prefix on

Returns

Series or DataFrame

New Series or DataFrame with updated labels.

pandas.DataFrame.add_prefix — pandas 2.0.3 documentation (pydata.org)

扩展数据类型¶

问题：

某些数据类型缺失值的处理不完备
含有大量数据的数据集，计算开销大
某些数据类型需要使用开销大的Python对象数组才能实现高效计算

为解决以上问题，Pandas发展出了扩展类型，可创建NumPy原本不支持的新数据类型。这些新数据类型可以当作NumPy数组的一级类，等同于其他NumPy原生数据

新数据类型的详解

字符串操作¶

Python内置的字符串对象方法¶

Python内置的字符串方法： 表格，或许可以参考Python Cookbook

正则表达式¶

正则表达式的简单书写和练习

正则表达式常称作regex，Python中内置re模块负责对字符串应用正则表达式

正则表达式：表格

Pandas的字符串函数¶

字符串规整工作

通过Series的str属性中的方法跳过并传播NA值

Series的部分字符串方法：表格

分类数据¶

背景和目标¶

分类表示法/字典编码表示法：用整数表示的方法

数据的分类/字典/层级：不同值的数组

分类编码：表示分类的整数值

Pandas的分类扩展类型¶

pandas.Categories

有categories和codes两个属性

Categories的构造

Categorical data — pandas 2.0.3 documentation (pydata.org)

利用Categorical对象进行计算¶

第九章分类数据 — Joyful Pandas 1.0 documentation (datawhale.club)

分类方法¶

.cat set_categories remove_unused_categories

Pandas中Series的分类方法：表格

one-hot code