Chapter 7¶
处理缺失数据¶
pandas中的呈现方式,又称哨兵值:
- 在
numpy
中利用np.nan
来表示缺失值
?这里的NaN是指numpy中的NaN吗,只针对浮点值吗? - NA(Not Available) Python内置的None值 - >
pandas.NA
?
NA的处理方法
dropna fillna isnull notnull
isna
和isnull
的区别 ,没有区别
过滤缺失数据¶
DataFrame.dropna(*, axis=0, how=_NoDefault.no_default, thresh=_NoDefault.no_default, subset=None, inplace=False, ignore_index=False)
- Series,返回一个仅含非空数据和索引值的Series
- DataFrame,
- 参数:
轴方向
axis{0 or ‘index’, 1 or ‘columns’}, default 0
- 删除方式
how{‘any’, ‘all’}, default ‘any’
any
该行/列中存在NA值,删除该行/列all
该行/列中全为NA值,删除该行/列
- 删除的非缺失值个数阈值
thresh
( 非NA值 没有达到这个数量的相应维度会被删除)
填充缺失数据¶
参数:
-
填充值:
value:scalar, dict, Series, or DataFrame
可以是标量,也可以是索引到元素的字典映射 -
填充方法
method:{‘backfill’, ‘bfill’, ‘ffill’, None}, default None
-
bfill
后面的元素填充 -
ffill
用前面的元素填充 -
连续缺失值的最大填充次数
limit:int, default None
-
是否对原有对象进行修改
inplace:bool, default False
fillna会默认返回新对象,若
inplace = True
,则直接对现有对象进行就地修改,而非返回新对象
In [44]: df
Out[44]:
0 1 2
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
2 0.523772 NaN 1.343810
3 -0.713544 NaN -2.370232
4 -1.860761 NaN NaN
5 -1.265934 NaN NaN
#在第1列中,从第2行开始用前面的元素进行填充(ffill),最大填写2次,即填写2、3两行
In [46]: df.fillna(method="ffill", limit=2)
Out[46]:
0 1 2
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
2 0.523772 0.124121 1.343810
3 -0.713544 0.124121 -2.370232
4 -1.860761 NaN -2.370232
5 -1.265934 NaN -2.370232
数据转换¶
删除重复数据¶
返回表示重复行的bool Series
keep
, 默认 ‘first’
first
: Mark duplicates asTrue
except for the first occurrence.last
: Mark duplicates asTrue
except for the last occurrence.- False : Mark all duplicates as
True
.
pandas.DataFrame.duplicated — pandas 2.0.3 documentation (pydata.org)
删除缺失值
keep{‘first’, ‘last’, False
}, default ‘first’
Determines which duplicates (if any) to keep.
- ‘first’ : Drop duplicates except for the first occurrence.
- ‘last’ : Drop duplicates except for the last occurrence.
False
: Drop all duplicates.
pandas.DataFrame.drop_duplicates — pandas 2.0.3 documentation (pydata.org)
利用函数或映射进行数据转换¶
根据输入映射或函数映射Series,实现元素集转换及其他的数据清洗工作
arg****function, collections.abc.Mapping subclass or Series
na_action, default None 对NA值的处理
- 如果‘ignore’,则传播 NaN 值,而不将其传递给映射对应。
pandas.Series.map — pandas 2.0.3 documentation (pydata.org)
替换值¶
DataFrame.replace(to_replace=None, value=_NoDefault.no_default, *, inplace=False, limit=None, regex=False, method=_NoDefault.no_default)
用 value
替换to_replace
中给出的值。
to_replace:str, regex, list, dict, Series, int, float, or None**
to_replace
的值的寻找,详见doc
value**: scalar, dict, list, str, regex, default None**
pandas.DataFrame.replace — pandas 2.0.3 documentation (pydata.org)
重命名轴索引¶
通过map
方法
mapper****function, dict, or Series
na_action****
pandas.Index.map — pandas 2.0.3 documentation (pydata.org)
DataFrame.rename(mapper=None, *, index=None, columns=None, axis=None, copy=None, inplace=False, level=None, errors='ignore')
mapper****dict-like or function
index****dict-like or function
columns****dict-like or function
axis**, default 0**
inplace:bool, default False Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.
两种写法:
以上两者等价,
index
同理
pandas.DataFrame.rename (pydata.org)
以上两种写法中:
使用
index.map
进行修改等同于df.rename(...inplace = True)
离散化和分箱¶
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
将数值划分为离散区间
待划分的数组 x:array-like The input array to be binned. Must be 1-dimensional.
bins**:int, sequence of scalars, or IntervalIndex** The criteria to bin by.
- int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
- sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
- IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.
right:bool, default True
决定那一边是封闭的,当right == True
左开右闭,当right == False
左闭右开
Indicates whether bins includes the rightmost edge or not. If right == True
(the default), then the bins [1, 2, 3, 4]
indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.
precision:int, default 3 The precision at which to store and display the bins labels.
pandas.cut — pandas 2.0.3 documentation (pydata.org)
根据样本分位数对样本进行划分,便于得到大小基本相等的分箱
x****1d ndarray or Series
q****int or list-like of float Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
labels:
pandas.qcut — pandas 2.0.3 documentation (pydata.org)
categorical data type
这里对 pandas categorical data type做一个简短的解释
Categorical data — pandas 2.0.3 documentation (pydata.org)
检测和过滤异常值¶
返回任何元素是否为 True,可能在一个轴上。
axis**, default 0**
pandas.DataFrame.any — pandas 2.0.3 documentation (pydata.org)
置换和随机采样¶
置换¶
随机排列序列,或返回一个排列范围。
参数:
**x**int or array_like
If x is an integer, randomly permute np.arange(x)
. If x is an array, make a copy and shuffle the elements randomly.
返回值:
**out**ndarray Permuted sequence or array range.
numpy.random.permutation — NumPy v1.25 Manual
沿轴返回给定位置索引中的元素。
Parameters
-
axis**, default 0**
-
**indices**array-like
An array of ints indicating which positions to take.
Returns
- same type as caller
An array-like containing the elements taken from the object.
pandas.DataFrame.take — pandas 2.0.3 documentation (pydata.org)
随机采样¶
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False
Parameters
- **n**int, optional
Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
- **frac**float, optional
Fraction of axis items to return. Cannot be used with n.
- **replace**bool, default False
Allow or disallow sampling of the same row more than once.
Returns
- Series or DataFrame
A new object of same type as caller containing n items randomly sampled from the caller object.
pandas.DataFrame.sample — pandas 2.0.3 documentation (pydata.org)
计算指标/虚拟变量¶
Dummy code
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
将分类变量转换为虚拟变量/指标变量。
Parameters
- **data**array-like, Series, or DataFrame
Data of which to get dummy indicators.
- **prefix**str, list of str, or dict of str, default None
Returns
- DataFrame
Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.
pandas.get_dummies — pandas 2.0.3 documentation (pydata.org)
为标签加上字符串前缀。
Parameters
- prefix str
The string to add before each label.
- axis}, default None
Axis to add prefix on
Returns
- Series or DataFrame
New Series or DataFrame with updated labels.
pandas.DataFrame.add_prefix — pandas 2.0.3 documentation (pydata.org)
扩展数据类型¶
问题:
- 某些数据类型缺失值的处理不完备
- 含有大量数据的数据集,计算开销大
- 某些数据类型需要使用开销大的Python对象数组才能实现高效计算
为解决以上问题,Pandas发展出了扩展类型,可创建NumPy原本不支持的新数据类型。 这些新数据类型可以当作NumPy数组的一级类,等同于其他NumPy原生数据
新数据类型的详解
字符串操作¶
Python内置的字符串对象方法¶
Python内置的字符串方法: 表格,或许可以参考Python Cookbook
正则表达式¶
正则表达式的简单书写和练习
正则表达式常称作regex
,Python中内置re
模块负责对字符串应用正则表达式
正则表达式:表格
Pandas的字符串函数¶
字符串规整工作
通过Series
的str
属性中的方法跳过并传播NA
值
Series的部分字符串方法:表格
分类数据¶
背景和目标¶
分类表示法/字典编码表示法:用整数表示的方法
数据的分类/字典/层级:不同值的数组
分类编码:表示分类的整数值
Pandas的分类扩展类型¶
pandas.Categories
有categories
和codes
两个属性
Categories
的构造
Categorical data — pandas 2.0.3 documentation (pydata.org)
利用Categorical对象进行计算¶
第九章 分类数据 — Joyful Pandas 1.0 documentation (datawhale.club)
分类方法¶
.cat
set_categories
remove_unused_categories
Pandas中Series的分类方法:表格
one-hot code