Pandas的高级操作

时间:2021-08-22 21:28:44

pandas数据处理

1. 删除重复元素

使用duplicated()函数检测重复的行,返回元素为布尔类型的Series对象,每个元素对应一行,如果该行不是第一次出现,则元素为True

  • keep参数:指定保留哪一重复的行数据

  • 创建具有重复元素行的DataFrame

import numpy as np
import pandas as pd
from pandas import DataFrame
# 创建一个df
df = DataFrame(data=np.random.randint(0,100,size=(12,7)))
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 18 75 98 68 33 35 15
1 48 71 36 34 41 17 46
2 3 28 81 21 86 53 85
3 15 35 35 90 63 83 14
4 12 36 65 79 25 53 95
5 98 63 4 58 35 64 80
6 31 61 23 33 80 53 60
7 52 47 60 58 54 35 17
8 7 92 42 61 31 40 56
9 76 45 30 42 74 83 53
10 69 2 89 99 12 51 62
11 17 86 1 76 40 34 41
# 手动将df的某几行设置成相同的内容
df.iloc[1] = [6,6,6,6,6,6,6]
df.iloc[8] = [6,6,6,6,6,6,6]
df.iloc[5] = [6,6,6,6,6,6,6]
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 18 75 98 68 33 35 15
1 6 6 6 6 6 6 6
2 3 28 81 21 86 53 85
3 15 35 35 90 63 83 14
4 12 36 65 79 25 53 95
5 6 6 6 6 6 6 6
6 31 61 23 33 80 53 60
7 52 47 60 58 54 35 17
8 6 6 6 6 6 6 6
9 76 45 30 42 74 83 53
10 69 2 89 99 12 51 62
11 17 86 1 76 40 34 41
  • 使用drop_duplicates()函数删除重复的行
    • drop_duplicates(keep='first/last'/False)
df.drop_duplicates(keep='last')  # 保留最后一个重复的行

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 26 9 31 11 38 46 22
2 89 24 53 28 64 89 40
3 7 80 43 91 32 95 6
4 96 92 58 55 82 73 21
6 43 1 13 54 24 34 43
7 75 32 88 85 40 29 41
8 6 6 6 6 6 6 6
9 93 3 62 4 60 51 78
10 58 11 63 42 62 30 10
11 89 93 96 49 23 40 57

2. 映射

1) replace()函数:替换元素

DataFrame替换操作

  • 单值替换

    • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
    • 按列指定单值替换: to_replace={列标签:替换值} value='value'
  • 多值替换

    • 列表替换: to_replace=[] value=[]
    • 字典替换(推荐) to_replace={to_replace:value,to_replace:value}
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 26 9 31 11 38 46 22
1 6 6 6 6 6 6 6
2 89 24 53 28 64 89 40
3 7 80 43 91 32 95 6
4 96 92 58 55 82 73 21
5 6 6 6 6 6 6 6
6 43 1 13 54 24 34 43
7 75 32 88 85 40 29 41
8 6 6 6 6 6 6 6
9 93 3 62 4 60 51 78
10 58 11 63 42 62 30 10
11 89 93 96 49 23 40 57

注意:DataFrame中,无法使用method和limit参数

df.replace(to_replace=6,value='six') # 将数据中的所有6替换成six,默认不改变原表

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 18 75 98 68 33 35 15
1 six six six six six six six
2 3 28 81 21 86 53 85
3 15 35 35 90 63 83 14
4 12 36 65 79 25 53 95
5 six six six six six six six
6 31 61 23 33 80 53 60
7 52 47 60 58 54 35 17
8 six six six six six six six
9 76 45 30 42 74 83 53
10 69 2 89 99 12 51 62
11 17 86 1 76 40 34 41
df.replace(to_replace={6:'six'})  # 效果同上

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 26 9 31 11 38 46 22
1 six six six six six six six
2 89 24 53 28 64 89 40
3 7 80 43 91 32 95 six
4 96 92 58 55 82 73 21
5 six six six six six six six
6 43 1 13 54 24 34 43
7 75 32 88 85 40 29 41
8 six six six six six six six
9 93 3 62 4 60 51 78
10 58 11 63 42 62 30 10
11 89 93 96 49 23 40 57
df.replace(to_replace={5:6},value='six')  # 将第5列中的6,替换成six

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
0 1 2 3 4 5 6
0 26 9 31 11 38 46 22
1 6 6 6 6 6 six 6
2 89 24 53 28 64 89 40
3 7 80 43 91 32 95 6
4 96 92 58 55 82 73 21
5 6 6 6 6 6 six 6
6 43 1 13 54 24 34 43
7 75 32 88 85 40 29 41
8 6 6 6 6 6 six 6
9 93 3 62 4 60 51 78
10 58 11 63 42 62 30 10
11 89 93 96 49 23 40 57

2) map()函数

新建一列 , map函数并不是df的方法,而是series的方法

  • map()可以映射新一列数据

  • map()中可以使用lambd表达式

  • map()中可以使用方法,可以是自定义的方法

    eg:map({to_replace:value})

  • 注意 map()中不能使用sum之类的函数,for循环

  • 新增一列:给df中,添加一列,该列的值为中文名对应的英文名

dic = {
'name':['张三','周杰伦','张三'],
'salary':[20000,10000,20000]
}
df = DataFrame(data=dic)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
name salary
0 张三 20000
1 周杰伦 10000
2 张三 20000
# 映射关系表
dic = {
'张三':'tom',
'周杰伦':'jay'
}
df['e_name'] = df['name'].map(dic)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
name salary e_name
0 张三 20000 tom
1 周杰伦 10000 jay
2 张三 20000 tom

map当做一种运算工具,至于执行何种运算,是由map函数的参数决定的(参数:lambda,函数)

  • 使用自定义函数
def after_sal(s):
return s - (s-3000)*0.5
# 超过3000部分的钱缴纳50%的税
df['after_sal'] = df['salary'].map(after_sal)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
name salary e_name after_sal
0 张三 20000 tom 11500.0
1 周杰伦 10000 jay 6500.0
2 张三 20000 tom 11500.0
df['after_sal'] = df['salary'].apply(after_sal)  # apply效率高于map
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
name salary e_name after_sal
0 张三 20000 tom 11500.0
1 周杰伦 10000 jay 6500.0
2 张三 20000 tom 11500.0
  • 使用lambda表达式
df['after_sal'] = df['salary'].apply(lambda x:x-(x-3000)*0.5)  # 或map
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
name salary e_name after_sal
0 张三 20000 tom 11500.0
1 周杰伦 10000 jay 6500.0
2 张三 20000 tom 11500.0

注意:并不是任何形式的函数都可以作为map的参数。只有当一个函数具有一个参数且有返回值,那么该函数才可以作为map的参数。

3. 使用聚合操作对数据异常值检测和过滤

使用df.std()函数可以求得DataFrame对象每一列的标准差

  • 创建一个1000行3列的df 范围(0-1),求其每一列的标准差
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
A B C
0 0.056365 0.080972 0.378327
1 0.371930 0.007791 0.318345
2 0.140999 0.921772 0.752930
3 0.877110 0.447756 0.760049
4 0.212178 0.143772 0.621486
5 0.255404 0.195473 0.008677
6 0.011568 0.308934 0.882607
7 0.470868 0.080049 0.285998
8 0.659013 0.794802 0.270541
9 0.315826 0.814653 0.906056
10 0.892474 0.301340 0.687254
11 0.015484 0.567598 0.043682
12 0.957620 0.967676 0.063608
13 0.102506 0.490077 0.235902
14 0.099083 0.778190 0.451824
15 0.023148 0.074169 0.589411
16 0.425894 0.772662 0.797658
17 0.939475 0.773502 0.766101
18 0.330299 0.984615 0.346554
19 0.882735 0.237546 0.847036
20 0.578589 0.730879 0.751632
21 0.504627 0.716272 0.386102
22 0.424879 0.231262 0.590047
23 0.580738 0.675268 0.726104
24 0.507248 0.136465 0.463764
25 0.421517 0.814806 0.449040
26 0.275373 0.935430 0.525679
27 0.404031 0.221492 0.730966
28 0.779142 0.063435 0.120807
29 0.618392 0.535934 0.554632
... ... ... ...
970 0.378107 0.687434 0.567923
971 0.876770 0.443219 0.236627
972 0.486757 0.416836 0.524889
973 0.886021 0.203959 0.789022
974 0.838247 0.279468 0.333581
975 0.762230 0.352878 0.550439
976 0.044568 0.680916 0.350743
977 0.031232 0.029839 0.918445
978 0.323142 0.686965 0.978349
979 0.746471 0.081773 0.729567
980 0.810169 0.793025 0.993532
981 0.480849 0.321984 0.233431
982 0.491794 0.056681 0.429988
983 0.278019 0.105290 0.435492
984 0.480974 0.098199 0.958667
985 0.465396 0.806955 0.668972
986 0.602675 0.966963 0.338542
987 0.051971 0.105833 0.132917
988 0.416362 0.861777 0.832573
989 0.951651 0.002912 0.942564
990 0.274033 0.071102 0.941272
991 0.632913 0.807060 0.540686
992 0.035006 0.526970 0.058584
993 0.368957 0.395593 0.210440
994 0.692847 0.655492 0.877564
995 0.245593 0.003551 0.913750
996 0.374804 0.311604 0.680521
997 0.355928 0.924330 0.224949
998 0.923060 0.834740 0.275359
999 0.905336 0.482290 0.722851

1000 rows × 3 columns

对df应用筛选条件,去除标准差太大的数据:假设过滤条件为 C列数据大于两倍的C列标准差

twice_std = df['C'].std() * 2
twice_std
0.5714973528631762
~(df['C'] > twice_std)
0       True
1 True
2 False
3 False
4 False
5 True
6 False
7 True
8 True
9 False
10 False
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 True
19 False
20 False
21 True
22 False
23 False
24 True
25 True
26 True
27 False
28 True
29 True
...
970 True
971 True
972 True
973 False
974 True
975 True
976 True
977 False
978 False
979 False
980 False
981 True
982 True
983 True
984 False
985 False
986 True
987 True
988 False
989 False
990 False
991 True
992 True
993 True
994 False
995 False
996 False
997 True
998 True
999 False
Name: C, Length: 1000, dtype: bool
df.loc[~(df['C'] > twice_std)]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
A B C
0 0.056365 0.080972 0.378327
1 0.371930 0.007791 0.318345
5 0.255404 0.195473 0.008677
7 0.470868 0.080049 0.285998
8 0.659013 0.794802 0.270541
11 0.015484 0.567598 0.043682
12 0.957620 0.967676 0.063608
13 0.102506 0.490077 0.235902
14 0.099083 0.778190 0.451824
18 0.330299 0.984615 0.346554
21 0.504627 0.716272 0.386102
24 0.507248 0.136465 0.463764
25 0.421517 0.814806 0.449040
26 0.275373 0.935430 0.525679
28 0.779142 0.063435 0.120807
29 0.618392 0.535934 0.554632
31 0.616251 0.034984 0.342615
33 0.009574 0.195987 0.221378
35 0.721609 0.518311 0.561978
36 0.316993 0.678054 0.163737
37 0.494355 0.499986 0.560351
39 0.584863 0.881738 0.481162
43 0.160369 0.402388 0.208208
45 0.002698 0.576528 0.070493
47 0.764883 0.778927 0.494559
48 0.868643 0.392903 0.109240
49 0.058928 0.350504 0.497170
50 0.373490 0.783554 0.335720
55 0.638066 0.442382 0.173654
56 0.837218 0.722685 0.454352
... ... ... ...
943 0.322268 0.957496 0.108147
944 0.384463 0.490386 0.245737
945 0.382611 0.726888 0.345724
947 0.713337 0.828064 0.364005
948 0.818703 0.445825 0.281585
951 0.968651 0.897188 0.368103
952 0.136136 0.431300 0.543917
954 0.846105 0.064527 0.200963
955 0.708107 0.857570 0.475146
957 0.595819 0.060763 0.294676
958 0.268046 0.790128 0.342255
959 0.116645 0.968789 0.493773
967 0.073665 0.204168 0.286095
968 0.205796 0.596242 0.468190
970 0.378107 0.687434 0.567923
971 0.876770 0.443219 0.236627
972 0.486757 0.416836 0.524889
974 0.838247 0.279468 0.333581
975 0.762230 0.352878 0.550439
976 0.044568 0.680916 0.350743
981 0.480849 0.321984 0.233431
982 0.491794 0.056681 0.429988
983 0.278019 0.105290 0.435492
986 0.602675 0.966963 0.338542
987 0.051971 0.105833 0.132917
991 0.632913 0.807060 0.540686
992 0.035006 0.526970 0.058584
993 0.368957 0.395593 0.210440
997 0.355928 0.924330 0.224949
998 0.923060 0.834740 0.275359

559 rows × 3 columns

  • 检测过滤缺失值
    • dropna
    • fillna
  • 检测过滤重复值
    • drop_duplicated(keep)
  • 检测过滤异常值
    • 得到鉴定异常值的条件
    • 将异常值对应的行删除

4. 排序

使用.take()函数排序

  • take()函数接受一个索引列表,用数字表示,使得df根据列表中索引的顺序进行排序
  • eg:df.take([1,3,4,2,5])

可以借助np.random.permutation()函数随机排序

np.random.permutation(1000)  #  将0-999进行随机排列
array([956, 614, 993, 437, 371, 215, 579, 282, 301, 646, 893,   7, 441,
539, 953, 794, 155, 370, 154, 100, 753, 793, 412, 867, 941, 998,
672, 590, 708, 1, 634, 899, 417, 242, 557, 122, 397, 850, 543,
560, 389, 896, 903, 505, 685, 334, 665, 460, 768, 937, 522, 637,
121, 605, 107, 130, 286, 532, 982, 563, 995, 89, 217, 218, 82,
781, 951, 798, 200, 947, 790, 398, 538, 411, 15, 44, 784, 205,
281, 314, 439, 132, 192, 238, 795, 470, 65, 842, 259, 426, 528,
383, 682, 750, 119, 465, 503, 278, 715, 603, 544, 265, 239, 569,
204, 616, 343, 710, 653, 256, 6, 873, 338, 27, 570, 707, 70,
73, 233, 838, 799, 266, 859, 279, 136, 479, 724, 870, 611, 574,
564, 655, 177, 39, 253, 148, 471, 317, 661, 851, 69, 523, 513,
928, 650, 23, 582, 622, 814, 959, 723, 938, 612, 912, 865, 402,
638, 80, 962, 214, 983, 194, 680, 758, 29, 74, 86, 102, 583,
695, 580, 835, 931, 832, 454, 258, 493, 967, 670, 555, 494, 501,
581, 591, 179, 354, 118, 671, 380, 732, 229, 719, 623, 874, 495,
944, 900, 123, 250, 628, 244, 872, 731, 625, 586, 57, 752, 596,
827, 775, 841, 163, 394, 833, 153, 669, 295, 826, 384, 890, 711,
60, 141, 237, 198, 404, 463, 712, 960, 749, 510, 866, 609, 26,
169, 372, 459, 365, 949, 124, 733, 12, 257, 668, 878, 487, 138,
652, 300, 219, 413, 445, 193, 207, 337, 779, 77, 95, 693, 812,
409, 33, 490, 992, 9, 167, 358, 743, 369, 99, 817, 542, 706,
289, 589, 666, 927, 391, 761, 844, 452, 66, 830, 498, 968, 689,
329, 508, 526, 335, 884, 129, 972, 507, 480, 274, 110, 425, 500,
388, 418, 869, 769, 251, 863, 456, 112, 247, 304, 478, 481, 429,
741, 241, 347, 37, 673, 427, 285, 415, 59, 853, 144, 822, 125,
455, 64, 332, 71, 971, 763, 804, 19, 191, 918, 608, 61, 327,
137, 116, 746, 482, 828, 766, 691, 424, 727, 468, 633, 302, 861,
848, 134, 704, 491, 320, 280, 660, 375, 846, 359, 987, 511, 342,
307, 399, 76, 825, 11, 28, 961, 485, 451, 675, 457, 618, 554,
551, 885, 531, 880, 534, 160, 607, 367, 374, 797, 910, 970, 595,
575, 756, 90, 897, 801, 49, 140, 985, 512, 577, 922, 168, 225,
360, 315, 350, 919, 231, 911, 631, 31, 774, 103, 186, 892, 293,
483, 149, 860, 887, 93, 340, 744, 908, 52, 196, 222, 955, 3,
930, 571, 484, 156, 50, 843, 599, 506, 936, 703, 881, 273, 520,
41, 85, 328, 223, 48, 492, 97, 56, 36, 974, 924, 656, 58,
649, 92, 114, 62, 173, 984, 973, 346, 573, 996, 597, 990, 667,
206, 917, 213, 272, 462, 686, 469, 472, 236, 643, 787, 224, 120,
255, 24, 171, 94, 904, 771, 344, 556, 981, 593, 988, 271, 762,
363, 254, 535, 361, 979, 303, 692, 964, 504, 150, 894, 349, 796,
714, 525, 943, 785, 260, 145, 292, 718, 811, 234, 641, 403, 818,
999, 461, 778, 802, 901, 352, 40, 515, 32, 877, 664, 323, 966,
635, 905, 754, 940, 810, 182, 75, 442, 308, 262, 776, 592, 267,
203, 294, 657, 34, 414, 405, 232, 151, 373, 601, 14, 807, 467,
421, 43, 935, 430, 287, 313, 283, 152, 516, 530, 356, 559, 518,
644, 889, 977, 521, 548, 381, 674, 929, 0, 916, 246, 540, 297,
67, 980, 422, 117, 772, 53, 13, 91, 46, 423, 509, 21, 128,
598, 115, 610, 679, 783, 264, 78, 270, 824, 311, 648, 220, 636,
226, 658, 886, 227, 268, 773, 620, 529, 864, 502, 567, 713, 963,
366, 210, 333, 249, 600, 701, 2, 640, 407, 745, 942, 113, 87,
390, 159, 188, 948, 957, 488, 351, 288, 245, 431, 248, 164, 767,
839, 702, 803, 792, 594, 837, 489, 934, 684, 386, 629, 519, 876,
63, 448, 98, 858, 378, 298, 368, 453, 25, 868, 624, 79, 133,
902, 906, 428, 401, 162, 157, 728, 950, 662, 190, 496, 568, 975,
952, 627, 909, 994, 131, 780, 751, 883, 871, 319, 722, 199, 536,
209, 821, 318, 290, 393, 35, 325, 187, 786, 681, 284, 514, 331,
647, 855, 143, 989, 642, 96, 676, 986, 561, 602, 336, 20, 379,
847, 735, 954, 645, 547, 357, 447, 435, 739, 228, 566, 305, 353,
158, 755, 716, 730, 856, 127, 47, 392, 862, 809, 720, 760, 432,
243, 932, 208, 382, 585, 747, 111, 836, 736, 700, 705, 615, 355,
18, 330, 820, 8, 857, 184, 175, 221, 737, 524, 697, 436, 395,
764, 939, 104, 759, 819, 240, 659, 147, 269, 387, 420, 621, 364,
926, 201, 549, 165, 696, 742, 997, 181, 277, 726, 10, 683, 991,
291, 81, 126, 68, 920, 808, 572, 740, 533, 699, 72, 146, 230,
888, 5, 606, 466, 263, 458, 898, 604, 385, 805, 105, 211, 945,
958, 721, 823, 376, 497, 545, 576, 738, 626, 852, 449, 541, 444,
406, 976, 88, 815, 552, 166, 183, 178, 438, 553, 84, 83, 717,
651, 782, 678, 324, 584, 42, 687, 517, 195, 106, 101, 933, 434,
348, 440, 587, 310, 923, 663, 921, 499, 565, 296, 38, 891, 895,
316, 30, 978, 677, 170, 322, 613, 546, 527, 630, 476, 174, 51,
816, 845, 185, 108, 17, 321, 813, 806, 109, 882, 197, 550, 907,
339, 698, 965, 362, 729, 914, 791, 694, 475, 879, 486, 309, 748,
326, 688, 202, 410, 915, 690, 854, 377, 341, 788, 22, 777, 275,
473, 261, 400, 45, 54, 135, 770, 189, 946, 562, 925, 537, 789,
312, 829, 725, 252, 800, 578, 446, 55, 419, 396, 4, 558, 212,
831, 450, 299, 161, 617, 345, 306, 757, 709, 180, 235, 433, 840,
477, 913, 474, 734, 408, 443, 834, 654, 875, 172, 632, 416, 16,
216, 464, 139, 619, 588, 969, 176, 276, 142, 639, 765, 849])
# 行排序与列排序均随机
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
A C B
810 0.056462 0.836914 0.105296
2 0.140999 0.752930 0.921772
721 0.941986 0.206568 0.283233
803 0.302248 0.027969 0.946815
46 0.576391 0.604795 0.199215
224 0.091563 0.448896 0.460941
682 0.081894 0.360009 0.174743
894 0.758221 0.311932 0.054626
389 0.951142 0.174418 0.764700
441 0.283697 0.577370 0.698306
350 0.623445 0.681211 0.547610
53 0.186217 0.617344 0.339724
467 0.231915 0.540558 0.972880
962 0.543442 0.895628 0.444214
598 0.516110 0.047393 0.670478
337 0.022056 0.222698 0.010719
481 0.182805 0.301250 0.652167
277 0.127561 0.749532 0.170472
162 0.767004 0.261541 0.381312
250 0.847071 0.344957 0.539958
416 0.369274 0.495600 0.393579
425 0.228196 0.273655 0.114908
843 0.394974 0.904397 0.875514
893 0.451844 0.336345 0.787189
492 0.516625 0.499929 0.350670
453 0.218878 0.957251 0.308231
186 0.611224 0.981765 0.809362
243 0.092659 0.374212 0.658671
522 0.773774 0.436375 0.037527
961 0.172133 0.762221 0.800747
... ... ... ...
624 0.587435 0.183552 0.831386
675 0.636248 0.542904 0.918788
861 0.519202 0.322943 0.315798
989 0.951651 0.942564 0.002912
136 0.940608 0.069835 0.504026
950 0.294872 0.712361 0.821118
529 0.648302 0.860493 0.626701
833 0.783501 0.823326 0.357173
173 0.181090 0.697154 0.906783
615 0.177069 0.732558 0.275658
182 0.091686 0.262477 0.340532
913 0.069850 0.903723 0.102737
417 0.353772 0.345310 0.618327
487 0.697415 0.083422 0.921608
345 0.331507 0.295755 0.995060
978 0.323142 0.978349 0.686965
197 0.947977 0.235533 0.295503
133 0.428408 0.963203 0.485624
214 0.861541 0.840486 0.435903
640 0.453934 0.807253 0.940066
977 0.031232 0.918445 0.029839
698 0.780159 0.042282 0.127449
427 0.326411 0.101616 0.915007
898 0.768911 0.231629 0.451036
77 0.718200 0.682757 0.986735
865 0.553171 0.535761 0.088467
513 0.203601 0.908238 0.116113
711 0.655778 0.164941 0.472295
685 0.012172 0.035356 0.501114
801 0.891855 0.355426 0.682663

1000 rows × 3 columns

  • np.random.permutation(x)可以生成x个从0-(x-1)的随机数列
df.take(indices=np.random.permutation(1000),axis=0).take(indices=np.random.permutation(3),axis=1)[0:5]

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
B A C
839 0.817163 0.346661 0.113644
708 0.644456 0.327089 0.081710
244 0.852833 0.366820 0.028656
728 0.627186 0.850947 0.375577
238 0.784179 0.764240 0.579280

随机抽样

当DataFrame规模足够大时,直接使用np.random.permutation(x)函数,就配合take()函数实现随机抽样

5. 数据分类处理【重点】

数据聚合是数据处理的最后一步,通常是要使每一个数组生成一个单一的数值。

数据分类处理:

  • 分组:先把数据分为几组
  • 用函数处理:为不同组的数据应用不同的函数以转换数据
  • 合并:把不同组得到的结果合并起来

数据分类处理的核心:

  • groupby()函数
  • groups属性查看分组情况
  • eg: df.groupby(by='item').groups

分组

df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
'price':[4,3,3,2.5,4,2],
'color':['red','yellow','yellow','green','green','green'],
'weight':[12,20,50,30,20,44]})
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44
  • 使用groupby实现分组
df.groupby(by='item',axis=0)
<pandas.core.groupby.DataFrameGroupBy object at 0x00000235AA6F6C18>
  • 使用groups查看分组情况
# 该函数可以进行数据的分组,但是不显示分组情况
df.groupby(by='item',axis=0).groups
{'Apple': Int64Index([0, 5], dtype='int64'),
'Banana': Int64Index([1, 3], dtype='int64'),
'Orange': Int64Index([2, 4], dtype='int64')}
  • 分组后的聚合操作:分组后的成员中可以被进行运算的值会进行运算,不能被运算的值不进行运算
# 给df创建一个新列,内容为各个水果的平均价格
df.groupby(by='item',axis=0).mean()['price']
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
mean_price_series = df.groupby(by='item',axis=0)['price'].mean()
mean_price_series
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 映射关系表
dic = mean_price_series.to_dict()
df['mean_price'] = df['item'].map(dic)
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
color item price weight mean_price
0 red Apple 4.0 12 3.00
1 yellow Banana 3.0 20 2.75
2 yellow Orange 3.0 50 3.50
3 green Banana 2.5 30 2.75
4 green Orange 4.0 20 3.50
5 green Apple 2.0 44 3.00

计算出苹果的平均价格

df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
color item price weight
0 red Apple 4.0 12
1 yellow Banana 3.0 20
2 yellow Orange 3.0 50
3 green Banana 2.5 30
4 green Orange 4.0 20
5 green Apple 2.0 44

按颜色查看各种颜色的水果的平均价格

汇总:将各种颜色水果的平均价格和df进行汇总

df['color_mean_price'] = df['color'].map(df.groupby(by='color')['price'].mean().to_dict())
df

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
} .dataframe thead th {
text-align: right;
}
color item price weight mean_price color_mean_price
0 red Apple 4.0 12 3.00 4.000000
1 yellow Banana 3.0 20 2.75 3.000000
2 yellow Orange 3.0 50 3.50 3.000000
3 green Banana 2.5 30 2.75 2.833333
4 green Orange 4.0 20 3.50 2.833333
5 green Apple 2.0 44 3.00 2.833333

6.高级数据聚合

使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算

  • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
  • transform和apply都会进行运算,在transform或者apply中传入函数即可
  • transform和apply也可以传入一个lambda表达式
df.groupby(by='item')['price'].mean()
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
def my_mean(s):
sum = 0
for i in s:
sum += i
return sum/len(s)
# 使用apply函数求出水果的平均价格
df.groupby(by='item')['price'].apply(my_mean)
item
Apple 3.00
Banana 2.75
Orange 3.50
Name: price, dtype: float64
# 使用transform函数求出水果的平均价格
df.groupby(by='item')['price'].transform(my_mean)
0    3.00
1 2.75
2 3.50
3 2.75
4 3.50
5 3.00
Name: price, dtype: float64