why
回顾我的数据分析入门, 最开始时SPSS+EXCEL,正好15年初是上大一下的时候, 因为统计学的还蛮好的, SPSS傻瓜式操作,上手挺方便,可渐渐地发现,使用软件的最不好的地方是不够灵活, 不能为所欲为**, 编程语言才是最灵活的, 最还是用R, 命令式的, 也是感觉不太好是, 于是开始Python来进行数据分析处理.
我当时看的是<用Python 进行数据分析> 2012年的第一版, 还是中文的, 感觉爱得不行, 后才到17-18年在github发现作者整了第二版,从Python2 ->Pyhotn3,主要是这本书的作者就是pandas的主要发起者之一, 那肯定是值得推荐的, 于是决定把书抄一遍, 我向来的学习方法就是模仿,就是抄书, 抄一遍完全跟着作者走, 真的受益无穷, 怎么说了, 嗯这也算是我进行翻译'的第一本英文了吧.
抄书我感觉是我学习生涯,最快乐的事情了吧, 在抄的过程中其实就是很作者对话, 偶尔也会发表下自己感想, 其乐无穷哦.
Abstract
Numpy, short for Numerical Python, is one of the most important foundational(基本的) packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy's array object as the linaua franca(通用语言) for data exchange.
Here are some of things you'll find in NumPy:
- nddary, an efficient multidimensional array providing fast array-oriented(面向数组编程) arithmetic operations and flexible broadcasting capabilitles.(强大而灵活的广播机制)
- Mathematical functions for fast operations on entire arrays of data without having to write loops.(高效的数学函数, 面向数组编程而不用写循环)
- Tools for reading/writing array data to disk and working with memory-mapped files.(提供了将数组数据读/写入磁盘的内存映射工具)
- Linear algebra, random number generation, and Fourier transform capabilities(傅里叶变换).
- A C API for connecting NumPy with libraries writen in C, C++, or FORTRAN.
Because NumPy provides an easy-to-use C API, it is straightforward(直接) to pass data to external(外部的) libraries wirtten in a low-level language and also for external libraries to return data to Python as NumPy arrays(通过扩展库函数, 将数据运行在低级语言(速度快)中,然后再返回给python).This feature has made Python a language of choice for wrapping legacy(包装遗留) C/C++/Fortran codebases and giving them a dynamic and easy-to-use interface.(python作为胶水语言,可跟其他语言进行交互)
While(虽然) NumPy by itself does not provide modeling or scientific functionality(不提供建模工具), having an understanding of NumPy arrays and array-oriented computing will help you use tools with array-oriented semantics(语义), like pandas, much more effectively(熟悉这种面向数组的形式,计算和用像excel似的语言工具pandas, 是会极大提供效率的).Since NumPy is a large topic, I will cover many advanced NumPy features(高级特性) like broadcasting in more depth later.(广播机制)
For most data analyss applications, the main areas of functionality I'll focus on are:
- Fast vectorized array operations for data munging and clean, subsetting and filtering, transformation, and any other kinds of computaions.(快速的向量化数组运算, 数据整理和清洗, 选取子集和过滤,数据转换,计算等)
- Common array algorithms(常见的数组算法) like sorting, unique, and set operations.
- Effecient descriptive statistics and aggregating/summarizing data.(高效的描述统计函数和对数据的分组聚合)
- Data alignment(数据对齐) and relatinal data manipulations(操纵) for merging and joining together heterogeneous datasets(数据集的拼接合并).
- Expressing condtinal logic as array expressions instead of loops with if-elif-else branches(数组表达式代替if-elif-else结构).
- Group-wise data manipulations (aggregaton, transformation, function application)(分组聚合)
While(尽管) NumPy provides a computaional foundation for general numerical data processing, (但)many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular(表格型) data. Pandas also provides some more domain-spcific(特殊领域) functionality(功能) like time series manipulation(时间序列管理), (定语从句-指时间序列)which is not present in NumPy.
Array-oriented(数组计算) computing in Python traces(追溯) its roots back to 1995, when Jim Hugunin created the Numeric library. (经历了)Over the next 10 yeras, many scientific programming communities began doing array programing in Python, but the library ecosystem(生态系统) had become fragmented(分裂, 指产Numpy片段) in the early 2000s, In 2005, Travis Oliphant was able to forge(合并) the NumPy project from the then Numeric and Numarray projects to bring the community together around a single array computing framwork.
One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency(效率) on large arrays of data(高效处理大数据阵列). There are a number of reasons for this:
- NumPy internally(内部的) stores data in a contiguous block of memory(存储在一个连续的内存块(数值类型)), (不同于)independent of other built-in Python objects. NumPy's library of algorthims written in the C language can operate on this memory without(无需) any type checking or other overhead(开支). NumPy arrays also use much less memory than built-in Python sequences.
- NumPy operatons perform(执行) complex computations on entire arrays without the need of Python for loops.(面向数组编程,不需要写循环)
To give you an idea of the performance differnce(性能差异), consider(演示) a NumPy array one million integers, and the equivalent Python list:
import numpy as np
my_arr = np.arange(1000000)
my_list = list(range(1000000))
Now let's multiply each sequence by 2:
# %time 测试一行代码执行完所需要的时间
%time for _ in range(10): my_arr2 = my_arr * 2
print('*'*50)
%time for _ in range(10): my_list2 = [x *2 for x in my_list]
Wall time: 35 ms
**************************************************
Wall time: 1.57 s
NumPy-based algorithms are generally 10 to 100 times fast (or) more than their pure Python counterparts and use significantly less memory.(比纯Python快很多倍, 内存也相应少)
The NumPy nddary: A Multidimensional Array Object
One of the key features of NumPy is its N-demensional array object(N维数数组对象), or ndarray, which is a fast, flexible container(容器) for large datasets in Python. Arrays enable you to perform(执行) mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar(标量) delements.(数组和标量运算, 会映射到数组的每一个元素上)
To give you a flavor(感觉,风味) of how NumPy enables batch computations(能批量计算) with similar syntax to scalar values on built-in Python objects, I first import NumPy and generate a small array of random data:
import numpy as np
# generate some random data
# randn N(0,1)的正态分布数据
data = np.random.randn(2,3)
data
array([[ 1.5610358 , 1.47201866, 0.64378465],
[ 0.39354435, -1.35112498, -3.12279483]])
Then I wirte mathematical operations with data:
data * 10
array([[ 15.61035804, 14.72018662, 6.4378465 ],
[ 3.93544348, -13.51124975, -31.22794833]])
data + data
array([[ 3.12207161, 2.94403732, 1.2875693 ],
[ 0.7870887 , -2.70224995, -6.24558967]])
In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values(相应值) in each "cell" in the array have been added to each oher.
In this chapter and throughout the book, I use the standard NumPy convention of always using "import numpy as np". You are, of course(当然,你知道的) welcome to put "from numpy improt *" in your code to avoid having to wirte np. But I advise against(不支持) making a habit of this(尽量不要用from 方式导包,命名空间混乱可能造成). Tthe NumPy namespace is large and contains a number of functions whose(函数重名) names conflict(竞争) with built-in Python functions (like min and max).
An ndaary is a generic multidimensional container for homogeneous data(同类型数据); that is, all of the elements must be the same type. Every array has a shape, a tuple indicating(说明) the size of each dimension, and a dtype, an object describing the data type of the array:
data.shape
(2, 3)
data.dtype
dtype('float64')
This chapter will introduce you to the basics fo using NumPy arrays, and should be sufficient(足够的) for following along with the rest of the book. While(尽管) it's not necessary to have a deep understanding of NumPy for many data analytical applications, (但)becoming proficient(熟练的) in array-oriented programming and thinking is a key step(关键的一步) along the way to becoming a scientific Python guru(专家, 大牛)
(本章介绍的是一些基础的数组操作, 尽管不同太深入了解数组, 但熟练使用数组和这种数组的思维方式是成为数据分析大牛必经的关键一步)
Whenever you see "array", 'NumPy array' or 'ndarray' in the text, with few exception(毫无意外) they all refer to (当做) the same thing: the ndarray object.
Creating ndarrays
The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) (往np.array()里面丢一个含类似列表的序列对象对象就可生成)and produces a new NumPy array containing(包含) the passed data. For example, a list is a good candidate(候选) for conversion(转换):
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1
array([6. , 7.5, 8. , 0. , 1. ])
Nested sequences(嵌套序列). like a list of equal-length lists, will be converted into a multidimensional array:
data2 = [[1,2,3,4], [5,6,7,8]]
arr2 = np.array(data2)
arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
# my test
cj_data2 = [[1,2,3,4], [5,6]]
cj_arr2 = np.array(data2)
"长度不一则被视为一个元素对象, 不会增加维度"
cj_arr2
cj_arr2[1]
cj_arr2.shape
cj_arr2.ndim
'长度不一则被视为一个元素对象, 不会增加维度'
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
array([5, 6, 7, 8])
(2, 4)
2
Since(当) data2 was a list of lists, the NumPy array arr2 has two dimensions wiht shape inferred(自动推断) from the data. We can confirm this by inspecting(检查) the ndim and shape attributes:
arr2.ndim
arr2.shape
2
(2, 4)
# my test
"维度, 列表嵌套层数, 选取第一个元素,需要几个下标,就是几维"
cj_arr2 = np.array([[1,2],[2,3],[5,6]])
cj_arr2
"选取第一个元素1, 需要两个下标, 即2维"
cj_arr2[0][0]
"{}维".format(cj_arr2.ndim)
"形状{}".format(cj_arr2.shape)
"3维, 即选取第1个元素需要3个下标, 3层列表嵌套"
cj_arr3 = np.array([[[1,2],[3,4]], [[5,6],[7,8]]])
cj_arr3
"获取第一个元素"
cj_arr3[0][0][0]
"{}维".format(cj_arr3.ndim)
"shape 从 [[ 外层往里层计数, 最后值是元素个数"
"shape {}".format(cj_arr3.shape)
'维度, 列表嵌套层数, 选取第一个元素,需要几个下标,就是几维'
array([[1, 2],
[2, 3],
[5, 6]])
'选取第一个元素1, 需要两个下标, 即2维'
1
'2维'
'形状(3, 2)'
'3维, 即选取第1个元素需要3个下标, 3层列表嵌套
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
'获取第一个元素'
1
'3维'
'shape 从 [[ 外层往里层计数, 最后值是元素个数'
'shape (2, 2, 2)'
Unless(除非) explicitly(显示地) specified(声明) (more on this later), np.array tries to infer a good data type for the array that it creates.(除非显示声明, np.array()会自动推断数组元素的类型), The data type is stored in(保存于) a special dtype metadata object; for example, in the previous two examples we have:
arr1.dtype
arr2.dtype
dtype('float64')
dtype('int32')
In additon to(另外) np.array, there are a number of other functions for creating new arrays. As examples, zeros and ones create array of 0s or 1s, respectively(分别地), with a given length or shape.(生成自定义shape的0或1数组) empty creates an array without initializing(初始化) its values to any particular value(特殊值). To create a higher dimensional array with these methods, pass a tuple for the shape.(给shape传元组, 就可生成多维数组了)
np.zeros(10)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
np.zeros((3,6))
array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
temp = np.empty((2,3,2))
temp
temp.ndim
array([[[1.16444061e-311, 2.47032823e-322],
[0.00000000e+000, 0.00000000e+000],
[0.00000000e+000, 2.42336543e-057]],
[[6.20393848e-091, 4.01519933e-057],
[2.32783945e-057, 4.50722901e+174],
[3.99910963e+252, 2.34394769e-056]]])
3
It'is not safe to assume(假定) that np.empty will return an array of all zeros. In some cases, it may return uninitialized(未初始化的) 'garbage' values(垃圾值).
arange is an array-valued version(内置版) of the built-in Python range function:
np.arange(15)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
See Table 4-1 for a short list of standard array creation functions. Since NumPy is focused on numerical computing, the data type, if not specified(没有指定类型, 默认float64), will in many cases be float64(floating point)
- array Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring(声明) a dtype explicitly specifying a dtype; copies the input data by default.
- asarray Convert input to narray, but do not copy if the input is already an ndarry(浅拷贝,原地操作)
- arange Like the built-in range but returns an ndarray instead of a list (range()的ndarray版)
- ones Produce an array of all is the given shape and dtype; ones_like takes another array and produces a ones array of the same shape and dtype.
- ones_like Produce an ones array of the same shape and dtype (根据传输的array的shape, 生成同shape的array, 每个元素值为1)
- zeros Like ones and ones_like but producing arrays of 0s instead.
- zero_like
- empty Create new arrays by allocating(分配) new memory, but do not populate with any values like ones zeros
- empty_like
- full Produce an array of the given shape and dtype with all values set to the indicated "fill value"
- full_like Takes another array and produces a filled array of the same shape and dtye.(这完全可以代替zeros, ones...)
- eye, identity Create a square(方阵) N x N identity (单位阵)
Data Types for ndarray
The data type or dtype is a special object containing(包含) the infomation(or metadata, data about data) the ndarray needs to intepret(开辟) a chunk of memory(内存块) as particular type of data:
arr1 = np.array([1,2,3], dtype=np.float64)
arr2 = np.array([1,2,3], dtype=np.int32)
"数据一样 arr1, arr2 的类型不同"
arr1.dtype
arr2.dtype
'数据一样 arr1, arr2 的类型不同'
dtype('float64')
dtype('int32')
dtypes are a source of NumPy's flexibility for interacting with data(数据交互) coming from other systems. In most cases(大多数情况下) they provide a mapping directly onto an underlying disk(磁盘里) or memory representation(直接映射到磁盘或者内存中), which(指 dtype) makes it easy to read and write binary stream of data(二进制数据流) to disk and also to connect to code written in a low-level language like C or Fortran.
The numerical dtypes are named the same way: a type name, like float or int, followed by a number indicating(指明) the the number of bits per element.
A standard double-precision(精度) floating-point value(what's used under the hood(包含在) on Python's float object) takes up 8 bytes or 64 bytes. Thus,(因此) this type is know in NumPy as float64. See table 4-2 for a full listing of NumPy's supported data types.
Don't worry about memorizing the NumPy dtypes, especially if you are a new user. It's often only necessary to care about the general kind of data you're dealing with, whether(不管是) floating, point, complex, integer, boolean, string, or general Python object. When you need more control over how data are stored in (存储) memory and on disk, especially large datasets, it is good to known that you have control over the storage type.(类型平时知道一点就行, 只有到涉及大量数据集存储的时候, 可去考虑它以什么样的类型去存储)
- int8, uint8 Signed an unsigned 8-bit(1byte) integer types
- int16, uint16, 32, 64
--
- float16 f2 Half-precision floating point
- float32, 64, 128
- complex64, 128, 256
- bool ? Boolean type storing True and False values
- object O Python object type; a value can be any Python object
- string_ S Fixed-length ASCII string type(1 byte per character); for example, to create a string dtype with 10, use 'S10' 单字节8位编码,中文不支持
- unicode_ U Fixed-length Unicode type(number of bytes platform specific); same specification semantics as string_ (eg, 'U10) 双字节16位编码支持中文
You can explicity convert or cast an array from one dtype to another usring ndarray's astype method:
arr = np.array([1,2,3,4,5])
arr.dtype
'astype 转换类型'
float_arr = arr.astype(np.float64)
float_arr.dtype
dtype('int32')
'astype 转换类型'
dtype('float64')
In this example, integers were cast(计算为) to floating pont. If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated(截断): (浮点转整型, 小数部分会被截掉)
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr
arr.dtype
"将float64 转为 int32, 小数部分会自动被截掉"
arr.astype(np.int32)
array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
dtype('float64')
'将float64 转为 int32, 小数部分会自动被截掉'
array([ 3, -1, -2, 0, 12, 10])
If you have an array of strings representing numbers, you can use astype to convert them to numeric form: (astype 将字符数字,转为数字)
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
"astype 是 not-inplace"
numeric_strings.astype(float)
numeric_strings.dtype
'astype 是 not-inplace'
array([ 1.25, -9.6 , 42. ])
dtype('S4')
It's important to be cautious(谨慎的) when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive(直观的) out-of-the-box behavior on numeric data. -> (少用np.string_, 数据可能会被截断, 而pandas 有比较直观的好用的处理方法)
If casting(过程) were to fail for some reason (like a string that cannot be converted to float64), a ValueError will be raised. Here I was a bit lazy an d wrote float instead of np.float64; NumPy aliases(别名) the Python types to its own equivalent data dtypes. -> 用类型的别名, 如 'S4', 'U4' 等
You can also use another array's dtype arribute:
int_array = np.arange(10)
calibers = np.array([.22, .270, .380, .44, .50], dtype=np.float64)
"astype() 接收其他类型参数"
int_array.astype(calibers.dtype)
'astype() 接收其他类型参数'
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
There are shorthand(速记) type code strings you can also use to refer to(参考) a dtype.
empty_uint32 = np.empty(8, dtype='u4')
empty_uint32
array([3207303360, 548, 3439083652, 548, 3207427172,
548, 0, 0], dtype=uint32)
Callingastype(调用astype函数) always creates a new array(a copy of the data, 会创建新的数组, 需用用新的变量接收哦), enven if the new dtype is the same as the old dtype.
Arithmetic with NumPy Arrays
Arrays are important because the enable you to express batch operations(批量操作) on data without writing any for loops.(批量操作数据而不用写循环, 简洁而高效) NumPy users call vectorization(向量化) Any arithmeic operations between equal-size arrays applies the operation element-wise:(操作会作用于每个元素)
arr = np.array([[1,2,3],[4,5,6]])
arr
"乘 对应元素相乘, 不是矩阵乘法哦"
arr * arr
"减"
arr - arr
array([[1, 2, 3],
[4, 5, 6]])
'乘 对应元素相乘, 不是矩阵乘法哦'
array([[ 1, 4, 9],
[16, 25, 36]])
'减'
array([[0, 0, 0],
[0, 0, 0]])
Arithmetic operations with scalars propagete(繁殖) the scalar argument to each element in the array.
"标量与数组运算, 会映射到数组的每个元素"
1 / arr
"幂"
arr ** 0.5
'标量与数组运算, 会映射到数组的每个元素'
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
'幂'
array([[1. , 1.41421356, 1.73205081],
[2. , 2.23606798, 2.44948974]])
Comarisons(对比) between array of the same size yield boolean arrays.(数组的布尔运算, 也是对应元素间等 比较)
arr2 = np.array([[0,4,1], [7,2,12]])
arr2
"布尔运算 对应位置元素的比较"
arr2 > arr
array([[ 0, 4, 1],
[ 7, 2, 12]])
'布尔运算 对应位置元素的比较'
array([[False, True, False],
[ True, False, True]]
Operations between differently sized array is called broadcasting(广播) and will be discussed in more detail in Appendix A(附录) Having a deep understanding of broadcasting is not necessary for most of this book. -> 不同shape间的数组操作, 遵循广播原则, 附录中有介绍, 本书没有进行深入探究
Basic Indexing and Slicing
NumPy array indexing is a rich topic,(丰富的话题) as there are many ways you may want to select a subset of your data or individual(个人的,特殊的) elements. One-dimensional arrays are simple; on the surface(表层) they act similarly to Python lists:
import numpy as np
arr = np.arange(10)
arr
"选取索引为5的元素, 即第六个元素"
arr[5]
"选取索引为[5:8], 即第6,7,8号元素"
arr[5:8]
"赋值"
arr[5:8] = 12
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
'选取索引为5的元素, 即第六个元素'
5
'选取索引为[5:8], 即第6,7,8号元素'
array([5, 6, 7])
'赋值'
array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
As you can see,(正如你看到的那样) if you assign(指定) a scalar value to a slice(切片), as in arr[5:8]=12, the value is propagated(被传播)(or broadcasted henceforth(从此以后)) to the entire selection. An important first distinction(区别) from Python's built-in lists is that array slices are view(视图) on the original array.(数组与列表的第一个区别是, 数组的切片是原数据的视图, 修改视图则会改变原始数组值) This means that the data is not copied, and any modifications to the view will be reflected in the source arry.
To give an example of this, I first create a slice of arr:
arr_slice = arr[5:8]
arr_slice
array([12, 12, 12])
Now, when I change values in arr_slice, the mutations are reflected in the original arry arr:
"对于array, 对切片数库进行改变, 会影响到原始数据, 视图呀"
arr_slice[1] = 12345
arr
'对于array, 对切片数库进行改变, 会影响到原始数据, 视图呀'
array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8,
9])
The 'bare' slice [:] will assign to all values in an array:
arr_slice[:] = 64
'切片数据改变会影响原始数据, 切片是原始数据的一个视图, 浅拷贝'
arr
'切片数据改变会影响原始数据, 切片是原始数据的一个视图, 浅拷贝'
array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
# my test list slice
cj_list = [1,2,3,4]
"拿到切片, 第2,3号数据"
cj_slice = cj_list[1:3]
cj_slice
"修改切片数据后, 并不会影响原来的list"
cj_slice[1] = 'youge'
cj_list
"说明-列表的切片是深拷贝, 互不影响了"
'拿到切片, 第2,3号数据'
[2, 3]
'修改切片数据后, 并不会影响原来的list'
[1, 2, 3, 4]
'说明-列表的切片是深拷贝, 互不影响了'
If you are new to NumPy, you might be surprise by this, especially if you hava used other array programing lauguage that copy data more eagerly(渴望地). As NumPy has been designed to be able to work with very large arrays, you could imagine perfomance and memory problems if NumPy insisted on always copyind data. -> NumPy 被设计成能处理大规模的数组, 如果都是深拷贝,而不是视图, 则对性能和内存来说都是不够友好的, so,视图很重要, 这是设计的初衷.
If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array. example, arr[5:8].copy(). 看来以后分析之前可以先深拷贝一份副本, 对副本进行操作呀
With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalar but ranther(而是) one-dimensional arrays: -> 对于二维数组,每个元素的索引不再是一个标量, 而是一个一维数组.
# arr2d = np.array([[],[],[]])
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
# cj补充
"dim:{}".format(arr2d.ndim)
"shape:{}".format(arr2d.shape)
arr2d[2]
"取'2'这个元素, 从外往里arr2[0][1]"
arr2d[0][1]
"等价于arr2d[0,1]"
arr2d[0,1]
'dim:2'
'shape:(3, 3)'
array([7, 8, 9])
"取'2'这个元素, 从外往里arr2[0][1]"
2
'等价于arr2d[0,1]'
2
Thus, individual elements(特定元素) can be accessed(被访问) recursively(递归地). But that is a bit to much work, so you can pass a comma-separated(逗号分割的索引列表) list of indices to select individual elements. So these are equivalent:
arr2d[0][2]
"arr2d[0][2] 等价于 arr2d[0, 2]"
arr2d[0,2]
3
'arr2d[0][2] 等价于 arr2d[0, 2]'
3
See Figure 4-1 for an illustation of indexing on a two-dimensional array, I find it helpful to think of axis 0 as the 'rows' of the array and axis 1 as the 'columns' -> axis=0, 表示行方向(从上到下), axis=1, 表示列方向(从左到右)-图待补充哦.
In multidimensional arrays, if you omit(省略) later indices, the returned object will be a lower dimensional ndarray consisting of(包含) all the data along the higher dimensions.-> 在多维数组中,如果省略后面的索引,则返回的对象将是较低维度的ndarray,其中包含沿较高维度的所有数据. So in the 2x2x3 array arr3d:
# arr3d = np.array([ [],[] ]) 第一个2, 外面是2个
# arr3d = np.array([ [ [], [] ],[ [], [] ] ]) 第二个2,往里, 每个[]里面2个
# arr3d = np.array([ [ [], [] ],[ [], [] ] ]) 最后一个3, 往里, 每个[]里面3个 -> 真正的元素[]有3个
# arr3d = np.array([ [ [1,2,3], [4,5,6] ],[ [7,8,9], [10,11,12] ] ]) 最后一个3, 往里, 每个[]里面3个 -> 真正的元素[]有3个
arr3d = np.array([ [ [1,2,3], [4,5,6] ],[ [7,8,9], [10,11,12] ] ])
arr3d
"nidm:{}".format(arr3d.ndim)
"shape:{}".format(arr3d.shape)
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
'nidm:3'
'shape:(2, 2, 3)'
arr3d[0] is a 2x3 array
# arr3d = np.array([ [ [1,2,3], [4,5,6] ],[ [7,8,9], [10,11,12] ] ]) 最后一个3, 往里, 每个[]里面3个 -> 真正的元素[]有3个
# arr3d = np.array([ [ [1,2,3], [4,5,6] ], ]) # arr3d[0] 最外层
arr3d[0]
array([[1, 2, 3],
[4, 5, 6]])
Both scalar values and arrays can be assigned to arr3[0]: 标量值和数组都可以赋值给arr3[0]
old_values = arr3d[0].copy()
"copy()深拷贝了哦, 跟原数据无关了哦"
arr3d[0] = 42
arr3d
"还原回来"
arr3d[0] = old_values
arr3d
'copy()深拷贝了哦, 跟原数据无关了哦'
array([[[42, 42, 42],
[42, 42, 42]],
[[ 7, 8, 9],
[10, 11, 12]]])
'还原回来'
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Similarly, arr3d[1, 0] gives you all of the values whose indices start with(1, 0), forming a 1-dimensional array.
# arr3d = np.array([ [ [1,2,3], [4,5,6] ],[ [7,8,9], [10,11,12] ] ]) 最后一个3, 往里, 每个[]里面3个 -> 真正的元素[]有3个
# arr3d = np.array([ [ [7,8,9], ] ]) 最后一个3, 往里, 每个[]里面3个 -> 真正的元素[]有3个
"arr3d[1,0] 跟 arr3d[1][0] 是等价的, 7,8,9"
arr3d[1,0]
"1表示最外层索引, 0表示在1的基础上的 索引"
arr3d[1][0]
"如再取第一个元素 arr3d[0][0][0] or arr3d[0,0,0]"
arr3d[0,0,0]
"我感觉arr3d[0][0][0] 会更直观看这种 数组嵌套数组的关系"
arr3d[0][0][0]
'arr3d[1,0] 跟 arr3d[1][0] 是等价的, 7,8,9'
array([7, 8, 9])
'1表示最外层索引, 0表示在1的基础上的 索引'
array([7, 8, 9])
'如再取第一个元素 arr3d[0][0][0] or arr3d[0,0,0]'
1
'我感觉arr3d[0][0][0] 会更直观看这种 数组嵌套数组的关系'
1
This expression is the same as though we had indexed in two steps:
x = arr3d[1]
x
array([[ 7, 8, 9],
[10, 11, 12]])
x[0]
"综合起来arr3d[0][1], 等价写成arr3d[1,0]"
arr3d[1,0]
array([7, 8, 9])
'综合起来arr3d[0][1], 等价写成arr3d[1,0]'
array([7, 8, 9])
Note that(注意一点) in all of these cases where subsections of the array have been selected, the returned arrays are views. -> 所有索引切片放回的是数组的视图(修改视图,原数据也会被修改, 解决: copy()实现深拷贝)
indexing with slices
Like one-dimensional objects such as Pytho lists, ndarray can be sliced with the familiar syntax: -> 1维下的索引, 跟列表是相同的.
arr
"arr[1:6] 选取2,3,4,5,6号元素跟列表是一样的, 前闭后开"
arr[1:6]
array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])
'arr[1:6] 选取2,3,4,5,6号元素跟列表是一样的, 前闭后开'
array([ 1, 2, 3, 4, 64])
Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different. -> 2维下的索引就跟列表有点差别了.
arr2d
"arr2d[:2] 表示第一层, 取0,1号元素"
arr2d[:2]
"表示第一层, 0,1号元组子集里的0号索引元素, 即[1,2,3]"
arr2d[:2][0]
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
'arr2d[:2] 表示第一层, 取0,1号元素'
array([[1, 2, 3],
[4, 5, 6]])
'表示第一层, 0,1号元组子集里的0号索引元素, 即[1,2,3]'
array([1, 2, 3])
As you can see, it has sliced(切割) alone axis 0(行方向, 从上往下), the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression arr2d[:2] as "select the first two rows of arr2d.' -> 将索引看作是沿着轴(0 or 1)方向, 进行切割, arr2d[:2] 即可看作沿着0轴方向, 选取前两行.
You can pass multiple silice(多重切片) just like you can pass multiple indexs:
arr2d
"需求: 选取[[2,3],[5,6]]"
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
'需求: 选取[[2,3],[5,6]]'
"第一步, 选择行(0轴) arr2d[:2] -> 前两行"
arr2d[:2]
"第二步, 根据第一步选出的[[1,2,3],[4,5,6]] -> 选取2,3,列(轴1)"
arr2d[:2, 1:]
"合并起来即arr2d[:2, 1:]"
arr2d[:2, 1:]
'第一步, 选择行(0轴) arr2d[:2] -> 前两行'
array([[1, 2, 3],
[4, 5, 6]])
'第二步, 根据第一步选出的[[1,2,3],[4,5,6]] -> 选取2,3,列(轴1)'
array([[2, 3],
[5, 6]])
'合并起来即arr2d[:2, 1:]'
array([[2, 3],
[5, 6]])
When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.
For example, I can selected the second row but only the first two rows like so:
arr2d[1, :2]
array([4, 5])
Similarly, I can select the third column but only the fisrt two rows like so:
"带了:, 就是行列索引而已, 根本没啥的, 在二维下"
"选取第三列, 行方向全要, 列方向只要index=2, 即可 arr2d[:, 2]"
arr2d[:, 2]
"跟arr2d[:][2]始终是0轴, 而arr2d[:, 2], 第二个下标是1轴"
arr[:][2]
'带了:, 就是行列索引而已, 根本没啥的, 在二维下'
'选取第三列, 行方向全要, 列方向只要index=2, 即可 arr2d[:, 2]'
array([3, 6, 9])
'跟arr2d[:][2]始终是0轴, 而arr2d[:, 2], 第二个下标是1轴'
2
See Figure 4-2 for an illustration(说明). Note that a colon(冒号) by itself means to take the entire axis, so you can slice only higher dimensional axes by doing:
"选取所有行, 第一列, 即1,4,7, 按行排列, 每行1个元素"
temp = arr2d[:, :1]
temp
"shape:{}".format(temp.shape)
'选取所有行, 第一列, 即1,4,7, 按行排列, 每行1个元素'
array([[1],
[4],
[7]])
'shape:(3, 1)'
Of course, assigning to a slice expression assigns to the whole selection:
"选取前三行, 2,3列 赋值为0 -> 修改视图了哦"
arr2d[:2, 1:] = 0
"视图被修改, 意味着原数据也被改掉了哦"
arr2d
'选取前三行, 2,3列 赋值为0 -> 修改视图了哦'
'视图被修改, 意味着原数据也被改掉了哦'
array([[1, 0, 0],
[4, 0, 0],
[7, 8, 9]])
Boolean Indexing
Let's consider an example where(定语从句) we have some data in an array and an array of names with dumplicates(重复值)-> 第一部分数据是一个带重复值的名字数组. I'm going to use here the randn function in numpy. random to generate some random normally distributed(分布) data.-> 第二部分数据是一组满足标准正态分布的数据.
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'will', 'Joe', 'Joe'])
data = np.random.randn(7,4)
names
data
array(['Bob', 'Joe', 'Will', 'Bob', 'will', 'Joe', 'Joe'], dtype='<U4')
array([[-0.88909951, 0.68369085, -0.08513694, 1.06803488],
[-0.51138662, 0.21741227, 0.03529778, 0.84034434],
[-1.81094959, -1.46366973, -1.92178824, -0.02783088],
[ 0.75025272, 0.79452051, -0.37595293, 0.49722783],
[ 0.54415236, 0.90016251, 1.57747441, 0.43730563],
[-1.86114486, -0.90310865, -0.59506003, -0.70246864],
[ 0.24035417, -0.91880007, 0.37165625, 0.29182528]])
Suppose(假设) each name corresponds(相等于) to a row in the data array and we wanted to selec all the rows with corresponding name 'Bob'. Like arthmetic operations, comparisons(比较)(such as ) with arrays are also vectorized. Thus, comparing names with the string 'Bob' yields a boolean array: -> 判断names里的每行(个)数据, 是否'Bob', 返回一个向量数组.
names == 'Bob'
array([ True, False, False, True, False, False, False])
This boolean array can be passed when indexing the array. 索引数组可以传递布尔数组作为参数
"两个True, 则会选取前两行呗"
data[names == 'Bob']
'两个True, 则会选取前两行呗'
array([[-0.88909951, 0.68369085, -0.08513694, 1.06803488],
[ 0.75025272, 0.79452051, -0.37595293, 0.49722783]])
This boolean array must be of the same length as the array axis it's indexing. You can even mix and match boolean arrays with slice or integers(or sequences of integers; more on this later.) -> bool索引时, 长度必须要跟array的index一致, (这里, 有几个True,就选几行), 还可以数字,数组的混合索引.
Boolean selection will not fail if the boolean array is not the correct length, so I recommend care when using this feature. 如果布尔数组的长度不正确,>布尔选择不会失败,因此我建议在使用此功能时要小心.
In this examples, I select from the rows where names == 'Bob' and index the columns, too: 列方向参考
data[names == 'Bob', 2:]
data[names == 'Bob', 3:]
array([[-0.08513694, 1.06803488],
[-0.37595293, 0.49722783]])
array([[1.06803488],
[0.49722783]])
To select everything but 'Bob', you can either use != or negate(否定) the condition using:
names != 'Bob'
array([False, True, True, False, True, True, True])
"~ 表示取反, names"
data[~(names == 'Bob')]
'~ 表示取反, names'
array([[-0.51138662, 0.21741227, 0.03529778, 0.84034434],
[-1.81094959, -1.46366973, -1.92178824, -0.02783088],
[ 0.54415236, 0.90016251, 1.57747441, 0.43730563],
[-1.86114486, -0.90310865, -0.59506003, -0.70246864],
[ 0.24035417, -0.91880007, 0.37165625, 0.29182528]])
The ~ operator can be useful when you want to invert(取反) a general condition:
cond = names == 'Bob'
"取反,这个感觉挺强大的"
data[~cond]
'取反,这个感觉挺强大的'
array([[-0.51138662, 0.21741227, 0.03529778, 0.84034434],
[-1.81094959, -1.46366973, -1.92178824, -0.02783088],
[ 0.54415236, 0.90016251, 1.57747441, 0.43730563],
[-1.86114486, -0.90310865, -0.59506003, -0.70246864],
[ 0.24035417, -0.91880007, 0.37165625, 0.29182528]])
Selecting two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like &, | -> 逻辑查询来操作
mask = (names == 'Bob') | (names == 'Will')
mask
data[mask]
array([ True, False, True, True, False, False, False])
array([[-0.88909951, 0.68369085, -0.08513694, 1.06803488],
[-1.81094959, -1.46366973, -1.92178824, -0.02783088],
[ 0.75025272, 0.79452051, -0.37595293, 0.49722783]])
Selecting data form an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.
Setting values with boolean arrays works in a common-sense way.(通过布尔值过滤数组, 是一种比较常见的方式) To set all of the negative values in data to 0 we need only do:
"将数组中小于0 的数值都替换为 0"
data[data < 0] = 0
data
'将数组中小于0 的数值都替换为 0'
array([[0. , 0.68369085, 0. , 1.06803488],
[0. , 0.21741227, 0.03529778, 0.84034434],
[0. , 0. , 0. , 0. ],
[0.75025272, 0.79452051, 0. , 0.49722783],
[0.54415236, 0.90016251, 1.57747441, 0.43730563],
[0. , 0. , 0. , 0. ],
[0.24035417, 0. , 0.37165625, 0.29182528]])
Setting whole rows or columns using one-dimensional boolean array is also easy:
data[names != 'Joe'] = 7
data
array([[7. , 7. , 7. , 7. ],
[0. , 0.21741227, 0.03529778, 0.84034434],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[0. , 0. , 0. , 0. ],
[0.24035417, 0. , 0.37165625, 0.29182528]])
As we will see later, these types of operations on two-dimensional data are convenient to do with pandas.
Fancy Indexing
Fancy indexing is a term(术语) adopted(名义上的) by NumPy to describe indexing using integer arrays. -> 花式索引是NumPy 里 用整数数组来做索引的一个术语.
Suppose we had an 8x4 array:
"定义一个8x4的空数组"
arr = np.empty((8,4))
"将每一行的值设置为当前的行数"
for i in range(8):
arr[i] = i
arr
'定义一个8x4的空数组'
'将每一行的值设置为当前的行数'
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
To select out a subset of the rows in a particular oder, you can simply pass a list or ndarray of integers specifying desired order:
"索引数组的值, 即 行号, 超出索引则会报错"
arr[[4,3,0,6]]
'索引数组的值, 即 行号, 超出索引则会报错'
array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[0., 0., 0., 0.],
[6., 6., 6., 6.]])
Hopefully this code did what you expected! Using negative indeces selects rows form the end. -> 负数则从底部往上分别为 -1, -2...
arr[[-3, -5, -7]]
array([[5., 5., 5., 5.],
[3., 3., 3., 3.],
[1., 1., 1., 1.]])
Passing multiple index array does somthing slightly defferent(稍有不同); it selects a one-dimensional array of elements corresponding to each tuple of indices: -> 传递多个索引数组略有不同, 如[[1,2], [3,4]], 则选取->[行索引, 列索引]-> [[1,3], [2,4], 长度不一会报错.
arr = np.arange(32).reshape((8,4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])
"选取的行列索引对应的元素为: [1,0], [5,3], [7,1], [2,2]"
"行列:[1,0], [5,3], [7,1], [2,2]"
"坐标(2,1), (6,4), (8,2), (3,3)"
"两个数组长度不匹配则会报错"
arr[[1,5,7,2], [0,3,1,2]]
'选取的行列索引对应的元素为: [1,0], [5,3], [7,1], [2,2]'
'行列:[1,0], [5,3], [7,1], [2,2]'
'坐标(2,1), (6,4), (8,2), (3,3)'
'两个数组长度不匹配则会报错'
array([ 4, 23, 29, 10])
Her the elements(1,0),(5,3),(7,1),(2,2),(索引是从0开始计数的) were selected. Regardless of(不管) how many dimensions the array has(here, only 2), the result of fancy indexing is always one-dimensional.
The behavior of fancy indexing in this case is a bit different from what some users might have expected(myself include), which is the rectangular(矩形的) region(区域) formed by selecting a subset of the matrix's rows and columns. Here the one way to get that. -> 花式索引的方式,确实有点跟我们认为的矩阵选取方式是不一样的呢, 不过可以这样实现矩阵的行列方式:
"先选取2,6,8,3 行, 再选取1,4,2,3列"
arr[[1,5,7,2]]
arr[[1,5,7,2]][:, [0,3,1,2]]
'先选取2,6,8,3 行, 再选取1,4,2,3列'
array([[ 4, 5, 6, 7],
[20, 21, 22, 23],
[28, 29, 30, 31],
[ 8, 9, 10, 11]])
array([[ 4, 7, 5, 6],
[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])
cj_arr = np.arange(9).reshape((3,3))
cj_arr
"花式索引, 后面有些不懂, 不过不重要, names实现行列选取子集就行"
cj_arr[1:, 1:]
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
'花式索引, 后面有些不懂, 不过不重要, names实现行列选取子集就行'
array([[4, 5],
[7, 8]])
Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.
Transposing Array and Swapping Axes
Transposing is a specail form of reshaping that similarly returns a views on the underlying data without copying anything. Arrays have the transpose method and also the special T attribute: -> 交换轴,是视图,修改会修改原数据的, 更常用的是属性 T 转置
arr = np.arange(15).reshape((3,5))
arr
"转置属性 T"
arr.T
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
When doing matrix computations, you may do this very often--for example, when computing the inner matrix product using np.dot: -> 转置在做矩阵运算还是用得很多的
arr = np.random.randn(6,3)
arr
array([[-0.31347827, 1.84428594, -1.04234885],
[-0.12911129, 0.70779926, 0.02346783],
[-0.35534693, -1.26234442, 1.06345449],
[ 0.66036534, 1.51442815, -1.00924714],
[ 0.83678267, -0.01628499, 1.08876609],
[-2.04661129, 0.27618009, -1.33821823]])
计算内积 $ a^Ta $
np.dot(arr.T, arr)
array([[ 5.56611517, 0.20025734, 2.92922976],
[ 0.20025734, 7.86591714, -5.16397673],
[ 2.92922976, -5.16397673, 6.21279673]])
For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute(转置) the axes(轴)(for extra(特定的) mind bending(特定的轴编号来转置在多维的时候))
# 2: [[],[]]
# 2: [ [ [],[] ], [ [],[] ] ]
# 4: [ [ [1,2,3,4],[5,6,7,8] ], [ [a,b,c,d],[e,f,g,h] ] ]
arr = np.arange(16).reshape(2,2,4)
arr
"目前还不太懂"
arr.transpose((1,0,2))
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
'目前还不太懂'
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
Here, the axes have been reorderde with the second axis first, the first axis second, and the last axis unchanged. -> 没看懂
Simple transposing with .T is a special case of swapping axes. ndaary has the method swapaxes, which takes a pair of axis numbers and switches the indicated axes to rear-range the data: -> arr.swapaxex(a,b) -> 实现轴的交换
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
"我还是没看懂这个交换轴的操作是咋弄的"
arr.swapaxes(1,2)
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
swapaxes similarly returns a view on the data without making a copy.再说吧.