pandas库笔记

本笔记来自于pandas数据结构(http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro 怎么插入链接呢?)介绍。
Series对象:
这玩意是个一维标签化数组(a one-dimensional labeled array)。标签可作为索引。基本的调用方法如下
s = pd.Series(data, index=index)
数据来源有三种:

  • Python字典
  • ndarray(numpy中的多维数组对象,很NB)
  • 标量值

下面是三种数据的例子:

d = {'a':23,'b':3.2,'c':19}
s = pd.Series(d)
s = pd.Series(np.random.randn(3),index=['a','b','c'])
#索引数量与值的数量要一致,不然要报错,或是用下面的方法
s = pd.Series(np.randm.randn(23))
#自动建立索引号,从0开始的哟
s = pd.Series(3,index=['a','c','d'])

其特性:
1.Series支持切片索引,键值查询的,来一堆栗子:

>>> s

a 1.676165

b -1.257345

c -0.282880

d -1.411001

e -0.534083

dtype: float64

>>> s[:3]

a 1.676165

b -1.257345

c -0.282880

dtype: float64

>>> np.exp(s) #numpy中的一个函数,大概是算指数的吧

a 5.345021

b 0.284408

c 0.753610

d 0.243899

e 0.586207

dtype: float64

>>> s[s > s.mean()]

a 1.676165

c -0.282880

dtype: float64

>>> s['a']

1.6761654283874481

>>> s['e'] = 21

>>> s

a 1.676165

b -1.257345

c -0.282880

d -1.411001

e 21.000000

dtype: float64

>>> 'e' in s

True

2.查询Series成员关系:

'e' in s

True

>>> 'f' in s

False

>>> 21 in s

False

>>> 21. in s

False

>>> s.get('d',np.nan)

-1.4110014180872303

>>> s.get(21,np.nan) #跟Python中文本处理是一样的呀。

nan

#看起来,只能查询键。

3.Series支持矢量操作和标签一致性(Vectorized operations and label alignment with Series)

这句不好懂,看看实际例子,领会其精神:

>>> s + s

a 3.352331

b -2.514689

c -0.565760

d -2.822003

e 42.000000

dtype: float64

>>> s * 3

a 5.028496

b -3.772034

c -0.848640

d -4.233004

e 63.000000

dtype: float64

>>> s + 3

a 4.676165

b 1.742655

c 2.717120

d 1.588999

e 24.000000

dtype: float64

>>> s[1:] + s[:-1] #这里就是所说的label alignment

a NaN #由于两个相加标签a没有对应的相加值,就成了nan了。下同。

b -2.514689

c -0.565760

d -2.822003

e NaN

dtype: float64

4.名称属性:

想想excel里的名称管理器,虽然有些差别,但也只能这样想了。

>>> s = pd.Series(np.random.randn(12),name='somthing')

>>> s

0 0.373053

1 -0.523582

2 0.036234

3 0.158118

4 -1.089069

5 0.598565

6 -0.526778

7 2.136106

8 0.588604

9 -1.093352

10 -0.796709

11 -0.500362

Name: somthing, dtype: float64

>>> s.name

'somthing'

>>> s2 = s.rename('hahah') #重命名方法。s2 与 s是两个不同的对象了。

>>> s2.name

'hahah'

>>> s2

0 0.373053

1 -0.523582

2 0.036234

3 0.158118

4 -1.089069

5 0.598565

6 -0.526778

7 2.136106

8 0.588604

9 -1.093352

10 -0.796709

11 -0.500362

Name: hahah, dtype: float64

Series对象的介绍结束。

DataFrame对象
这是个二维的数据结构。可以想像成SQL中的数据表。它容纳五种数据类型:
一维ndarray,列表,字典,Series的字典(。。。这种机翻的既视感,大概是这么个意思啦)
二维ndarray
结构化的ndarray
Series
另一个DataFrame

得,还是直接看看例子要好一点:

>>> d = { 'one':pd.Series([1,2,3],index=['a','b','c']),

'two' : pd.Series([1,2,3,4],index=['a','b','c','d'])} #这就是Series对象的字典。

>>> df = pd.DataFrame(d)

>>> df

one two

a 1.0 1

b 2.0 2

c 3.0 3

d NaN 4

>>> pd.DataFrame(d,index=['b','d','z'])

one two

b 2.0 2.0

d NaN 4.0

z NaN NaN

>>> pd.DataFrame(d,index=['a','c','b'],columns=['two','three'])

two three

a 1 NaN

c 3 NaN

b 2 NaN

#也就是说,可以越界访问键,虽然没有什么值出来。

>>> df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

>>> df.columns

Index(['one', 'two'], dtype='object')

#下面是列表字典的例子:

>>> d = { 'one': [1,2,3,4],

 'two':[2,3,4,5]}

>>> pd.DataFrame(d)

 one two

0 1 2

1 2 3

2 3 4

3 4 5

#下面是structured or record array例子:

>>> data = np.zeros((2,),dtype=[('a','i4'),('b','f4'),('c','a11')])

#为什么输f3就报错呢?非要凑成f4?

>>> data[:] = [(1,2,'hello'),(2,3.,"world")]

>>> pd.DataFrame(data)

 a b c

0 1 2.0 b'hello'

1 2 3.0 b'world'

#字典列表

>>> data2 = [{'a':1,'b':2},{'a':2,'b':3,'c':7}]

>>> pd.DataFrame(data2)

 a b c

0 1 2 NaN

1 2 3 7.0

#元组字典

>>> pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},

 ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},

 ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},

 ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

 a b

 a b c b

A B 4.0 1.0 5.0 10.0

 C 3.0 2.0 6.0 NaN

 D NaN NaN NaN 9.0

特性:

1.可选构造器(Alternate Constructors)

大概就是这么意思吧。
DataFrame.from_records()
这个大概就是利用已有的数据进行构造吧。

>>> data #data已在前面生成了,具体在哪儿,我也不知道。摊手。

array([(1, 2., b'hello'), (2, 3., b'world')], 

 dtype=[('a', '<i4'), ('b', '<f4'), ('c', 'S11')])

>>> pd.DataFrame.from_records(data,index='c')

 a b

c 

b'hello' 1 2.0

b'world' 2 3.0

DataFrame.from_items()
也是生成DataFrame的方法,注意,这两种定义方式。
>>> pd.DataFrame.from_items([('a',[1,2,3]),('b',[4,5,6])])

 a b

0 1 4

1 2 5

2 3 6

>>> pd.DataFrame.from_items([('a',[1,2,3]),('b',[4,5,6])],

 orient='index',columns=['one','two','three'])

 one two three

a 1 2 3

b 4 5 6

列的操作

**>>> df['one'] #选择某列

a 1.0

b 2.0

c 3.0

d NaN

Name: one, dtype: float64

>>> df['three'] = df['one'] * df['two'] + 2 #新增一列

>>> df['flag'] = df['one'] > 2

>>> df

 one two three flag

a 1.0 1 3.0 False

b 2.0 2 6.0 False

c 3.0 3 11.0 True

d NaN 4 NaN False

>>> del df['two'] #删除一列的两种方法

>>> three = df.pop('three')

>>> df

 one flag

a 1.0 False

b 2.0 False

c 3.0 True

d NaN False

>>> df['foo'] = 'bar' #新增一列

>>> df

 one flag foo

a 1.0 False bar

b 2.0 False bar

c 3.0 True bar

d NaN False bar

>>> df['one_tmp'] = df['one'][:2]

>>> df

 one flag foo one_tmp

a 1.0 False bar 1.0

b 2.0 False bar 2.0

c 3.0 True bar NaN

d NaN False bar NaN

>>> df.insert(1,'bar',df['one']) #在指定位置新增一列,注意是从0开始。>>> df one bar flag foo one_tmpa 1.0 1.0 False bar 1.0b 2.0 2.0 False bar 2.0c 3.0 3.0 True bar NaNd NaN NaN False bar NaN

****assign方法
**
不是很清楚要这个方法干嘛。摊手。
>>> yeji = pd.read_csv(u'E:/python code/some_referenece/test_data.csv')

# 随便找的数据,所以后面画图就没什么效果了。windows下的两个斜杠,我真的是哔了汪了。

#中文在初学阶段就不要随便加了,要死人的。

>>> yeji.head()

 a b c d

0 160.36 150.36 200.94 221.23

1 120.45 96.35 140.45 155.63

2 56.90 30.52 63.74 67.16

3 65.40 28.96 94.84 109.56

>>> (yeji.assign(ratio = yeji['b'] / yeji['a']).head())

 a b c d ratio

0 160.36 150.36 200.94 221.23 0.937640

1 120.45 96.35 140.45 155.63 0.799917

2 56.90 30.52 63.74 67.16 0.536380

3 65.40 28.96 94.84 109.56 0.442813

>>> yeji.assign(ratio = lambda x :(x['d']/x['c'])).head()

 a b c d ratio

0 160.36 150.36 200.94 221.23 1.100975

1 120.45 96.35 140.45 155.63 1.108081

2 56.90 30.52 63.74 67.16 1.053655

3 65.40 28.96 94.84 109.56 1.155209

>>> import matplotlib.pyplot as plt #手册里都没说有导入这个呢。。。

>>> (yeji.query('a > 100')

 .assign(ratio = lambda x : x.c / x.d,

 ratio2 = lambda x : x.b / x.a)

 .plot(kind='scatter',x='a',y='b'))

<matplotlib.axes._subplots.AxesSubplot object at 0x06375FD0>

>>> plt.show() #并不能输出好看的图,以后找点有意义的数据加上吧。

感觉是很重要的一句话,就抄下来了:

assign always returns a copy of the data, leaving the original DataFrame untouched.

**

数据的线性运算

df = pd.DataFrame(np.random.rand(10,4),columns=['A','B','C','D'])

 df2 = pd.DataFrame(np.random.randn(7,3),columns=['A','B','C'])

>>> df + df2

 A B C D

0 1.721702 1.977382 -0.713337 NaN

1 -0.358129 1.017056 0.940957 NaN

2 -0.014927 -1.277490 0.747128 NaN

3 1.741724 0.909578 2.034446 NaN

4 0.618743 -0.391043 0.237377 NaN

5 -0.015751 0.164203 0.714825 NaN

6 -0.171107 1.446850 0.215734 NaN

7 NaN NaN NaN NaN

8 NaN NaN NaN NaN

9 NaN NaN NaN NaN

带日期的DataFrame运算

>>> df = pd.DataFrame(np.random.rand(8,3),index=index,columns=list('ABC'))

>>> df

 A B C

2018-01-02 0.619302 0.397314 0.720963

2018-01-03 0.342102 0.365098 0.594494

2018-01-04 0.867247 0.873567 0.103344

2018-01-05 0.130527 0.576591 0.587693

2018-01-06 0.412718 0.995881 0.772881

2018-01-07 0.030249 0.148611 0.752574

2018-01-08 0.560166 0.530715 0.738528

2018-01-09 0.811958 0.913701 0.339096

#生成带日期的

>>> type(df['A'])

<class 'pandas.core.series.Series'>

>>> df - df['A']

 2018-01-02 00:00:00 2018-01-03 00:00:00 2018-01-04 00:00:00 

2018-01-02 NaN NaN NaN 

2018-01-03 NaN NaN NaN 

2018-01-04 NaN NaN NaN 

2018-01-05 NaN NaN NaN 

2018-01-06 NaN NaN NaN 

2018-01-07 NaN NaN NaN 

2018-01-08 NaN NaN NaN 

2018-01-09 NaN NaN NaN 

 2018-01-05 00:00:00 2018-01-06 00:00:00 2018-01-07 00:00:00 

2018-01-02 NaN NaN NaN 

2018-01-03 NaN NaN NaN 

2018-01-04 NaN NaN NaN 

2018-01-05 NaN NaN NaN 

2018-01-06 NaN NaN NaN 

2018-01-07 NaN NaN NaN 

2018-01-08 NaN NaN NaN 

2018-01-09 NaN NaN NaN 

 2018-01-08 00:00:00 2018-01-09 00:00:00 A B C 

2018-01-02 NaN NaN NaN NaN NaN 

2018-01-03 NaN NaN NaN NaN NaN 

2018-01-04 NaN NaN NaN NaN NaN 

2018-01-05 NaN NaN NaN NaN NaN 

2018-01-06 NaN NaN NaN NaN NaN 

2018-01-07 NaN NaN NaN NaN NaN 

2018-01-08 NaN NaN NaN NaN NaN 

2018-01-09 NaN NaN NaN NaN NaN 

#不太清楚这个操作是干嘛的。矩阵减法?

#后面的版本要用下面的操作代替上面的

>>> df.sub(df['A'],axis=0)

 A B C

2018-01-02 0.0 -0.221988 0.101661

2018-01-03 0.0 0.022996 0.252393

2018-01-04 0.0 0.006321 -0.763903

2018-01-05 0.0 0.446064 0.457166

2018-01-06 0.0 0.583163 0.360162

2018-01-07 0.0 0.118362 0.722325

2018-01-08 0.0 -0.029451 0.178362

2018-01-09 0.0 0.101743 -0.472861

#就是把某一行清零而已,跟前一操作有关?

>>> df * 3 + 1

 A B C

2018-01-02 2.857905 2.191942 3.162888

2018-01-03 2.026305 2.095293 2.783483

2018-01-04 3.601740 3.620702 1.310032

2018-01-05 1.391582 2.729774 2.763080

2018-01-06 2.238155 3.987643 3.318642

2018-01-07 1.090747 1.445833 3.257723

2018-01-08 2.680497 2.592144 3.215583

2018-01-09 3.435873 3.741102 2.017289

#这是要说明数据运算不影响日期吧,其他的就不演示了。

>>> df[:4].T #矩阵的转置。
 2018-01-02 2018-01-03 2018-01-04 2018-01-05A 0.619302 0.342102 0.867247 0.130527B 0.397314 0.365098 0.873567 0.576591C 0.720963 0.594494 0.103344 0.587693

布尔型数据的运算
>>> df1 = pd.DataFrame({'a':[1,0,1],'b':[0,1,0]},dtype=bool) 

#定义方法还是要学着点

>>> df2 = pd.DataFrame({'a':[0,1,0],'b':[1,0,1]},dtype=bool)

#下面这堆逻辑运算,我也不知道对还是错。

>>> df1

 a b

0 True False

1 False True

2 True False

>>> df1 & df2

 a b

0 False False

1 False False

2 False False

>>> df1 | df2

 a b

0 True True

1 True True

2 True True

>>> df1 ^ df2

 a b

0 True True

1 True True

2 True True

>>> -df1

 a b

0 False True

1 True False

2 False True

Panel
这货是个比较少用的类型,是三维的(那个三维)。所以就简单地了解下吧。

>>> wp = pd.Panel(np.random.randn(2,5,3),items=['item1','item2'],

 major_axis=pd.date_range('1/1/2013',periods=5),

 minor_axis=['a','b','c'])

>>> wp

<class 'pandas.core.panel.Panel'>

Dimensions: 2 (items) x 5 (major_axis) x 3 (minor_axis)

Items axis: item1 to item2

Major_axis axis: 2013-01-01 00:00:00 to 2013-01-05 00:00:00

Minor_axis axis: a to c

就简单了解到这儿吧,毕竟三维数据想想都很头疼。



NaN (not a number)
np.zeros(a,b)生成一个a行b列的零矩阵
numpy.random.rand(d0, d1, …, dn) 输出一个(d0, d1, …, dn)的矩阵。
numpy.random.randn(d0, d1, …, dn) 输出标准正太分布的矩阵。

IT文库 » pandas库笔记
分享到: 更多 (0)

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址