标签 numpy 下的文章

先写下遇到的问题及解决

导入pandas_datareader时出错

import pandas_datareader

出现

...
ImportError: cannot import name 'is_list_like'

根据stackoverflow的提问,通过评论中通过

import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like

来解决报错

获取阿里巴巴股票数据出错

alibaba = pdr.get_data_yahoo('BABA')

出现

ImmediateDeprecationError: Yahoo Actions has been immediately
deprecated due to large breaks in the API without the introduction of
a stable replacement. Pull Requests to re-enable these data connectors
are welcome.

github上说打补丁什么的,还有改代码之类的,也有说有pandas_datareader分支有已经修复的包。
后来还是根据说安装pandas_datareader的dev包来解决
不过GitHub的写法是

pip install git+https://github.com/pydata/pandas-datareader.git

注意 这都是anaconda管理的,但不能直接通过conda直接安装dev包
不过我么有装git啊 那么操作如下

  • 先运行Anaconda Prompt就是Anaconda中打开的那个terminal,不过在win下开始菜单可以直接用Anaconda Prompt打开。
  • 安装dev包之前需要卸载已经安装的pandas-datareader包
pip uninstall pandas-datareader
  • 下载zip包 解压 切换到解压后的目录
  • 然后通过pip安装dev包
pip setup.py install

pdr.get_data_yahoo('APPL')获取不到苹果的股票数据 好像是接口问题 舍弃苹果的数据

调用函数sns.distplot()有个警告

C:UsersweimoAnaconda3libsite-packagesmatplotlibaxes_axes.py:6462: UserWarning: The 'normed' > kwarg is deprecated, and has been replaced by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "
GitHub上有评论说是版本变化引出的问题,但似乎又没有影响,搜索到一个中文博客是说直接修改源代码
seaborn/distributions.py
hist_kws.setdefault(“normed”, norm_hist)
改为
hist_kws.setdefault(“density”, norm_hist)

不过最终决定暂时不管,毕竟只是个warning。
推测应该是数据源的问题

总结:就像是在学matlab画图...但课程中提到Matplotlib for Python Developers这本书,目前出第二版了(2018)。第一版还是2009年,计划好好学学。顺便翻译一下?
这本书的彩图PDF官方链接:
https://www.packtpub.com/sites/default/files/downloads/MatplotlibforPythonDevelopersSecondEdition_ColorImages.pdf
GitHub上的章节代码:
https://github.com/PacktPublishing/Matplotlib-for-Python-Developers-Second-Edition/
书籍首页:
Matplotlib for Python Developers, 2nd Edition

记录正文

import numpy as np
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
from pandas import Series, DataFrame
import pandas_datareader as pdr

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from datetime import datetime
start = datetime(2014, 9, 20)
alibaba = pdr.get_data_yahoo('BABA', start=start)
amazon = pdr.get_data_yahoo('AMZN', start=start)
alibaba.head()
High Low Open Close Volume Adj Close
Date
2014-09-19 99.699997 89.949997 92.699997 93.889999 271879400 93.889999
2014-09-22 92.949997 89.500000 92.699997 89.889999 66657800 89.889999
2014-09-23 90.480003 86.620003 88.940002 87.169998 39009800 87.169998
2014-09-24 90.570000 87.220001 88.470001 90.570000 32088000 90.570000
2014-09-25 91.500000 88.500000 91.089996 88.919998 28598000 88.919998
amazon.head()
High Low Open Close Volume Adj Close
Date
2014-09-19 332.760010 325.570007 327.600006 331.320007 6886200 331.320007
2014-09-22 329.489990 321.059998 328.489990 324.500000 3109700 324.500000
2014-09-23 327.600006 321.250000 322.459991 323.630005 2352600 323.630005
2014-09-24 329.440002 319.559998 324.170013 328.209991 2642200 328.209991
2014-09-25 328.540009 321.399994 327.989990 321.929993 2928800 321.929993
#alibaba.shape
#alibaba.describe()
alibaba.to_csv("alibaba.csv")
amazon.to_csv("amazon.csv")
alibaba['Adj Close'].plot(legend=True)
<matplotlib.axes._subplots.AxesSubplot at 0x2eb3813a128>



output_5_1.png

for _ in alibaba:
    if _ == 'Volume':
        continue
    alibaba[_].plot(legend=True)

output_6_0.png

alibaba['high-low'] = alibaba['High'] - alibaba['Low']
alibaba.head()
High Low Open Close Volume Adj Close high-low
Date
2014-09-19 99.699997 89.949997 92.699997 93.889999 271879400 93.889999 9.750000
2014-09-22 92.949997 89.500000 92.699997 89.889999 66657800 89.889999 3.449997
2014-09-23 90.480003 86.620003 88.940002 87.169998 39009800 87.169998 3.860001
2014-09-24 90.570000 87.220001 88.470001 90.570000 32088000 90.570000 3.349998
2014-09-25 91.500000 88.500000 91.089996 88.919998 28598000 88.919998 3.000000
alibaba['high-low'].plot(figsize=(25,5))
<matplotlib.axes._subplots.AxesSubplot at 0x2eb388cd6d8>



output_9_1.png

alibaba['daily-return'] = alibaba['Adj Close'].pct_change()
alibaba['daily-return'].plot(figsize=(25,5),linestyle='--',marker='o')
<matplotlib.axes._subplots.AxesSubplot at 0x2eb38b609b0>



output_11_1.png

sns.distplot(alibaba['daily-return'].dropna(),bins=100,color='red')
C:\Users\weimo\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "





<matplotlib.axes._subplots.AxesSubplot at 0x2eb3a9bfac8>



output_12_2.png

start = datetime(2015, 1, 1)
company = ['GOOG', 'MSFT', 'AMZN', 'FB']
#company = 'APPL'
top_tech_df = pdr.get_data_yahoo(company,start=start)['Adj Close']
top_tech_df.head()
Symbols AMZN FB GOOG MSFT
Date
2014-12-31 310.350006 78.019997 523.521423 42.663837
2015-01-02 308.519989 78.449997 521.937744 42.948578
2015-01-05 302.190002 77.190002 511.057617 42.553627
2015-01-06 295.290009 76.150002 499.212799 41.929050
2015-01-07 298.420013 76.150002 498.357513 42.461777
top_tech_df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x180d6e03f60>



output_15_1.png

top_tech_dr = top_tech_df.pct_change()
top_tech_df[['FB', 'MSFT']].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x180d6de7550>



output_17_1.png

sns.jointplot('AMZN','GOOG',top_tech_dr,kind='scatter')
<seaborn.axisgrid.JointGrid at 0x180d6859f98>



output_18_1.png

sns.pairplot(top_tech_dr.dropna())
<seaborn.axisgrid.PairGrid at 0x180d6db60f0>



output_19_1.png

top_tech_dr['MSFT'].quantile(0.02)
-0.029942679770481886

output_5_1.png

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
s1 = Series([1,2,3],index=['A','B','C'])
s1
A    1
B    2
C    3
dtype: int64



s2 = Series([4,5,6,7],index=['B','C','D','E'])
s2
B    4
C    5
D    6
E    7
dtype: int64



s1 + s2
A    NaN
B    6.0
C    8.0
D    NaN
E    NaN
dtype: float64


对应index相加,对不上的就是NaN

DataFrame的运算

df_a = DataFrame(np.arange(4).reshape(2,2),index=['A','B'],columns=['北京','上海'])
df_b = DataFrame(np.arange(9).reshape(3,3),index=['A','B','C'],columns=['北京','上海','广州'])
df_a
北京 上海
A 0 1
B 2 3
df_b
北京 上海 广州
A 0 1 2
B 3 4 5
C 6 7 8
df_a + df_b
上海 北京 广州
A 2.0 0.0 NaN
B 7.0 5.0 NaN
C NaN NaN NaN
类似的,index和columns对应的部分可以相加,否则为NaN
df_c = DataFrame([[1,2,3],[4,5,np.nan],[7,8,9]],index=['A','B','C'],columns=['c1','c2','c3'])
df_c
c1 c2 c3
A 1 2 3.0
B 4 5 NaN
C 7 8 9.0
df_c.sum()
c1    12.0
c2    15.0
c3    12.0
dtype: float64



df_c.sum(axis = 1)
A     6.0
B     9.0
C    24.0
dtype: float64



type(df_c.sum())
pandas.core.series.Series


DataFrame中求和的时候会忽略NaN
axis = 1 可以指定行的计算
df_c.describe()
c1 c2 c3
count 3.0 3.0 2.000000
mean 4.0 5.0 6.000000
std 3.0 3.0 4.242641
min 1.0 2.0 3.000000
25% 2.5 3.5 4.500000
50% 4.0 5.0 6.000000
75% 5.5 6.5 7.500000
max 7.0 8.0 9.000000
s1.index
Index(['A', 'B', 'C'], dtype='object')



s1.sort_values()#按value升序
A    1
B    2
C    3
dtype: int64



s1.sort_values(ascending=False)#按value降序
C    3
B    2
A    1
dtype: int64



s1.sort_index(ascending=False)#按index降序
C    3
B    2
A    1
dtype: int64



df_d = DataFrame(np.random.randn(35).reshape(7,5),columns=['A','B','C','D','E'])
df_d
A B C D E
0 -0.694245 -0.302792 0.667865 0.447782 -0.413812
1 -0.502081 -1.849090 1.885715 -1.117864 0.406936
2 0.384877 0.076701 -1.052755 -0.709675 0.272562
3 -1.194740 -0.518320 -0.139549 -0.745238 1.270952
4 -1.266443 -1.163004 -0.644873 -0.333446 0.349508
5 -0.695937 -0.589887 1.475200 0.278659 2.207159
6 -0.712247 0.171372 0.268192 0.138490 0.604858
df_d.sort_values('A',ascending=False)#按A列降序
A B C D E
2 0.384877 0.076701 -1.052755 -0.709675 0.272562
1 -0.502081 -1.849090 1.885715 -1.117864 0.406936
0 -0.694245 -0.302792 0.667865 0.447782 -0.413812
5 -0.695937 -0.589887 1.475200 0.278659 2.207159
6 -0.712247 0.171372 0.268192 0.138490 0.604858
3 -1.194740 -0.518320 -0.139549 -0.745238 1.270952
4 -1.266443 -1.163004 -0.644873 -0.333446 0.349508
df_d.sort_index(ascending=False)#按index降序
A B C D E
6 -0.712247 0.171372 0.268192 0.138490 0.604858
5 -0.695937 -0.589887 1.475200 0.278659 2.207159
4 -1.266443 -1.163004 -0.644873 -0.333446 0.349508
3 -1.194740 -0.518320 -0.139549 -0.745238 1.270952
2 0.384877 0.076701 -1.052755 -0.709675 0.272562
1 -0.502081 -1.849090 1.885715 -1.117864 0.406936
0 -0.694245 -0.302792 0.667865 0.447782 -0.413812

所谓课后练习 ——> 将一个已有csv文件中的三列提取出来,按其中一项降序排列,要求用一行完成(真是。。)。

import pandas as pd
csv_file = open("E:/Python3数据科学入门与实战/project/o25mso/homework/movie_metadata.csv","r",encoding="utf-8")
read_csv_in = pd.read_csv(csv_file)
df = read_csv_in[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False)
df.to_csv('new_imdb.csv')
#pd.read_csv(open("E:/Python3数据科学入门与实战/project/o25mso/homework/movie_metadata.csv"))[['imdb_score','director_name','movie_title']].sort_values('imdb_score',ascending=False)
!ls
csv_practice.ipynb
demo.ipynb
new_imdb.csv
practice-2018-08-11.ipynb

有中文路径时通过open("filename","r",encoding="utf-8")读进来,
再用pandas.read_csv()读。注意用utf-8读取否则会出现问题。
这样做好处主要是允许路径有中文
df.head()
imdb_score director_name movie_title
2765 9.5 John Blanchard Towering Inferno
1937 9.3 Frank Darabont The Shawshank Redemption
3466 9.2 Francis Ford Coppola The Godfather
4409 9.1 John Stockwell Kickboxer: Vengeance
2824 9.1 NaN Dekalog