I'm trying to get information from a table in the internet as shown below. I'm using jupyter notebook with python 2.7. I want to use this information in Python's panda modüle as data frame. But when ı copy the table with table headings and then use the read_clipboard command, I see the error as shown below the table link. But without table headings there is no problem. How can ı get the data from internet with table headindgs.
我正试图从互联网上的表格中获取信息,如下所示。我正在使用带有python 2.7的jupyter笔记本。我想在Python的pandamodüle中使用这些信息作为数据框架。但当ı使用表标题复制表,然后使用read_clipboard命令时,我看到错误,如表链接下面所示。但没有表格标题就没有问题。如何通过表headindgs从互联网获取数据。
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from numpy.random import randn
df1 = pd.read_clipboard()
df1
The table which I want to import as a data frame.
我想要作为数据框导入的表。
CParserError Traceback (most recent call last)
<ipython-input-4-151d7223d8dc> in <module>()
----> 1 df1 = pd.read_clipboard()
2 df1
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\clipboard.pyc in read_clipboard(**kwargs)
49 kwargs['sep'] = '\s+'
50
---> 51 return read_table(StringIO(text), **kwargs)
52
53
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
496 skip_blank_lines=skip_blank_lines)
497
--> 498 return _read(filepath_or_buffer, kwds)
499
500 parser_f.__name__ = name
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
283 return parser
284
--> 285 return parser.read()
286
287 _parser_defaults = {
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
745 raise ValueError('skip_footer not supported for iteration')
746
--> 747 ret = self._engine.read(nrows)
748
749 if self.options.get('as_recarray'):
C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
1195 def read(self, nrows=None):
1196 try:
-> 1197 data = self._reader.read(nrows)
1198 except StopIteration:
1199 if self._first_chunk:
pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7988)()
pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8244)()
pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8970)()
pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)()
pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:22649)()
CParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 2
2 个解决方案
#1
1
There is a csv you can use on the page with all the data which read_csv
can parse easily:
您可以在页面上使用一个csv,其中包含read_csv可以轻松解析的所有数据:
import pandas as pd
df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv")
If you want certain time periods you just have to change the params in the url i.e s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980
, if we change 1980 to 2015:
如果你想要某些时间段你只需改变url中的params,即s = AAPL&d = 1&e = 16&f = 2016&g = d&a = 11&b = 12&c = 1980,如果我们改变1980年到2015年:
df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=2015&ignore=.csv",parse_dates=0)
print(df)
We get:
Date Open High Low Close Volume \
0 2016-02-12 94.190002 94.500000 93.010002 93.989998 40121700
1 2016-02-11 93.790001 94.720001 92.589996 93.699997 49686200
2 2016-02-10 95.919998 96.349998 94.099998 94.269997 42245000
3 2016-02-09 94.290001 95.940002 93.930000 94.989998 44331200
4 2016-02-08 93.129997 95.699997 93.040001 95.010002 54021400
5 2016-02-05 96.519997 96.919998 93.690002 94.019997 46418100
6 2016-02-04 95.860001 97.330002 95.190002 96.599998 46471700
7 2016-02-03 95.000000 96.839996 94.080002 96.349998 45964300
8 2016-02-02 95.419998 96.040001 94.279999 94.480003 37357200
9 2016-02-01 96.470001 96.709999 95.400002 96.430000 40943500
10 2016-01-29 94.790001 97.339996 94.349998 97.339996 64416500
11 2016-01-28 93.790001 94.519997 92.389999 94.089996 55678800
12 2016-01-27 96.040001 96.629997 93.339996 93.419998 133369700
13 2016-01-26 99.930000 100.879997 98.070000 99.989998 75077000
14 2016-01-25 101.519997 101.529999 99.209999 99.440002 51794500
15 2016-01-22 98.629997 101.459999 98.370003 101.419998 65800500
16 2016-01-21 97.059998 97.879997 94.940002 96.300003 52161500
17 2016-01-20 95.099998 98.190002 93.419998 96.790001 72334400
18 2016-01-19 98.410004 98.650002 95.500000 96.660004 53087700
19 2016-01-15 96.199997 97.709999 95.360001 97.129997 79833900
20 2016-01-14 97.959999 100.480003 95.739998 99.519997 63170100
21 2016-01-13 100.320000 101.190002 97.300003 97.389999 62439600
22 2016-01-12 100.550003 100.690002 98.839996 99.959999 49154200
23 2016-01-11 98.970001 99.059998 97.339996 98.529999 49739400
24 2016-01-08 98.550003 99.110001 96.760002 96.959999 70798000
25 2016-01-07 98.680000 100.129997 96.430000 96.449997 81094400
26 2016-01-06 100.559998 102.370003 99.870003 100.699997 68457400
27 2016-01-05 105.750000 105.849998 102.410004 102.709999 55791000
28 2016-01-04 102.610001 105.370003 102.000000 105.349998 67649400
29 2015-12-31 107.010002 107.029999 104.820000 105.260002 40912300
30 2015-12-30 108.580002 108.699997 107.180000 107.320000 25213800
31 2015-12-29 106.959999 109.430000 106.860001 108.739998 30931200
32 2015-12-28 107.589996 107.690002 106.180000 106.820000 26704200
33 2015-12-24 109.000000 109.000000 107.949997 108.029999 13596700
34 2015-12-23 107.269997 108.849998 107.199997 108.610001 32657400
35 2015-12-22 107.400002 107.720001 106.449997 107.230003 32789400
36 2015-12-21 107.279999 107.370003 105.570000 107.330002 47590600
37 2015-12-18 108.910004 109.519997 105.809998 106.029999 96453300
38 2015-12-17 112.019997 112.250000 108.980003 108.980003 44772800
39 2015-12-16 111.070000 111.989998 108.800003 111.339996 56238500
40 2015-12-15 111.940002 112.800003 110.349998 110.489998 52978100
41 2015-12-14 112.180000 112.680000 109.790001 112.480003 64318700
Adj Close
0 93.989998
1 93.699997
2 94.269997
3 94.989998
4 95.010002
5 94.019997
6 96.599998
7 95.830001
8 93.970098
9 95.909571
10 96.814656
11 93.582196
12 92.915814
13 99.450356
14 98.903329
15 100.872638
16 95.780276
17 96.267629
18 96.138333
19 96.605790
20 98.982891
21 96.864389
22 99.420519
23 97.998236
24 96.436710
25 95.929460
26 100.156523
27 102.155677
28 104.781429
29 104.691918
30 106.740798
31 108.153132
32 106.243496
33 107.446965
34 108.023837
35 106.651287
36 106.750746
37 105.457759
38 108.391842
39 110.739099
40 109.893688
41 111.872953
#2
1
Consider using an html web scraper like python's lxml module, html()
method to scrape html table data and then migrate to a pandas dataframe. While there are automation features like pandas.read_html(), this approach provides more control over nuances in html content like the Feb 4 column span. Below uses an xpath expression on the <td>
position in table using brackets, []
:
考虑使用像python的lxml模块,html()方法的html web scraper来抓取html表数据,然后迁移到pandas数据帧。虽然有pandas.read_html()等自动化功能,但这种方法可以更好地控制html内容中的细微差别,例如2月4列的范围。下面使用括号[]在表中的位置使用xpath表达式:
import requests
import pandas as pd
from lxml import etree
# READ IN AND PARSE WEB DATA
url = "https://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices"
rq = requests.get(url)
htmlpage = etree.HTML(rq.content)
# INITIALIZE LISTS
dates = []
openstock = []
highstock = []
lowstock = []
closestock = []
volume = []
adjclose = []
# ITERATE THROUGH SEVEN COLUMNS OF TABLE
for i in range(1,8):
htmltable = htmlpage.xpath("//tr[td/@class='yfnc_tabledata1']/td[{}]".format(i))
# APPEND COLUMN DATA TO CORRESPONDING LIST
for row in htmltable:
if i == 1: dates.append(row.text)
if i == 2: openstock.append(row.text)
if i == 3: highstock.append(row.text)
if i == 4: lowstock.append(row.text)
if i == 5: closestock.append(row.text)
if i == 6: volume.append(row.text)
if i == 7: adjclose.append(row.text)
# CLEAN UP COLSPAN VALUE (AT FEB. 4)
dates = [d for d in dates if len(d.strip()) > 3]
del dates[7]
del openstock[7]
# MIGRATE LISTS TO DATA FRAME
df = pd.DataFrame({'Dates':dates,
'Open':openstock,
'High':highstock,
'Low':lowstock,
'Close':closestock,
'Volume':volume,
'AdjClose':adjclose})
# AdjClose Close Dates High Low Open Volume
#0 93.99 93.99 Feb 12, 2016 94.50 93.01 94.19 40,121,700
#1 93.70 93.70 Feb 11, 2016 94.72 92.59 93.79 49,686,200
#2 94.27 94.27 Feb 10, 2016 96.35 94.10 95.92 42,245,000
#3 94.99 94.99 Feb 9, 2016 95.94 93.93 94.29 44,331,200
#4 95.01 95.01 Feb 8, 2016 95.70 93.04 93.13 54,021,400
#5 94.02 94.02 Feb 5, 2016 96.92 93.69 96.52 46,418,100
#...
#61 111.73 112.34 Nov 13, 2015 115.57 112.27 115.20 45,812,400
#62 115.10 115.72 Nov 12, 2015 116.82 115.65 116.26 32,525,600
#63 115.48 116.11 Nov 11, 2015 117.42 115.21 116.37 45,218,000
#64 116.14 116.77 Nov 10, 2015 118.07 116.06 116.90 59,127,900
#65 119.92 120.57 Nov 9, 2015 121.81 120.05 120.96 33,871,400
#1
1
There is a csv you can use on the page with all the data which read_csv
can parse easily:
您可以在页面上使用一个csv,其中包含read_csv可以轻松解析的所有数据:
import pandas as pd
df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv")
If you want certain time periods you just have to change the params in the url i.e s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980
, if we change 1980 to 2015:
如果你想要某些时间段你只需改变url中的params,即s = AAPL&d = 1&e = 16&f = 2016&g = d&a = 11&b = 12&c = 1980,如果我们改变1980年到2015年:
df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=2015&ignore=.csv",parse_dates=0)
print(df)
We get:
Date Open High Low Close Volume \
0 2016-02-12 94.190002 94.500000 93.010002 93.989998 40121700
1 2016-02-11 93.790001 94.720001 92.589996 93.699997 49686200
2 2016-02-10 95.919998 96.349998 94.099998 94.269997 42245000
3 2016-02-09 94.290001 95.940002 93.930000 94.989998 44331200
4 2016-02-08 93.129997 95.699997 93.040001 95.010002 54021400
5 2016-02-05 96.519997 96.919998 93.690002 94.019997 46418100
6 2016-02-04 95.860001 97.330002 95.190002 96.599998 46471700
7 2016-02-03 95.000000 96.839996 94.080002 96.349998 45964300
8 2016-02-02 95.419998 96.040001 94.279999 94.480003 37357200
9 2016-02-01 96.470001 96.709999 95.400002 96.430000 40943500
10 2016-01-29 94.790001 97.339996 94.349998 97.339996 64416500
11 2016-01-28 93.790001 94.519997 92.389999 94.089996 55678800
12 2016-01-27 96.040001 96.629997 93.339996 93.419998 133369700
13 2016-01-26 99.930000 100.879997 98.070000 99.989998 75077000
14 2016-01-25 101.519997 101.529999 99.209999 99.440002 51794500
15 2016-01-22 98.629997 101.459999 98.370003 101.419998 65800500
16 2016-01-21 97.059998 97.879997 94.940002 96.300003 52161500
17 2016-01-20 95.099998 98.190002 93.419998 96.790001 72334400
18 2016-01-19 98.410004 98.650002 95.500000 96.660004 53087700
19 2016-01-15 96.199997 97.709999 95.360001 97.129997 79833900
20 2016-01-14 97.959999 100.480003 95.739998 99.519997 63170100
21 2016-01-13 100.320000 101.190002 97.300003 97.389999 62439600
22 2016-01-12 100.550003 100.690002 98.839996 99.959999 49154200
23 2016-01-11 98.970001 99.059998 97.339996 98.529999 49739400
24 2016-01-08 98.550003 99.110001 96.760002 96.959999 70798000
25 2016-01-07 98.680000 100.129997 96.430000 96.449997 81094400
26 2016-01-06 100.559998 102.370003 99.870003 100.699997 68457400
27 2016-01-05 105.750000 105.849998 102.410004 102.709999 55791000
28 2016-01-04 102.610001 105.370003 102.000000 105.349998 67649400
29 2015-12-31 107.010002 107.029999 104.820000 105.260002 40912300
30 2015-12-30 108.580002 108.699997 107.180000 107.320000 25213800
31 2015-12-29 106.959999 109.430000 106.860001 108.739998 30931200
32 2015-12-28 107.589996 107.690002 106.180000 106.820000 26704200
33 2015-12-24 109.000000 109.000000 107.949997 108.029999 13596700
34 2015-12-23 107.269997 108.849998 107.199997 108.610001 32657400
35 2015-12-22 107.400002 107.720001 106.449997 107.230003 32789400
36 2015-12-21 107.279999 107.370003 105.570000 107.330002 47590600
37 2015-12-18 108.910004 109.519997 105.809998 106.029999 96453300
38 2015-12-17 112.019997 112.250000 108.980003 108.980003 44772800
39 2015-12-16 111.070000 111.989998 108.800003 111.339996 56238500
40 2015-12-15 111.940002 112.800003 110.349998 110.489998 52978100
41 2015-12-14 112.180000 112.680000 109.790001 112.480003 64318700
Adj Close
0 93.989998
1 93.699997
2 94.269997
3 94.989998
4 95.010002
5 94.019997
6 96.599998
7 95.830001
8 93.970098
9 95.909571
10 96.814656
11 93.582196
12 92.915814
13 99.450356
14 98.903329
15 100.872638
16 95.780276
17 96.267629
18 96.138333
19 96.605790
20 98.982891
21 96.864389
22 99.420519
23 97.998236
24 96.436710
25 95.929460
26 100.156523
27 102.155677
28 104.781429
29 104.691918
30 106.740798
31 108.153132
32 106.243496
33 107.446965
34 108.023837
35 106.651287
36 106.750746
37 105.457759
38 108.391842
39 110.739099
40 109.893688
41 111.872953
#2
1
Consider using an html web scraper like python's lxml module, html()
method to scrape html table data and then migrate to a pandas dataframe. While there are automation features like pandas.read_html(), this approach provides more control over nuances in html content like the Feb 4 column span. Below uses an xpath expression on the <td>
position in table using brackets, []
:
考虑使用像python的lxml模块,html()方法的html web scraper来抓取html表数据,然后迁移到pandas数据帧。虽然有pandas.read_html()等自动化功能,但这种方法可以更好地控制html内容中的细微差别,例如2月4列的范围。下面使用括号[]在表中的位置使用xpath表达式:
import requests
import pandas as pd
from lxml import etree
# READ IN AND PARSE WEB DATA
url = "https://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices"
rq = requests.get(url)
htmlpage = etree.HTML(rq.content)
# INITIALIZE LISTS
dates = []
openstock = []
highstock = []
lowstock = []
closestock = []
volume = []
adjclose = []
# ITERATE THROUGH SEVEN COLUMNS OF TABLE
for i in range(1,8):
htmltable = htmlpage.xpath("//tr[td/@class='yfnc_tabledata1']/td[{}]".format(i))
# APPEND COLUMN DATA TO CORRESPONDING LIST
for row in htmltable:
if i == 1: dates.append(row.text)
if i == 2: openstock.append(row.text)
if i == 3: highstock.append(row.text)
if i == 4: lowstock.append(row.text)
if i == 5: closestock.append(row.text)
if i == 6: volume.append(row.text)
if i == 7: adjclose.append(row.text)
# CLEAN UP COLSPAN VALUE (AT FEB. 4)
dates = [d for d in dates if len(d.strip()) > 3]
del dates[7]
del openstock[7]
# MIGRATE LISTS TO DATA FRAME
df = pd.DataFrame({'Dates':dates,
'Open':openstock,
'High':highstock,
'Low':lowstock,
'Close':closestock,
'Volume':volume,
'AdjClose':adjclose})
# AdjClose Close Dates High Low Open Volume
#0 93.99 93.99 Feb 12, 2016 94.50 93.01 94.19 40,121,700
#1 93.70 93.70 Feb 11, 2016 94.72 92.59 93.79 49,686,200
#2 94.27 94.27 Feb 10, 2016 96.35 94.10 95.92 42,245,000
#3 94.99 94.99 Feb 9, 2016 95.94 93.93 94.29 44,331,200
#4 95.01 95.01 Feb 8, 2016 95.70 93.04 93.13 54,021,400
#5 94.02 94.02 Feb 5, 2016 96.92 93.69 96.52 46,418,100
#...
#61 111.73 112.34 Nov 13, 2015 115.57 112.27 115.20 45,812,400
#62 115.10 115.72 Nov 12, 2015 116.82 115.65 116.26 32,525,600
#63 115.48 116.11 Nov 11, 2015 117.42 115.21 116.37 45,218,000
#64 116.14 116.77 Nov 10, 2015 118.07 116.06 116.90 59,127,900
#65 119.92 120.57 Nov 9, 2015 121.81 120.05 120.96 33,871,400