如何使用pandas模块将带有标题的表导入数据框

时间:2022-12-30 22:54:53

I'm trying to get information from a table in the internet as shown below. I'm using jupyter notebook with python 2.7. I want to use this information in Python's panda modüle as data frame. But when ı copy the table with table headings and then use the read_clipboard command, I see the error as shown below the table link. But without table headings there is no problem. How can ı get the data from internet with table headindgs.

我正试图从互联网上的表格中获取信息,如下所示。我正在使用带有python 2.7的jupyter笔记本。我想在Python的pandamodüle中使用这些信息作为数据框架。但当ı使用表标题复制表,然后使用read_clipboard命令时,我看到错误,如表链接下面所示。但没有表格标题就没有问题。如何通过表headindgs从互联网获取数据。

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from numpy.random import randn

df1 = pd.read_clipboard()
df1

The table which I want to import as a data frame.

我想要作为数据框导入的表。

CParserError                              Traceback (most recent call last)
<ipython-input-4-151d7223d8dc> in <module>()
----> 1 df1 = pd.read_clipboard()
      2 df1

C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\clipboard.pyc in read_clipboard(**kwargs)
     49         kwargs['sep'] = '\s+'
     50 
---> 51     return read_table(StringIO(text), **kwargs)
     52 
     53 

C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    496                     skip_blank_lines=skip_blank_lines)
    497 
--> 498         return _read(filepath_or_buffer, kwds)
    499 
    500     parser_f.__name__ = name

C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    283         return parser
    284 
--> 285     return parser.read()
    286 
    287 _parser_defaults = {

C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    745                 raise ValueError('skip_footer not supported for iteration')
    746 
--> 747         ret = self._engine.read(nrows)
    748 
    749         if self.options.get('as_recarray'):

C:\Anaconda3\envs\python2\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1195     def read(self, nrows=None):
   1196         try:
-> 1197             data = self._reader.read(nrows)
   1198         except StopIteration:
   1199             if self._first_chunk:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7988)()

pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:8244)()

pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8970)()

pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8838)()

pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:22649)()

CParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 2

2 个解决方案

#1


1  

There is a csv you can use on the page with all the data which read_csv can parse easily:

您可以在页面上使用一个csv,其中包含read_csv可以轻松解析的所有数据:

import pandas as pd

df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv")

If you want certain time periods you just have to change the params in the url i.e s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980, if we change 1980 to 2015:

如果你想要某些时间段你只需改变url中的params,即s = AAPL&d = 1&e = 16&f = 2016&g = d&a = 11&b = 12&c = 1980,如果我们改变1980年到2015年:

df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=2015&ignore=.csv",parse_dates=0)

print(df)

We get:

          Date        Open        High         Low       Close     Volume  \
0   2016-02-12   94.190002   94.500000   93.010002   93.989998   40121700   
1   2016-02-11   93.790001   94.720001   92.589996   93.699997   49686200   
2   2016-02-10   95.919998   96.349998   94.099998   94.269997   42245000   
3   2016-02-09   94.290001   95.940002   93.930000   94.989998   44331200   
4   2016-02-08   93.129997   95.699997   93.040001   95.010002   54021400   
5   2016-02-05   96.519997   96.919998   93.690002   94.019997   46418100   
6   2016-02-04   95.860001   97.330002   95.190002   96.599998   46471700   
7   2016-02-03   95.000000   96.839996   94.080002   96.349998   45964300   
8   2016-02-02   95.419998   96.040001   94.279999   94.480003   37357200   
9   2016-02-01   96.470001   96.709999   95.400002   96.430000   40943500   
10  2016-01-29   94.790001   97.339996   94.349998   97.339996   64416500   
11  2016-01-28   93.790001   94.519997   92.389999   94.089996   55678800   
12  2016-01-27   96.040001   96.629997   93.339996   93.419998  133369700   
13  2016-01-26   99.930000  100.879997   98.070000   99.989998   75077000   
14  2016-01-25  101.519997  101.529999   99.209999   99.440002   51794500   
15  2016-01-22   98.629997  101.459999   98.370003  101.419998   65800500   
16  2016-01-21   97.059998   97.879997   94.940002   96.300003   52161500   
17  2016-01-20   95.099998   98.190002   93.419998   96.790001   72334400   
18  2016-01-19   98.410004   98.650002   95.500000   96.660004   53087700   
19  2016-01-15   96.199997   97.709999   95.360001   97.129997   79833900   
20  2016-01-14   97.959999  100.480003   95.739998   99.519997   63170100   
21  2016-01-13  100.320000  101.190002   97.300003   97.389999   62439600   
22  2016-01-12  100.550003  100.690002   98.839996   99.959999   49154200   
23  2016-01-11   98.970001   99.059998   97.339996   98.529999   49739400   
24  2016-01-08   98.550003   99.110001   96.760002   96.959999   70798000   
25  2016-01-07   98.680000  100.129997   96.430000   96.449997   81094400   
26  2016-01-06  100.559998  102.370003   99.870003  100.699997   68457400   
27  2016-01-05  105.750000  105.849998  102.410004  102.709999   55791000   
28  2016-01-04  102.610001  105.370003  102.000000  105.349998   67649400   
29  2015-12-31  107.010002  107.029999  104.820000  105.260002   40912300   
30  2015-12-30  108.580002  108.699997  107.180000  107.320000   25213800   
31  2015-12-29  106.959999  109.430000  106.860001  108.739998   30931200   
32  2015-12-28  107.589996  107.690002  106.180000  106.820000   26704200   
33  2015-12-24  109.000000  109.000000  107.949997  108.029999   13596700   
34  2015-12-23  107.269997  108.849998  107.199997  108.610001   32657400   
35  2015-12-22  107.400002  107.720001  106.449997  107.230003   32789400   
36  2015-12-21  107.279999  107.370003  105.570000  107.330002   47590600   
37  2015-12-18  108.910004  109.519997  105.809998  106.029999   96453300   
38  2015-12-17  112.019997  112.250000  108.980003  108.980003   44772800   
39  2015-12-16  111.070000  111.989998  108.800003  111.339996   56238500   
40  2015-12-15  111.940002  112.800003  110.349998  110.489998   52978100   
41  2015-12-14  112.180000  112.680000  109.790001  112.480003   64318700   

     Adj Close  
0    93.989998  
1    93.699997  
2    94.269997  
3    94.989998  
4    95.010002  
5    94.019997  
6    96.599998  
7    95.830001  
8    93.970098  
9    95.909571  
10   96.814656  
11   93.582196  
12   92.915814  
13   99.450356  
14   98.903329  
15  100.872638  
16   95.780276  
17   96.267629  
18   96.138333  
19   96.605790  
20   98.982891  
21   96.864389  
22   99.420519  
23   97.998236  
24   96.436710  
25   95.929460  
26  100.156523  
27  102.155677  
28  104.781429  
29  104.691918  
30  106.740798  
31  108.153132  
32  106.243496  
33  107.446965  
34  108.023837  
35  106.651287  
36  106.750746  
37  105.457759  
38  108.391842  
39  110.739099  
40  109.893688  
41  111.872953  

#2


1  

Consider using an html web scraper like python's lxml module, html() method to scrape html table data and then migrate to a pandas dataframe. While there are automation features like pandas.read_html(), this approach provides more control over nuances in html content like the Feb 4 column span. Below uses an xpath expression on the <td> position in table using brackets, []:

考虑使用像python的lxml模块,html()方法的html web scraper来抓取html表数据,然后迁移到pandas数据帧。虽然有pandas.read_html()等自动化功能,但这种方法可以更好地控制html内容中的细微差别,例如2月4列的范围。下面使用括号[]在表中的位置使用xpath表达式:

import requests
import pandas as pd
from lxml import etree

# READ IN AND PARSE WEB DATA
url = "https://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices"    
rq = requests.get(url)
htmlpage = etree.HTML(rq.content)

# INITIALIZE LISTS
dates = []  
openstock = []
highstock = []
lowstock = []
closestock = []
volume = []
adjclose = []

# ITERATE THROUGH SEVEN COLUMNS OF TABLE
for i in range(1,8):
    htmltable = htmlpage.xpath("//tr[td/@class='yfnc_tabledata1']/td[{}]".format(i))

    # APPEND COLUMN DATA TO CORRESPONDING LIST
    for row in htmltable:
        if i == 1: dates.append(row.text)
        if i == 2: openstock.append(row.text)
        if i == 3: highstock.append(row.text)
        if i == 4: lowstock.append(row.text)
        if i == 5: closestock.append(row.text)
        if i == 6: volume.append(row.text)
        if i == 7: adjclose.append(row.text)

# CLEAN UP COLSPAN VALUE (AT FEB. 4)
dates = [d for d in dates if len(d.strip()) > 3]
del dates[7]
del openstock[7]

# MIGRATE LISTS TO DATA FRAME
df = pd.DataFrame({'Dates':dates,
                   'Open':openstock,
                   'High':highstock,
                   'Low':lowstock,                   
                   'Close':closestock,
                   'Volume':volume,
                   'AdjClose':adjclose})

#   AdjClose   Close         Dates    High     Low    Open       Volume
#0     93.99   93.99  Feb 12, 2016   94.50   93.01   94.19   40,121,700
#1     93.70   93.70  Feb 11, 2016   94.72   92.59   93.79   49,686,200
#2     94.27   94.27  Feb 10, 2016   96.35   94.10   95.92   42,245,000
#3     94.99   94.99   Feb 9, 2016   95.94   93.93   94.29   44,331,200
#4     95.01   95.01   Feb 8, 2016   95.70   93.04   93.13   54,021,400
#5     94.02   94.02   Feb 5, 2016   96.92   93.69   96.52   46,418,100
#...
#61   111.73  112.34  Nov 13, 2015  115.57  112.27  115.20   45,812,400
#62   115.10  115.72  Nov 12, 2015  116.82  115.65  116.26   32,525,600
#63   115.48  116.11  Nov 11, 2015  117.42  115.21  116.37   45,218,000
#64   116.14  116.77  Nov 10, 2015  118.07  116.06  116.90   59,127,900
#65   119.92  120.57   Nov 9, 2015  121.81  120.05  120.96   33,871,400

#1


1  

There is a csv you can use on the page with all the data which read_csv can parse easily:

您可以在页面上使用一个csv,其中包含read_csv可以轻松解析的所有数据:

import pandas as pd

df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980&ignore=.csv")

If you want certain time periods you just have to change the params in the url i.e s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=1980, if we change 1980 to 2015:

如果你想要某些时间段你只需改变url中的params,即s = AAPL&d = 1&e = 16&f = 2016&g = d&a = 11&b = 12&c = 1980,如果我们改变1980年到2015年:

df = pd.read_csv("http://real-chart.finance.yahoo.com/table.csv?s=AAPL&d=1&e=16&f=2016&g=d&a=11&b=12&c=2015&ignore=.csv",parse_dates=0)

print(df)

We get:

          Date        Open        High         Low       Close     Volume  \
0   2016-02-12   94.190002   94.500000   93.010002   93.989998   40121700   
1   2016-02-11   93.790001   94.720001   92.589996   93.699997   49686200   
2   2016-02-10   95.919998   96.349998   94.099998   94.269997   42245000   
3   2016-02-09   94.290001   95.940002   93.930000   94.989998   44331200   
4   2016-02-08   93.129997   95.699997   93.040001   95.010002   54021400   
5   2016-02-05   96.519997   96.919998   93.690002   94.019997   46418100   
6   2016-02-04   95.860001   97.330002   95.190002   96.599998   46471700   
7   2016-02-03   95.000000   96.839996   94.080002   96.349998   45964300   
8   2016-02-02   95.419998   96.040001   94.279999   94.480003   37357200   
9   2016-02-01   96.470001   96.709999   95.400002   96.430000   40943500   
10  2016-01-29   94.790001   97.339996   94.349998   97.339996   64416500   
11  2016-01-28   93.790001   94.519997   92.389999   94.089996   55678800   
12  2016-01-27   96.040001   96.629997   93.339996   93.419998  133369700   
13  2016-01-26   99.930000  100.879997   98.070000   99.989998   75077000   
14  2016-01-25  101.519997  101.529999   99.209999   99.440002   51794500   
15  2016-01-22   98.629997  101.459999   98.370003  101.419998   65800500   
16  2016-01-21   97.059998   97.879997   94.940002   96.300003   52161500   
17  2016-01-20   95.099998   98.190002   93.419998   96.790001   72334400   
18  2016-01-19   98.410004   98.650002   95.500000   96.660004   53087700   
19  2016-01-15   96.199997   97.709999   95.360001   97.129997   79833900   
20  2016-01-14   97.959999  100.480003   95.739998   99.519997   63170100   
21  2016-01-13  100.320000  101.190002   97.300003   97.389999   62439600   
22  2016-01-12  100.550003  100.690002   98.839996   99.959999   49154200   
23  2016-01-11   98.970001   99.059998   97.339996   98.529999   49739400   
24  2016-01-08   98.550003   99.110001   96.760002   96.959999   70798000   
25  2016-01-07   98.680000  100.129997   96.430000   96.449997   81094400   
26  2016-01-06  100.559998  102.370003   99.870003  100.699997   68457400   
27  2016-01-05  105.750000  105.849998  102.410004  102.709999   55791000   
28  2016-01-04  102.610001  105.370003  102.000000  105.349998   67649400   
29  2015-12-31  107.010002  107.029999  104.820000  105.260002   40912300   
30  2015-12-30  108.580002  108.699997  107.180000  107.320000   25213800   
31  2015-12-29  106.959999  109.430000  106.860001  108.739998   30931200   
32  2015-12-28  107.589996  107.690002  106.180000  106.820000   26704200   
33  2015-12-24  109.000000  109.000000  107.949997  108.029999   13596700   
34  2015-12-23  107.269997  108.849998  107.199997  108.610001   32657400   
35  2015-12-22  107.400002  107.720001  106.449997  107.230003   32789400   
36  2015-12-21  107.279999  107.370003  105.570000  107.330002   47590600   
37  2015-12-18  108.910004  109.519997  105.809998  106.029999   96453300   
38  2015-12-17  112.019997  112.250000  108.980003  108.980003   44772800   
39  2015-12-16  111.070000  111.989998  108.800003  111.339996   56238500   
40  2015-12-15  111.940002  112.800003  110.349998  110.489998   52978100   
41  2015-12-14  112.180000  112.680000  109.790001  112.480003   64318700   

     Adj Close  
0    93.989998  
1    93.699997  
2    94.269997  
3    94.989998  
4    95.010002  
5    94.019997  
6    96.599998  
7    95.830001  
8    93.970098  
9    95.909571  
10   96.814656  
11   93.582196  
12   92.915814  
13   99.450356  
14   98.903329  
15  100.872638  
16   95.780276  
17   96.267629  
18   96.138333  
19   96.605790  
20   98.982891  
21   96.864389  
22   99.420519  
23   97.998236  
24   96.436710  
25   95.929460  
26  100.156523  
27  102.155677  
28  104.781429  
29  104.691918  
30  106.740798  
31  108.153132  
32  106.243496  
33  107.446965  
34  108.023837  
35  106.651287  
36  106.750746  
37  105.457759  
38  108.391842  
39  110.739099  
40  109.893688  
41  111.872953  

#2


1  

Consider using an html web scraper like python's lxml module, html() method to scrape html table data and then migrate to a pandas dataframe. While there are automation features like pandas.read_html(), this approach provides more control over nuances in html content like the Feb 4 column span. Below uses an xpath expression on the <td> position in table using brackets, []:

考虑使用像python的lxml模块,html()方法的html web scraper来抓取html表数据,然后迁移到pandas数据帧。虽然有pandas.read_html()等自动化功能,但这种方法可以更好地控制html内容中的细微差别,例如2月4列的范围。下面使用括号[]在表中的位置使用xpath表达式:

import requests
import pandas as pd
from lxml import etree

# READ IN AND PARSE WEB DATA
url = "https://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices"    
rq = requests.get(url)
htmlpage = etree.HTML(rq.content)

# INITIALIZE LISTS
dates = []  
openstock = []
highstock = []
lowstock = []
closestock = []
volume = []
adjclose = []

# ITERATE THROUGH SEVEN COLUMNS OF TABLE
for i in range(1,8):
    htmltable = htmlpage.xpath("//tr[td/@class='yfnc_tabledata1']/td[{}]".format(i))

    # APPEND COLUMN DATA TO CORRESPONDING LIST
    for row in htmltable:
        if i == 1: dates.append(row.text)
        if i == 2: openstock.append(row.text)
        if i == 3: highstock.append(row.text)
        if i == 4: lowstock.append(row.text)
        if i == 5: closestock.append(row.text)
        if i == 6: volume.append(row.text)
        if i == 7: adjclose.append(row.text)

# CLEAN UP COLSPAN VALUE (AT FEB. 4)
dates = [d for d in dates if len(d.strip()) > 3]
del dates[7]
del openstock[7]

# MIGRATE LISTS TO DATA FRAME
df = pd.DataFrame({'Dates':dates,
                   'Open':openstock,
                   'High':highstock,
                   'Low':lowstock,                   
                   'Close':closestock,
                   'Volume':volume,
                   'AdjClose':adjclose})

#   AdjClose   Close         Dates    High     Low    Open       Volume
#0     93.99   93.99  Feb 12, 2016   94.50   93.01   94.19   40,121,700
#1     93.70   93.70  Feb 11, 2016   94.72   92.59   93.79   49,686,200
#2     94.27   94.27  Feb 10, 2016   96.35   94.10   95.92   42,245,000
#3     94.99   94.99   Feb 9, 2016   95.94   93.93   94.29   44,331,200
#4     95.01   95.01   Feb 8, 2016   95.70   93.04   93.13   54,021,400
#5     94.02   94.02   Feb 5, 2016   96.92   93.69   96.52   46,418,100
#...
#61   111.73  112.34  Nov 13, 2015  115.57  112.27  115.20   45,812,400
#62   115.10  115.72  Nov 12, 2015  116.82  115.65  116.26   32,525,600
#63   115.48  116.11  Nov 11, 2015  117.42  115.21  116.37   45,218,000
#64   116.14  116.77  Nov 10, 2015  118.07  116.06  116.90   59,127,900
#65   119.92  120.57   Nov 9, 2015  121.81  120.05  120.96   33,871,400