从字符串创建Pandas DataFrame

时间:2021-07-21 22:54:37

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

为了测试一些功能,我想从字符串创建一个DataFrame。假设我的测试数据如下:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

将数据读入Pandas DataFrame的最简单方法是什么?

2 个解决方案

#1


271  

A simple way to do this is to use StringIO and pass that to the pandas.read_csv function. E.g:

一种简单的方法是使用StringIO并将其传递给pandas.read_csv函数。例如:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

#2


2  

A traditional variable-width CSV is unreadable for storing data as a string variable. Consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

传统的可变宽度CSV对于将数据存储为字符串变量是不可读的。请考虑使用固定宽度的管道分隔数据。各种IDE和编辑器可能有一个插件,用于将管道分隔的文本格式化为整齐的表格。

The following works for me. To use it, store it into a file named pandas_util.py. An example is included in the function's docstring. If you're using a version of Python older than 3.6, delete the type annotations from the function definition line.

以下适用于我。要使用它,请将其存储在名为pandas_util.py的文件中。一个例子包含在函数的docstring中。如果您使用的是早于3.6的Python版本,请从函数定义行中删除类型注释。

import re

import pandas as pd


def read_pipe_separated_str(str_input: str) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present, so must be the other.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can be used to neatly format a table.
    """
    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(pd.compat.StringIO(str_input), sep='|')

#1


271  

A simple way to do this is to use StringIO and pass that to the pandas.read_csv function. E.g:

一种简单的方法是使用StringIO并将其传递给pandas.read_csv函数。例如:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

#2


2  

A traditional variable-width CSV is unreadable for storing data as a string variable. Consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

传统的可变宽度CSV对于将数据存储为字符串变量是不可读的。请考虑使用固定宽度的管道分隔数据。各种IDE和编辑器可能有一个插件,用于将管道分隔的文本格式化为整齐的表格。

The following works for me. To use it, store it into a file named pandas_util.py. An example is included in the function's docstring. If you're using a version of Python older than 3.6, delete the type annotations from the function definition line.

以下适用于我。要使用它,请将其存储在名为pandas_util.py的文件中。一个例子包含在函数的docstring中。如果您使用的是早于3.6的Python版本,请从函数定义行中删除类型注释。

import re

import pandas as pd


def read_pipe_separated_str(str_input: str) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present, so must be the other.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can be used to neatly format a table.
    """
    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(pd.compat.StringIO(str_input), sep='|')