In order to test some functionality I would like to create a DataFrame
from a string. Let's say my test data looks like:
为了测试一些功能,我想从字符串创建一个DataFrame。假设我的测试数据如下:
TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""
What is the simplest way to read that data into a Pandas DataFrame
?
将数据读入Pandas DataFrame的最简单方法是什么?
2 个解决方案
#1
271
A simple way to do this is to use StringIO
and pass that to the pandas.read_csv
function. E.g:
一种简单的方法是使用StringIO并将其传递给pandas.read_csv函数。例如:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO("""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
""")
df = pd.read_csv(TESTDATA, sep=";")
#2
2
A traditional variable-width CSV is unreadable for storing data as a string variable. Consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.
传统的可变宽度CSV对于将数据存储为字符串变量是不可读的。请考虑使用固定宽度的管道分隔数据。各种IDE和编辑器可能有一个插件,用于将管道分隔的文本格式化为整齐的表格。
The following works for me. To use it, store it into a file named pandas_util.py
. An example is included in the function's docstring. If you're using a version of Python older than 3.6, delete the type annotations from the function definition line.
以下适用于我。要使用它,请将其存储在名为pandas_util.py的文件中。一个例子包含在函数的docstring中。如果您使用的是早于3.6的Python版本,请从函数定义行中删除类型注释。
import re
import pandas as pd
def read_pipe_separated_str(str_input: str) -> pd.DataFrame:
"""Read a Pandas object from a pipe-separated table contained within a string.
Example:
| int_score | ext_score | eligible |
| | 701 | True |
| 221.3 | 0 | False |
| | 576 | True |
| 300 | 600 | True |
The leading and trailing pipes are optional, but if one is present, so must be the other.
In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can be used to neatly format a table.
"""
substitutions = [
('^ *', ''), # Remove leading spaces
(' *$', ''), # Remove trailing spaces
(r' *\| *', '|'), # Remove spaces between columns
]
if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
substitutions.extend([
(r'^\|', ''), # Remove redundant leading delimiter
(r'\|$', ''), # Remove redundant trailing delimiter
])
for pattern, replacement in substitutions:
str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
return pd.read_csv(pd.compat.StringIO(str_input), sep='|')
#1
271
A simple way to do this is to use StringIO
and pass that to the pandas.read_csv
function. E.g:
一种简单的方法是使用StringIO并将其传递给pandas.read_csv函数。例如:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO("""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
""")
df = pd.read_csv(TESTDATA, sep=";")
#2
2
A traditional variable-width CSV is unreadable for storing data as a string variable. Consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.
传统的可变宽度CSV对于将数据存储为字符串变量是不可读的。请考虑使用固定宽度的管道分隔数据。各种IDE和编辑器可能有一个插件,用于将管道分隔的文本格式化为整齐的表格。
The following works for me. To use it, store it into a file named pandas_util.py
. An example is included in the function's docstring. If you're using a version of Python older than 3.6, delete the type annotations from the function definition line.
以下适用于我。要使用它,请将其存储在名为pandas_util.py的文件中。一个例子包含在函数的docstring中。如果您使用的是早于3.6的Python版本,请从函数定义行中删除类型注释。
import re
import pandas as pd
def read_pipe_separated_str(str_input: str) -> pd.DataFrame:
"""Read a Pandas object from a pipe-separated table contained within a string.
Example:
| int_score | ext_score | eligible |
| | 701 | True |
| 221.3 | 0 | False |
| | 576 | True |
| 300 | 600 | True |
The leading and trailing pipes are optional, but if one is present, so must be the other.
In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can be used to neatly format a table.
"""
substitutions = [
('^ *', ''), # Remove leading spaces
(' *$', ''), # Remove trailing spaces
(r' *\| *', '|'), # Remove spaces between columns
]
if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
substitutions.extend([
(r'^\|', ''), # Remove redundant leading delimiter
(r'\|$', ''), # Remove redundant trailing delimiter
])
for pattern, replacement in substitutions:
str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
return pd.read_csv(pd.compat.StringIO(str_input), sep='|')