I have the following code:
我有以下代码:
businessdata = ['Name of Location','Address','City','Zip Code','Website','Yelp',
'# Reviews', 'Yelp Rating Stars','BarRestStore','Category',
'Price Range','Alcohol','Ambience','Latitude','Longitude']
business = pd.read_table('FL_Yelp_Data_v2.csv', sep=',', header=1, names=businessdata)
print '\n\nBusiness\n'
print business[:6]
It reads my file and creates a Panda table I can work with. What I need is to count how many categories are in each line of the 'Category' variable and store this number in a new column named '# Categories'. Here is the target column sample:
它读取我的文件并创建一个我可以使用的Panda表。我需要的是计算“类别”变量的每一行中有多少类别,并将此数字存储在名为“#Categories”的新列中。以下是目标列示例:
Category
French
Adult Entertainment , Lounges , Music Venues
American (New) , Steakhouses
American (New) , Beer, Wine & Spirits , Gastropubs
Chicken Wings , Sports Bars , American (New)
Japanese
Desired output:
Category # Categories
French 1
Adult Entertainment , Lounges , Music Venues 3
American (New) , Steakhouses 2
American (New) , Beer, Wine & Spirits , Gastropubs 4
Chicken Wings , Sports Bars , American (New) 3
Japanese 1
EDIT 1:
Raw input = CSV file. Target column: "Category" I can't post screenshots yet. I don't think the values to be counted are lists.
原始输入= CSV文件。目标栏:“类别”我无法发布截图。我不认为要计算的值是列表。
This is my code:
这是我的代码:
business = pd.read_table('FL_Yelp_Data_v2.csv', sep=',', header=1, names=businessdata, skip_blank_lines=True)
#business = pd.read_csv('FL_Yelp_Data_v2.csv')
business['Category'].str.split(',').apply(len)
#not sure where to declare the df part in the suggestions that use it.
print business[:6]
but I keep getting the following error:
但我一直收到以下错误:
TypeError: object of type 'float' has no len()
EDIT 2:
I GIVE UP. Thanks for all your help, but I'll have to figure something else.
我放弃。谢谢你的帮助,但我必须要想出别的东西。
5 个解决方案
#1
Assuming that Category is actually a list, you can use apply
(per @EdChum's suggestion):
假设Category实际上是一个列表,你可以使用apply(per @ EdChum的建议):
business['# Categories'] = business.Category.apply(len)
If not, you first need to parse it and convert it into a list.
如果没有,您首先需要解析它并将其转换为列表。
df['Category'] = df.Category.map(lambda x: [i.strip() for i in x.split(",")])
Can you show some sample output of EXACTLY what this column looks like (including correct quotations)?
您能否显示一些样本输出完全符合此列的含义(包括正确的引用)?
P.S. @EdChum Thank you for your suggestions. I appreciate them. I believe the list comprehension method may be faster, per a sample of some text data I tested with 30k+ rows of data:
附: @EdChum感谢您的建议。我很感激他们。我相信列表理解方法可能更快,根据我用30k +行数据测试的一些文本数据的样本:
%%timeit
df.Category.str.strip().str.split(',').apply(len)
10 loops, best of 3: 44.8 ms per loop
%%timeit
df.Category.map(lambda x: [i.strip() for i in x.split(",")])
10 loops, best of 3: 28.4 ms per loop
Even accounting for the len
function call:
甚至考虑len函数调用:
%%timeit
df.Category.map(lambda x: len([i.strip() for i in x.split(",")]))
10 loops, best of 3: 30.3 ms per loop
#2
This works:
business['# Categories'] = business['Category'].apply(lambda x: len(x.split(',')))
If you need to handle NA, etc, you can pass a more elaborate function instead of the lambda.
如果你需要处理NA等,你可以传递更复杂的函数而不是lambda。
#3
Use pd.read_csv to make the input easier:
使用pd.read_csv可以更轻松地输入:
business = pd.read_csv('FL_Yelp_Data_v2.csv')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Once this is created, you can create a function to split the categories column by the "," and count the length of the resulting list. Use lambda and apply.
创建完成后,您可以创建一个函数,将“类别”列拆分为“,”,并计算结果列表的长度。使用lambda并申请。
#4
You can do this...
你可以这样做...
for i in business['Category'].tolist():
business.loc[i, '#Categories'] = len(i.split(","))
#5
I had a similar doubt. I had count number of comma-separated words in each row . I resolved it in the following manner:
我有类似的疑问。我计算了每行中以逗号分隔的单词数。我通过以下方式解决了这个问题:
data['Number_of_Categories'] = data['Category'].apply(lambda x : len(str(x).split(',')))
data ['Number_of_Categories'] = data ['Category']。apply(lambda x:len(str(x).split(',')))
Basically I am first converting each row to string since Python is recognizing it as a float and then performing the 'len' function. Hope this helps
基本上我首先将每一行转换为字符串,因为Python将其识别为float,然后执行'len'函数。希望这可以帮助
#1
Assuming that Category is actually a list, you can use apply
(per @EdChum's suggestion):
假设Category实际上是一个列表,你可以使用apply(per @ EdChum的建议):
business['# Categories'] = business.Category.apply(len)
If not, you first need to parse it and convert it into a list.
如果没有,您首先需要解析它并将其转换为列表。
df['Category'] = df.Category.map(lambda x: [i.strip() for i in x.split(",")])
Can you show some sample output of EXACTLY what this column looks like (including correct quotations)?
您能否显示一些样本输出完全符合此列的含义(包括正确的引用)?
P.S. @EdChum Thank you for your suggestions. I appreciate them. I believe the list comprehension method may be faster, per a sample of some text data I tested with 30k+ rows of data:
附: @EdChum感谢您的建议。我很感激他们。我相信列表理解方法可能更快,根据我用30k +行数据测试的一些文本数据的样本:
%%timeit
df.Category.str.strip().str.split(',').apply(len)
10 loops, best of 3: 44.8 ms per loop
%%timeit
df.Category.map(lambda x: [i.strip() for i in x.split(",")])
10 loops, best of 3: 28.4 ms per loop
Even accounting for the len
function call:
甚至考虑len函数调用:
%%timeit
df.Category.map(lambda x: len([i.strip() for i in x.split(",")]))
10 loops, best of 3: 30.3 ms per loop
#2
This works:
business['# Categories'] = business['Category'].apply(lambda x: len(x.split(',')))
If you need to handle NA, etc, you can pass a more elaborate function instead of the lambda.
如果你需要处理NA等,你可以传递更复杂的函数而不是lambda。
#3
Use pd.read_csv to make the input easier:
使用pd.read_csv可以更轻松地输入:
business = pd.read_csv('FL_Yelp_Data_v2.csv')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Once this is created, you can create a function to split the categories column by the "," and count the length of the resulting list. Use lambda and apply.
创建完成后,您可以创建一个函数,将“类别”列拆分为“,”,并计算结果列表的长度。使用lambda并申请。
#4
You can do this...
你可以这样做...
for i in business['Category'].tolist():
business.loc[i, '#Categories'] = len(i.split(","))
#5
I had a similar doubt. I had count number of comma-separated words in each row . I resolved it in the following manner:
我有类似的疑问。我计算了每行中以逗号分隔的单词数。我通过以下方式解决了这个问题:
data['Number_of_Categories'] = data['Category'].apply(lambda x : len(str(x).split(',')))
data ['Number_of_Categories'] = data ['Category']。apply(lambda x:len(str(x).split(',')))
Basically I am first converting each row to string since Python is recognizing it as a float and then performing the 'len' function. Hope this helps
基本上我首先将每一行转换为字符串,因为Python将其识别为float,然后执行'len'函数。希望这可以帮助