I have a data frame column containing around 4000 records containing "ID" and "Description",which is tokenized into one word.
我有一个数据框列,包含大约4000条包含“ID”和“Description”的记录,这些记录被标记为一个单词。
>>df[:,0:1]
Output:
Id one_word_tokenize
1952043 [Swimming, Pool, in, the, roof, top,…
1918916 [Luxury, Apartments, consisting, 11, towers, B...
1645751 [Flat, available, sale, Medavakkam, Modular, k…
1270503 [Toddler, Pool, with, Jogging, Tracks, for people…
1495638 [near, medavakkam, junction, calm, area, near,...
how to iterate through the rows and find the matching values from Categories. Categories.py file contains the below classification of words.
如何遍历行并从类别中查找匹配值。类别。py文件包含以下单词分类。
category = [('Luxury', 'IN', 'Recreation_Ammenities'),
('Swimming', 'IN','Recreation_Ammenities'),
('Toddler', 'IN', 'Recreation_Ammenities'),
('Pool', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
I have tried by specifying the row number. But I want it to check for every row.
我已经通过指定行号进行了尝试。但是我想让它检查每一行。
example = df['one_word_tokenize'].ix[1]
for val in example:
for am in Categories.Recreation:
if am==val:
print(am,"~","Recreation")
My Desired output is:
我的期望输出值是:
Id one_word_tokenized_text Recreation_Ammenities
1952043 [Swimming, Pool, in, the, roof, top,… Swimming, Pool
1918916 [Luxury, Apartments B... Luxury
1645751 [Flat, available, sale, k…
1270503 [Toddler, Pool, with, Jogging, Tracks,… Toddler,Pool,Jogging
1495638 [near, medavakkam, junction,...
Please help.
请帮助。
2 个解决方案
#1
1
It's not clear if you expect ["Swimming", "Pool"]
to match the "Swimming Pool"
category. If so, you have a much more expensive operation on hand, as you'll need to specify what level of n-grams need to be evaluated in each token list.
现在还不清楚你是否期望“游泳池”和“游泳池”匹配。如果是这样,您手头就有一个更昂贵的操作,因为您需要指定在每个令牌列表中需要计算的n-g的级别。
If you're only interested in matching a single token to a category, you can use either extractall()
for long format output, or count()
for wide-format output.
如果您只对将单个令牌匹配到一个类别感兴趣,那么可以对长格式输出使用extractall(),也可以对宽格式输出使用count()。
With extractall
import numpy as np
import pandas as pd
# Note: "Swimming" and "Pool" from OP is combined in first row for example purposes
# Additionally, one "Luxury" is added to the first entry, to consider repeat matches
tokens = pd.Series([["Swimming Pool", "in", "Luxury", "roof", "top", "Luxury"],
["Luxury", "Apartments", "consisting", "11", "towers"],
["near", "medavakkam", "junction", "calm", "area", "near"]])
category = [('Luxury', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Pool Table', 'IN', 'Recreation_Ammenities')]
recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
# check for matches from any element in recreation, for each token set
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in recreation])))
# report results
match_list = [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"tokens":tokens, "matches":match_list})
Long match_df
:
长match_df:
matches tokens
0 [Swimming Pool, Luxury, Luxury] [Swimming Pool, in, Luxury, roof, top, Luxury]
1 [Luxury] [Luxury, Apartments, consisting, 11, towers]
2 [] [near, medavakkam, junction, calm, area, near]
With count
matches = {cat:tokens.apply(lambda x: pd.Series(x).str.count("{}".format(cat)).sum()) for cat in recreation}
match_df = pd.DataFrame(matches)
match_df["tokens"] = tokens
Wide match_df
:
宽match_df:
Luxury Pool Table Swimming Pool Toddler Pool tokens
0 2 0 1 0 [Swimming Pool, in, Luxury, roof, top, Luxury]
1 1 0 0 0 [Luxury, Apartments, consisting, 11, towers]
2 0 0 0 0 [near, medavakkam, junction, calm, area, near]
#2
-1
Wouldn't a boolean slice using apply do the trick here?
使用apply的布尔切片不是很管用吗?
df[df['one_word_tokenize'].apply(lambda ls: 'Recreation_Ammenities' in ls)]
#1
1
It's not clear if you expect ["Swimming", "Pool"]
to match the "Swimming Pool"
category. If so, you have a much more expensive operation on hand, as you'll need to specify what level of n-grams need to be evaluated in each token list.
现在还不清楚你是否期望“游泳池”和“游泳池”匹配。如果是这样,您手头就有一个更昂贵的操作,因为您需要指定在每个令牌列表中需要计算的n-g的级别。
If you're only interested in matching a single token to a category, you can use either extractall()
for long format output, or count()
for wide-format output.
如果您只对将单个令牌匹配到一个类别感兴趣,那么可以对长格式输出使用extractall(),也可以对宽格式输出使用count()。
With extractall
import numpy as np
import pandas as pd
# Note: "Swimming" and "Pool" from OP is combined in first row for example purposes
# Additionally, one "Luxury" is added to the first entry, to consider repeat matches
tokens = pd.Series([["Swimming Pool", "in", "Luxury", "roof", "top", "Luxury"],
["Luxury", "Apartments", "consisting", "11", "towers"],
["near", "medavakkam", "junction", "calm", "area", "near"]])
category = [('Luxury', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Pool Table', 'IN', 'Recreation_Ammenities')]
recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']
# check for matches from any element in recreation, for each token set
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in recreation])))
# report results
match_list = [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"tokens":tokens, "matches":match_list})
Long match_df
:
长match_df:
matches tokens
0 [Swimming Pool, Luxury, Luxury] [Swimming Pool, in, Luxury, roof, top, Luxury]
1 [Luxury] [Luxury, Apartments, consisting, 11, towers]
2 [] [near, medavakkam, junction, calm, area, near]
With count
matches = {cat:tokens.apply(lambda x: pd.Series(x).str.count("{}".format(cat)).sum()) for cat in recreation}
match_df = pd.DataFrame(matches)
match_df["tokens"] = tokens
Wide match_df
:
宽match_df:
Luxury Pool Table Swimming Pool Toddler Pool tokens
0 2 0 1 0 [Swimming Pool, in, Luxury, roof, top, Luxury]
1 1 0 0 0 [Luxury, Apartments, consisting, 11, towers]
2 0 0 0 0 [near, medavakkam, junction, calm, area, near]
#2
-1
Wouldn't a boolean slice using apply do the trick here?
使用apply的布尔切片不是很管用吗?
df[df['one_word_tokenize'].apply(lambda ls: 'Recreation_Ammenities' in ls)]