
时间:2022-04-13 18:16:18

I have a data frame column containing around 4000 records containing "ID" and "Description",which is tokenized into one word.


Id       one_word_tokenize
1952043  [Swimming, Pool, in, the, roof, top,…
1918916  [Luxury, Apartments, consisting, 11, towers, B...
1645751  [Flat, available, sale, Medavakkam, Modular, k…
1270503  [Toddler, Pool, with, Jogging, Tracks, for people…
1495638  [near, medavakkam, junction, calm, area, near,...

how to iterate through the rows and find the matching values from Categories. Categories.py file contains the below classification of words.


category = [('Luxury', 'IN', 'Recreation_Ammenities'),
        ('Swimming', 'IN','Recreation_Ammenities'),
        ('Toddler', 'IN', 'Recreation_Ammenities'),
        ('Pool', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

I have tried by specifying the row number. But I want it to check for every row.


example = df['one_word_tokenize'].ix[1]
for val in example:
    for am in Categories.Recreation:
        if am==val:

My Desired output is:


  Id     one_word_tokenized_text                       Recreation_Ammenities         
1952043  [Swimming, Pool, in, the, roof, top,…          Swimming, Pool
1918916  [Luxury, Apartments B...                       Luxury
1645751  [Flat, available, sale, k…                    
1270503  [Toddler, Pool, with, Jogging, Tracks,…        Toddler,Pool,Jogging
1495638  [near, medavakkam, junction,...    

Please help.


2 个解决方案



It's not clear if you expect ["Swimming", "Pool"] to match the "Swimming Pool" category. If so, you have a much more expensive operation on hand, as you'll need to specify what level of n-grams need to be evaluated in each token list.


If you're only interested in matching a single token to a category, you can use either extractall() for long format output, or count() for wide-format output.


With extractall

import numpy as np
import pandas as pd

# Note: "Swimming" and "Pool" from OP is combined in first row for example purposes
# Additionally, one "Luxury" is added to the first entry, to consider repeat matches
tokens = pd.Series([["Swimming Pool", "in", "Luxury", "roof", "top", "Luxury"],
                   ["Luxury", "Apartments", "consisting", "11", "towers"],
                   ["near", "medavakkam", "junction", "calm", "area", "near"]])
category = [('Luxury', 'IN', 'Recreation_Ammenities'),
        ('Swimming Pool', 'IN','Recreation_Ammenities'),
        ('Toddler Pool', 'IN', 'Recreation_Ammenities'),
        ('Pool Table', 'IN', 'Recreation_Ammenities')]
recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

# check for matches from any element in recreation, for each token set
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in recreation])))
# report results
match_list = [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"tokens":tokens, "matches":match_list})

Long match_df:


                           matches                                      tokens  
0  [Swimming Pool, Luxury, Luxury]   [Swimming Pool, in, Luxury, roof, top, Luxury] 
1                         [Luxury]   [Luxury, Apartments, consisting, 11, towers] 
2                               []   [near, medavakkam, junction, calm, area, near] 

With count

matches = {cat:tokens.apply(lambda x: pd.Series(x).str.count("{}".format(cat)).sum()) for cat in recreation}
match_df = pd.DataFrame(matches)
match_df["tokens"] = tokens

Wide match_df:


   Luxury  Pool Table  Swimming Pool  Toddler Pool                      tokens
0       2           0              1             0   [Swimming Pool, in, Luxury, roof, top, Luxury]
1       1           0              0             0   [Luxury, Apartments, consisting, 11, towers]
2       0           0              0             0   [near, medavakkam, junction, calm, area, near] 



Wouldn't a boolean slice using apply do the trick here?


df[df['one_word_tokenize'].apply(lambda ls: 'Recreation_Ammenities' in ls)]



It's not clear if you expect ["Swimming", "Pool"] to match the "Swimming Pool" category. If so, you have a much more expensive operation on hand, as you'll need to specify what level of n-grams need to be evaluated in each token list.


If you're only interested in matching a single token to a category, you can use either extractall() for long format output, or count() for wide-format output.


With extractall

import numpy as np
import pandas as pd

# Note: "Swimming" and "Pool" from OP is combined in first row for example purposes
# Additionally, one "Luxury" is added to the first entry, to consider repeat matches
tokens = pd.Series([["Swimming Pool", "in", "Luxury", "roof", "top", "Luxury"],
                   ["Luxury", "Apartments", "consisting", "11", "towers"],
                   ["near", "medavakkam", "junction", "calm", "area", "near"]])
category = [('Luxury', 'IN', 'Recreation_Ammenities'),
        ('Swimming Pool', 'IN','Recreation_Ammenities'),
        ('Toddler Pool', 'IN', 'Recreation_Ammenities'),
        ('Pool Table', 'IN', 'Recreation_Ammenities')]
recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

# check for matches from any element in recreation, for each token set
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in recreation])))
# report results
match_list = [[m for m in match.values.ravel() if isinstance(m, str)] for match in matches]
match_df = pd.DataFrame({"tokens":tokens, "matches":match_list})

Long match_df:


                           matches                                      tokens  
0  [Swimming Pool, Luxury, Luxury]   [Swimming Pool, in, Luxury, roof, top, Luxury] 
1                         [Luxury]   [Luxury, Apartments, consisting, 11, towers] 
2                               []   [near, medavakkam, junction, calm, area, near] 

With count

matches = {cat:tokens.apply(lambda x: pd.Series(x).str.count("{}".format(cat)).sum()) for cat in recreation}
match_df = pd.DataFrame(matches)
match_df["tokens"] = tokens

Wide match_df:


   Luxury  Pool Table  Swimming Pool  Toddler Pool                      tokens
0       2           0              1             0   [Swimming Pool, in, Luxury, roof, top, Luxury]
1       1           0              0             0   [Luxury, Apartments, consisting, 11, towers]
2       0           0              0             0   [near, medavakkam, junction, calm, area, near] 



Wouldn't a boolean slice using apply do the trick here?


df[df['one_word_tokenize'].apply(lambda ls: 'Recreation_Ammenities' in ls)]