.isin()和字符串(Python / Pandas)的奇怪问题

时间:2022-10-06 21:45:36

I'm having a strange problem with the Pandas .isin() method. I'm doing a project in which I need to identify bad passwords by length, common word/password lists, etc (don't worry, this is from a public source). One of the ways is to see if someone is using part of their name as a password. I'm using .isin() to determine if that is the case, but it's giving me weird results. To show:

我对Pandas .isin()方法有一个奇怪的问题。我正在做一个项目,我需要通过长度,常用字/密码列表等识别错误密码(不要担心,这是来自公共资源)。其中一种方法是查看是否有人将其名称的一部分用作密码。我正在使用.isin()来确定是否是这种情况,但它给了我奇怪的结果。以显示:

# Extracting first and last names into their own columns
users['first_name'] = users.user_name.str.extract('(^.+)(\.)', expand = False)[0]
users['last_name'] = users.user_name.str.extract('\.(.+)', expand = False)

# Flagging the users with passwords that matches their names
users['uses_name'] = (users['password'].isin(users.first_name)) | (users['password'].isin(users.last_name))

# Looking at the new data
print(users[users['uses_name']][['password','user_name','first_name','last_name','uses_name']].head())

The output of this is:

这个输出是:

   password            user_name first_name  last_name uses_name
7    murphy          noreen.hale     noreen       hale      True
11  hubbard      milford.hubbard    milford    hubbard      True
22  woodard        jenny.woodard      jenny    woodard      True
30     reid         rosanna.reid    rosanna       reid      True
58   golden  rosalinda.rodriquez  rosalinda  rodriquez      True

Mostly it's good; milford.hubbard is using 'hubbard' as the password, etc. But then we have several examples like the first one. Noreen Hale is being flagged, despite her password being "murphy", which shares only a single letter with her name.

大多数情况下都很好; milford.hubbard使用'hubbard'作为密码等。但是我们有几个例子,比如第一个。 Noreen Hale被标记,尽管她的密码是“墨菲”,其中只有一个字母与她的名字相同。

I can't for the life of me figure out what is causing this. Does anyone know why this is happening, and how to fix it?

我不能为我的生活找出导致这种情况的原因。有谁知道为什么会这样,以及如何解决它?

2 个解决方案

#1


4  

Since you need to compare adjacent columns in the same row, vectorisation isn't much of an option here. As such, you could use the (possibly) fastest alternative at your disposal: a list comprehension:

由于您需要比较同一行中的相邻列,因此矢量化不是一个很好的选择。因此,您可以使用(可能)最快的替代方案:列表理解:

df['uses_name'] = [
       pwd in name for name, pwd in zip(df.user_name, df.password)
]

Or, if you dislike loops, you can hide them with np.vectorize:

或者,如果您不喜欢循环,可以使用np.vectorize隐藏它们:

def f(name, pwd):
    return pwd in name

v = np.vectorize(f)
df['uses_name'] = v(df.user_name, df.password)

df
   password            user_name  uses_name
7    murphy          noreen.hale      False
11  hubbard      milford.hubbard       True
22  woodard        jenny.woodard       True
30     reid         rosanna.reid       True
58   golden  rosalinda.rodriquez      False

Considering you extract first_name and last_name from user_name, I don't think you need it here.

考虑到你从user_name中提取first_name和last_name,我认为你不需要它。

#2


1  

Regarding the reason why this error occurs:

关于发生此错误的原因:

If you do users['password'].isin(users.first_name) you ask for each row of users['password'] if the element is contained in ANY of the elements in the column first_name Therefore I assume that the element murphy is somewhere in that column

如果你做用户['password']。isin(users.first_name)你要求每一行用户['password']如果该元素包含在列first_name中的任何元素中那么我假设元素murphy是那个专栏的某个地方

#1


4  

Since you need to compare adjacent columns in the same row, vectorisation isn't much of an option here. As such, you could use the (possibly) fastest alternative at your disposal: a list comprehension:

由于您需要比较同一行中的相邻列,因此矢量化不是一个很好的选择。因此,您可以使用(可能)最快的替代方案:列表理解:

df['uses_name'] = [
       pwd in name for name, pwd in zip(df.user_name, df.password)
]

Or, if you dislike loops, you can hide them with np.vectorize:

或者,如果您不喜欢循环,可以使用np.vectorize隐藏它们:

def f(name, pwd):
    return pwd in name

v = np.vectorize(f)
df['uses_name'] = v(df.user_name, df.password)

df
   password            user_name  uses_name
7    murphy          noreen.hale      False
11  hubbard      milford.hubbard       True
22  woodard        jenny.woodard       True
30     reid         rosanna.reid       True
58   golden  rosalinda.rodriquez      False

Considering you extract first_name and last_name from user_name, I don't think you need it here.

考虑到你从user_name中提取first_name和last_name,我认为你不需要它。

#2


1  

Regarding the reason why this error occurs:

关于发生此错误的原因:

If you do users['password'].isin(users.first_name) you ask for each row of users['password'] if the element is contained in ANY of the elements in the column first_name Therefore I assume that the element murphy is somewhere in that column

如果你做用户['password']。isin(users.first_name)你要求每一行用户['password']如果该元素包含在列first_name中的任何元素中那么我假设元素murphy是那个专栏的某个地方