存储一组四个(或更多)值的最佳数据结构是什么?

时间:2022-05-17 17:01:00

Say I have the following variables and its corresponding values which represents a record.

假设我有以下变量及其代表记录的相应值。

name = 'abc'
age = 23
weight = 60
height = 174

Please note that the value could be of different types (string, integer, float, reference-to-any-other-object, etc).

请注意,值可以是不同的类型(字符串,整数,浮点数,引用任何其他对象等)。

There will be many records (at least >100,000). Each and every record will be unique when all these four variables (actually its values) are put together. In other words, there exists no record with all 4 values are the same.

会有很多记录(至少> 100,000)。当所有这四个变量(实际上是它的值)放在一起时,每个记录都是唯一的。换句话说,没有记录,所有4个值都相同。

I am trying to find an efficient data structure in Python which will allow me to (store and) retrieve records based on any one of these variables in log(n) time complexity.

我试图在Python中找到一个有效的数据结构,这将允许我在log(n)时间复杂度中基于这些变量中的任何一个来检索(存储和)检索记录。

For example:

def retrieve(name=None,age=None,weight=None,height=None) 
    if name is not None and age is None and weight is None and height is None:
        /* get all records with the given name */
    if name is None and age is not None and weight is None and height is None:
        /* get all records with the given age */
    ....
    return records

The way the retrieve should be called is as follows:

应该调用检索的方式如下:

retrieve(name='abc') 

The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]

以上应该返回[{name:'abc',年龄:23​​,wight:50,身高= 175},{name:'abc',年龄:28,wight:55,身高= 170}等]

retrieve(age=23) 

The above should return [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]

上面应该返回[{name:'abc',年龄:23​​,wight:50,身高= 175},{name:'def',年龄:23​​,wight:65,身高= 180}等等]

And, I may need to add one or two more variables to this record in future. For example, say, sex = 'm'. So, the retrieve function must be scalable.

而且,我可能需要在将来为此记录添加一个或两个以上的变量。比如说,sex ='m'。因此,检索功能必须是可扩展的。

So in short: Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?

简而言之:Python中是否存在一个数据结构,它允许存储包含n个列(名称,年龄,性别,体重,高度等)的记录,并根据对数中的任何一个列检索记录(或理想的常数 - O(1)查找时间)复杂度?

4 个解决方案

#1


6  

There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it does have to achieve your goals and do so fairly efficiently.

Python中没有一个内置的数据结构可以完成你想要的任何事情,但是它很容易结合使用它来实现你的目标并且相当有效地完成目标。

For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:

例如,假设您的输入是名为employees.csv的逗号分隔值文件中的以下数据,其字段名称定义如第一行所示:

name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.

以下是工作代码,说明如何将此数据读取并存储到记录列表中,并自动创建单独的查找表,以查找与每个记录的字段中包含的值相关联的记录。

The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.

记录是由namedtuple创建的类的实例,这是一个非常高效的内存,因为每个类都缺少类实例通常包含的__dict__属性。使用它们可以使用点语法(如record.fieldname)按名称访问每个字段。

The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought, and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.

查找表是defaultdict(list)实例,它们平均提供类似字典的O(1)查找时间,并且还允许多个值与每个值相关联。因此,查找键是要搜索的字段值的值,与之关联的数据将是存储在员工列表中的人员记录的整数索引列表中的值 - 因此它们都是相对的小。

Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. When an instance is used, any actual retrieve() method calls must, of course, contain valid field name keyword arguments.

请注意,该类的代码完全是数据驱动的,因为它不包含任何硬编码的字段名称,而是在读入时从csv数据输入文件的第一行中获取。当使用实例时,任何实际当然,retrieve()方法调用必须包含有效的字段名称关键字参数。

Update

Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method creates them only as needed (and saves/caches the result for future use). Also modified to work in Python 2.7+ including 3.x.

修改为在首次读取数据文件时不为每个字段的每个唯一值创建查找表。现在,retrieve()方法仅在需要时创建它们(并保存/缓存结果以供将来使用)。也修改为在Python 2.7+中工作,包括3.x.

from collections import defaultdict, namedtuple
import csv

class DataBase(object):
    def __init__(self, csv_filename, recordname):
        # Read data from csv format file into a list of namedtuples.
        with open(csv_filename, 'r') as inputfile:
            csv_reader = csv.reader(inputfile, delimiter=',')
            self.fields = next(csv_reader)  # Read header row.
            self.Record = namedtuple(recordname, self.fields)
            self.records = [self.Record(*row) for row in csv_reader]
            self.valid_fieldnames = set(self.fields)

        # Create an empty table of lookup tables for each field name that maps
        # each unique field value to a list of record-list indices of the ones
        # that contain it.
        self.lookup_tables = defaultdict(lambda: defaultdict(list))

    def retrieve(self, **kwargs):
        """ Fetch a list of records with a field name with the value supplied
            as a keyword arg (or return None if there aren't any). """
        if len(kwargs) != 1: raise ValueError(
            'Exactly one fieldname/keyword argument required for function '
            '(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
        field, value = list(kwargs.items())[0]  # Get only keyword arg and value.
        if field not in self.valid_fieldnames:
            raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
        if field not in self.lookup_tables:  # Must create field look up table.
            for index, record in enumerate(self.records):
                value = getattr(record, field)
                self.lookup_tables[field][value].append(index)
        matches = [self.records[index]
                    for index in self.lookup_tables[field].get(value, [])]
        return matches if matches else None

if __name__ == '__main__':
    empdb = DataBase('employees.csv', 'Person')
    print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
    print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
    print("retrieve(weight='150'):".format(empdb.retrieve(weight='150')))
    try:
        print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
    except ValueError as e:
        print("ValueError('{}') raised as expected".format(e))
    else:
        raise type('NoExceptionError', (Exception,), {})(
            'No exception raised from "retrieve(hight=\'5ft\')" call.')

Output:

retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
                     Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
                           raised as expected

#2


4  

Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?

Python中是否有数据结构允许存储具有n个列(名称,年龄,性别,体重,高度等)的记录,并基于对数的任何(一个)列检索记录(或理想情况下不变 - O(1)查找时间)复杂性?

No, there is none. But you could try to implement one on the basis of one dictionary per value dimension. As long as your values are hashable of course. If you implement a custom class for your records, each dictionary will contain references to the same objects. This will save you some memory.

不,没有。但是你可以尝试在每个值维度的一个字典的基础上实现一个。只要您的价值当然可以清洗。如果为记录实现自定义类,则每个字典将包含对相同对象的引用。这样可以节省一些内存。

#3


3  

Given http://wiki.python.org/moin/TimeComplexity how about this:

鉴于http://wiki.python.org/moin/TimeComplexity如何:

  • Have a dictionary for every column you're interested in - AGE, NAME, etc.
  • 为您感兴趣的每一栏都准备一本词典 - 年龄,名字等。

  • Have the keys of that dictionaries (AGE, NAME) be possible values for given column (35 or "m").
  • 让字典(AGE,NAME)的键是给定列(35或“m”)的可能值。

  • Have a list of lists representing values for one "collection", e.g. VALUES = [ [35, "m"], ...]
  • 有一个表示一个“集合”值的列表,例如VALUES = [[35,“m”],...]

  • Have the value of column dictionaries (AGE, NAME) be lists of indices from the VALUES list.
  • 列字典(AGE,NAME)的值是VALUES列表中的索引列表。

  • Have a dictionary which maps column name to index within lists in VALUES so that you know that first column is age and second is sex (you could avoid that and use dictionaries, but they introduce large memory footrpint and with over 100K objects this may or not be a problem).
  • 有一个字典,它将列名称映射到VALUES列表中的索引,这样你就知道第一列是年龄,第二列是性别(你可以避免使用字典,但是它们引入了大量的内存footrpint和超过100K的对象,这可能与否是一个问题)。

Then the retrieve function could look like this:

然后,检索功能可能如下所示:

def retrieve(column_name, column_value):
    if column_name == "age":
        return [VALUES[index] for index in AGE[column_value]]      
    elif ...: # repeat for other "columns"

Then, this is what you get

然后,这就是你得到的

VALUES = [[35, "m"], [20, "f"]]
AGE = {35:[0], 20:[1]}
SEX = {"m":[0], "f":[1]}
KEYS = ["age", "sex"]

retrieve("age", 35)
# [[35, 'm']]

If you want a dictionary, you can do the following:

如果您想要字典,可以执行以下操作:

[dict(zip(KEYS, values)) for values in retrieve("age", 35)]
# [{'age': 35, 'sex': 'm'}]

but again, dictionaries are a little heavy on the memory side, so if you can go with lists of values it might be better.

但同样,字典在内存方面有点沉重,所以如果你可以使用值列表,那可能会更好。

Both dictionary and list retrieval are O(1) on average - worst case for dictionary is O(n) - so this should be pretty fast. Maintaining that will be a little bit of pain, but not so much. To "write", you'd just have to append to the VALUES list and then append the index in VALUES to each of the dictionaries.

字典和列表检索平均为O(1) - 字典的最坏情况是O(n) - 因此这应该非常快。保持这将是一点点痛苦,但不是那么多。要“写”,您只需要附加到VALUES列表,然后将VALUES中的索引附加到每个词典。

Of course, then best would be to benchmark your actual implementation and look for potential improvements, but hopefully this make sense and will get you going :)

当然,那么最好是对你的实际实现进行基准测试并寻找潜在的改进,但希望这是有道理的,并会让你去:)

EDIT:

Please note that as @moooeeeep said, this will only work if your values are hashable and therefore can be used as dictionary keys.

请注意,正如@moooeeeep所说,这只适用于您的值是可以清除的,因此可以用作字典键。

#4


2  

You could achieve logarithmic time complexity in a relational database using indexes (O(log(n)**k) with single column indexes). Then to retrieve data just construct appropriate SQL:

您可以使用索引(O(log(n)** k)和单列索引在关系数据库中实现对数时间复杂度)。然后检索数据只需构造适当的SQL:

names = {'name', 'age', 'weight', 'height'}

def retrieve(c, **params):
    if not (params and names.issuperset(params)):
        raise ValueError(params)
    where = ' and '.join(map('{0}=:{0}'.format, params))
    return c.execute('select * from records where ' + where, params)

Example:

import sqlite3

c = sqlite3.connect(':memory:')
c.row_factory = sqlite3.Row # to provide key access

# create table
c.execute("""create table records
             (name text, age integer, weight real, height real)""")

# insert data
records = (('abc', 23, 60, 174+i) for i in range(2))
c.executemany('insert into records VALUES (?,?,?,?)', records)

# create indexes
for name in names:
    c.execute("create index idx_{0} on records ({0})".format(name))

try:
    retrieve(c, naame='abc')
except ValueError:
    pass
else:
    assert 0

for record in retrieve(c, name='abc', weight=60):
    print(record['height'])

Output:

174.0
175.0

#1


6  

There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it does have to achieve your goals and do so fairly efficiently.

Python中没有一个内置的数据结构可以完成你想要的任何事情,但是它很容易结合使用它来实现你的目标并且相当有效地完成目标。

For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:

例如,假设您的输入是名为employees.csv的逗号分隔值文件中的以下数据,其字段名称定义如第一行所示:

name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.

以下是工作代码,说明如何将此数据读取并存储到记录列表中,并自动创建单独的查找表,以查找与每个记录的字段中包含的值相关联的记录。

The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.

记录是由namedtuple创建的类的实例,这是一个非常高效的内存,因为每个类都缺少类实例通常包含的__dict__属性。使用它们可以使用点语法(如record.fieldname)按名称访问每个字段。

The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought, and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.

查找表是defaultdict(list)实例,它们平均提供类似字典的O(1)查找时间,并且还允许多个值与每个值相关联。因此,查找键是要搜索的字段值的值,与之关联的数据将是存储在员工列表中的人员记录的整数索引列表中的值 - 因此它们都是相对的小。

Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. When an instance is used, any actual retrieve() method calls must, of course, contain valid field name keyword arguments.

请注意,该类的代码完全是数据驱动的,因为它不包含任何硬编码的字段名称,而是在读入时从csv数据输入文件的第一行中获取。当使用实例时,任何实际当然,retrieve()方法调用必须包含有效的字段名称关键字参数。

Update

Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method creates them only as needed (and saves/caches the result for future use). Also modified to work in Python 2.7+ including 3.x.

修改为在首次读取数据文件时不为每个字段的每个唯一值创建查找表。现在,retrieve()方法仅在需要时创建它们(并保存/缓存结果以供将来使用)。也修改为在Python 2.7+中工作,包括3.x.

from collections import defaultdict, namedtuple
import csv

class DataBase(object):
    def __init__(self, csv_filename, recordname):
        # Read data from csv format file into a list of namedtuples.
        with open(csv_filename, 'r') as inputfile:
            csv_reader = csv.reader(inputfile, delimiter=',')
            self.fields = next(csv_reader)  # Read header row.
            self.Record = namedtuple(recordname, self.fields)
            self.records = [self.Record(*row) for row in csv_reader]
            self.valid_fieldnames = set(self.fields)

        # Create an empty table of lookup tables for each field name that maps
        # each unique field value to a list of record-list indices of the ones
        # that contain it.
        self.lookup_tables = defaultdict(lambda: defaultdict(list))

    def retrieve(self, **kwargs):
        """ Fetch a list of records with a field name with the value supplied
            as a keyword arg (or return None if there aren't any). """
        if len(kwargs) != 1: raise ValueError(
            'Exactly one fieldname/keyword argument required for function '
            '(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
        field, value = list(kwargs.items())[0]  # Get only keyword arg and value.
        if field not in self.valid_fieldnames:
            raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
        if field not in self.lookup_tables:  # Must create field look up table.
            for index, record in enumerate(self.records):
                value = getattr(record, field)
                self.lookup_tables[field][value].append(index)
        matches = [self.records[index]
                    for index in self.lookup_tables[field].get(value, [])]
        return matches if matches else None

if __name__ == '__main__':
    empdb = DataBase('employees.csv', 'Person')
    print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
    print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
    print("retrieve(weight='150'):".format(empdb.retrieve(weight='150')))
    try:
        print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
    except ValueError as e:
        print("ValueError('{}') raised as expected".format(e))
    else:
        raise type('NoExceptionError', (Exception,), {})(
            'No exception raised from "retrieve(hight=\'5ft\')" call.')

Output:

retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
                     Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
                           raised as expected

#2


4  

Is there a data structure in Python which will allow storing a record with n number of columns (name, age, sex, weigh, height, etc) and retrieving records based on any (one) of the column in logarithmic (or ideally constant - O(1) look-up time) complexity?

Python中是否有数据结构允许存储具有n个列(名称,年龄,性别,体重,高度等)的记录,并基于对数的任何(一个)列检索记录(或理想情况下不变 - O(1)查找时间)复杂性?

No, there is none. But you could try to implement one on the basis of one dictionary per value dimension. As long as your values are hashable of course. If you implement a custom class for your records, each dictionary will contain references to the same objects. This will save you some memory.

不,没有。但是你可以尝试在每个值维度的一个字典的基础上实现一个。只要您的价值当然可以清洗。如果为记录实现自定义类,则每个字典将包含对相同对象的引用。这样可以节省一些内存。

#3


3  

Given http://wiki.python.org/moin/TimeComplexity how about this:

鉴于http://wiki.python.org/moin/TimeComplexity如何:

  • Have a dictionary for every column you're interested in - AGE, NAME, etc.
  • 为您感兴趣的每一栏都准备一本词典 - 年龄,名字等。

  • Have the keys of that dictionaries (AGE, NAME) be possible values for given column (35 or "m").
  • 让字典(AGE,NAME)的键是给定列(35或“m”)的可能值。

  • Have a list of lists representing values for one "collection", e.g. VALUES = [ [35, "m"], ...]
  • 有一个表示一个“集合”值的列表,例如VALUES = [[35,“m”],...]

  • Have the value of column dictionaries (AGE, NAME) be lists of indices from the VALUES list.
  • 列字典(AGE,NAME)的值是VALUES列表中的索引列表。

  • Have a dictionary which maps column name to index within lists in VALUES so that you know that first column is age and second is sex (you could avoid that and use dictionaries, but they introduce large memory footrpint and with over 100K objects this may or not be a problem).
  • 有一个字典,它将列名称映射到VALUES列表中的索引,这样你就知道第一列是年龄,第二列是性别(你可以避免使用字典,但是它们引入了大量的内存footrpint和超过100K的对象,这可能与否是一个问题)。

Then the retrieve function could look like this:

然后,检索功能可能如下所示:

def retrieve(column_name, column_value):
    if column_name == "age":
        return [VALUES[index] for index in AGE[column_value]]      
    elif ...: # repeat for other "columns"

Then, this is what you get

然后,这就是你得到的

VALUES = [[35, "m"], [20, "f"]]
AGE = {35:[0], 20:[1]}
SEX = {"m":[0], "f":[1]}
KEYS = ["age", "sex"]

retrieve("age", 35)
# [[35, 'm']]

If you want a dictionary, you can do the following:

如果您想要字典,可以执行以下操作:

[dict(zip(KEYS, values)) for values in retrieve("age", 35)]
# [{'age': 35, 'sex': 'm'}]

but again, dictionaries are a little heavy on the memory side, so if you can go with lists of values it might be better.

但同样,字典在内存方面有点沉重,所以如果你可以使用值列表,那可能会更好。

Both dictionary and list retrieval are O(1) on average - worst case for dictionary is O(n) - so this should be pretty fast. Maintaining that will be a little bit of pain, but not so much. To "write", you'd just have to append to the VALUES list and then append the index in VALUES to each of the dictionaries.

字典和列表检索平均为O(1) - 字典的最坏情况是O(n) - 因此这应该非常快。保持这将是一点点痛苦,但不是那么多。要“写”,您只需要附加到VALUES列表,然后将VALUES中的索引附加到每个词典。

Of course, then best would be to benchmark your actual implementation and look for potential improvements, but hopefully this make sense and will get you going :)

当然,那么最好是对你的实际实现进行基准测试并寻找潜在的改进,但希望这是有道理的,并会让你去:)

EDIT:

Please note that as @moooeeeep said, this will only work if your values are hashable and therefore can be used as dictionary keys.

请注意,正如@moooeeeep所说,这只适用于您的值是可以清除的,因此可以用作字典键。

#4


2  

You could achieve logarithmic time complexity in a relational database using indexes (O(log(n)**k) with single column indexes). Then to retrieve data just construct appropriate SQL:

您可以使用索引(O(log(n)** k)和单列索引在关系数据库中实现对数时间复杂度)。然后检索数据只需构造适当的SQL:

names = {'name', 'age', 'weight', 'height'}

def retrieve(c, **params):
    if not (params and names.issuperset(params)):
        raise ValueError(params)
    where = ' and '.join(map('{0}=:{0}'.format, params))
    return c.execute('select * from records where ' + where, params)

Example:

import sqlite3

c = sqlite3.connect(':memory:')
c.row_factory = sqlite3.Row # to provide key access

# create table
c.execute("""create table records
             (name text, age integer, weight real, height real)""")

# insert data
records = (('abc', 23, 60, 174+i) for i in range(2))
c.executemany('insert into records VALUES (?,?,?,?)', records)

# create indexes
for name in names:
    c.execute("create index idx_{0} on records ({0})".format(name))

try:
    retrieve(c, naame='abc')
except ValueError:
    pass
else:
    assert 0

for record in retrieve(c, name='abc', weight=60):
    print(record['height'])

Output:

174.0
175.0