I have a file separated by semicolons and I need to read this file by parts and so I'm using readlines(k), where k is a buffer size chosen. The piece of code below not exactly return the 'k' quantity required because each line could not have the same length of bytes (case of csv files). Something like that :
我有一个用分号分隔的文件,我需要按部分读取这个文件,所以我使用readlines(k),其中k是选择的缓冲区大小。下面的代码不能完全返回所需的'k'数量,因为每行不能具有相同的字节长度(csv文件的情况)。像这样的东西:
BUFFER_SIZE = 1024
f=open(file,'r')
chunck_tmp = f.readlines(BUFFER_SIZE)
In this point "chunck_tmp" is an list and I would like to discover the exact number of bytes inside it. The problem is that now the information has another format including spaces, brackets and something else to consider if I try something like "str(chunck_tmp)" and so the number of bytes would be greater than real quantity of bytes inside of chunck_tmp if I try evaluate it with "len(str(chunck_tmp))". To prove it I simulate a little test :
在这一点上,“chunck_tmp”是一个列表,我想发现其中的确切字节数。问题是现在信息有另一种格式,包括空格,括号和其他需要考虑的东西,如果我尝试类似“str(chunck_tmp)”的东西,所以如果我尝试的话,字节数将大于chunck_tmp里面的实际字节数用“len(str(chunck_tmp))”来评估它。为了证明这一点,我模拟了一个小测试:
>>> test="abcde;abcde;abcde;abcde;abcde\n"
>>> len(test)
30
>>> t=test.split(';')
>>> type(t)
<type 'list'>
>>> len(str(t))
47
>>> print str(t)
['abcde', 'abcde', 'abcde', 'abcde', 'abcde\n']
Look at test has exactly 30 bytes considering '\n' as a special character (this happens because in csv files each line has a line feed in POS-IX format or '\r\n' in WINDOWS format).
考虑'\ n'作为一个特殊字符,测试有正好30个字节(这是因为在csv文件中每行都有一个POS-IX格式的换行符或WINDOWS格式的'\ r \ n')。
Using the expression "len(test)" we can check it but now if we create a list of this string test using a split and try to discover the original size after this operation we found the problem : length is 47 bytes !!
使用表达式“len(test)”我们可以检查它,但现在如果我们使用拆分创建此字符串测试的列表并尝试在此操作后发现原始大小,我们发现问题:长度为47个字节!
Why ? Well, printing the chunck converted to string and evaluate its length we can see that now the 17 bytes exceed are exactly "[" (1 byte), " " (4 bytes), "'" (10 bytes), "\n" (1 byte) and "]" (1 byte).
为什么?好吧,打印转换为字符串的chunck并评估其长度,我们可以看到现在超过17个字节正好是“[”(1字节),“”(4字节),“'”(10字节),“\ n” (1字节)和“]”(1字节)。
Bingo !!! 1 + 4 + 10 + 1 + 1 = 17 bytes
答对了 !!! 1 + 4 + 10 + 1 + 1 = 17个字节
And now my point : Someone can help me to find out a way to calculate the real value of bytes inside of list object in Python ? My real intention is use to know the real size returned by function readlines after call it passing an argument as showed belong represented by chunck_tmp.
现在我的观点:有人可以帮我找到一种方法来计算Python中列表对象内部字节的实际值吗?我的真实目的是用来知道函数readlines返回的实际大小,在调用它之后传递一个参数,如chunck_tmp所示。
1 个解决方案
#1
I think you're getting confused between the actual objects, and their string representations:
我认为你在实际的对象和它们的字符串表示之间感到困惑:
The problem is that now the information has another format including spaces, brackets and something else …
问题是现在信息有另一种格式,包括空格,括号和其他东西......
This is wrong. The information is not in another format including spaces, brackets, and something else; it's just a list of strings.
这是错的。信息不是其他格式,包括空格,括号和其他内容;它只是一个字符串列表。
If you call str
on a list of strings, it will generate spaces, brackets, and commas, and quotes around each string, and possibly convert some characters to backslash escapes, and so on. But there's no reason to call str
here.
如果在字符串列表上调用str,它将生成空格,括号和逗号,并在每个字符串周围引用,并可能将某些字符转换为反斜杠转义,依此类推。但是没有理由在这里打电话给str。
If you want the sum of the lengths of all of the strings in a list of strings, just write that:
如果你想要一个字符串列表中所有字符串的长度之和,只需写下:
sum(map(len, chunck_tmp))
Now, this may not be the same as the number of bytes actually read off-disk. As you pointed out, there may be Windows-style newlines (\r\n
) that get converted to Python-style newlines (\n
). But this will only happen if you've opened the file in universal newlines mode (e.g., mode 'rU'
instead of 'r'
).
现在,这可能与实际从磁盘外读取的字节数不同。正如您所指出的,可能会有Windows样式的换行符(\ r \ n)转换为Python样式的换行符(\ n)。但这只会在您以通用换行模式打开文件时发生(例如,模式'rU'而不是'r')。
If that's what you're trying to solve, you can fix it by looking at the newlines
attribute of the file. If it was Windows-style, that will be '\r\n'
. So, you can do this:
如果这是您要解决的问题,可以通过查看文件的newlines属性来修复它。如果它是Windows风格,那将是'\ r \ n'。所以,你可以这样做:
sum(map(len, chunck_tmp)) + len(chunck_tmp) * (len(f.newlines) - 1)
But again, sum(map(len, chunck_tmp))
is already the number of bytes in the list, which is what you asked for; this is only if you want to get the number of bytes on disk that had to be read to generate this list, which is a different thing.
但是,sum(map(len,chunck_tmp))已经是列表中的字节数,这就是你要求的;只有当你想获得必须读取的磁盘上的字节数才能生成这个列表时,这是另一回事。
Finally, in attempting to figure out what was going on, you tried to simplify it by just calling split
on a string. But there's a big difference here: readlines
leaves the newlines on the end of each line, but split
throws away the delimiters. Still, the answer is nearly the same as the last point:
最后,在试图弄清楚发生了什么时,你试图通过在字符串上调用split来简化它。但是这里有一个很大的不同:readlines在每一行的末尾都留下了换行符,但是split会抛弃分隔符。答案仍然与最后一点几乎相同:
sum(map(len, t)) + (len(t) - 1) * len(';')
(Obviously in your case, you know len(';')
is 1, and multiplying by 1 does nothing, so you can leave it off.)
(显然在你的情况下,你知道len(';')是1,乘以1什么都不做,所以你可以把它关掉。)
But, once again, sum(map(len, t))
is already the length of bytes in the list, which is what you asked for; you only need this if you want to regenerate the length of the original test
.
但是,sum(map(len,t))再一次是列表中的字节长度,这就是你要求的;如果要重新生成原始测试的长度,则只需要此项。
#1
I think you're getting confused between the actual objects, and their string representations:
我认为你在实际的对象和它们的字符串表示之间感到困惑:
The problem is that now the information has another format including spaces, brackets and something else …
问题是现在信息有另一种格式,包括空格,括号和其他东西......
This is wrong. The information is not in another format including spaces, brackets, and something else; it's just a list of strings.
这是错的。信息不是其他格式,包括空格,括号和其他内容;它只是一个字符串列表。
If you call str
on a list of strings, it will generate spaces, brackets, and commas, and quotes around each string, and possibly convert some characters to backslash escapes, and so on. But there's no reason to call str
here.
如果在字符串列表上调用str,它将生成空格,括号和逗号,并在每个字符串周围引用,并可能将某些字符转换为反斜杠转义,依此类推。但是没有理由在这里打电话给str。
If you want the sum of the lengths of all of the strings in a list of strings, just write that:
如果你想要一个字符串列表中所有字符串的长度之和,只需写下:
sum(map(len, chunck_tmp))
Now, this may not be the same as the number of bytes actually read off-disk. As you pointed out, there may be Windows-style newlines (\r\n
) that get converted to Python-style newlines (\n
). But this will only happen if you've opened the file in universal newlines mode (e.g., mode 'rU'
instead of 'r'
).
现在,这可能与实际从磁盘外读取的字节数不同。正如您所指出的,可能会有Windows样式的换行符(\ r \ n)转换为Python样式的换行符(\ n)。但这只会在您以通用换行模式打开文件时发生(例如,模式'rU'而不是'r')。
If that's what you're trying to solve, you can fix it by looking at the newlines
attribute of the file. If it was Windows-style, that will be '\r\n'
. So, you can do this:
如果这是您要解决的问题,可以通过查看文件的newlines属性来修复它。如果它是Windows风格,那将是'\ r \ n'。所以,你可以这样做:
sum(map(len, chunck_tmp)) + len(chunck_tmp) * (len(f.newlines) - 1)
But again, sum(map(len, chunck_tmp))
is already the number of bytes in the list, which is what you asked for; this is only if you want to get the number of bytes on disk that had to be read to generate this list, which is a different thing.
但是,sum(map(len,chunck_tmp))已经是列表中的字节数,这就是你要求的;只有当你想获得必须读取的磁盘上的字节数才能生成这个列表时,这是另一回事。
Finally, in attempting to figure out what was going on, you tried to simplify it by just calling split
on a string. But there's a big difference here: readlines
leaves the newlines on the end of each line, but split
throws away the delimiters. Still, the answer is nearly the same as the last point:
最后,在试图弄清楚发生了什么时,你试图通过在字符串上调用split来简化它。但是这里有一个很大的不同:readlines在每一行的末尾都留下了换行符,但是split会抛弃分隔符。答案仍然与最后一点几乎相同:
sum(map(len, t)) + (len(t) - 1) * len(';')
(Obviously in your case, you know len(';')
is 1, and multiplying by 1 does nothing, so you can leave it off.)
(显然在你的情况下,你知道len(';')是1,乘以1什么都不做,所以你可以把它关掉。)
But, once again, sum(map(len, t))
is already the length of bytes in the list, which is what you asked for; you only need this if you want to regenerate the length of the original test
.
但是,sum(map(len,t))再一次是列表中的字节长度,这就是你要求的;如果要重新生成原始测试的长度,则只需要此项。