python如何在从数据库中提取的纯文本中测试二进制数据?

时间:2022-01-04 03:04:00

I have to extract some product information from mysql database, then construct a SOAP request and use python's suds library to send this SOAP request to a remote server.

我必须从mysql数据库中提取一些产品信息,然后构造一个SOAP请求,并使用python的suds库将这个SOAP请求发送到远程服务器。

But some of the information extract is combine with binary data and text data, such as:

但其中一些信息提取是与二进制数据和文本数据相结合的,例如:

...
Some plain data
...
Content-Type: image/pjpeg

? JFIF  H H   C 


 C     
  P P                  E       !1AQa "q?#2脑$4BCR倯ご?3TVfrs枴贬             -        !1A"2QR亼Ba毖q?   ? 鮊€
 ( ?€
 ( ?€
 釱Whf颲[e?喸媼q屧ㄠ蚀厲蹳ZIO痙(r5?-i擯栧剗矹?尴?蝓玁帰XZ鞭#崛攳┸蹵X僦?攅Z@?らM;X藙?N蹮垀s@jQ?Z徸林炑M~?麒]H=颦C胝_р}"?Gixqz坽徸玨?O?Q+谍?w鬪??-囯礥?а|乛聚Zyt>?`[~跲桫?騽D曐縅CmN=?shOU+湫锏竩&6げ?铚扺d)mn?c?X?6RmQ   JJ?7繴*v>.捈鉵基d?堌疻熼G肗裪囅w騲癔R?qW鑪陭瞘.C窄贇CkyV瀷1? 柚%W}}?Yz僫芐D嫆1鬊懜赈篽效lq蒟H棯]y|G.;硅憖Ew??栧$?=e菚鏪Rbj?枝}爲釼Z3FE<尒n%C蓎??樋>`I 顝y∥+pP敐慗岻゛\硳诮湣]~??xΔ诩[_?叴b嫜?yz*?=ψk?猝"%Ak?撍滷秠BR?-铈b礖?蟷y[)厌麓4,怈
窧q觏?_   N獛擒F杍q凞画Q襃@镛P$讄k\鏁祘譟㎎*V>W??鵔M嫯q寓y焊閔C杔栽?+鹕瑟qbs:z氱^PJ聣?汜ZU"ス嘔+輔€8楺<夻Uゎ顓瞚?氅豴<]P銨c? +K6]┓gr杺  蝫?VJ能?陭欹殡J倢gS扚?娭酧??gw?膙y矼j折B礕殯
繅捁%撽蛵震挔撲y?3鑪澳N?Ec~涰巽j慭搆锥▓IP?)┤燎鐠懴 €H9瘾F毖l氾+岎6o?殎託炗y尬n??8??黬?4Qbń覦;縢?兟HRONd *紂蚽娖t猦?^?2
庴E$x 譴q箘瘃J檐H筶鷆[?8 ?9颢*髟揤v緜魸擭槧?%msV嬖z瘨摉擀F摫鞍s犮殩H4s咸?S蓉扷濅?    V?昋c?u婆SG撙???{臘亞攕曳<\K? D]+#瓃kgw犤?.?惨邔蹓#p(巂s?瘑蜲Q?傻鑟6ce?敟)?9嶔?測誗?yfvp謒NnbmB3齑栘v>RR=拏H'焴j壎e鎨洘?窑??MH单;5?T1倧o)锐认J?QY&7?橥%诤授b?氭\堫轁q)荖no弎閂?添頶5E敌?U瞿??雛柖Q??Ps?冇9'=)J殅朥k%鈌l疆$q}?い$袋蜕~跏綺衄qU玉矰潱v硻e鷵?薭?<爗树q熣?I;ぞ_鬿埗d.握莰俜6渺^貀No-乾R?r\芷<A稙鋆j璲吡累Y错$F梱?镫[猄k\﹋JrRp悇?救 
...
end of binary data.

I don't know how this data insert into mysql, but I have to detect this type of data, and replace this binary data to string EEEEE, otherwise suds will raise exception.

我不知道这个数据如何插入到mysql中,但我必须检测这类数据,并将此二进制数据替换为string EEEEE,否则suds将引发异常。

Anyone can tell me how to test this type of data?

有人能告诉我如何测试这类数据吗?

Thanks in advance.

提前谢谢。

1 个解决方案

#1


2  

Mixed text and binary?! Sounds awful...

混合文本和二进制? !听起来可怕的……

However if all data is in the format you presented in your example (i.e. with a Content-type declaration) you could do something along the lines of:

但是,如果所有的数据都是您在示例中呈现的格式(即使用内容类型声明),您可以按照以下方式做一些事情:

#!/usr/bin/env python
# -*- coding: utf-8  -*-

data = '''Some plain data!Content-Type: image/pjpeg ? JFIF  H H   C  C 
          P P                  E       !1AQa "q?#2脑$4BCR倯ご?3TVfrs枴贬    - 釱W
          hf颲[e?喸媼q屧ㄠ蚀厲蹳ZIO痙(r5?-i擯栧剗矹?尴?蝓玁帰XZ鞭#崛攳┸蹵X僦?攅Z@?らM
          ;X藙?N蹮垀s@jQ?Z徸林炑M~?麒]H=颦C胝_р}"?Gixqz坽徸玨?O?Q+谍?w鬪??'''
tdata = '''This has no binary in it'''

def filter_data(blob):
    mixed = blob.find('Content-Type:')
    if mixed != -1:  # -1 ==> not found
        return blob[:mixed] + 'EEEEE'
    return blob

print filter_data(data)
print filter_data(tdata)

If binary data is not prepended by a Content-Type declaration, I'm not sure there is a 100% reliable way to distinguish text from binary (a byte from the binary could be decoded to some sensful character) but you could at least improve the situation by filtering against a pool of valid characters.

如果不按一个二进制数据内容类型声明,我不确定有100%可靠的方法来区分文本从二进制(一个字节的二进制可以解码一些sensful字符)但你至少可以改善这种情况通过过滤池的有效字符。

For example, assuming that all valid text is alphanumeric A-Z, a-z and 0-1 plus the space character:

例如,假设所有有效文本都是字母数字A-Z、A-Z和0-1加上空格字符:

#!/usr/bin/env python
# -*- coding: utf-8  -*-

data = '''Some plain data脑$4BCR倯ご?3TVfrs枴贬    - 釱W hf颲[e?喸媼q屧ㄠ蚀厲蹳
          ZIO痙(r5?-i擯栧剗矹?尴?蝓玁帰XZ鞭#崛攳┸蹵X僦?攅Z@?らM;X藙?N蹮垀s@jQ?Z徸
          林炑M~?麒]H=颦C胝_р}"?Gixqz坽徸玨?O?Q+谍?w鬪??'''
tdata = '''This has no binary in it'''
bdata = '''炑M~?麒]H=颦C胝'''

pool = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 '

def filter_data(blob):
    last_good_one = None
    for i, c in enumerate(blob):
        if c in pool:
            last_good_one = i
        else:
            break
    if last_good_one == None:
        raise BaseException('Only binary data!')
    return blob[:last_good_one+1]

print filter_data(data)
print filter_data(tdata)
print filter_data(bdata)

HTH!

HTH !

#1


2  

Mixed text and binary?! Sounds awful...

混合文本和二进制? !听起来可怕的……

However if all data is in the format you presented in your example (i.e. with a Content-type declaration) you could do something along the lines of:

但是,如果所有的数据都是您在示例中呈现的格式(即使用内容类型声明),您可以按照以下方式做一些事情:

#!/usr/bin/env python
# -*- coding: utf-8  -*-

data = '''Some plain data!Content-Type: image/pjpeg ? JFIF  H H   C  C 
          P P                  E       !1AQa "q?#2脑$4BCR倯ご?3TVfrs枴贬    - 釱W
          hf颲[e?喸媼q屧ㄠ蚀厲蹳ZIO痙(r5?-i擯栧剗矹?尴?蝓玁帰XZ鞭#崛攳┸蹵X僦?攅Z@?らM
          ;X藙?N蹮垀s@jQ?Z徸林炑M~?麒]H=颦C胝_р}"?Gixqz坽徸玨?O?Q+谍?w鬪??'''
tdata = '''This has no binary in it'''

def filter_data(blob):
    mixed = blob.find('Content-Type:')
    if mixed != -1:  # -1 ==> not found
        return blob[:mixed] + 'EEEEE'
    return blob

print filter_data(data)
print filter_data(tdata)

If binary data is not prepended by a Content-Type declaration, I'm not sure there is a 100% reliable way to distinguish text from binary (a byte from the binary could be decoded to some sensful character) but you could at least improve the situation by filtering against a pool of valid characters.

如果不按一个二进制数据内容类型声明,我不确定有100%可靠的方法来区分文本从二进制(一个字节的二进制可以解码一些sensful字符)但你至少可以改善这种情况通过过滤池的有效字符。

For example, assuming that all valid text is alphanumeric A-Z, a-z and 0-1 plus the space character:

例如,假设所有有效文本都是字母数字A-Z、A-Z和0-1加上空格字符:

#!/usr/bin/env python
# -*- coding: utf-8  -*-

data = '''Some plain data脑$4BCR倯ご?3TVfrs枴贬    - 釱W hf颲[e?喸媼q屧ㄠ蚀厲蹳
          ZIO痙(r5?-i擯栧剗矹?尴?蝓玁帰XZ鞭#崛攳┸蹵X僦?攅Z@?らM;X藙?N蹮垀s@jQ?Z徸
          林炑M~?麒]H=颦C胝_р}"?Gixqz坽徸玨?O?Q+谍?w鬪??'''
tdata = '''This has no binary in it'''
bdata = '''炑M~?麒]H=颦C胝'''

pool = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 '

def filter_data(blob):
    last_good_one = None
    for i, c in enumerate(blob):
        if c in pool:
            last_good_one = i
        else:
            break
    if last_good_one == None:
        raise BaseException('Only binary data!')
    return blob[:last_good_one+1]

print filter_data(data)
print filter_data(tdata)
print filter_data(bdata)

HTH!

HTH !