使用正则表达式解析SEC RSS Title字段

时间:2022-06-21 00:34:18

All, I have an RSS feed from the SEC with company title as follows; e.g.,


10-Q - What ever INC (0000123456) (Filer)

10-Q - 什么是INC(0000123456)(Filer)

so the general structure is:


form_name + whitespace + dash + whitespace + company_name + " (" + SIC_Number + ") (Filer)"

form_name + whitespace + dash + whitespace + company_name +“(”+ SIC_Number +“)(Filer)”

I need to extract the company_name and SIC_Number. Note the form_name can have a dash, and the company name will have white spaces and dashes. This can be done (I'm using python) by using the re.split function for the dashes, and again for the brackets, but it's ugly (showing for completeness):


m = re.split('[()]',re.split(' - ',str)[-1])

What would the proper RegEx be?


1 个解决方案



If the company name does not contain the string " - ", the SIC Number is only numbers and there is a space before the opening bracket, this is what you are looking for:

如果公司名称不包含字符串“ - ”,则SIC编号只是数字,并且在左括号之前有一个空格,这就是您要查找的内容:

m = re.search(r' - ([^(]+?) \((\d+)\)',t)



If the company name does not contain the string " - ", the SIC Number is only numbers and there is a space before the opening bracket, this is what you are looking for:

如果公司名称不包含字符串“ - ”,则SIC编号只是数字,并且在左括号之前有一个空格,这就是您要查找的内容:

m = re.search(r' - ([^(]+?) \((\d+)\)',t)