使用正则表达式解析SEC RSS Title字段

时间:2022-06-01 12:46:07

All, I have an RSS feed from the SEC with company title as follows; e.g.,

全部,我有来自美国证券交易委员会的RSS提要,公司名称如下;例如。,

10-Q - What ever INC (0000123456) (Filer)

10-Q - 什么是INC(0000123456)(Filer)

so the general structure is:

所以一般结构是:

form_name + whitespace + dash + whitespace + company_name + " (" + SIC_Number + ") (Filer)"

form_name + whitespace + dash + whitespace + company_name +“(”+ SIC_Number +“)(Filer)”

I need to extract the company_name and SIC_Number. Note the form_name can have a dash, and the company name will have white spaces and dashes. This can be done (I'm using python) by using the re.split function for the dashes, and again for the brackets, but it's ugly (showing for completeness):

我需要提取company_name和SIC_Number。请注意,form_name可以有一个破折号,公司名称将包含空格和破折号。这可以通过使用用于破折号的re.split函数来完成(我正在使用python),并且再次用于括号,但它很难看(显示完整性):

m = re.split('[()]',re.split(' - ',str)[-1])

What would the proper RegEx be?

适当的RegEx会是什么?

1 个解决方案

#1


1  

If the company name does not contain the string " - ", the SIC Number is only numbers and there is a space before the opening bracket, this is what you are looking for:

如果公司名称不包含字符串“ - ”,则SIC编号只是数字,并且在左括号之前有一个空格,这就是您要查找的内容:

m = re.search(r' - ([^(]+?) \((\d+)\)',t)

#1


1  

If the company name does not contain the string " - ", the SIC Number is only numbers and there is a space before the opening bracket, this is what you are looking for:

如果公司名称不包含字符串“ - ”,则SIC编号只是数字,并且在左括号之前有一个空格,这就是您要查找的内容:

m = re.search(r' - ([^(]+?) \((\d+)\)',t)