使用Python Regex以文本方式分析会议记录：“谁说了什么”

I'm trying to textually analyze "who said what" in FOMC meetings. I have these PDFs of the meeting minutes converted into text.

我正试图在FOMC会议上用文字分析“谁说了什么”。我将会议记录的这些PDF转换为文本。

My current plan is to use regex to split the file on the names (always capitalized) and keep the delimiters.

我目前的计划是使用正则表达式将文件拆分为名称(始终大写)并保留分隔符。

names = re.findall("\s\n{2,}[A-Z]{2,}\.*\s*[A-Z]{2,}\.\d*\s",text)
speech = re.split("\s\n{2,}[A-Z]{2,}\.*\s*[A-Z]{2,}\.\d*\s",text)

And then I write these lists into a CSV with two columns: names, speech.

然后我将这些列表写入一个包含两列的CSV:名称,语音。

Is seems like a really inefficient method. Is there a better way to do this?

似乎是一种非常低效的方法。有一个更好的方法吗?

Sample of minutes:

分钟样本:

\n\nCHAIRMAN BERNANKE. Good afternoon, everybody. \n\nPARTICIPANTS. Good afternoon. \n\nCHAIRMAN BERNANKE. We need a motion to close our meeting. \n\nMR. KOHN. So moved. \n\nCHAIRMAN BERNANKE. Thank you. Our meeting today and tomorrow follows the \n\nbasic sequence we’ve been having recently, but with an important addition, which is that we \n\nhave a staff presentation on inflation dynamics. We need about two hours for that presentation, I \n\nunderstand, and we’ve thought about it and decided to put it at the end of the meeting so we \n\nwould have plenty of time to complete our policy decision. But I hope that people will pay \n\nattention to the time and make sure we have enough time tomorrow to give appropriate attention \n\n \nto the presentation. \n\nIn that spirit, why don’t we start directly? Mr. Sack. \n\n\nMR. SACK. Since the last FOMC meeting, financial conditions have generally \n\n\nbecome more supportive of economic growth.

\ n \ n主席BERNANKE。大家下午好。 \ n \ nPARTICIPANTS。下午好。 \ n \ n主席BERNANKE。我们需要一项议案来结束我们的会议。 \ n \ NMR。 KOHN。感动了。 \ n \ n主席BERNANKE。谢谢。我们今天和明天的会议遵循我们最近一直在进行的\ n \ n基本序列,但有一个重要的补充,即我们没有关于通货膨胀动态的工作人员介绍。我们需要大约两个小时的时间进行演示,我不会理解,我们已经考虑过它,并决定在会议结束时把它放在一边,这样我们就没有足够的时间来完成我们的政策决定。但我希望人们能够及时付出代价,并确保我们明天有足够的时间给予适当的关注\ n \ n \ n \ n在演示中。 \ n \ n本着这种精神,我们为什么不直接开始?萨克先生。 \ n \ n \ NMR。袋。自上次FOMC会议以来,财务状况普遍不利于经济增长。

1 个解决方案

#1

The regex could be much simpler:

正则表达式可以更简单:

re.split("([A-Z \.]+\.)",text)

And then your code could just be:

然后你的代码可能是:

data = list(filter(None, [s.strip() for s in re.split("([A-Z \.]+\.)",text)]))

Then you can do:

然后你可以这样做:

names = data[0:][::2]
speech = data[1:][::2]

See it in action here

在这里看到它

#1