使用python中的正则表达式对罗马数字进行分割

I need to split a text on the roman numbers.
Here is my text

我需要在罗马数字上分一段文字。这是我的文本

This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one

Actually this is one part in a question of a question paper.How ever I wanted it to be broken down as follows.

实际上这是问题论文的一部分。我多么希望它能像下面这样被分解。

This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

So in here What I want is, divided the sentence on roman numbers.
Here is my the regular expression I have written

这里我想要的是，用罗马数字来划分句子。这是我写的正则表达式

text = This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one
for m in re.split(r' [a-z]+\. ',text):
    print(m)

This is what I get

这就是我得到的

This is the part (a) of question number one.
i. This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

My expression worked on roman number two and three but not on roman number one.
So I need a general expression which suit for any roman number.
Important thing to be noted is that before a roman number there is a space and after a roman number there is a full stop and then a space.
Can some help me to solve this?

我的表达对罗马数字2和3起作用，但对罗马数字1起作用不大。所以我需要一个适用于任何罗马数字的通用表达式。需要注意的是，在罗马数字之前有一个空格，在罗马数字之后有一个句号，然后是空格。能帮我解决这个问题吗?

3 个解决方案

#1

Your regular expression captures substring one., try to change it in this way:

正则表达式捕获子字符串1。，试着改变它:

text = 'This is the part (a) of question number one. i. This is sub part one of part (a) question one ii. This is sub part two of part (a) question one iii. This is sub part three of part (a) question one'

for m in re.split(r' [MDCLXVI]+\. ', text, flags=re.IGNORECASE):
    print(m)

#2

That's not what I get. Check your first line again. I get

这不是我得到的。再检查一下你的第一行。我得到

This is the part (a) of question number

and that because your regex matches "one."

因为你的正则表达式匹配“1”

re.split(r'i+\. ',text)

works for me.

为我工作。

#3

If you want proper romanette numbers (roman numerals in lower case are often refer to as 'romanette'), they are easily generated. Mark Pilgrim has a variety of Roman Numeral utilities in the book Dive Into Python some of which can be seen here.

如果你想要适当的浪漫数字(小写的罗马数字通常被称为“罗曼语”)，它们很容易生成。Mark Pilgrim在书中有很多关于Python的罗马数字工具有些可以在这里看到。

The one that that generates man numerals:

产生人数的那个:

class RomanError(Exception): pass
class OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass

def toRoman(n):
    """convert integer to Roman numeral"""
    if not (0 < n < 5000):
        raise OutOfRangeError, "number out of range (must be 1..4999)"
    if int(n) != n:
        raise NotIntegerError, "decimals can not be converted"
    romanNumeralMap = (('M',  1000), ('CM', 900), ('D',  500), ('CD', 400), ('C',  100), ('XC', 90),
       ('L',  50), ('XL', 40), ('X',  10), ('IX', 9), ('V',  5), ('IV', 4), ('I',  1))
    result = ""
    for numeral, integer in romanNumeralMap:
        while n >= integer:
            result += numeral
            n -= integer
    return result

Test that:

测试:

>>> [toRoman(x) for x in range(1,21)]
['I', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX']

That can be used to generate a pattern for all the roman numerals up to 20 and put that into a regex:

它可以用来为所有罗马数字生成一个模式，最多为20，并将其放入一个regex:

>>> pat=' (?:'+'|'.join([int_to_roman(i).lower() for i in range(1,21)])+')\. '
>>> pat
' (?:i|ii|iii|iv|v|vi|vii|viii|ix|x|xi|xii|xiii|xiv|xv|xvi|xvii|xviii|xix|xx)\\. '

Then you can split your text:

然后你可以把你的文本分开:

>>> print '\n'.join(re.split(pat, txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

Or, you can use his regex in re.split:

或者，你也可以在rel .split:

>>> pat=re.compile('''\
... [ ]                 # one space
... m{0,4}              # thousands - 0 to 4 M's
... (?:cm|cd|d?c{0,3})  # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
...                     #            or 500-800 (D, followed by 0 to 3 C's)
... (?:xc|xl|l?x{0,3})  # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
...                     #        or 50-80 (L, followed by 0 to 3 X's)
... (?:ix|iv|v?i{0,3})  # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
...                     #        or 5-8 (V, followed by 0 to 3 I's)
... [.][ ]                # full stop then a space''', re.X)
>>> print '\n'.join(pat.split(txt))
This is the part (a) of question number one.
This is sub part one of part (a) question one
This is sub part two of part (a) question one
This is sub part three of part (a) question one

#1