I'm trying to write a parser to divide up a .txt file full of data according to authors, titles, and reviews. i've hit a block and don't know what to do next. I'm having problems with just one line of code, the regex code, in the cell that sez # Now separate reviews from titles
我正在尝试编写一个解析器来根据作者、标题和评论来划分一个.txt文件。我撞了个街区,不知道下一步怎么办。我的问题只有一行代码,regex代码,在cell中sez #现在独立的评论和标题。
the code misses apostrophes (\'). and when i try to use a caret ^ to block the last stretch, i get an empty set. i include a bit if the source text below so you can see the mess I'm trying to parse. it's tricky! a title will flow directly into the name of a journal, like Choice. so i'm trying to separate by cutting off the words that immediately precede a \s-\s pattern.
代码遗漏了撇号(\)。当我尝试使用插入符号^块的最后阶段,我得到一个空集包括一点如果下面的源文本,这样你就可以看到我要解析的烂摊子。这很棘手!标题会直接流进日记的名字,就像选择一样。因此,我试图通过切断那些立即出现在“\s-\s”模式前面的单词来区分。
Here's the code:
这是代码:
with open(file) as f:
content = f.readlines()
content = [x.strip() for x in content]
content = " ".join(content)
# Get all authors
pattern = r"[A-Z\-]{2,}[\,]+\s[A-Za-z\s\,\(\)\.]+\s[\-\*\•\.\■ ]{1}"
authors = re.findall(pattern, content)
# Now replace all found authors with XXX_XXX
if re.search(pattern, content):
r = re.compile(pattern)
content2 = r.sub(r'XXX_XXX', content)
# Now get all the content for each author
content3 = content2.split('XXX_XXX')
bib = content3[1:]
# Now separate reviews from titles
**pattern2 = r"[A-Z][a-z][\w\'\-\:\;\s\(\)]+\w+\s\-\s"**
bib2 = "".join(bib)
titles = re.findall(pattern2, bib2)
It's this line, pattern2, that I can't get to work. Source code below:
就是这条线,我没法工作。源代码如下:
MA, Huan • The Overall Survey Of The Ocean’s Shores 1433
Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279
JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China
Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking
VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry
Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320
Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation
Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid
Cur R - v20 - S '81 -p325 c MA De - Crickets
Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r MAACK, Mary N - Libraries In Senegal ARBA - vl3 - '82 - pi53 CRL - v45 - Mr '84-pl52 JAL - v7 - S '81 - p244 JLH - vl9 - Spring ’84 - p315 LJ - vl07 - My 1 ’82 - p865 LQ - v52 - Ap '82-pl75 MAACK, Reinhard • Kontinentaldrift Und Geologie Des Sudatlantischen Ozeans GJ - vl36 - Mr '70 - pl38 MAAG, Russell C - Observe And Understand The Sun
S&T - v54 - S ’77 - p221 MAAG, Victor - Hiob
Rel St Rev - vlO - Ap '84 - pi 75 MAAILMA Katettu Poyta
WIT - vS8 • Winter '84 - pi 36 MAALOE, Ole - Control Of Macromolecular Synthesis
Choice - v3 - 0 '66 - p676 Sci - vl54 - D 2 '66 - pll59 MAALOUF, Amin • The Crusades Through Arab Eyes
TLS -N 16 ’84 -pi300 c MAAR, Len - Out-Of-Sight Games CBRS - v9 - F ’81 - p57 SLJ-v27 - Mr ’81 -pl48 p MAARAV
Choice - vl6 - D '79 - pl280 MAAREK, Gerard • Introduction Au Capital De Karl Marx
JEL - vl7 - Mr ’79 - p92 MAAS, Audrey Gellen • Wait Till The Sun Shines, Nellie
1 个解决方案
#1
0
You can just match the record, and it's parts directly instead of doing
split, sub's and other stuff.
你可以只匹配记录,它是直接的,而不是分开的,子的和其他的东西。
In the end, you're just doing the same thing.
最后,你只是在做同样的事情。
I know it's hard. It's like a dog with a bone so big it doesn't know how to
approach it.
我知道这很难。它就像一只骨头那么大的狗,不知道怎么去接近它。
I've tried to break up the parts a little better.
我已经试着把零件拆开好一点。
Group 1 contains the author.
Group 2 contains the title.
Group 3 contains misc stuff (the remaining stuff up to the next author).
第一组包含作者。第2组包含标题。第三组包含了misc的内容(剩下的内容留给下一个作者)。
All I've done is to make every record of the sample match correctly.
This usually covers all the bases if the sample is diverse enough.
我所做的就是正确地记录样本匹配的每一个记录。如果样品足够多样化,这通常会覆盖所有的碱基。
If you find special cases, you'll have to horn it into the code.
如果你发现了特殊的情况,你必须把它按在代码中。
A thing to note - The Misc part uses a negative assertion to test each
character up until the next Author.
If you make a change to the author part, you have to copy that change to
this assertion.要注意的一点是,Misc部分使用一个否定的断言来测试每个字符直到下一个作者。如果对author部分进行更改,则必须将该更改复制到该断言。
Good Luck!!
祝你好运! !
(?s)([A-Z-]{2,},*\s(?:[A-Za-z\s,().]|[A-Za-z]-[A-Za-z])+)\s[-*•.■ ]\s*(?:([A-Z](?:(?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$))?((?:(?![A-Z-]{2,},*\s(?:[A-Za-z\s,().]|[A-Za-z]-[A-Za-z])+\s[-*•.■ ]).)*)
(?)([a - z]{ 2,},* \ s(?:[A-Za-z \ s,()。]|[A-Za-z]-[A-Za-z])+)\[- *•。■]\ s *(?:[a - z](?(? ! \ s + - | \ s +选择\ s * -))* ? \ w)(?:\ s + - \ s * | \ s +(? =选择\ s * -)| \ s * $))?((?(? ![a - z -]{ 2,},* \ s(?:[A-Za-z \ s,()。]|[A-Za-z]-[A-Za-z])+ \[- *•。■]))*)
https://regex101.com/r/dg7hAB/1
https://regex101.com/r/dg7hAB/1
格式化/扩大解释道
(?s) # Dot-all modifier
( # (1 start), Required Author
[A-Z-]{2,}
,* # Optional comma's
\s
(?: # Author characters, add more if needed
[A-Za-z\s,().]
| [A-Za-z] - [A-Za-z] # allow hyphen between letters only
)+
) # (1 end)
\s
[-*•.■ ] # Author ender
\s*
(?:
( # (2 start), Optional Title
[A-Z]
(?: # Anything except title enders
(?!
\s+ -
| \s+ Choice \s* -
)
.
)*?
\w # Should end with a word char
) # (2 end)
(?: # Title enders
\s+ - \s*
| \s+
(?= Choice \s* - ) # Choice to be picked up by Misc Info
| \s* $
)
)?
( # (3 start), Optional Misc Info
(?: # Anything except start of new author
(?!
[A-Z-]{2,}
,*
\s
(?:
[A-Za-z\s,().]
| [A-Za-z] - [A-Za-z]
)+
\s
[-*•.■ ]
)
.
)*
) # (3 end)
#1
0
You can just match the record, and it's parts directly instead of doing
split, sub's and other stuff.
你可以只匹配记录,它是直接的,而不是分开的,子的和其他的东西。
In the end, you're just doing the same thing.
最后,你只是在做同样的事情。
I know it's hard. It's like a dog with a bone so big it doesn't know how to
approach it.
我知道这很难。它就像一只骨头那么大的狗,不知道怎么去接近它。
I've tried to break up the parts a little better.
我已经试着把零件拆开好一点。
Group 1 contains the author.
Group 2 contains the title.
Group 3 contains misc stuff (the remaining stuff up to the next author).
第一组包含作者。第2组包含标题。第三组包含了misc的内容(剩下的内容留给下一个作者)。
All I've done is to make every record of the sample match correctly.
This usually covers all the bases if the sample is diverse enough.
我所做的就是正确地记录样本匹配的每一个记录。如果样品足够多样化,这通常会覆盖所有的碱基。
If you find special cases, you'll have to horn it into the code.
如果你发现了特殊的情况,你必须把它按在代码中。
A thing to note - The Misc part uses a negative assertion to test each
character up until the next Author.
If you make a change to the author part, you have to copy that change to
this assertion.要注意的一点是,Misc部分使用一个否定的断言来测试每个字符直到下一个作者。如果对author部分进行更改,则必须将该更改复制到该断言。
Good Luck!!
祝你好运! !
(?s)([A-Z-]{2,},*\s(?:[A-Za-z\s,().]|[A-Za-z]-[A-Za-z])+)\s[-*•.■ ]\s*(?:([A-Z](?:(?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$))?((?:(?![A-Z-]{2,},*\s(?:[A-Za-z\s,().]|[A-Za-z]-[A-Za-z])+\s[-*•.■ ]).)*)
(?)([a - z]{ 2,},* \ s(?:[A-Za-z \ s,()。]|[A-Za-z]-[A-Za-z])+)\[- *•。■]\ s *(?:[a - z](?(? ! \ s + - | \ s +选择\ s * -))* ? \ w)(?:\ s + - \ s * | \ s +(? =选择\ s * -)| \ s * $))?((?(? ![a - z -]{ 2,},* \ s(?:[A-Za-z \ s,()。]|[A-Za-z]-[A-Za-z])+ \[- *•。■]))*)
https://regex101.com/r/dg7hAB/1
https://regex101.com/r/dg7hAB/1
格式化/扩大解释道
(?s) # Dot-all modifier
( # (1 start), Required Author
[A-Z-]{2,}
,* # Optional comma's
\s
(?: # Author characters, add more if needed
[A-Za-z\s,().]
| [A-Za-z] - [A-Za-z] # allow hyphen between letters only
)+
) # (1 end)
\s
[-*•.■ ] # Author ender
\s*
(?:
( # (2 start), Optional Title
[A-Z]
(?: # Anything except title enders
(?!
\s+ -
| \s+ Choice \s* -
)
.
)*?
\w # Should end with a word char
) # (2 end)
(?: # Title enders
\s+ - \s*
| \s+
(?= Choice \s* - ) # Choice to be picked up by Misc Info
| \s* $
)
)?
( # (3 start), Optional Misc Info
(?: # Anything except start of new author
(?!
[A-Z-]{2,}
,*
\s
(?:
[A-Za-z\s,().]
| [A-Za-z] - [A-Za-z]
)+
\s
[-*•.■ ]
)
.
)*
) # (3 end)