R sub,后引用未正确替换

时间:2022-10-14 21:46:01

I am attempting to extract a string from some file names to use as a variable later.

我试图从一些文件名中提取一个字符串,以便以后用作变量。

The file names look like this:

文件名如下所示:

c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
> dput(sample(vote_files, size = 25))
c("./Vote/Лианозово vote 2.xls", "./Vote/Зюзино vote 1.xls", 
"./Vote/Восточное Дегунино vote 2.xls", "./Vote/Аэропорт vote 2.xls", 
"./Vote/Академический vote 1.xls", "./Vote/Замоскворечье в городе Москве vote 1.xls", 
"./Vote/Обручевский vote 2.xls", "./Vote/Даниловский vote 3.xls", 
"./Vote/Нагатино-Садовники vote 1.xls", "./Vote/Ново-Переделкино в городе Москве vote 1.xls", 
"./Vote/Кунцево vote 2.xls", "./Vote/Текстильщики в городе Москве vote 2.xls", 
"./Vote/Южное Медведково vote 1.xls", "./Vote/Западное Дегунино vote 2.xls", 
"./Vote/Хамовники vote 1.xls", "./Vote/Крюково vote 1.xls", "./Vote/Беговой vote 1.xls", 
"./Vote/Восточный vote 1.xls", "./Vote/Богородское vote 2.xls", 
"./Vote/Некрасовка vote 2.xls", "./Vote/Косино-Ухтомский vote 1.xls", 
"./Vote/Лосиноостровский vote 3.xls", "./Vote/Хорошевский vote 2.xls", 
"./Vote/Бирюлево Западное vote 2.xls", "./Vote/Гольяново vote 3.xls"
)

I am attempting to extract the Russian text between the /Vote/ and the /vote #.xls using sub as follows

我试图使用sub在/ Vote /和/ vote#.xls之间提取俄语文本,如下所示

sub(x= string, pattern = ".*((?<=.//Vote//).*(?=vote)).*", replacement = "\\1", perl = T)

I have to use lookarounds because the string I want to extract is sometimes more than one word. However, despite the capturing group appearing to capture the right text when I verify on an online regex tester, the sub call just returns the exact same string I put in.

我必须使用lookarounds,因为我想要提取的字符串有时不止一个字。但是,尽管当我在在线正则表达式测试器上验证时,捕获组似乎捕获了正确的文本,但是子调用只返回我输入的完全相同的字符串。

What's the issue here? Alternatively, is there a simpler way to do this?

这有什么问题?或者,有更简单的方法吗?

2 个解决方案

#1


3  

As mentioned in the comments under the question your regular expression would work if the double slashes were single slashes (and although not mentioned there also 'vote' were replaced with ' vote', i.e. with a space before it).

正如在问题的评论中所提到的,如果双斜杠是单斜线,你的正则表达式将起作用(虽然没有提到,但'vote'也被'vote'替换,即在它之前有一个空格)。

Regarding a simpler way to do it, basename will get the filename part after which we can replace the space followed by vote and everything after it with the empty string:

关于一个更简单的方法,basename将得到文件名部分,之后我们可以用空字符串替换后面的空格和后面的所有内容:

sub(" vote.*", "", basename(x))

giving:

 [1] "Лианозово"                        "Зюзино"                          
 [3] "Восточное Дегунино"               "Аэропорт"                        
 [5] "Академический"                    "Замоскворечье в городе Москве"   
 [7] "Обручевский"                      "Даниловский"                     
 [9] "Нагатино-Садовники"               "Ново-Переделкино в городе Москве"
[11] "Кунцево"                          "Текстильщики в городе Москве"    
[13] "Южное Медведково"                 "Западное Дегунино"               
[15] "Хамовники"                        "Крюково"                         
[17] "Беговой"                          "Восточный"                       
[19] "Богородское"                      "Некрасовка"                      
[21] "Косино-Ухтомский"                 "Лосиноостровский"                
[23] "Хорошевский"                      "Бирюлево Западное"               
[25] "Гольяново"                       

Update: Handle phrases with embedded spaces.

更新:处理带有嵌入空格的短语。

#2


1  

Just remove the things which are consistent rather than capturing the text in between.

只需删除一致的内容,而不是捕获其间的文本。

vote_files2 <- sub("./Vote/", "", vote_files)
vote_files2 <- sub(" vote \\d*.xls", "", vote_files2)
vote_files2

#1


3  

As mentioned in the comments under the question your regular expression would work if the double slashes were single slashes (and although not mentioned there also 'vote' were replaced with ' vote', i.e. with a space before it).

正如在问题的评论中所提到的,如果双斜杠是单斜线,你的正则表达式将起作用(虽然没有提到,但'vote'也被'vote'替换,即在它之前有一个空格)。

Regarding a simpler way to do it, basename will get the filename part after which we can replace the space followed by vote and everything after it with the empty string:

关于一个更简单的方法,basename将得到文件名部分,之后我们可以用空字符串替换后面的空格和后面的所有内容:

sub(" vote.*", "", basename(x))

giving:

 [1] "Лианозово"                        "Зюзино"                          
 [3] "Восточное Дегунино"               "Аэропорт"                        
 [5] "Академический"                    "Замоскворечье в городе Москве"   
 [7] "Обручевский"                      "Даниловский"                     
 [9] "Нагатино-Садовники"               "Ново-Переделкино в городе Москве"
[11] "Кунцево"                          "Текстильщики в городе Москве"    
[13] "Южное Медведково"                 "Западное Дегунино"               
[15] "Хамовники"                        "Крюково"                         
[17] "Беговой"                          "Восточный"                       
[19] "Богородское"                      "Некрасовка"                      
[21] "Косино-Ухтомский"                 "Лосиноостровский"                
[23] "Хорошевский"                      "Бирюлево Западное"               
[25] "Гольяново"                       

Update: Handle phrases with embedded spaces.

更新:处理带有嵌入空格的短语。

#2


1  

Just remove the things which are consistent rather than capturing the text in between.

只需删除一致的内容,而不是捕获其间的文本。

vote_files2 <- sub("./Vote/", "", vote_files)
vote_files2 <- sub(" vote \\d*.xls", "", vote_files2)
vote_files2