I am attempting to extract a string from some file names to use as a variable later.
我试图从一些文件名中提取一个字符串,以便以后用作变量。
The file names look like this:
文件名如下所示:
c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls",
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls",
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls",
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls",
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls",
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls",
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls",
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls",
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls",
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls",
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")
> dput(sample(vote_files, size = 25))
c("./Vote/Лианозово vote 2.xls", "./Vote/Зюзино vote 1.xls",
"./Vote/Восточное Дегунино vote 2.xls", "./Vote/Аэропорт vote 2.xls",
"./Vote/Академический vote 1.xls", "./Vote/Замоскворечье в городе Москве vote 1.xls",
"./Vote/Обручевский vote 2.xls", "./Vote/Даниловский vote 3.xls",
"./Vote/Нагатино-Садовники vote 1.xls", "./Vote/Ново-Переделкино в городе Москве vote 1.xls",
"./Vote/Кунцево vote 2.xls", "./Vote/Текстильщики в городе Москве vote 2.xls",
"./Vote/Южное Медведково vote 1.xls", "./Vote/Западное Дегунино vote 2.xls",
"./Vote/Хамовники vote 1.xls", "./Vote/Крюково vote 1.xls", "./Vote/Беговой vote 1.xls",
"./Vote/Восточный vote 1.xls", "./Vote/Богородское vote 2.xls",
"./Vote/Некрасовка vote 2.xls", "./Vote/Косино-Ухтомский vote 1.xls",
"./Vote/Лосиноостровский vote 3.xls", "./Vote/Хорошевский vote 2.xls",
"./Vote/Бирюлево Западное vote 2.xls", "./Vote/Гольяново vote 3.xls"
)
I am attempting to extract the Russian text between the /Vote/
and the /vote #.xls
using sub
as follows
我试图使用sub在/ Vote /和/ vote#.xls之间提取俄语文本,如下所示
sub(x= string, pattern = ".*((?<=.//Vote//).*(?=vote)).*", replacement = "\\1", perl = T)
I have to use lookarounds because the string I want to extract is sometimes more than one word. However, despite the capturing group appearing to capture the right text when I verify on an online regex tester, the sub
call just returns the exact same string I put in.
我必须使用lookarounds,因为我想要提取的字符串有时不止一个字。但是,尽管当我在在线正则表达式测试器上验证时,捕获组似乎捕获了正确的文本,但是子调用只返回我输入的完全相同的字符串。
What's the issue here? Alternatively, is there a simpler way to do this?
这有什么问题?或者,有更简单的方法吗?
2 个解决方案
#1
3
As mentioned in the comments under the question your regular expression would work if the double slashes were single slashes (and although not mentioned there also 'vote' were replaced with ' vote', i.e. with a space before it).
正如在问题的评论中所提到的,如果双斜杠是单斜线,你的正则表达式将起作用(虽然没有提到,但'vote'也被'vote'替换,即在它之前有一个空格)。
Regarding a simpler way to do it, basename
will get the filename part after which we can replace the space followed by vote
and everything after it with the empty string:
关于一个更简单的方法,basename将得到文件名部分,之后我们可以用空字符串替换后面的空格和后面的所有内容:
sub(" vote.*", "", basename(x))
giving:
[1] "Лианозово" "Зюзино"
[3] "Восточное Дегунино" "Аэропорт"
[5] "Академический" "Замоскворечье в городе Москве"
[7] "Обручевский" "Даниловский"
[9] "Нагатино-Садовники" "Ново-Переделкино в городе Москве"
[11] "Кунцево" "Текстильщики в городе Москве"
[13] "Южное Медведково" "Западное Дегунино"
[15] "Хамовники" "Крюково"
[17] "Беговой" "Восточный"
[19] "Богородское" "Некрасовка"
[21] "Косино-Ухтомский" "Лосиноостровский"
[23] "Хорошевский" "Бирюлево Западное"
[25] "Гольяново"
Update: Handle phrases with embedded spaces.
更新:处理带有嵌入空格的短语。
#2
1
Just remove the things which are consistent rather than capturing the text in between.
只需删除一致的内容,而不是捕获其间的文本。
vote_files2 <- sub("./Vote/", "", vote_files)
vote_files2 <- sub(" vote \\d*.xls", "", vote_files2)
vote_files2
#1
3
As mentioned in the comments under the question your regular expression would work if the double slashes were single slashes (and although not mentioned there also 'vote' were replaced with ' vote', i.e. with a space before it).
正如在问题的评论中所提到的,如果双斜杠是单斜线,你的正则表达式将起作用(虽然没有提到,但'vote'也被'vote'替换,即在它之前有一个空格)。
Regarding a simpler way to do it, basename
will get the filename part after which we can replace the space followed by vote
and everything after it with the empty string:
关于一个更简单的方法,basename将得到文件名部分,之后我们可以用空字符串替换后面的空格和后面的所有内容:
sub(" vote.*", "", basename(x))
giving:
[1] "Лианозово" "Зюзино"
[3] "Восточное Дегунино" "Аэропорт"
[5] "Академический" "Замоскворечье в городе Москве"
[7] "Обручевский" "Даниловский"
[9] "Нагатино-Садовники" "Ново-Переделкино в городе Москве"
[11] "Кунцево" "Текстильщики в городе Москве"
[13] "Южное Медведково" "Западное Дегунино"
[15] "Хамовники" "Крюково"
[17] "Беговой" "Восточный"
[19] "Богородское" "Некрасовка"
[21] "Косино-Ухтомский" "Лосиноостровский"
[23] "Хорошевский" "Бирюлево Западное"
[25] "Гольяново"
Update: Handle phrases with embedded spaces.
更新:处理带有嵌入空格的短语。
#2
1
Just remove the things which are consistent rather than capturing the text in between.
只需删除一致的内容,而不是捕获其间的文本。
vote_files2 <- sub("./Vote/", "", vote_files)
vote_files2 <- sub(" vote \\d*.xls", "", vote_files2)
vote_files2