R Grep变量的文件名

I am new to R so I am struggling with what I imagine is a fairly simple question. For this question I am not looking for someone to give me just a solution. I was hoping that someone could explain the answer to me, so that I might learn to do it myself, rather than just copy what it is you have done. That being said, here is my problem and questions.

我是R的新手所以我正在努力解决我想象的一个相当简单的问题。对于这个问题,我不是在找人给我一个解决方案。我希望有人可以向我解释答案,这样我就可以自己学习,而不是仅仅复制你所做的事情。话虽如此,这是我的问题和疑问。

I am making a histogram with R. A user will submit a file and data from that file will be used to make a histogram. That much is already set and done. Where I am having a problem is that I need to take only part of that file name and use it to help make a title for the histogram. The file name is a bit of a monster and follows this naming convention:

我用R做直方图。用户将提交一个文件,该文件中的数据将用于制作直方图。已经设定并完成了那么多。我遇到问题的地方是我只需要获取该文件名的一部分并使用它来帮助制作直方图的标题。文件名有点像怪物,遵循这个命名约定:

X_Y.doc.Z.x_y_z

The aspects of that file name that I need are the Y and Z. I know that many people use grep but I am not sure how to use it in this instance. I have already read the ??grep page and am familiar with the basics of grep but don't really know where to start.

我需要的文件名的方面是Y和Z.我知道很多人使用grep,但我不知道如何在这个实例中使用它。我已经阅读过grep页面并且熟悉grep的基础知识但是不知道从哪里开始。

Eventually I will also need to grep some information from an excel file, if someone cares to advise me in that matter as well. If it helps, this is how I am accepting the files:

最后,我还需要从excel文件中获取一些信息,如果有人在这方面也提供建议。如果它有帮助,这就是我接受文件的方式:

F.n<-(tk_choose.files(default="", caption="Select a file", multi=TRUE, filters=NULL, index=1))

Does anyone have any suggestions?

有没有人有什么建议?

3 个解决方案

#1

The answer already given using stringr is excellent. That package provides you with some very helpful string munging tools.

使用stringr给出的答案非常好。该软件包为您提供了一些非常有用的字符串修改工具。

If you want to only use base, you could do this with gsub. Assuming your punctuation stays the same and there will not be any embedded periods or underscores in the X, Y or Z something like this should work

如果你只想使用base,你可以用gsub来做。假设你的标点符号保持不变,并且在X,Y或Z中不会有任何嵌入的句点或下划线这样的东西应该工作

f <- 'X_Y.doc.Z.x_y_z'
gsub('^.+_(.+)\\.doc\\.(.+)\\..+_.+$', '\\1 \\2', f)

which returns:

"Y Z"

you could put whatever you want in there though to make it easier to get at each piece or could do this in two lines returning one each. And remember, R almost never changes data in place. You need to assign the output of a function to a variable like below. Otherwise it will just print to the console and be "lost" (this is true most of the time).

你可以把任何你想要的东西放在那里,但是为了让它更容易获得每一件,或者可以在两行中分别返回一件。请记住,R几乎从不改变数据。您需要将函数的输出分配给如下所示的变量。否则它只会打印到控制台并“丢失”(大部分时间都是这样)。

y <- gsub('^.+_(.+)\\.doc\\..+\\..+_.+$', '\\1', f)
z <- gsub('^.+_.+\\.doc\\.(.+)\\..+_.+$', '\\1', f)

Lets break it down.

让我们分解吧。

^ specifies the beginning of a line. its good to be explicit. similarly $ identifies the end of a line.

^指定一行的开头。很明白。类似地,$表示一行的结尾。

. represents any character and following it with a + means one or more of any character. If you used .* instead of .+ it would mean zero or more of any character and that isnt what we want. If i want to write a normal . I need to escape it since its a special character. \ is the escape character both for regular expressions and for R. So... you need two. To write a normal period you need to write \\.

。表示任何字符,并用+表示,表示任何字符中的一个或多个。如果您使用。*而不是。+它将意味着任何角色零或更多,这不是我们想要的。如果我想写一个正常的。我需要逃脱它,因为它是一个特殊的角色。 \是正则表达式和R的转义字符。所以...你需要两个。要写一个正常的句号,你需要写\\。

Clear to be sure. Finally the parentheses represent a group I want to save. They can be referenced later using numbers indicating the order you saved them. In some languages these parentheses need to be escaped also, but not R.

清楚可以肯定。最后,括号代表我想要保存的组。稍后可以使用表示您保存它们的顺序的数字来引用它们。在某些语言中,这些括号也需要转义,但不是R.

#2

Grep uses Regular Expressions to search for substrings matching a pattern. For your problem of matching certain elements from a filename, you would probably want to use capturing groups to extract the different parts.

Grep使用正则表达式来搜索与模式匹配的子字符串。对于从文件名中匹配某些元素的问题,您可能希望使用捕获组来提取不同的部分。

An example of a regular expression with a capturing group would be:

具有捕获组的正则表达式的示例将是:

"Hello, (\w+)"

To match strings of the format "Hello, Friend". Here is an explanation of the pattern:

匹配“Hello,Friend”格式的字符串。以下是该模式的解释:

\w will match a "word character", while

\ w将匹配“单词字符”,而

+ means that at least one, but multiple of them will be matched.

+表示至少有一个,但多个匹配。

For the other structural parts of your file name convention, we can just include _ as they are but have to escape . as they have a special meaning in regular expressions.

对于文件名约定的其他结构部分,我们可以包含_,因为它们必须要转义。因为它们在正则表达式中具有特殊含义。

To define a group that you want to match (a capturing group), you put the part to be matched in parentheses (\w+)

要定义要匹配的组(捕获组),请将要匹配的部分放在括号中(\ w +)

Using all that, we get the following pattern:

使用所有这些,我们得到以下模式:

"(\w+)_(\w+)\.doc\.(\w+)\.(\w+)_(\w+)_(\w+)"

To get the pattern to work in R, we will have to escape all \ characters as \\:

要使模式在R中工作,我们必须将所有\字符转义为\\:

> pattern = "(\\w+)_(\\w+)\\.doc\.(\\w+)\\.(\\w+)_(\\w+)_(\\w+)"

While grep and regex are powerful, I personally prefer the stringr package for its simpler interface, in particular the str_match function can be very helpful as it will return a matrix with column 1 giving the full match and all subsequent columns giving the matches to the capturing groups:

虽然grep和regex很强大,但我个人更喜欢stringr包用于其更简单的接口,特别是str_match函数非常有用,因为它将返回一个矩阵,第1列给出完全匹配,所有后续列给出匹配到捕获团体:

> x = "X_Y.doc.Z.x_y_z"
> str_match(x, pattern)

     [,1]              [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "X_Y.doc.Z.x_y_z" "X"  "Y"  "Z"  "x"  "y"  "z"

If you are new to regular expressions, you should be fine with a tutorial for any language such as this one. Syntax will mostly be similar, but vary only in details while not all features are supported by all programming languages. If you want to try out your expressions before putting them into your programs, I highly recommend RegexPal

如果您不熟悉正则表达式,那么您可以使用任何语言(如此语言)的教程。语法大致相似,但仅在细节上有所不同,而并非所有编程语言都支持所有功能。如果您想在将表达式放入程序之前先试用它们,我强烈推荐使用RegexPal

#3

In this simple case of just needing a single letter that is in a well-defined place, substr would probably be simpler:

在这个简单的情况下,只需要一个明确定义的单个字母,substr可能会更简单:

> a <- "X_Y.doc.Z.x_y_z"
> substr(a, 3, 3)
[1] "Y"
> substr(a, 9, 9)
[1] "Z"

#1