以灵活的方式用正则表达式拆分字符串。

时间:2020-12-11 21:38:22

Context: I need to split strings that are too long and that are used as column headers in an html table. Those strings are variable names, so they don't have any spaces in them.

上下文:我需要分割太长且在html表中用作列标题的字符串。这些字符串是变量名,所以它们没有空格。

If I let the css max-width property do the job, the string is split at a fixed place, not making use of the dots or _'s in the string.

如果我让css max-width属性完成这项工作,那么字符串将在一个固定的位置被分割,而不是使用字符串中的圆点或_'s。

For example, suppose I have this string:

例如,假设我有这个字符串:

this.is.a.long.string.indeed.yeah.well.you.know

Using the dots as separators, I can split it in many, many different ways. But I pose these guiding principles:

用点作为分隔符,我可以用很多很多不同的方式来分割它。但我提出以下指导原则:

  1. All substrings must be 12 characters or less
  2. 所有子字符串必须是12个字符或更少
  3. Separators [._] should be at the end, not at the beginning of a substring
  4. 分隔符(。应该在末尾,而不是子字符串的开头
  5. The number of substrings must be minimal
  6. 子字符串的数量必须是最小的
  7. If several solutions exist, the one having the most similar substring lengths is to be preferred.
  8. 如果存在多个解决方案,则首选具有最类似的子串长度的解决方案。

I could do this programmatically with R, but I'm turning to regex wizards to see whether this is possible using solely regular expressions.

我可以以编程的方式使用R,但是我将转到regex向导,以查看是否可以使用单独的正则表达式。

What I have so far:

到目前为止,我所拥有的:

Regex: .{1,12}(_|\b|\Z)

Regex:。{ 12 }(_ | | \ b \ Z)

Results: this.is.a. | long.string. | indeed.yeah. | well.you. | know

结果:this.is.a。| long.string。| indeed.yeah。|你。|知道

It works well, except when there is a long sequence of letters without any separators. Please see this example on regex101.com.

它工作得很好,除非有一长串没有分隔符的字母。请参见regex101.com上的示例。

Ideally, separators would be used whenever possible, and a fallback split would occur when there is a sequence longer than 12 characters without a separator.

理想情况下,只要有可能,就会使用分隔符,当序列长度超过12个字符而没有分隔符时,就会发生回退分离。

1 个解决方案

#1


3  

You were so close, you just need to present it with another alternative for cases where no separator is found:

你离得太近了,你只需要在找不到分隔符的情况下提供另一种选择:

.{1,12}(_|\b|\Z)|.{1,12}

Check it out: https://regex101.com/r/XrJuYj/2/

检查一下:https://regex101.com/r/XrJuYj/2/

Edit: to ensure the split portion contains a non-separating character, you can use the following:

编辑:为了确保分割部分包含非分割字符,您可以使用以下内容:

(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}

See it at: https://regex101.com/r/XrJuYj/3

看到它在:https://regex101.com/r/XrJuYj/3

#1


3  

You were so close, you just need to present it with another alternative for cases where no separator is found:

你离得太近了,你只需要在找不到分隔符的情况下提供另一种选择:

.{1,12}(_|\b|\Z)|.{1,12}

Check it out: https://regex101.com/r/XrJuYj/2/

检查一下:https://regex101.com/r/XrJuYj/2/

Edit: to ensure the split portion contains a non-separating character, you can use the following:

编辑:为了确保分割部分包含非分割字符,您可以使用以下内容:

(?=.{1,12}(.*))(?=.*?[^\W_].*?[\W_].*?\1).{1,12}(?<=_|\b|\Z)|.{1,12}

See it at: https://regex101.com/r/XrJuYj/3

看到它在:https://regex101.com/r/XrJuYj/3