不适用于带连字符的域的正则表达式

时间:2022-06-11 21:44:42

I have some smart script, that check name of server and get domain name. For example, i have name of server: example.ru01. I need to get: example.ru My scipt:

我有一些智能脚本,检查服务器名称和获取域名。例如,我有服务器的名称:example.ru01。我需要得到:example.ru我的scipt:

#!/bin/bash

hostname=example.com01
echo $hostname
reg0="\(\(\w*\.[a-z]*\)\|\(\w*\.[a-z]*\.[a-z]*\)\)"
domain=`expr match $hostname $reg0`
echo $domain

It is ok. in output i have:

没关系。在输出我有:

example.com01
example.com

But, in my infrastructure, i have some domains with hyphens. For example: test-test.com01. But it doesn't working in my script. How to resolve this problem ? Please help. I made some changes in my regular expression, like this:

但是,在我的基础设施中,我有一些带连字符的域名。例如:test-test.com01。但它在我的脚本中不起作用。如何解决这个问题?请帮忙。我在正则表达式中做了一些更改,如下所示:

\(\(\w*\.[a-z_-]*\)\|\(\w*\.[a-z_-]*\.[a-z_-]*\)\)

But it doesn't work. Where i have error ? Please help. Thanks for your attention.

但它不起作用。哪里有错误?请帮忙。感谢您的关注。

2 个解决方案

#1


1  

Yes, test-test.com01 will not match.

是的,test-test.com01不匹配。

However, www.test-test.com01 will:

但是,www.test-test.com01将:

$ hostname="www.test-test.com01"
$ reg0="\(\(\w*\.[a-z_-]*\)\|\(\w*\.[a-z_-]*\.[a-z_-]*\)\)"
$ expr match $hostname $reg0
www.test-test.com

The problem is that you are requiring an optional w (zero or more times) and a dot \..

问题是你需要一个可选的w(零次或多次)和一个点\ ..

Well, in fact, what you wrote is "a word" \w, in this case you should remove the backslash if what you want to match is the "www".

嗯,事实上,你写的是“一个字”\ w,在这种情况下,你应该删除反斜杠,如果你想匹配的是“www”。

Also, underscores are incorrect in a domain name. This is the correct regex that you should use:

此外,域名中的下划线不正确。这是你应该使用的正确的正则表达式:

reg0="\(\(w\{1,3\}\.\)\?[a-z-]\+\(\.[a-z-]*\)\?\)"

In this one, the www. is matched optionally and then one or (optionally) two names with a dot in between.

在这一个,www。可选地匹配,然后匹配一个或(可选地)两个名称,其间带有点。

However, domain names could include numbers: www.1and1.com

但是,域名可以包含数字:www.1and1.com

And, in fact, Watch out !! now they could contain any valid UTF-8 string:

事实上,小心!!现在它们可以包含任何有效的UTF-8字符串:

From section 3.3 of RFC 6531:

从RFC 6531的3.3节:

The definition of is extended to permit both the RFC 5321 definition and a UTF-8 string in a DNS label that conforms with IDNA definitions [RFC5890].

扩展的定义允许RFC 5321定义和DNS标签中的UTF-8字符串符合IDNA定义[RFC5890]。

And section 2.3.2.1 of RFC 5890

RFC 5890的2.3.2.1节

A "U-label" is an IDNA-valid string of Unicode characters, in Normalization Form C (NFC) and including at least one non-ASCII character, expressed in a standard Unicode Encoding Form (such as UTF-8).

“U-label”是IDNA有效的Unicode字符串,采用标准化表格C(NFC)并包含至少一个非ASCII字符,以标准Unicode编码格式(如UTF-8)表示。

#2


0  

You are on the right track, the little problem that you had is that you added - to the part of Regex that is responsible for matching the last part of the domain such as .com, .net or .ru. Instead, you should add - to the first part of regex. This should work:

你正走在正确的轨道上,你所遇到的一个小问题就是你所添加的部分 - 负责匹配域的最后一部分,如.com,.net或.ru。相反,你应该添加 - 正则表达式的第一部分。这应该工作:

req0="\(\(\[a-z0-9_-]*\.[a-z]*\)\|\([a-z0-9_-]*\.[a-z0-9_-]*\.[a-z]*\)\)"

This regex [a-z0-9_] can be shortened using this token \w, and it works without any problem. However, this token \w does not seem to function inside [] in bash, therefore, I used [a-z0-9_] in order to add - .

使用此令牌\ w可以缩短此正则表达式[a-z0-9_],并且它可以正常工作。但是,这个标记\ w似乎在bash中的[]内部不起作用,因此,我使用[a-z0-9_]来添加 - 。

#1


1  

Yes, test-test.com01 will not match.

是的,test-test.com01不匹配。

However, www.test-test.com01 will:

但是,www.test-test.com01将:

$ hostname="www.test-test.com01"
$ reg0="\(\(\w*\.[a-z_-]*\)\|\(\w*\.[a-z_-]*\.[a-z_-]*\)\)"
$ expr match $hostname $reg0
www.test-test.com

The problem is that you are requiring an optional w (zero or more times) and a dot \..

问题是你需要一个可选的w(零次或多次)和一个点\ ..

Well, in fact, what you wrote is "a word" \w, in this case you should remove the backslash if what you want to match is the "www".

嗯,事实上,你写的是“一个字”\ w,在这种情况下,你应该删除反斜杠,如果你想匹配的是“www”。

Also, underscores are incorrect in a domain name. This is the correct regex that you should use:

此外,域名中的下划线不正确。这是你应该使用的正确的正则表达式:

reg0="\(\(w\{1,3\}\.\)\?[a-z-]\+\(\.[a-z-]*\)\?\)"

In this one, the www. is matched optionally and then one or (optionally) two names with a dot in between.

在这一个,www。可选地匹配,然后匹配一个或(可选地)两个名称,其间带有点。

However, domain names could include numbers: www.1and1.com

但是,域名可以包含数字:www.1and1.com

And, in fact, Watch out !! now they could contain any valid UTF-8 string:

事实上,小心!!现在它们可以包含任何有效的UTF-8字符串:

From section 3.3 of RFC 6531:

从RFC 6531的3.3节:

The definition of is extended to permit both the RFC 5321 definition and a UTF-8 string in a DNS label that conforms with IDNA definitions [RFC5890].

扩展的定义允许RFC 5321定义和DNS标签中的UTF-8字符串符合IDNA定义[RFC5890]。

And section 2.3.2.1 of RFC 5890

RFC 5890的2.3.2.1节

A "U-label" is an IDNA-valid string of Unicode characters, in Normalization Form C (NFC) and including at least one non-ASCII character, expressed in a standard Unicode Encoding Form (such as UTF-8).

“U-label”是IDNA有效的Unicode字符串,采用标准化表格C(NFC)并包含至少一个非ASCII字符,以标准Unicode编码格式(如UTF-8)表示。

#2


0  

You are on the right track, the little problem that you had is that you added - to the part of Regex that is responsible for matching the last part of the domain such as .com, .net or .ru. Instead, you should add - to the first part of regex. This should work:

你正走在正确的轨道上,你所遇到的一个小问题就是你所添加的部分 - 负责匹配域的最后一部分,如.com,.net或.ru。相反,你应该添加 - 正则表达式的第一部分。这应该工作:

req0="\(\(\[a-z0-9_-]*\.[a-z]*\)\|\([a-z0-9_-]*\.[a-z0-9_-]*\.[a-z]*\)\)"

This regex [a-z0-9_] can be shortened using this token \w, and it works without any problem. However, this token \w does not seem to function inside [] in bash, therefore, I used [a-z0-9_] in order to add - .

使用此令牌\ w可以缩短此正则表达式[a-z0-9_],并且它可以正常工作。但是,这个标记\ w似乎在bash中的[]内部不起作用,因此,我使用[a-z0-9_]来添加 - 。