你如何处理具有结构的字符串?

时间:2021-12-31 14:31:53

Suppose I have an object representing a person, with getter and setter methods for the person's email address. The setter method definition might look something like this:

假设我有一个代表一个人的对象,使用该人的电子邮件地址的getter和setter方法。 setter方法定义可能如下所示:

setEmailAddress(String emailAddress)
    {
    this.emailAddress = emailAddress;
    }

Calling person.setEmailAddress(0), then, would generate a type error, but calling person.setEmailAddress("asdf") would not - even though "asdf" is in no way a valid email address.

然后,调用person.setEmailAddress(0)将生成类型错误,但调用person.setEmailAddress(“asdf”)将不会 - 即使“asdf”绝不是有效的电子邮件地址。

In my experience, so-called strings are almost never arbitrary sequences of characters, with no restriction on length or format. URIs come to mind - as do street addresses, as do phone numbers, as do first names ... you get the idea. Yet these data types are most often stored as "just strings".

根据我的经验,所谓的字符串几乎不是任意字符序列,对长度或格式没有限制。我想到了URI - 街道地址和电话号码一样,名字也是如此......你明白了。然而,这些数据类型通常存储为“只是字符串”。

Returning to my person object, suppose I modify setEmailAddress() like so

回到我的person对象,假设我像这样修改setEmailAddress()

setEmailAddress(EmailAddress emailAddress)
    // ...

where EmailAddress is a class ... whose constructor takes a string representation of an email address. Have I gained anything?

其中EmailAddress是一个类......其构造函数采用电子邮件地址的字符串表示形式。我有什么收获吗?

OK, so an email address is kind of a bad example. What about a URI class that takes a string representation of a URI as a constructor parameter, and provides methods for managing that URI - setting the path, fetching a query parameter, etc. The validity of the source string becomes important.

好的,所以电子邮件地址是一个不好的例子。如果URI类将URI的字符串表示形式作为构造函数参数,并提供管理该URI的方法 - 设置路径,获取查询参数等。源字符串的有效性变得很重要。

So I ask all of you, how do you deal with strings that have structure? And how do you make your structural expectations clear in your interfaces?

所以我问你们所有人,你们如何处理具有结构的字符串?您如何在界面中明确您的结构期望?

Thank you.

9 个解决方案

#1


Welcome to the world of programming!

欢迎来到编程世界!

I don't think your question is a symptom of an error on your part. Rather it is a basic problem which appears in many guises throughout the programming world. Strings that have some structure and meaning are passed around between different subsystems of an application and each subsystem can only do much parsing and validation.

我不认为您的问题是您的错误症状。相反,它是整个编程世界中出现在许多伪装中的基本问题。具有某种结构和含义的字符串在应用程序的不同子系统之间传递,并且每个子系统只能进行大量的解析和验证。

The problem of verifying an email address, for example, is quite tricky. The regular expressions various people offer accepting an email address, for example, are generally either "too tight" (don't accept everything) or "too loose" (accept illegal things). The first google hit for 'regex "email address"', for example says:

例如,验证电子邮件地址的问题非常棘手。例如,各种人提供接受电子邮件地址的正则表达通常要么“太紧”(不接受所有内容)要么“太松散”(接受非法的事情)。第一个谷歌搜索“正则表达式”电子邮件地址“',例如说:

The regular expression I receive the most feedback, not to mention "bug" reports on, is the one you'll find right on this site's home page: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b Analyze this regular expression with RegexBuddy. This regular expression, I claim, matches any email address. Most of the feedback I get refutes that claim by showing one email address that this regex doesn't match.

正则表达式我收到的反馈最多,更不用说“bug”报告了,你可以在这个网站的主页上找到:\ b [A-Z0-9 ._%+ - ] + @ [A -Z0-9 .-] +。[AZ] {2,4} \ b使用RegexBuddy分析此正则表达式。我声称,这个正则表达式匹配任何电子邮件地址。我得到的大多数反馈都通过显示此正则表达式不匹配的一个电子邮件地址来驳斥该声明。

The fact is the what is or isn't a valid email address is a complex problem, one that a given program might or might not want to solve. The problem of URLs is even worse, especially given the possibility of malicious URLS.

事实是,有效的电子邮件地址是什么或不是一个复杂的问题,一个给定的程序可能或可能不想解决。 URL的问题甚至更糟,特别是考虑到恶意URL的可能性。

Ideally, you can have a library or system-call which solves problems of this sort instead of doing anything yourself (Microsoft windows calls a custom dialogue box to allow the user to select or create a file, since validating file names is another tricky problem). But you can't always count on having an appropriate system call for a given "meaningful string" either.

理想情况下,您可以使用库或系统调用来解决此类问题,而不是自己做任何事情(Microsoft Windows调用自定义对话框以允许用户选择或创建文件,因为验证文件名是另一个棘手的问题) 。但是,您不能总是指望对给定的“有意义的字符串”进行适当的系统调用。

I would say that there no a generic solution to the problem of strings-with-structure. Rather, it is a basic problem that appears right when you design your application. In the process of gathering requirements for your application, you should determine what data the application will take in and how meaningful that data will be to the application. And this is where things get tricky, since you may notice the possibility that the app may grow in ways that your boss or customer might not have thought of - or the app may in fact grow in ways that none of you thought of. Thus the application needs to be a little more flexible than what seems like the minimum BUT only a little. It should also not be so flexible you get bogged down.

我会说没有通用的解决方案来解决带结构的字符串问题。相反,它是您在设计应用程序时出现的基本问题。在收集应用程序需求的过程中,您应该确定应用程序将采用哪些数据以及数据对应用程序的意义。事情变得棘手,因为您可能会注意到应用程序可能以您的老板或客户可能没有想到的方式增长 - 或者应用程序实际上可能以您没有想到的方式增长。因此,应用程序需要比看起来最小但稍微有点灵活一点。它也应该不那么灵活,你陷入困境。

Now, if you decide that you need to validate/interpret etc a given string, putting that string into an object or a hash can be a good approach - this is one way I know to make sure your interface is clear. But the tricky thing is deciding just how much validation or interpretation you need.

现在,如果您决定需要验证/解释给定字符串等,将该字符串放入对象或散列可能是一种很好的方法 - 这是我知道确保您的界面清晰的一种方法。但棘手的是决定你需要多少验证或解释。

Making these decisions is thus an art - there are no dogmatic answers that work here.

因此,做出这些决定是一门艺术 - 在这里没有教条式的答案。

#2


"Strings with structure" are a symptom of the common code smell "Primitive Obsession".

“具有结构的字符串”是常见代码气味“原始痴迷”的症状。

The remedy is to watch closely for duplication in code that validates or manipulates parts of these structures. At the first hint of duplication - but not before - extract a class that encapsulates the structure and locate validations and queries there.

补救措施是密切关注验证或操纵这些结构部分的代码中的重复。在第一次重复提示时 - 但不是之前 - 提取一个封装结构的类,并在那里找到验证和查询。

#3


This is a pretty common problem falling under the title 'validation' - there are many ways to validate textual user input, one of the most common being Regular Expressions.

这是一个非常常见的问题,属于“验证”标题 - 有许多方法可以验证文本用户输入,其中最常见的是正则表达式。

You might also consider using the built-in System.Net.MailAddress class for this, as it provides validation for email addresses.

您也可以考虑使用内置的System.Net.MailAddress类,因为它提供了对电子邮件地址的验证。

#4


Strings are strings. If you need your strings to be smarter than average strings then parsing them into a structural object like you describe would be a good idea. I would use a regex to do that.

字符串是字符串。如果你需要你的字符串比普通字符串更聪明,那么将它们解析为你描述的结构对象将是一个好主意。我会使用正则表达式来做到这一点。

#5


Regular expressions are your friend when it comes to formatting strings. you could also store each part separately in a struct to avoid going through the trouble of using regular expressions every time you want to use them. e.g.

在格式化字符串时,正则表达式是您的朋友。您还可以将每个部分分别存储在结构中,以避免每次要使用它们时都遇到使用正则表达式的麻烦。例如

struct EMail
{
    String BeforeAt = "johndoe123";
    String AfterAt = "gmail.com";
}

Struct URL
{
    String Protocol = "http";
    String Domain = "sub.example.com";
    String Path = "stuff/example.html";
}

#6


Well, if you want to do several different kinds of things with an EmailAddress object, those other actions do not have to check if it is a valid email address since the EmailAddress object is guaranteed to have a valid string. You could throw an exception in the constructor or use a factory method or whatever "One True Methodology" approach you're using.

好吧,如果你想用EmailAddress对象做几种不同的事情,那些其他动作不必检查它是否是有效的电子邮件地址,因为EmailAddress对象保证有一个有效的字符串。您可以在构造函数中抛出异常,或者使用工厂方法或您正在使用的“One True Methodology”方法。

#7


Personally, I like the idea of strong typing, so if I were still working in such languages I'd go with the style of your second example. The only thing I'd change might be to use a more "cast-like" structure, like EmailAddressFromString(String), that generated a new EmailAddress object (or pitched a fit if the string wasn't right), as I'm a bit of a fan of application Hungarian notation.

就个人而言,我喜欢强类型的想法,所以如果我还在使用这种语言,我会选择第二个例子的风格。我唯一要改变的可能是使用一个更像“类似Cast”的结构,比如EmailAddressFromString(String),它生成一个新的EmailAddress对象(如果字符串不正确,则投入适合),因为我是有点申请匈牙利表示法的粉丝。

This whole problem, incidentally, is covered pretty well by Joel in http://www.joelonsoftware.com/articles/Wrong.html if you're interested.

顺便提一下,如果你感兴趣的话,Joel在http://www.joelonsoftware.com/articles/Wrong.html中很好地介绍了整个问题。

#8


I agree with the calls to strongly type the object, but for those cases where you're parsing from a string to an object, the answer is simple: error handling.

我同意强烈键入对象的调用,但对于那些从字符串解析为对象的情况,答案很简单:错误处理。

There are two general ways to handle errors: exceptions and return conditions. Generally if you expect to receive badly formed data, then you should return an error message. For cases where the input is not expected, then I would throw an exception. For example, you might pass in an ill formed email address, such as 'bob' instead of 'bob@gmail.com'. However, for null values, you might throw an exception, as you shouldn't try to form an email out of null.

处理错误有两种常用方法:异常和返回条件。通常,如果您希望收到格式错误的数据,则应返回错误消息。对于不期望输入的情况,那么我会抛出异常。例如,您可能会传入一个生成错误的电子邮件地址,例如“bob”而不是“bob@gmail.com”。但是,对于null值,您可能会抛出异常,因为您不应尝试从null形成电子邮件。

Returning to your question, I do think you gain something by encoding a structure into an object. Specifically, you only need to validate that the string represents a valid email address in one specific place, such as the constructor. Elsewhere, your code is free to assume that an EmailAddress object is valid, and you don't have to rely upon dodgy classes with names like 'EmailHelper' or some such.

回到你的问题,我认为你通过将一个结构编码成一个对象来获得一些东西。具体来说,您只需要验证字符串是否代表某个特定位置的有效电子邮件地址,例如构造函数。在其他地方,您的代码可以*地假设EmailAddress对象是有效的,并且您不必依赖具有“EmailHelper”等名称的狡猾类。

#9


I personally do not think strong-typing the email address string as EmailAddress is necessary, in this case.

在这种情况下,我个人认为不需要强力输入电子邮件地址字符串作为EmailAddress。

To create your email address you will, sooner or later, have to do something like:

要创建您的电子邮件地址,您迟早会做以下事情:

EmailAddress(String email)

or a setter

或者是一个二传手

SetEmailAddress(String email)

In both cases, you'll have to validate the email string input, which puts you back into your initial validation problem.

在这两种情况下,您都必须验证电子邮件字符串输入,这会使您回到初始验证问题。

I would, as others pointed out, use regular expressions.

正如其他人指出的那样,我会使用正则表达式。

Having an EmailAddress class would be useful if you plan on having to perform specific operations on your stored information later on (say get domain name only, stuff like that).

如果您计划以后必须对存储的信息执行特定操作(例如,只获取域名,类似的东西),那么拥有EmailAddress类将非常有用。

#1


Welcome to the world of programming!

欢迎来到编程世界!

I don't think your question is a symptom of an error on your part. Rather it is a basic problem which appears in many guises throughout the programming world. Strings that have some structure and meaning are passed around between different subsystems of an application and each subsystem can only do much parsing and validation.

我不认为您的问题是您的错误症状。相反,它是整个编程世界中出现在许多伪装中的基本问题。具有某种结构和含义的字符串在应用程序的不同子系统之间传递,并且每个子系统只能进行大量的解析和验证。

The problem of verifying an email address, for example, is quite tricky. The regular expressions various people offer accepting an email address, for example, are generally either "too tight" (don't accept everything) or "too loose" (accept illegal things). The first google hit for 'regex "email address"', for example says:

例如,验证电子邮件地址的问题非常棘手。例如,各种人提供接受电子邮件地址的正则表达通常要么“太紧”(不接受所有内容)要么“太松散”(接受非法的事情)。第一个谷歌搜索“正则表达式”电子邮件地址“',例如说:

The regular expression I receive the most feedback, not to mention "bug" reports on, is the one you'll find right on this site's home page: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}\b Analyze this regular expression with RegexBuddy. This regular expression, I claim, matches any email address. Most of the feedback I get refutes that claim by showing one email address that this regex doesn't match.

正则表达式我收到的反馈最多,更不用说“bug”报告了,你可以在这个网站的主页上找到:\ b [A-Z0-9 ._%+ - ] + @ [A -Z0-9 .-] +。[AZ] {2,4} \ b使用RegexBuddy分析此正则表达式。我声称,这个正则表达式匹配任何电子邮件地址。我得到的大多数反馈都通过显示此正则表达式不匹配的一个电子邮件地址来驳斥该声明。

The fact is the what is or isn't a valid email address is a complex problem, one that a given program might or might not want to solve. The problem of URLs is even worse, especially given the possibility of malicious URLS.

事实是,有效的电子邮件地址是什么或不是一个复杂的问题,一个给定的程序可能或可能不想解决。 URL的问题甚至更糟,特别是考虑到恶意URL的可能性。

Ideally, you can have a library or system-call which solves problems of this sort instead of doing anything yourself (Microsoft windows calls a custom dialogue box to allow the user to select or create a file, since validating file names is another tricky problem). But you can't always count on having an appropriate system call for a given "meaningful string" either.

理想情况下,您可以使用库或系统调用来解决此类问题,而不是自己做任何事情(Microsoft Windows调用自定义对话框以允许用户选择或创建文件,因为验证文件名是另一个棘手的问题) 。但是,您不能总是指望对给定的“有意义的字符串”进行适当的系统调用。

I would say that there no a generic solution to the problem of strings-with-structure. Rather, it is a basic problem that appears right when you design your application. In the process of gathering requirements for your application, you should determine what data the application will take in and how meaningful that data will be to the application. And this is where things get tricky, since you may notice the possibility that the app may grow in ways that your boss or customer might not have thought of - or the app may in fact grow in ways that none of you thought of. Thus the application needs to be a little more flexible than what seems like the minimum BUT only a little. It should also not be so flexible you get bogged down.

我会说没有通用的解决方案来解决带结构的字符串问题。相反,它是您在设计应用程序时出现的基本问题。在收集应用程序需求的过程中,您应该确定应用程序将采用哪些数据以及数据对应用程序的意义。事情变得棘手,因为您可能会注意到应用程序可能以您的老板或客户可能没有想到的方式增长 - 或者应用程序实际上可能以您没有想到的方式增长。因此,应用程序需要比看起来最小但稍微有点灵活一点。它也应该不那么灵活,你陷入困境。

Now, if you decide that you need to validate/interpret etc a given string, putting that string into an object or a hash can be a good approach - this is one way I know to make sure your interface is clear. But the tricky thing is deciding just how much validation or interpretation you need.

现在,如果您决定需要验证/解释给定字符串等,将该字符串放入对象或散列可能是一种很好的方法 - 这是我知道确保您的界面清晰的一种方法。但棘手的是决定你需要多少验证或解释。

Making these decisions is thus an art - there are no dogmatic answers that work here.

因此,做出这些决定是一门艺术 - 在这里没有教条式的答案。

#2


"Strings with structure" are a symptom of the common code smell "Primitive Obsession".

“具有结构的字符串”是常见代码气味“原始痴迷”的症状。

The remedy is to watch closely for duplication in code that validates or manipulates parts of these structures. At the first hint of duplication - but not before - extract a class that encapsulates the structure and locate validations and queries there.

补救措施是密切关注验证或操纵这些结构部分的代码中的重复。在第一次重复提示时 - 但不是之前 - 提取一个封装结构的类,并在那里找到验证和查询。

#3


This is a pretty common problem falling under the title 'validation' - there are many ways to validate textual user input, one of the most common being Regular Expressions.

这是一个非常常见的问题,属于“验证”标题 - 有许多方法可以验证文本用户输入,其中最常见的是正则表达式。

You might also consider using the built-in System.Net.MailAddress class for this, as it provides validation for email addresses.

您也可以考虑使用内置的System.Net.MailAddress类,因为它提供了对电子邮件地址的验证。

#4


Strings are strings. If you need your strings to be smarter than average strings then parsing them into a structural object like you describe would be a good idea. I would use a regex to do that.

字符串是字符串。如果你需要你的字符串比普通字符串更聪明,那么将它们解析为你描述的结构对象将是一个好主意。我会使用正则表达式来做到这一点。

#5


Regular expressions are your friend when it comes to formatting strings. you could also store each part separately in a struct to avoid going through the trouble of using regular expressions every time you want to use them. e.g.

在格式化字符串时,正则表达式是您的朋友。您还可以将每个部分分别存储在结构中,以避免每次要使用它们时都遇到使用正则表达式的麻烦。例如

struct EMail
{
    String BeforeAt = "johndoe123";
    String AfterAt = "gmail.com";
}

Struct URL
{
    String Protocol = "http";
    String Domain = "sub.example.com";
    String Path = "stuff/example.html";
}

#6


Well, if you want to do several different kinds of things with an EmailAddress object, those other actions do not have to check if it is a valid email address since the EmailAddress object is guaranteed to have a valid string. You could throw an exception in the constructor or use a factory method or whatever "One True Methodology" approach you're using.

好吧,如果你想用EmailAddress对象做几种不同的事情,那些其他动作不必检查它是否是有效的电子邮件地址,因为EmailAddress对象保证有一个有效的字符串。您可以在构造函数中抛出异常,或者使用工厂方法或您正在使用的“One True Methodology”方法。

#7


Personally, I like the idea of strong typing, so if I were still working in such languages I'd go with the style of your second example. The only thing I'd change might be to use a more "cast-like" structure, like EmailAddressFromString(String), that generated a new EmailAddress object (or pitched a fit if the string wasn't right), as I'm a bit of a fan of application Hungarian notation.

就个人而言,我喜欢强类型的想法,所以如果我还在使用这种语言,我会选择第二个例子的风格。我唯一要改变的可能是使用一个更像“类似Cast”的结构,比如EmailAddressFromString(String),它生成一个新的EmailAddress对象(如果字符串不正确,则投入适合),因为我是有点申请匈牙利表示法的粉丝。

This whole problem, incidentally, is covered pretty well by Joel in http://www.joelonsoftware.com/articles/Wrong.html if you're interested.

顺便提一下,如果你感兴趣的话,Joel在http://www.joelonsoftware.com/articles/Wrong.html中很好地介绍了整个问题。

#8


I agree with the calls to strongly type the object, but for those cases where you're parsing from a string to an object, the answer is simple: error handling.

我同意强烈键入对象的调用,但对于那些从字符串解析为对象的情况,答案很简单:错误处理。

There are two general ways to handle errors: exceptions and return conditions. Generally if you expect to receive badly formed data, then you should return an error message. For cases where the input is not expected, then I would throw an exception. For example, you might pass in an ill formed email address, such as 'bob' instead of 'bob@gmail.com'. However, for null values, you might throw an exception, as you shouldn't try to form an email out of null.

处理错误有两种常用方法:异常和返回条件。通常,如果您希望收到格式错误的数据,则应返回错误消息。对于不期望输入的情况,那么我会抛出异常。例如,您可能会传入一个生成错误的电子邮件地址,例如“bob”而不是“bob@gmail.com”。但是,对于null值,您可能会抛出异常,因为您不应尝试从null形成电子邮件。

Returning to your question, I do think you gain something by encoding a structure into an object. Specifically, you only need to validate that the string represents a valid email address in one specific place, such as the constructor. Elsewhere, your code is free to assume that an EmailAddress object is valid, and you don't have to rely upon dodgy classes with names like 'EmailHelper' or some such.

回到你的问题,我认为你通过将一个结构编码成一个对象来获得一些东西。具体来说,您只需要验证字符串是否代表某个特定位置的有效电子邮件地址,例如构造函数。在其他地方,您的代码可以*地假设EmailAddress对象是有效的,并且您不必依赖具有“EmailHelper”等名称的狡猾类。

#9


I personally do not think strong-typing the email address string as EmailAddress is necessary, in this case.

在这种情况下,我个人认为不需要强力输入电子邮件地址字符串作为EmailAddress。

To create your email address you will, sooner or later, have to do something like:

要创建您的电子邮件地址,您迟早会做以下事情:

EmailAddress(String email)

or a setter

或者是一个二传手

SetEmailAddress(String email)

In both cases, you'll have to validate the email string input, which puts you back into your initial validation problem.

在这两种情况下,您都必须验证电子邮件字符串输入,这会使您回到初始验证问题。

I would, as others pointed out, use regular expressions.

正如其他人指出的那样,我会使用正则表达式。

Having an EmailAddress class would be useful if you plan on having to perform specific operations on your stored information later on (say get domain name only, stuff like that).

如果您计划以后必须对存储的信息执行特定操作(例如,只获取域名,类似的东西),那么拥有EmailAddress类将非常有用。