I have written an application named address_parser.exe
in C# (WinForm), targeted for PCs running Windows XP, Vista, 7 and 8. With the .NET Framework version 3.5 being the minimal set up...
我已经编写了一个名为address_parser的应用程序。exe在c# (WinForm)中,针对运行Windows XP、Vista、7和8的个人电脑。.NET Framework 3.5是最小的设置……
The application reads in and parses text files (plain text files only, as I have no control over the input files so XML is not an option, unfortunately).
应用程序读取和解析文本文件(只有纯文本文件,因为我无法控制输入文件,所以很不幸,XML不是一个选项)。
These text files contain a set of data, lets say an address, split over multiple, non consecutive, lines.
这些文本文件包含一组数据,比如一个地址,分割成多行、非连续行。
Please have a look at the following two text files as a demo:
作为演示,请查看以下两个文本文件:
address_type_1.txt:
address_type_1.txt:
Elm Grove
47
PO5 1JF
Southsea
and
和
address_type_2.txt:
address_type_2.txt:
Southsea
Albert Road
147b
PO4 0JW
Now, currently I have hard coded the information where in the input file the street, the house number, the zip code and the city is located, in my code. So for each address file type if have created a set of rules, which line contains which information.
现在,我已经硬编码了输入文件中的信息街道,房子号,邮政编码和城市所在的地方,在我的代码中。因此,对于每个地址文件类型,如果创建了一组规则,其中一行包含哪些信息。
In addition, I have a set of regular expressions that check the validity of each information (street, house number, zip code, city).
此外,我还有一组正则表达式来检查每个信息的有效性(街道、房屋号、邮政编码、城市)。
Since these two sets of rules/checks (which line contains which information/regex pattern for each information) vary for each different address type, I would like to store these rules in a sort of config file. So instead of hard coding this, I would like to have a configuration file for each address type, that my application can read and configure itself how to parse the particular address file type.
由于这两组规则/检查(哪一行包含每个信息的信息/regex模式)对于每个不同的地址类型不同,我想将这些规则存储在一个配置文件中。所以我不需要硬编码,我希望每个地址类型都有一个配置文件,我的应用程序可以读取并配置自己如何解析特定的地址文件类型。
I would like to get some ideas and inspiration from you. Please share your thoughts and best practises!
我想从你那里得到一些想法和灵感。请分享你的想法和最好的实践!
Thanks!
谢谢!
Below are some thoughts of mine, and code snippets I am using so far...
下面是我的一些想法,以及到目前为止我正在使用的代码片段……
My currently hard coded address file parsing runs like this:
我目前的硬编码地址文件解析如下:
public static Address Parse(string fileName)
{
var a = new Address();
a.OriginalFile = fileName;
int i = 0;
using (var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.None))
{
using (var reader = new StreamReader(fs, Encoding.GetEncoding(65001)))
{
Regex rgxStreet = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$");
Regex rgxNumber = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,20}$");
Regex rgxCity = new Regex(@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$");
Regex rgxZIP = new Regex(@"^([0-9]){5}$");
while (!reader.EndOfStream)
{
var line = reader.ReadLine().TrimEnd(';').Trim();
if (line != null)
{
if (i == 4 && rgxStreet.IsMatch(line))
{
a.Street = line;
}
else if (i == 7 && rgxNumber.IsMatch(line))
{
a.Number = line;
}
else if (i == 12 && (rgxZIP.IsMatch(line) || String.IsNullOrEmpty(line)))
{
a.Zip = line;
}
else if (i == 15 && rgxCity.IsMatch(line))
{
a.City = line;
}
}
i++;
}
}
}
return a;
}
As you can see, I am also using individual regular expressions on those 4 attributes to check if the stuff that I am reading is valid.
如您所见,我还在这4个属性上使用单个正则表达式来检查我正在读取的内容是否有效。
Now, I would like to modify this hard coded information (line X contains field Y with regular expression Z) so that I can support reading and parsing files where the same information is stored in a different order, or with different valid values.
现在,我想修改这个硬编码信息(第X行包含带正则表达式Z的字段Y),以便支持读取和解析文件,其中相同的信息以不同的顺序存储,或者使用不同的有效值存储。
The example above targets a file containing an address in Germany (ZIP code is 5 digits).
上面的示例针对一个包含德国地址的文件(邮政编码为5位数)。
Parsing another type of text file which contains an adress in the UK may look like this:
在英国解析另一种包含adress的文本文件可能如下所示:
line 1: city;
line 2: zip;
line 20: street;
line 159: number;
In this example, the order of the information has changed as well as the needed reg ex for the zip code (postal codes in the UK are 6 digits long, and contain letters and numbers).
在本例中,信息的顺序以及邮政编码所需的reg ex都发生了更改(在英国,邮政编码为6位数,包含字母和数字)。
Instead of hard coding the information how to parse this type of file, I would like something like a config file which tells my application how to parse a specific type of file. Something like this:
除了硬编码信息如何解析这类文件之外,我还想要一个配置文件,它告诉我的应用程序如何解析特定类型的文件。是这样的:
#config file for UK address files:
#line;field;regex;
1;city;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$";
2;zip;@"^([A-Za-z0-9]){6}$";
20;street;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,128}$";
150;number;@"^([\w\.,:\/\\\-öäüÖÄÜß_\s\(\)\[\]-[=;]]){0,20}$";
My question is: is this a good idea, or are there better ways to achieve this (to tell my application how a specific file needs to be read and parsed and its contents interpreted and validated)?
我的问题是:这是一个好主意,还是有更好的方法来实现这一点(告诉我的应用程序如何读取和解析一个特定的文件,并对其内容进行解释和验证)?
Thank you!
谢谢你!
2 个解决方案
#1
3
Yes is a good idea, use Newtonsoft.Json
to help you with the config load like
是的,这是个好主意,使用Newtonsoft。Json帮助您进行配置加载
private class StartSettings
{
public string CityReg;
public int CityNum;
public string ZipReg;
public int ZipNum;
public string StreetReg;
public int StreetNum;
public string NumberReg;
public int NumberNum;
}
var configString = File.ReadAllText(configFilePath);
var config = JsonConvert.DeserializeObject<StartSettings>(configString);
And to read the files just use
并且读取文件只是使用。
Regex rgxStreet = new Regex(config.StreetReg);
Regex rgxNumber = new Regex(config.NumberReg);
Regex rgxCity = new Regex(config.CityReg);
Regex rgxZIP = new Regex(config.ZipReg");
foreach (var line = File.ReadLines(fileName, Encoding.GetEncoding(65001))
.Select(l => l.TrimEnd(';').Trim())
{
if(config.CityNum == i && rgxCity.IsMatch(line))
a.City = line;
...
i++;
}
return a;
#2
0
Since I doubt it is possible to determine if a value is a street or Cityname, you need to specifiy atleast some information on iput-data in what "format" the data is made up.
由于我怀疑是否有可能确定一个值是street还是Cityname,所以您需要至少说明一些关于iput数据的信息,这些数据的“格式”是什么。
If it is possible for you to still decide dataformat go for XML.
如果您仍然可以决定使用XML的dataformat。
Use XML and XmlSerializer like so:
像这样使用XML和XmlSerializer:
[Serializable]
public class AdressData
{
[XmlArrayItem("Adress")]
public Adress[] Adresses
}
[Serializable]
public class Adress
{
public string Street {get; set;}
public int Number {get; set;}
public int Zip{get; set;}
public string City{get; set;}
public string State{get; set;}
}
Then use it like this:
然后像这样使用:
XmlSerializer serializer = new XmlSerializer(typeof(AdressData));
AdressData data = (AdressData)serializer.Deserialize(File.Open(fileName));
foreach(Adress adress in data.Adresses)
{
checkIfItExists(adress);
}
Your XMl should look like this:
您的XMl应该如下所示:
<AdressData>
<Adresses>
<Adress>
<Street>WhateverStr</Street>
<Number>7</Number>
<Zip>5675765</Zip>
<City>Citytown</City>
<State>Alabama</State>
</Adress>
<Adress>
<!-- Order doesnt matter here -->
<Number>7</Number>
<Zip>5675765</Zip>
<City>Citytown</City>
<State>Alabama</State>
<Street>WhateverStr</Street>
</Adress>
</Adresses>
</AdressData>
The order of the data in the XML doesnt matter, as long as it fitts in the hirearchy. The serializer does some Validation e.g. tries to parse numeric values. All you need to do is check whether the information itself is valid.
XML中的数据顺序并不重要,只要它适合于hire。序列化器进行一些验证,例如尝试解析数值。您需要做的就是检查信息本身是否有效。
It is capable of parsing Enums aswell, so you could (wouldnt recommend though) create an Enum containing all US-Statenames...
它也能够解析枚举,所以您可以(尽管不建议)创建包含所有US-Statenames的Enum……
#1
3
Yes is a good idea, use Newtonsoft.Json
to help you with the config load like
是的,这是个好主意,使用Newtonsoft。Json帮助您进行配置加载
private class StartSettings
{
public string CityReg;
public int CityNum;
public string ZipReg;
public int ZipNum;
public string StreetReg;
public int StreetNum;
public string NumberReg;
public int NumberNum;
}
var configString = File.ReadAllText(configFilePath);
var config = JsonConvert.DeserializeObject<StartSettings>(configString);
And to read the files just use
并且读取文件只是使用。
Regex rgxStreet = new Regex(config.StreetReg);
Regex rgxNumber = new Regex(config.NumberReg);
Regex rgxCity = new Regex(config.CityReg);
Regex rgxZIP = new Regex(config.ZipReg");
foreach (var line = File.ReadLines(fileName, Encoding.GetEncoding(65001))
.Select(l => l.TrimEnd(';').Trim())
{
if(config.CityNum == i && rgxCity.IsMatch(line))
a.City = line;
...
i++;
}
return a;
#2
0
Since I doubt it is possible to determine if a value is a street or Cityname, you need to specifiy atleast some information on iput-data in what "format" the data is made up.
由于我怀疑是否有可能确定一个值是street还是Cityname,所以您需要至少说明一些关于iput数据的信息,这些数据的“格式”是什么。
If it is possible for you to still decide dataformat go for XML.
如果您仍然可以决定使用XML的dataformat。
Use XML and XmlSerializer like so:
像这样使用XML和XmlSerializer:
[Serializable]
public class AdressData
{
[XmlArrayItem("Adress")]
public Adress[] Adresses
}
[Serializable]
public class Adress
{
public string Street {get; set;}
public int Number {get; set;}
public int Zip{get; set;}
public string City{get; set;}
public string State{get; set;}
}
Then use it like this:
然后像这样使用:
XmlSerializer serializer = new XmlSerializer(typeof(AdressData));
AdressData data = (AdressData)serializer.Deserialize(File.Open(fileName));
foreach(Adress adress in data.Adresses)
{
checkIfItExists(adress);
}
Your XMl should look like this:
您的XMl应该如下所示:
<AdressData>
<Adresses>
<Adress>
<Street>WhateverStr</Street>
<Number>7</Number>
<Zip>5675765</Zip>
<City>Citytown</City>
<State>Alabama</State>
</Adress>
<Adress>
<!-- Order doesnt matter here -->
<Number>7</Number>
<Zip>5675765</Zip>
<City>Citytown</City>
<State>Alabama</State>
<Street>WhateverStr</Street>
</Adress>
</Adresses>
</AdressData>
The order of the data in the XML doesnt matter, as long as it fitts in the hirearchy. The serializer does some Validation e.g. tries to parse numeric values. All you need to do is check whether the information itself is valid.
XML中的数据顺序并不重要,只要它适合于hire。序列化器进行一些验证,例如尝试解析数值。您需要做的就是检查信息本身是否有效。
It is capable of parsing Enums aswell, so you could (wouldnt recommend though) create an Enum containing all US-Statenames...
它也能够解析枚举,所以您可以(尽管不建议)创建包含所有US-Statenames的Enum……