When learning a new programming language there are always a couple of traditional problems that are good to get yourself moving. For example, Hello world and Fibonacci will show how to read input, print output and compute functions (the bread and butter that will solve basically everything) and while they are really simple they are nontrivial enough to be worth their time (and there is always some fun to be had by calculating the factorial of a ridiculously large number in a language with bignums)
当你学习一门新的编程语言的时候,总会有一些传统的问题让你觉得很好。例如,Hello world和斐波那契将展示如何读取输入,打印输出和计算功能(基本上能解决一切的面包和黄油),虽然他们很简单的非平凡足以值得他们的时间(和总是有一些有趣的是通过计算阶乘的可笑与bignums大量的语言)
So now I'm trying to get to grips with some SQL system and all the textbook examples I can think of involve mind-numbingly boring tables like "Student" or "Employee". What nice alternate datasets could I use instead? I am looking for something that (in order of importance) ...
因此,现在我正试图掌握一些SQL系统和所有我能想到的教科书示例,包括“Student”或“Employee”等令人瞠目结舌的枯燥表格。我可以用什么好的替代数据集呢?我正在寻找(按重要性排序)……
-
The data can be generated by a straightforward algorithm.
- I don't want to have to enter things by hand.
- 我不想用手工输入东西。
- I want to be able to easily increase the size of my tables to stress efficiency, etc
- 我希望能够轻松地增加表的大小以提高效率等等
- 数据可以通过简单的算法生成。我不想用手工输入东西。我希望能够轻松地增加表的大小以提高效率等等
- Can be used to showcase as much stuff as possible. Selects, Joins, Indexing... You name it.
- 可以用来展示尽可能多的东西。选择、连接索引…你的名字。
-
Can be used to get back some interesting results.
- I can live with "boring" data manipulation if the data is real and has an use by itself but I'd rather have something more interesting if I am creating the dataset from scratch.
- 如果数据是真实的,并且有自己的用途,我可以忍受“无聊”的数据操作,但如果我从头开始创建数据集,我宁愿有更有趣的东西。
- 可以用来得到一些有趣的结果。如果数据是真实的,并且有自己的用途,我可以忍受“无聊”的数据操作,但如果我从头开始创建数据集,我宁愿有更有趣的东西。
In the worst case, I at least presume there should be some sort of benchmark dataset out there that would at least fit the first two criteria and I would love to hear about that too.
在最坏的情况下,我至少假定应该有一些基准数据集至少符合前两个标准,我也很乐意听到。
7 个解决方案
#1
12
The benchmark database in the Microsoft world is Northwind. One similar open source (EPL) one is Eclipse's Classic Models database.
微软世界的基准数据库是北风数据库。一个类似的开源(EPL)是Eclipse的经典模型数据库。
You can't autogenerate either as far as I know.
就我所知,你也不能自动生成。
However, Northwind "imports and exports specialty foods from around the world", while Classic Models sells "scale models of classic cars". Both are pretty interesting. :)
然而,北风“从世界各地进口和出口特色食品”,而经典车型则销售“经典车型的规模化车型”。都是很有趣的。:)
#2
10
SQL is a query language, not a procedural language, so unless you will be playing with PL/SQL or something similar, your examples will be manipulating data.
SQL是一种查询语言,而不是过程语言,所以除非您使用PL/SQL或类似的东西,否则示例将处理数据。
So here is what was fun for me -- data mining! Go to:
这就是我的乐趣——数据挖掘!至:
http://usa.ipums.org/usa/
And download their micro-data (you will need to make an account, but its free).
下载他们的微数据(你需要注册一个账户,但这是免费的)。
You'll need to write a little script to inject the fixed width file into your db, which in itself should be fun. And you will need to write a little script to auto create the fields (since there are many) based on parsing their meta-file. That's fun, too.
您需要编写一个小脚本将固定宽度的文件注入到您的db中,这本身应该很有趣。您需要编写一个小脚本,根据解析元文件自动创建字段(因为有很多)。这也很有趣。
Then, you can start asking questions. Suppose the questions are about house prices:
然后,你可以开始问问题。假设问题是关于房价的:
Say you want to look at the evolution of house price values by those with incomes in the top 10% of the population over the last 40 years. Then restrict to if they are living in california. See if there is a correlation between income and the proportion of mortgage payments as a percentage of income. Then group this by geographic area. Then see if there is a correlation between those areas with the highest mortgage burden and the percentage of units occupied by renters. Your db will have some built-in statistical functions, but you can always program your own as well -- so correl might be the equivalent of fibonnacci. Then write a little script to do the same thing in R, importing data from your db, manipulating it, and storing the result.
假设你想看看在过去的40年里收入最高的10%人口的房价的变化。然后限制他们是否住在加州。看看收入与抵押贷款支付占收入的比例之间是否存在相关性。然后按地理区域分组。然后,看看那些抵押贷款负担最高的地区与租房者所占的比例之间是否存在相关性。你的db会有一些内置的统计函数,但你也可以自己编程——所以correl可能是等价的fibonacci。然后编写一个小脚本在R中执行相同的操作,从db中导入数据,操作它,并存储结果。
The best way to learn about DBs is to use them for some other purpose.
了解DBs的最好方法是将它们用于其他目的。
Once you are done playing with iPUMS, take a look at GEO data, with (depending on your database) something like PostGis -- the only difference is that iPUMS gives you resolution in terms of tracts, whereas GIS data has latitude/longitude coordinates. Then you can plot a heat map of mortgage burdens for the U.S., and evolve this heat map over different time scales.
一旦您使用了iPUMS,请查看地理数据(取决于您的数据库),比如PostGis——唯一的区别是iPUMS提供的是域分辨率,而GIS数据具有纬度/经度坐标。然后你可以绘制一幅美国抵押贷款负担的热度图,并在不同的时间尺度上演化出这张热图。
#3
1
Perhaps you can do something with chemistry. Input the 118 elements, or extract them for an online source. Use basic rules to combine them into molecules, which you can store in the database. Combine molecules into bigger molecules and perform more complex queries upon them.
也许你能用化学做点什么。输入118个元素,或者提取它们作为在线资源。使用基本的规则将它们组合成分子,您可以将它们存储在数据库中。把分子结合成更大的分子,并对它们进行更复杂的查询。
#4
1
You will have a hard time finding database agnostic tutorials. The main reason for that is that the SQL-92 standard on which most examples are based on is plain old boring. There are updated standards, but most database agnostic tutorials will dumb-it-down to the lowest common denomiator: SQL-92.
您将很难找到与数据库无关的教程。最主要的原因是,大多数例子所基于的SQL-92标准是乏味的。有更新的标准,但是大多数数据库不可知的教程将简化到最小的公共名称:SQL-92。
If you want to learn about databases as a software engineer, I would definitely recommend starting with Microsoft SQL Server. There are many reasons for that, some are facts, some are opinions. The primary reason though is that it's a lot easier to get a lot further with SQL Server.
如果您想作为一名软件工程师了解数据库,我建议您从Microsoft SQL Server开始。原因有很多,有些是事实,有些是观点。但是,主要原因是使用SQL Server更容易获得更多信息。
As for sample data, Northwind has been replaced by AdventureWorks. You can get the latest versions from codeplex. This is a much more realistic database and allows demonstrating way more than basic joins, filtering and roll-ups. The great thing too, is that it is actually maintained for each release of SQL Server and updated to showcase some of the new features of the database.
至于样本数据,北风已经被AdventureWorks取代。您可以从codeplex上获得最新的版本。这是一个更真实的数据库,并且允许演示比基本的连接、过滤和上卷更多的方式。很棒的是,它实际上是为SQL Server的每个版本维护的,并更新以显示数据库的一些新特性。
Now, for your goal #1, well, I would consider the scaling out an exercise. After you go through the basic and boring stuff, you should gradually be able to perform efficient large-scale data manipulation and while not really generating data, at least copy/paste/modify your SQL data to take it to the size you think.
现在,对于你的第一个目标,我将考虑进行扩展。在完成了基本的、枯燥的工作之后,您应该逐渐能够执行高效的大规模数据操作,并且在不生成数据的同时,至少可以复制/粘贴/修改SQL数据,使其达到您认为的大小。
Keep in mind though that benchmarking databases is not trivial. The performance and efficiency of a database depends on many aspect of your application. How it is used is just as important as how it is setup.
记住,基准测试数据库并不是那么简单。数据库的性能和效率取决于应用程序的许多方面。它的使用方式和它的设置方式一样重要。
Good luck and do let us know if you find a viable solution outside this forum.
祝你好运,如果你在这个论坛之外找到可行的解决方案,请告诉我们。
#5
0
Implement your genealogical tree within a single table and print it. In itself is not a very general problem, but the approach certainly is, and it should prove reasonably challenging.
在一个表中实现你的族谱树并打印出来。这本身并不是一个非常普遍的问题,但方法肯定是这样的,而且它应该被证明是相当具有挑战性的。
#6
0
Geographic data can showcase a lot of SQL capabilities while being somewhat complicated (but not too complicated). It's also readily available from many sources online - international organizations, etc.
地理数据可以显示许多SQL功能,但有些复杂(但不太复杂)。它也很容易从许多来源在线-国际组织,等等。
You could create a database with countries, cities, zip codes, etc. Mark capitals of countries (remember that some countries have more than one capital city...). Include GIS data if you want to get really fancy. Also, consider how you might model different address information. Now what if the address information had to support international addresses? You can do the same with phone numbers as well. Once you get the hang of things you could even integrate with Google Maps or something similar.
您可以创建一个包含国家、城市、邮政编码等的数据库。标记国家的首都(记住,有些国家有多个首都…)如果你想要更花哨的话,可以包含GIS数据。另外,考虑如何建模不同的地址信息。现在,如果地址信息必须支持国际地址怎么办?你也可以用电话号码做同样的事情。一旦你掌握了一些东西,你甚至可以和谷歌地图或类似的东西集成。
You'd likely have to do the database design and import work yourself, but really that's a pretty huge part of working with databases.
您可能需要自己进行数据库设计和导入工作,但这确实是处理数据库的很大一部分。
#7
0
Eclipse's Classic Model database is the best open source database equivalent of Factorial and the Fibonacci function .And Microsoft's Northwind is the another powerful alternative that you can use .
Eclipse的经典模型数据库是最好的开源数据库,相当于Factorial和Fibonacci函数。
#1
12
The benchmark database in the Microsoft world is Northwind. One similar open source (EPL) one is Eclipse's Classic Models database.
微软世界的基准数据库是北风数据库。一个类似的开源(EPL)是Eclipse的经典模型数据库。
You can't autogenerate either as far as I know.
就我所知,你也不能自动生成。
However, Northwind "imports and exports specialty foods from around the world", while Classic Models sells "scale models of classic cars". Both are pretty interesting. :)
然而,北风“从世界各地进口和出口特色食品”,而经典车型则销售“经典车型的规模化车型”。都是很有趣的。:)
#2
10
SQL is a query language, not a procedural language, so unless you will be playing with PL/SQL or something similar, your examples will be manipulating data.
SQL是一种查询语言,而不是过程语言,所以除非您使用PL/SQL或类似的东西,否则示例将处理数据。
So here is what was fun for me -- data mining! Go to:
这就是我的乐趣——数据挖掘!至:
http://usa.ipums.org/usa/
And download their micro-data (you will need to make an account, but its free).
下载他们的微数据(你需要注册一个账户,但这是免费的)。
You'll need to write a little script to inject the fixed width file into your db, which in itself should be fun. And you will need to write a little script to auto create the fields (since there are many) based on parsing their meta-file. That's fun, too.
您需要编写一个小脚本将固定宽度的文件注入到您的db中,这本身应该很有趣。您需要编写一个小脚本,根据解析元文件自动创建字段(因为有很多)。这也很有趣。
Then, you can start asking questions. Suppose the questions are about house prices:
然后,你可以开始问问题。假设问题是关于房价的:
Say you want to look at the evolution of house price values by those with incomes in the top 10% of the population over the last 40 years. Then restrict to if they are living in california. See if there is a correlation between income and the proportion of mortgage payments as a percentage of income. Then group this by geographic area. Then see if there is a correlation between those areas with the highest mortgage burden and the percentage of units occupied by renters. Your db will have some built-in statistical functions, but you can always program your own as well -- so correl might be the equivalent of fibonnacci. Then write a little script to do the same thing in R, importing data from your db, manipulating it, and storing the result.
假设你想看看在过去的40年里收入最高的10%人口的房价的变化。然后限制他们是否住在加州。看看收入与抵押贷款支付占收入的比例之间是否存在相关性。然后按地理区域分组。然后,看看那些抵押贷款负担最高的地区与租房者所占的比例之间是否存在相关性。你的db会有一些内置的统计函数,但你也可以自己编程——所以correl可能是等价的fibonacci。然后编写一个小脚本在R中执行相同的操作,从db中导入数据,操作它,并存储结果。
The best way to learn about DBs is to use them for some other purpose.
了解DBs的最好方法是将它们用于其他目的。
Once you are done playing with iPUMS, take a look at GEO data, with (depending on your database) something like PostGis -- the only difference is that iPUMS gives you resolution in terms of tracts, whereas GIS data has latitude/longitude coordinates. Then you can plot a heat map of mortgage burdens for the U.S., and evolve this heat map over different time scales.
一旦您使用了iPUMS,请查看地理数据(取决于您的数据库),比如PostGis——唯一的区别是iPUMS提供的是域分辨率,而GIS数据具有纬度/经度坐标。然后你可以绘制一幅美国抵押贷款负担的热度图,并在不同的时间尺度上演化出这张热图。
#3
1
Perhaps you can do something with chemistry. Input the 118 elements, or extract them for an online source. Use basic rules to combine them into molecules, which you can store in the database. Combine molecules into bigger molecules and perform more complex queries upon them.
也许你能用化学做点什么。输入118个元素,或者提取它们作为在线资源。使用基本的规则将它们组合成分子,您可以将它们存储在数据库中。把分子结合成更大的分子,并对它们进行更复杂的查询。
#4
1
You will have a hard time finding database agnostic tutorials. The main reason for that is that the SQL-92 standard on which most examples are based on is plain old boring. There are updated standards, but most database agnostic tutorials will dumb-it-down to the lowest common denomiator: SQL-92.
您将很难找到与数据库无关的教程。最主要的原因是,大多数例子所基于的SQL-92标准是乏味的。有更新的标准,但是大多数数据库不可知的教程将简化到最小的公共名称:SQL-92。
If you want to learn about databases as a software engineer, I would definitely recommend starting with Microsoft SQL Server. There are many reasons for that, some are facts, some are opinions. The primary reason though is that it's a lot easier to get a lot further with SQL Server.
如果您想作为一名软件工程师了解数据库,我建议您从Microsoft SQL Server开始。原因有很多,有些是事实,有些是观点。但是,主要原因是使用SQL Server更容易获得更多信息。
As for sample data, Northwind has been replaced by AdventureWorks. You can get the latest versions from codeplex. This is a much more realistic database and allows demonstrating way more than basic joins, filtering and roll-ups. The great thing too, is that it is actually maintained for each release of SQL Server and updated to showcase some of the new features of the database.
至于样本数据,北风已经被AdventureWorks取代。您可以从codeplex上获得最新的版本。这是一个更真实的数据库,并且允许演示比基本的连接、过滤和上卷更多的方式。很棒的是,它实际上是为SQL Server的每个版本维护的,并更新以显示数据库的一些新特性。
Now, for your goal #1, well, I would consider the scaling out an exercise. After you go through the basic and boring stuff, you should gradually be able to perform efficient large-scale data manipulation and while not really generating data, at least copy/paste/modify your SQL data to take it to the size you think.
现在,对于你的第一个目标,我将考虑进行扩展。在完成了基本的、枯燥的工作之后,您应该逐渐能够执行高效的大规模数据操作,并且在不生成数据的同时,至少可以复制/粘贴/修改SQL数据,使其达到您认为的大小。
Keep in mind though that benchmarking databases is not trivial. The performance and efficiency of a database depends on many aspect of your application. How it is used is just as important as how it is setup.
记住,基准测试数据库并不是那么简单。数据库的性能和效率取决于应用程序的许多方面。它的使用方式和它的设置方式一样重要。
Good luck and do let us know if you find a viable solution outside this forum.
祝你好运,如果你在这个论坛之外找到可行的解决方案,请告诉我们。
#5
0
Implement your genealogical tree within a single table and print it. In itself is not a very general problem, but the approach certainly is, and it should prove reasonably challenging.
在一个表中实现你的族谱树并打印出来。这本身并不是一个非常普遍的问题,但方法肯定是这样的,而且它应该被证明是相当具有挑战性的。
#6
0
Geographic data can showcase a lot of SQL capabilities while being somewhat complicated (but not too complicated). It's also readily available from many sources online - international organizations, etc.
地理数据可以显示许多SQL功能,但有些复杂(但不太复杂)。它也很容易从许多来源在线-国际组织,等等。
You could create a database with countries, cities, zip codes, etc. Mark capitals of countries (remember that some countries have more than one capital city...). Include GIS data if you want to get really fancy. Also, consider how you might model different address information. Now what if the address information had to support international addresses? You can do the same with phone numbers as well. Once you get the hang of things you could even integrate with Google Maps or something similar.
您可以创建一个包含国家、城市、邮政编码等的数据库。标记国家的首都(记住,有些国家有多个首都…)如果你想要更花哨的话,可以包含GIS数据。另外,考虑如何建模不同的地址信息。现在,如果地址信息必须支持国际地址怎么办?你也可以用电话号码做同样的事情。一旦你掌握了一些东西,你甚至可以和谷歌地图或类似的东西集成。
You'd likely have to do the database design and import work yourself, but really that's a pretty huge part of working with databases.
您可能需要自己进行数据库设计和导入工作,但这确实是处理数据库的很大一部分。
#7
0
Eclipse's Classic Model database is the best open source database equivalent of Factorial and the Fibonacci function .And Microsoft's Northwind is the another powerful alternative that you can use .
Eclipse的经典模型数据库是最好的开源数据库,相当于Factorial和Fibonacci函数。