I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:
我将各种时间序列数据存储在或多或少的地理参考网格上,例如每0.2度纬度和经度一个值。目前,数据存储在文本文件中,因此在第251天您可能会看到:
251
12.76 12.55 12.55 12.34 [etc., 200 more values...]
13.02 12.95 12.70 12.40 [etc., 200 more values...]
[etc., 250 more lines]
252
[etc., etc.]
I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).
我想提高抽象级别,提高性能并降低脆弱性(例如,当前代码不能在两个现有代码之间插入一天!)。我们搞乱了BLOB-y RDBMS黑客,甚至将文本文件格式的每一行复制为一个表中的一行(每个时间戳/纬度对一行,每个经度增量一列 - yecch!)。
We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.
我们可以去一个“真正的”地理数据库,但用lat和long标记每个单独值的开销似乎很高。数据的大小和分辨率在十年内没有变化,也不太可能发生变化。
I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.
我一直在把所有东西都放在NetCDF文件中,但我认为我们需要完全超越文件思维模式 - 我讨厌我所有的软件必须从日期中找出文件名,多年处理多个文件等等。另一种方法是,将所有十年(和计数)数据放入一个文件中,似乎也不可行。
Any bright ideas or products?
有什么好主意或产品吗?
5 个解决方案
#1
2
I've assembled your comments here:
我在这里收集你的评论:
- I'd like to do all this "w/o writing my own file I/O code"
- I need access from "Java Ruby MATLAB" and "FORTRAN routines"
我想做所有这些“没有编写我自己的文件I / O代码”
我需要从“Java Ruby MATLAB”和“FORTRAN例程”访问
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
当你添加它们时,你绝对不需要新的文件格式。坚持你所拥有的那个。
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
如果我们可以让您放松您的第一个要求 - 即,如果您愿意编写自己的文件I / O代码,那么有一些有趣的选项供您使用。我会编写C ++类,并且我会使用类似SWIG的东西来使您的新类可用于您需要的多种语言。 (但我不确定你是否能够使用SWIG从Java,Ruby,MATLAB和FORTRAN访问。你可能还需要别的东西。我自己也不确定如何去做。)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
你还说,“实际上,如果我必须有文件,我更喜欢文字,因为那时我可以进入并在必要时进行手工编辑。”
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
我的信念是,这是一个误入歧途的陈述。如果您愿意制作自己的文件I / O例程,那么您可以做一些非常聪明的事情......作为一个终极后备,您可以给自己一个工具,从新文件格式转换为相同的旧文件格式您习惯的文本格式......以及另一种转换回来的工具。我会在帖子的最后回到这个...
You said something that I want to address:
你说了一些我要解决的问题:
"leverage 40 yrs of DB optimization"
“利用40年的数据库优化”
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
数据库用于关系数据,而不是栅格数据。您不会利用此类数据利用任何人的数据库优化。您可能可以将数据塞入数据库,但这几乎不是一回事。
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
根据你告诉我们的一切,这是我能告诉你的最有用的东西。你说的这个:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
“我对优化我的时间比对CPU更感兴趣,尽管执行速度很快!”
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
坦白说,这需要TOOLS。不要把它当成文本文件。开始考虑你做的常见任务,并编写小工具 - 在WHATEVER LANGAUGE(S)中 - 让这些东西成为TRIVIAL。
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
如果你的工具结果糟糕吗?猜猜是什么 - 这是因为你的平面文本文件是一种粗糙的格式。但那只是我的个人意见。 :)
#2
0
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
我肯定会从文本更改为二进制文件,但仍然将每一天保留在单独的文件中。您可以通过这样的方式命名它们,使得插入介于两者之间不会对索引产生任何异常,例如在文件名中包含日期和可能的时间。如果每个位置有多个字段,您也可以考虑文件结构。从大量的时间步中寻找一个小瓷砖是很常见的吗?在这种情况下,您可能希望将它们存储为包含几天数据的切片。您没有提到如何访问数据,这在如何有效地组织数据方面发挥了重要作用。
#3
0
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
我很惊讶您添加了“数据库”作为标签之一,并将其视为一种选择。你为什么要这么做?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
实际上,每个时间步都有一个2D单组分浮点图像。您是否同意这种查看数据的方式?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
你还提到了在现有的两个之间插入一天的愿望 - 这似乎是一件非常奇怪的事情。你为什么要这样做?我不知道5月4日到5月5日是否有新的一天?
Is "compression" one of the things you care about, or are you just sick of flat files?
“压缩”是你关心的事情之一,还是你只是厌倦了平面文件?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
float或double是否足以存储您的数据,或者您是否觉得需要更高的任意精度?
Also, what programming language(s) do you want to access this data with?
另外,您想要使用哪种编程语言访问此数据?
#4
0
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
关于如何存储数据的答案完全取决于您将如何处理数据。例如,如果您只需要通过指定日期或日期范围进行检索,那么将数据库存储为BLOB就行了。但如果您需要查找具有特定值的记录,则需要执行不同的操作。
please describe how you need to be able to access the data/
请描述您需要如何访问数据/
#5
0
Matt, thanks very much, and likewise longneck and jirv.
马特,非常感谢,同样也是longneck和jirv。
This post was partly an experiment, testing the quality of * discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
这篇文章部分是一个实验,测试*话语的质量。如果你们/ gals / alien lifeforms具有代表性,我就卖掉了。
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
在某种程度上,你已经大大澄清了我的想法。记住,我仍然可能不一定执行你的建议,但知道我会非常认真地考虑它。 > ;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
我可能会保持文件格式相同,添加到现有的C和/或Ruby例程来处理我缺少的一些低级功能(例如插入缺少的时间步长),并在整个事情上挂起HTTP前端数据可以被任何需要它的盒子消耗,无论当前是什么语言。虽然构建这些数据的遗传软件大多不变,但我们总是想出新的消费者,因此多语言/多计算机的要求(哎呀,我忘了那个?)适用于阅读方面,而不是写作方面。这也消除了一大堆安全问题。
Thanks again, folks.
伙计们,再次感谢
#1
2
I've assembled your comments here:
我在这里收集你的评论:
- I'd like to do all this "w/o writing my own file I/O code"
- I need access from "Java Ruby MATLAB" and "FORTRAN routines"
我想做所有这些“没有编写我自己的文件I / O代码”
我需要从“Java Ruby MATLAB”和“FORTRAN例程”访问
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
当你添加它们时,你绝对不需要新的文件格式。坚持你所拥有的那个。
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
如果我们可以让您放松您的第一个要求 - 即,如果您愿意编写自己的文件I / O代码,那么有一些有趣的选项供您使用。我会编写C ++类,并且我会使用类似SWIG的东西来使您的新类可用于您需要的多种语言。 (但我不确定你是否能够使用SWIG从Java,Ruby,MATLAB和FORTRAN访问。你可能还需要别的东西。我自己也不确定如何去做。)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
你还说,“实际上,如果我必须有文件,我更喜欢文字,因为那时我可以进入并在必要时进行手工编辑。”
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
我的信念是,这是一个误入歧途的陈述。如果您愿意制作自己的文件I / O例程,那么您可以做一些非常聪明的事情......作为一个终极后备,您可以给自己一个工具,从新文件格式转换为相同的旧文件格式您习惯的文本格式......以及另一种转换回来的工具。我会在帖子的最后回到这个...
You said something that I want to address:
你说了一些我要解决的问题:
"leverage 40 yrs of DB optimization"
“利用40年的数据库优化”
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
数据库用于关系数据,而不是栅格数据。您不会利用此类数据利用任何人的数据库优化。您可能可以将数据塞入数据库,但这几乎不是一回事。
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
根据你告诉我们的一切,这是我能告诉你的最有用的东西。你说的这个:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
“我对优化我的时间比对CPU更感兴趣,尽管执行速度很快!”
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
坦白说,这需要TOOLS。不要把它当成文本文件。开始考虑你做的常见任务,并编写小工具 - 在WHATEVER LANGAUGE(S)中 - 让这些东西成为TRIVIAL。
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
如果你的工具结果糟糕吗?猜猜是什么 - 这是因为你的平面文本文件是一种粗糙的格式。但那只是我的个人意见。 :)
#2
0
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
我肯定会从文本更改为二进制文件,但仍然将每一天保留在单独的文件中。您可以通过这样的方式命名它们,使得插入介于两者之间不会对索引产生任何异常,例如在文件名中包含日期和可能的时间。如果每个位置有多个字段,您也可以考虑文件结构。从大量的时间步中寻找一个小瓷砖是很常见的吗?在这种情况下,您可能希望将它们存储为包含几天数据的切片。您没有提到如何访问数据,这在如何有效地组织数据方面发挥了重要作用。
#3
0
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
我很惊讶您添加了“数据库”作为标签之一,并将其视为一种选择。你为什么要这么做?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
实际上,每个时间步都有一个2D单组分浮点图像。您是否同意这种查看数据的方式?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
你还提到了在现有的两个之间插入一天的愿望 - 这似乎是一件非常奇怪的事情。你为什么要这样做?我不知道5月4日到5月5日是否有新的一天?
Is "compression" one of the things you care about, or are you just sick of flat files?
“压缩”是你关心的事情之一,还是你只是厌倦了平面文件?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
float或double是否足以存储您的数据,或者您是否觉得需要更高的任意精度?
Also, what programming language(s) do you want to access this data with?
另外,您想要使用哪种编程语言访问此数据?
#4
0
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
关于如何存储数据的答案完全取决于您将如何处理数据。例如,如果您只需要通过指定日期或日期范围进行检索,那么将数据库存储为BLOB就行了。但如果您需要查找具有特定值的记录,则需要执行不同的操作。
please describe how you need to be able to access the data/
请描述您需要如何访问数据/
#5
0
Matt, thanks very much, and likewise longneck and jirv.
马特,非常感谢,同样也是longneck和jirv。
This post was partly an experiment, testing the quality of * discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
这篇文章部分是一个实验,测试*话语的质量。如果你们/ gals / alien lifeforms具有代表性,我就卖掉了。
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
在某种程度上,你已经大大澄清了我的想法。记住,我仍然可能不一定执行你的建议,但知道我会非常认真地考虑它。 > ;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
我可能会保持文件格式相同,添加到现有的C和/或Ruby例程来处理我缺少的一些低级功能(例如插入缺少的时间步长),并在整个事情上挂起HTTP前端数据可以被任何需要它的盒子消耗,无论当前是什么语言。虽然构建这些数据的遗传软件大多不变,但我们总是想出新的消费者,因此多语言/多计算机的要求(哎呀,我忘了那个?)适用于阅读方面,而不是写作方面。这也消除了一大堆安全问题。
Thanks again, folks.
伙计们,再次感谢