数据序列化导读(3)[JSON v.s. YAML]

时间:2021-11-28 19:53:07

前面两节介绍了JSON和YAML,本文则对下面的文章做一个中英文对照翻译。

Comparison between JSON and YAML for data serialization
用于数据序列化的JSON和YAML之比较

Chapter 1 Introduction | 第一章 概述

This paper discusses and compares different serialization formats in computer science. In addition to a background of the serialization technique and how it is utilized as a method today, it also compares two important light-weight data interchange formats, namely JSON and YAML. Although they are quite similar when it comes to usability, there are some distinctions between them, mostly regarding design choices and syntax. These choices do however affect their scope of use. The report aims to discuss the various differences between the languages, along with the resulting consequences regarding performance and usability in different use cases which they lead to.
本文讨论和比较了计算机科学中不同的数据序列化格式。除了介绍数据序列化技术的背景和如今序列化作为一种方法是如何被使用的之外,本文还比较了两种轻量级的数据交换格式,JSON和YAML。虽然JSON和YAML在可用性上很相似,但是也有一些区别,其主要区别在设计考虑和语法上。然而,这些设计考虑对他们的使用范围无疑是有影响的。本文将主要讨论序列化语言之间的差异,以及这些差异所导致的不同的性能及可用性。

1.1 Problem statement | 问题描述

There has been an increase in discussions comparing the usability of YAML and JSON as serialization formats in recent years. Even though there are multiple thoughts and opinions on the net, there is a lack of actual general investigation on the subject.

近些年来,针对YAML和JSON作为数据序列化格式的可用性的比较讨论可谓层出不穷。网络上想法和意见也有很多,但对这一话题进行全面的调查研究却并不存在。

The primary aim for this project is to determine and compare the major differences between YAML and JSON from multiple perspectives. This will not only be done through a performance test, the comparison is also based on collected facts and research. From this, conclusions can be drawn regarding their usability and different scope of use. This comparison will help define the gap between them (if such exists), and will hopefully provide some guidelines to consider for future development involving data serialization.
XXX

Chapter 2 Background | 背景

This chapter explains and defines background facts regarding the different parts of this report. It includes information about the serialization and parsing process, but focuses on the serialization languages to be compared - YAML and JSON.
XXX

2.1 Serialization | 序列化

2.1.1 Definition | 定义

Serialization is a process for converting a data structure or object into a format that can be transmitted through a wire, or stored somewhere for later use [8].
XXXX

In terms of serialization there are a legion of different ways and formats that can be used. Which method and format to choose depends on the requirements set up on the object or data, and the use for the serialization (sending or storing). The choice may also affect the size of the serialized data as well as serialization/deserialization performance in terms of processing time and memory usage.
XXXX

2.1.2 General method | 一般方法

Common for all serialization methods is the procedure of reading data as a series, once started the whole object will usually be serialized/deserialized. This enables the use of simple I/O interfaces to hold and pass on the state of an object, although difficulties arise in applications which require higher performance by having a non-linear storage organization, or when the object contains large amounts of data. These cases requires more effort to deal with, and will not be covered in this paper. The most commonly used data structures when encoding data like this are scalars, maps and sequences (lists or arrays).
XXXX

Serialization is supported by many of the popular object-oriented programming languages like PHP, Ruby, Java, Smalltalk and Python along with the .NET Framework. All of these languages provide serialization methods either as implementable interface or as syntactic sugar. For example .NET provides a serializable attribute [10], and Java uses an interface named Serializable for classes to implement [9]. In Ruby, the term used for serialization is marshaling, and the language provides a module called Marshal for this [16]. This module can often be used without any changes to the definitions of the objects to be serialized. A serialization strategy can be defined in cases when you want to restrict the serialization process (all instance variables are serialized by default) or handle data in specific ways.
XXXX

Most of the standard serialization implementations converts the data into a binary string, which means that the data will not easily be inspected by a human in its serialized form.  Rubys Marshal module returns a plain text string, which however is not completely readable as it contains special byte sequences and is not formatted in a way to be easily read by a human.
XXXX

Example | 例如

A concrete example where serialization is needed is when storing information from an address book, in this case written in Java. Every instance contains a person with details about their address and phone number. One wants to store all instances on a server in exactly the way they are created and there are a few possible solutions;
XXXX

1. By using Java serialization, which is part of the language. This can easily be done, but problems arise if the data would have to be accessible to applications written in C++, Python or another language as the data is serialized in a way unique to Java.
XXXX

2. By using an improvised way of encoding the data into single strings, such as encoding four integers into for example 12:3:-23:67. This solution requires some custom parsing code to be written, and is most efficiently used when converting very simple data.
XXXXX

3. By serializing the data into XML. It is an attractive method due to the fact that XML is human readable and have bindings (API libraries) for many languages, although it is space intensive and can cause performance penalties on applications.
XXXX

Due to the ineffectiveness regarding these approaches mentioned above, other solutions are often desirable.
XXXX

2.1.3 Scope of use | 使用范围

Serialization is often used when transmitting data, as has been mentioned above. Some example of such cases are when storing user preferences in an object or for maintaining security information across pages and applications. In general, when transferring objects in applications, domains, or through firewalls, serialization can be very helpful.
XXXX

2.2 Parsing | 解析

2.2.1 General | 概述

The term parsing in computer science means in general to analyze written text, determining its grammatical structure from a known formal grammar. In linguistic terms, parse means analyzing and describe the grammar of a sentence. The parser splits up an expression into tokens which are then inserted into some kind of data structure. This data is the evaluated to interpret the meaning of each expression by the rules from given grammar, followed by execution of the appropriate action.
XXXX

2.2.2 Serialization and parsing | 序列化和解析

Serialization is mainly a method to maintain easy ways of storing, in the sense of converting data and then restore it into a semantically equivalent clone. Unless the serialization method used serializes the data in a coherent order (never changing) and expects the data to be read in the same order when deserializing, parsing will have to be done when the data is to be deserialized.  When deserializing, parsing is done to identify the data identifiers (attribute names or the like) and their corresponding values (while at the same time often having to discern the type of data).
XXXX

The following sections aims to introduce JSON and YAML and makes no statements about differences between them. This will be discussed later on in the results and conclusions chapters.
XXXX

2.3 JSON

2.3.1 General | 概述

JSON is a subset of the open ECMAScript standard [3](which the JavaScript programming language is an implementation of). It was created to be used as a way to parse human-readable (in plain text format) representations of data into valid ECMAScript objects [7]. It is completely language independent and uses notations similar to common programming languages such as C, C++, Java, etc.
XXXX

The format has grown to be very popular in cases where serialization and inter-change of structured data over networks [13] and is often associated with the modern web due to the fact that it is frequently used when communication between a web server and client side web application is requested.
XXXX

2.3.2 Origin | 起源

JSON was originally introduced as a written specification by Douglas Crockford in 2001 [4], who used the format within his company State Software. Crockford was not the first person to invent the object notation as other individuals had discovered it independently at about the same time, but he was the first one to give it a complete specification based on parts of the JavaScript standard. Following that he launched the JSON.org website in 2002, which still exists and currently provides a listing of JSON libraries for different programming languages [3]. It quickly grew in popularity partly thanks to its simplicity, which made it much more light weight (resulting in faster load times over the Internet) compared to XML, a format frequently used on the web. The other reason for the growth in usage is the increased use of JavaScript on the web.
XXXX

JSON documents can be parsed in JavaScript by calling the built-in eval function with the JSON string provided as an argument. The JavaScript interpreter will then execute the parameter as JavaScript code, constructing an object with the properties defined by the JSON string.  This will work due to the fact that JavaScript is a superset of JSON. Using the eval function is theoretically the most efficient way to parse JSON as it will just invoke the JavaScript interpreter (without any security/constraint cheks). This method can be said to be quite inelegant since the interpreter does not prevent any JavaScript code from being executed. In most cases a dedicated JSON parser should be used to avoid security issues and only allow valid JSON as input. Most of the modern browsers have had fast native JSON parsers since 2009, which are preferred to using eval [13].
XXXX

2.3.3 Functionality | 功能

JSON is human-readable language, foremost designed for its simplicity and universality. It implements basic data types available to most modern programming languages [6]. The fact that it also is easy to read and parse contributes to its usefulness in programming. JSON also is language-independent, meaning that the specification is not tied to any specific programming language (it was originally based on the JavaScript object notation however). The design incorporates data types common across most modern languages.
XXXX

The JSON standard does not support object references, which affects the ability to store cyclic structures for example. This functionality can be provided by an extension like dojox.json.ref from the Dojo Toolkit [19], enabling JSON objects to be marked with specific ids which can later be referenced to.
XXXX

Complex structures can also be built as associative arrays, objects within objects. JSON objects can contain any valid data type, enabling deep data hierarchies in JSON documents.
XXXX

The JSON format specification does not include support for validations of values or structure, but a external specification called JSON Schema exists as a draft [20]. JSON Schema can be used to define the structure of a JSON document much like an XML Schema, for example which data types values should have and if they are optional or required to be present. The defined schema can then be used to validate JSON documents or as a way to document application APIs.
XXXX

Valid data types are [2]:
o Numbers (floating-point numbers in scientific notation, infinity is not permitted)
o Strings (with Unicode support)
o Boolean (true/false)
o Objects (associative arrays / objects with key-value pairs)
o Arrays (ordered lists)
o Null

XXXX

2.3.4 Syntax | 语法

As is described above, JSON consists of objects, arrays and scalars. General syntax will be described in this section, which is intended to give an overview over the language and its usability regarding semantics. An example of a arbitrary JSON document can be found in table 2.1 (with code converted from the YAML example [6]).
XXXX

o Comments are not allowed in the current standard (they were removed by the author in a later revision of the specification [4]).
o Objects (unordered collection of name/value pairs) are denoted with braces({}).
o Identifiers must be enclosed in quotes (as a string) and are followed by a colon and value.
o Objects (associative arrays / objects with key-value pairs)
o Multiple key-value pairs are separated with a comma.
o Arrays (ordered set of values) are placed within brackets ([]) and separated by commas.
o The root node of a JSON document must be an object or an array.
XXXX

2.3.5 Scope of use | 使用范围

JSON, considered to be a more user-friendly alternative to XML, is often used as a substitute to it. When XML has been said to contribute with a lot of unnecessary baggage, JSON documents can contain the same information while also being much more light weight and easy to read [5]. JSON is most commonly used when exchanging or storing structured data. It is especially common in Ajax web applications, where it provides a standardized data exchange format for JavaScript implementations [11].
XXXX

2.3.6 Process | 处理过程

JSON is parsed (deserialized) in a simple character by character reading, constructing structures and object in one single pass. JavaScript implementations allows a parameter for an external function (called a reviver) to be provided, allowing more specific transformation of data. Serialization is also done in one single iteration through the data structure, where most implementations call a to_json (or similarly named) method, either earlier defined by the implementation or by the user, and then appends the result of this method call to the JSON output.
XXX

Table 2.1: Log file for an arbitrary application in JSON format

 [
{
"User": "ed",
"Time": "2001-11-23 15:01:42 -5",
"Warning": "This is an error message for the log file"
},
{
"User": "ed",
"Time": "2001-11-23 15:02:31 -5",
"Warning": "A slightly different error message."
},
{
"User": "ed",
"Date": "2001-11-23 15:03:17 -5",
"Fatal": "Unknown variable \"bar\"",
"Stack": [
{
"code": "x = MoreObject(\"345\\n\")\n",
"line": 23,
"file": "TopClass.py"
},
{
"code": "foo = bar",
"line": 58,
"file": "MoreClass.py"
}
]
}
]

A JSON document with an array containing multiple log entries.

...未完待续,请耐心等待...