I have a huge JSON file titled something.json. The file is 20 MB. I'm reading this in with GSON. It's being read on a standard Android Nexus 5X.
我有一个名为something.json的大型JSON文件。这个文件是20mb,我正在和GSON一起读。它被标准的安卓Nexus 5X读取。
Example of the Json:
Json的例子:
[
{"country":"UA","name":"Hurzuf","_id":707860,"coord":{"lon":34.283333,"lat":44.549999}},
{"country":"UA","name":"Il’ichëvka","_id":707716,"coord":{"lon":34.383331,"lat":44.666668}},
{"country":"BG","name":"Rastnik","_id":727762,"coord":{"lon":25.283331,"lat":41.400002}}
...
]
Code:
代码:
@Override
protected ArrayList<City> doInBackground(File... files) {
ArrayList<City> cities = new ArrayList<>();
try {
InputStream is = new FileInputStream(files[0]);
JsonReader reader = new JsonReader(new InputStreamReader(is, "UTF-8"));
reader.beginArray();
while (reader.hasNext()) {
City city = new Gson().fromJson(reader, City.class);
cities.add(city);
}
reader.endArray();
reader.close();
} catch (Exception e) {
mResult.onFinish(cities, e.getMessage());
}
Collections.sort(cities, (o1, o2) -> o1.getName().compareTo(o2.getName()));
mResult.onFinish(cities, CityService.SUCCESS);
return cities;
}
Library used:
图书馆使用:
com.google.code.gson:gson:2.8.0
It needs to work from Android API 16 till the latest.
它需要从Android API 16一直工作到最近。
I need to read this in to mCities, and sort it alphabetically on city name. Right now this takes 3 minutes and it has to be done in around 10 seconds. My approach is to cut up the json file in 10 smaller chunks, read these in, concatenate and sort them.
我需要把它读给mCities,然后按字母顺序排序。现在这需要3分钟,而且必须在10秒内完成。我的方法是将json文件分割成10个较小的块,读取这些块,并将它们连接起来并进行排序。
So my question is: how to divide the file in smaller chunks and is this the correct approach to solve this problem?
我的问题是:如何将文件分成更小的块,这是解决这个问题的正确方法吗?
Link to the file: http://www.jimclermonts.nl/docs/cities.json
链接到文件:http://www.jimclermonts.nl/docs/cities.json
1 个解决方案
#1
1
I mostly never do Android coding per se, but I have some notes and probably ideas for you to go with since this is pure Java. Your reader does very excessive work while reading each element. First of all, you don't need to create Gson
every time you need it:
我基本上从来不做Android代码本身,但是我有一些笔记和想法供你们参考,因为这是纯Java。你的读者在阅读每一个元素时都要做大量的工作。首先,您不需要每次都创建Gson:
- It's immutable and thread-safe.
- 它是不可变的,线程安全的。
- It's relatively expensive to create.
- 这是相对昂贵的创造。
- Instantiating a
Gson
instance also hits the heap taking more time to execute and then garbage-collect. - 实例化Gson实例也会导致堆需要更多的时间来执行,然后垃圾收集。
Next, there is a difference between only-deserialization and JSON stream reading in Gson: the first may use a heavy type adapters composition under the hood, whilst the latter simply can parse JSON documents token by token. Having that said, you can gain a better performance while reading the JSON stream: your JSON file is really known to have a very strict structure so the high-level parser can be implemented much simpler.
接下来,在Gson中,单反序列化和JSON流读取之间有一个区别:第一个可能使用大类型适配器组合,而后者只是通过标记解析JSON文档令牌。话虽如此,您可以在读取JSON流时获得更好的性能:众所周知,您的JSON文件具有非常严格的结构,因此可以更简单地实现高级解析器。
Suppose a simple test suite with different implementations for your problem:
假设有一个简单的测试套件,针对您的问题有不同的实现:
Data objects
City.java
final class City {
@SerializedName("_id")
final int id;
@SerializedName("country")
final String country;
@SerializedName("name")
final String name;
@SerializedName("coord")
final Coordinates coordinates;
private City(final int id, final String country, final String name, final Coordinates coordinates) {
this.id = id;
this.country = country;
this.name = name;
this.coordinates = coordinates;
}
static City of(final int id, final String country, final String name, final Coordinates coordinates) {
return new City(id, country, name, coordinates);
}
@Override
public boolean equals(final Object o) {
if ( this == o ) {
return true;
}
if ( o == null || getClass() != o.getClass() ) {
return false;
}
final City that = (City) o;
return id == that.id;
}
@Override
public int hashCode() {
return id;
}
@SuppressWarnings("ConstantConditions")
public static int compareByName(final City city1, final City city2) {
return city1.name.compareTo(city2.name);
}
}
Coordinates.java
final class Coordinates {
@SerializedName("lat")
final double latitude;
@SerializedName("lon")
final double longitude;
private Coordinates(final double latitude, final double longitude) {
this.latitude = latitude;
this.longitude = longitude;
}
static Coordinates of(final double latitude, final double longitude) {
return new Coordinates(latitude, longitude);
}
@Override
public boolean equals(final Object o) {
if ( this == o ) {
return true;
}
if ( o == null || getClass() != o.getClass() ) {
return false;
}
final Coordinates that = (Coordinates) o;
return Double.compare(that.latitude, latitude) == 0
&& Double.compare(that.longitude, longitude) == 0;
}
@Override
public int hashCode() {
final long latitudeBits = Double.doubleToLongBits(latitude);
final long longitudeBits = Double.doubleToLongBits(longitude);
final int latitudeHash = (int) (latitudeBits ^ latitudeBits >>> 32);
final int longitudeHash = (int) (longitudeBits ^ longitudeBits >>> 32);
return 31 * latitudeHash + longitudeHash;
}
}
Test infrastructure
ITest.java
interface ITest {
@Nonnull
default String getName() {
return getClass().getSimpleName();
}
@Nonnull
Collection<City> test(@Nonnull JsonReader jsonReader)
throws IOException;
}
main
public static void main(final String... args)
throws IOException {
final Iterable<ITest> tests = ImmutableList.of(
FirstTest.get(),
ReadAsWholeListTest.get(),
ReadAsWholeTreeSetTest.get(),
ReadJsonStreamIntoListTest.get(),
ReadJsonStreamIntoTreeSetTest.get(),
ReadJsonStreamIntoListChunksTest.get()
);
for ( int i = 0; i < 3; i++ ) {
for ( final ITest test : tests ) {
try ( final ZipInputStream zipInputStream = new ZipInputStream(Resources.getPackageResourceInputStream(Q49273660.class, "cities.json.zip")) ) {
for ( ZipEntry zipEntry = zipInputStream.getNextEntry(); zipEntry != null; zipEntry = zipInputStream.getNextEntry() ) {
if ( zipEntry.getName().equals("cities.json") ) {
final JsonReader jsonReader = new JsonReader(new InputStreamReader(zipInputStream)); // do not close
System.out.printf("%1$35s : ", test.getName());
final Stopwatch stopwatch = Stopwatch.createStarted();
final Collection<City> cities = test.test(jsonReader);
System.out.printf("in %d ms with %d elements\n", stopwatch.elapsed(TimeUnit.MILLISECONDS), cities.size());
assertSorted(cities, City::compareByName);
}
}
}
}
System.out.println("--------------------");
}
}
private static <E> void assertSorted(final Iterable<? extends E> iterable, final Comparator<? super E> comparator) {
final Iterator<? extends E> iterator = iterable.iterator();
if ( !iterator.hasNext() ) {
return;
}
E a = iterator.next();
if ( !iterator.hasNext() ) {
return;
}
do {
final E b = iterator.next();
if ( comparator.compare(a, b) > 0 ) {
throw new AssertionError(a + " " + b);
}
a = b;
} while ( iterator.hasNext() );
}
Tests
FirstTest.java
This is the slowest one. And it's just an adaptation of your question to the tests.
这是最慢的。这只是你对测试问题的一种适应。
final class FirstTest
implements ITest {
private static final ITest instance = new FirstTest();
private FirstTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
jsonReader.beginArray();
final List<City> cities = new ArrayList<>();
while ( jsonReader.hasNext() ) {
final City city = new Gson().fromJson(jsonReader, City.class);
cities.add(city);
}
jsonReader.endArray();
cities.sort(City::compareByName);
return cities;
}
}
ReadAsWholeListTest.java
This is most likely how you might implement it. It's not the winner, but it's the simplest one, and it uses default sorting.
这很可能是您实现它的方式。它不是赢家,但它是最简单的,它使用默认排序。
final class ReadAsWholeListTest
implements ITest {
private static final ITest instance = new ReadAsWholeListTest();
private ReadAsWholeListTest() {
}
static ITest get() {
return instance;
}
private static final Gson gson = new Gson();
private static final Type citiesListType = new TypeToken<List<City>>() {
}.getType();
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader) {
final List<City> cities = gson.fromJson(jsonReader, citiesListType);
cities.sort(City::compareByName);
return cities;
}
}
ReadAsWholeTreeSetTest.java
Another idea, if you're not bound to lists, is using an already-sorted collections like TreeSet
. Since I don't know if there's a way to specify a new TreeSet
comparator mechanism in Gson
, it must use a custom type adapter factory (but this is not required if City
is already comparable by name, however it's not flexible).
如果不绑定到列表,另一个想法是使用已经排序的集合,如TreeSet。由于我不知道是否有一种方法可以在Gson中指定一个新的TreeSet比较器机制,所以它必须使用一个自定义类型适配器工厂(但是如果City已经按名称进行比较,则不需要这样做,但是它不灵活)。
final class ReadAsWholeTreeSetTest
implements ITest {
private static final ITest instance = new ReadAsWholeTreeSetTest();
private ReadAsWholeTreeSetTest() {
}
static ITest get() {
return instance;
}
@SuppressWarnings({ "rawtypes", "unchecked" })
private static final TypeToken<TreeSet<?>> rawTreeSetType = (TypeToken) TypeToken.get(TreeSet.class);
private static final Map<Type, Comparator<?>> comparatorsRegistry = ImmutableMap.of(
City.class, (Comparator<City>) City::compareByName
);
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
@Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( !TreeSet.class.isAssignableFrom(typeToken.getRawType()) ) {
return null;
}
final Type elementType = ((ParameterizedType) typeToken.getType()).getActualTypeArguments()[0];
@SuppressWarnings({ "rawtypes", "unchecked" })
final Comparator<Object> comparator = (Comparator) comparatorsRegistry.get(elementType);
if ( comparator == null ) {
return null;
}
final TypeAdapter<TreeSet<?>> originalTreeSetTypeAdapter = gson.getDelegateAdapter(this, rawTreeSetType);
final TypeAdapter<?> originalElementTypeAdapter = gson.getDelegateAdapter(this, TypeToken.get(elementType));
final TypeAdapter<TreeSet<Object>> treeSetTypeAdapter = new TypeAdapter<TreeSet<Object>>() {
@Override
public void write(final JsonWriter jsonWriter, final TreeSet<Object> treeSet)
throws IOException {
originalTreeSetTypeAdapter.write(jsonWriter, treeSet);
}
@Override
public TreeSet<Object> read(final JsonReader jsonReader)
throws IOException {
jsonReader.beginArray();
final TreeSet<Object> elements = new TreeSet<>(comparator);
while ( jsonReader.hasNext() ) {
final Object element = originalElementTypeAdapter.read(jsonReader);
elements.add(element);
}
return elements;
}
}.nullSafe();
@SuppressWarnings({ "rawtypes", "unchecked" })
final TypeAdapter<T> castTreeSetTypeAdapter = (TypeAdapter<T>) treeSetTypeAdapter;
return castTreeSetTypeAdapter;
}
})
.create();
private static final Type citiesSetType = new TypeToken<TreeSet<City>>() {
}.getType();
@Nonnull
@Override
public Set<City> test(@Nonnull final JsonReader jsonReader) {
return gson.fromJson(jsonReader, citiesSetType);
}
}
JSON stream reader tests
The following class is a special reader test that uses a simplified strategy of reading the cities JSON.
下面的类是一个特殊的读取器测试,它使用了读取城市JSON的简化策略。
AbstractJsonStreamTest.java
It's probably as simplest as possible (in terms of JSON structure analysis), and it requires the JSON document to be very strict.
它可能是尽可能简单的(就JSON结构分析而言),并且它要求JSON文档非常严格。
abstract class AbstractJsonStreamTest
implements ITest {
protected static void read(final JsonReader jsonReader, final Consumer<? super City> cityConsumer)
throws IOException {
jsonReader.beginArray();
while ( jsonReader.hasNext() ) {
jsonReader.beginObject();
require(jsonReader, "country");
final String country = jsonReader.nextString();
require(jsonReader, "name");
final String name = jsonReader.nextString();
require(jsonReader, "_id");
final int id = jsonReader.nextInt();
require(jsonReader, "coord");
jsonReader.beginObject();
require(jsonReader, "lon");
final double longitude = jsonReader.nextDouble();
require(jsonReader, "lat");
final double latitude = jsonReader.nextDouble();
jsonReader.endObject();
jsonReader.endObject();
final City city = City.of(id, country, name, Coordinates.of(latitude, longitude));
cityConsumer.accept(city);
}
jsonReader.endArray();
}
private static void require(final JsonReader jsonReader, final String expectedName)
throws IOException {
final String actualName = jsonReader.nextName();
if ( !actualName.equals(expectedName) ) {
throw new JsonParseException("Expected " + expectedName + " but was " + actualName);
}
}
}
ReadJsonStreamIntoListTest.java
This one is pretty much like ReadAsWholeListTest
but it uses a simplified deserialization mechanism.
这个很像ReadAsWholeListTest,但是它使用了一个简化的反序列化机制。
final class ReadJsonStreamIntoListTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoListTest();
private ReadJsonStreamIntoListTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public Collection<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final List<City> cities = new ArrayList<>();
read(jsonReader, cities::add);
cities.sort(City::compareByName);
return cities;
}
}
ReadJsonStreamIntoTreeSetTest.java
This one, like the previous one, is also just another implementation of a more expensive implementation (ReadAsWholeTreeSetTest
), however it does not require a custom type adatpter.
与前一个一样,这个实现也是一个更昂贵的实现(ReadAsWholeTreeSetTest)的另一个实现,但是它不需要自定义类型adatpter。
final class ReadJsonStreamIntoTreeSetTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoTreeSetTest();
private ReadJsonStreamIntoTreeSetTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public Collection<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final Collection<City> cities = new TreeSet<>(City::compareByName);
read(jsonReader, cities::add);
return cities;
}
}
ReadJsonStreamIntoListChunksTest.java
The following test is based on your initial idea, but it does not sort the chunks in parallel (I'm not sure but you could give it a try). I still think that the previous two are simpler and probably easier to maintain and give more performance gain.
下面的测试是基于您最初的想法,但是它并不是并行地对块进行排序(我不确定,但是您可以尝试一下)。我仍然认为前两种方法更简单,可能更易于维护,并提供更多性能收益。
final class ReadJsonStreamIntoListChunksTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoListChunksTest();
private ReadJsonStreamIntoListChunksTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final Collection<List<City>> cityChunks = new ArrayList<>();
final AtomicReference<List<City>> cityChunkRef = new AtomicReference<>(new ArrayList<>());
read(jsonReader, city -> {
final List<City> cityChunk = cityChunkRef.get();
cityChunk.add(city);
if ( cityChunk.size() >= 10000 ) {
cityChunks.add(cityChunk);
cityChunkRef.set(new ArrayList<>());
}
});
if ( !cityChunkRef.get().isEmpty() ) {
cityChunks.add(cityChunkRef.get());
}
for ( final List<City> cities : cityChunks ) {
Collections.sort(cities, City::compareByName);
}
return merge(cityChunks, City::compareByName);
}
/**
* <p>Adapted from:</p>
* <ul>
* <li>Original question: https://*.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list</li>
* <li>Accepted answer: https://*.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list/1775748#1775748</li>
* </ul>
*/
@SuppressWarnings("MethodCallInLoopCondition")
private static <E> List<E> merge(final Iterable<? extends List<E>> lists, final Comparator<? super E> comparator) {
int totalSize = 0;
for ( final List<E> l : lists ) {
totalSize += l.size();
}
final List<E> result = new ArrayList<>(totalSize);
while ( result.size() < totalSize ) { // while we still have something to add
List<E> lowest = null;
for ( final List<E> l : lists ) {
if ( !l.isEmpty() ) {
if ( lowest == null || comparator.compare(l.get(0), lowest.get(0)) <= 0 ) {
lowest = l;
}
}
}
assert lowest != null;
result.add(lowest.get(0));
lowest.remove(0);
}
return result;
}
}
Test results
For my desktop JRE I could obtain the following test results:
对于我的桌面JRE,我可以得到以下测试结果:
FirstTest : in 5797 ms with 209557 elements
ReadAsWholeListTest : in 796 ms with 209557 elements
ReadAsWholeTreeSetTest : in 733 ms with 162006 elements
ReadJsonStreamIntoListTest : in 461 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 452 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 607 ms with 209557 elements
--------------------
FirstTest : in 3396 ms with 209557 elements
ReadAsWholeListTest : in 493 ms with 209557 elements
ReadAsWholeTreeSetTest : in 520 ms with 162006 elements
ReadJsonStreamIntoListTest : in 385 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 377 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 540 ms with 209557 elements
--------------------
FirstTest : in 3448 ms with 209557 elements
ReadAsWholeListTest : in 429 ms with 209557 elements
ReadAsWholeTreeSetTest : in 421 ms with 162006 elements
ReadJsonStreamIntoListTest : in 400 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 385 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 480 ms with 209557 elements
--------------------
As you can see, creating excessive Gson
instance is definitely a wrong idea. The more optimized tests gain better performance. However, splitting a large list into sorted chunks (no parallel) to be merged later does not give much performance gain in my environment.
如您所见,创建过多的Gson实例绝对是错误的想法。更优化的测试获得更好的性能。但是,将一个大列表拆分为已排序的块(不并行)之后,在我的环境中不会有太多的性能提升。
For simplicity and probably the best choice, I would go with ReadJsonStreamInto_Collection_Test
depending on the required collection. I'm not really sure how fine would it work in a real Android environment, but you can simply do some JSON deserialization a bit better than Gson does using its internals.
为了简单,也可能是最好的选择,我将根据需要的集合使用ReadJsonStreamInto_Collection_Test。我不确定在真正的Android环境下它能有多好,但是您可以简单地做一些JSON反序列化,比Gson使用它的内部特性要好一点。
By the way:
顺便说一下:
- I'm not really sure, but did you notice 162006 unique cities? Your JSON file probably has some duplicates (at least if its
_id
is the identity). - 我不太确定,但是你注意到162006年的独特城市了吗?您的JSON文件可能有一些重复(至少如果其_id是标识的话)。
- What if you simply generate a sorted version of
cities.json
in advance on your workstation before you use it on an Android device? Additionally, you might want to filter out the duplicates if my above assumption is correct. - 如果你只是生成一个有序的城市版本。在Android设备上使用json之前,先在你的工作站上使用json ?此外,如果我的上述假设是正确的,您可能想要过滤掉重复的内容。
#1
1
I mostly never do Android coding per se, but I have some notes and probably ideas for you to go with since this is pure Java. Your reader does very excessive work while reading each element. First of all, you don't need to create Gson
every time you need it:
我基本上从来不做Android代码本身,但是我有一些笔记和想法供你们参考,因为这是纯Java。你的读者在阅读每一个元素时都要做大量的工作。首先,您不需要每次都创建Gson:
- It's immutable and thread-safe.
- 它是不可变的,线程安全的。
- It's relatively expensive to create.
- 这是相对昂贵的创造。
- Instantiating a
Gson
instance also hits the heap taking more time to execute and then garbage-collect. - 实例化Gson实例也会导致堆需要更多的时间来执行,然后垃圾收集。
Next, there is a difference between only-deserialization and JSON stream reading in Gson: the first may use a heavy type adapters composition under the hood, whilst the latter simply can parse JSON documents token by token. Having that said, you can gain a better performance while reading the JSON stream: your JSON file is really known to have a very strict structure so the high-level parser can be implemented much simpler.
接下来,在Gson中,单反序列化和JSON流读取之间有一个区别:第一个可能使用大类型适配器组合,而后者只是通过标记解析JSON文档令牌。话虽如此,您可以在读取JSON流时获得更好的性能:众所周知,您的JSON文件具有非常严格的结构,因此可以更简单地实现高级解析器。
Suppose a simple test suite with different implementations for your problem:
假设有一个简单的测试套件,针对您的问题有不同的实现:
Data objects
City.java
final class City {
@SerializedName("_id")
final int id;
@SerializedName("country")
final String country;
@SerializedName("name")
final String name;
@SerializedName("coord")
final Coordinates coordinates;
private City(final int id, final String country, final String name, final Coordinates coordinates) {
this.id = id;
this.country = country;
this.name = name;
this.coordinates = coordinates;
}
static City of(final int id, final String country, final String name, final Coordinates coordinates) {
return new City(id, country, name, coordinates);
}
@Override
public boolean equals(final Object o) {
if ( this == o ) {
return true;
}
if ( o == null || getClass() != o.getClass() ) {
return false;
}
final City that = (City) o;
return id == that.id;
}
@Override
public int hashCode() {
return id;
}
@SuppressWarnings("ConstantConditions")
public static int compareByName(final City city1, final City city2) {
return city1.name.compareTo(city2.name);
}
}
Coordinates.java
final class Coordinates {
@SerializedName("lat")
final double latitude;
@SerializedName("lon")
final double longitude;
private Coordinates(final double latitude, final double longitude) {
this.latitude = latitude;
this.longitude = longitude;
}
static Coordinates of(final double latitude, final double longitude) {
return new Coordinates(latitude, longitude);
}
@Override
public boolean equals(final Object o) {
if ( this == o ) {
return true;
}
if ( o == null || getClass() != o.getClass() ) {
return false;
}
final Coordinates that = (Coordinates) o;
return Double.compare(that.latitude, latitude) == 0
&& Double.compare(that.longitude, longitude) == 0;
}
@Override
public int hashCode() {
final long latitudeBits = Double.doubleToLongBits(latitude);
final long longitudeBits = Double.doubleToLongBits(longitude);
final int latitudeHash = (int) (latitudeBits ^ latitudeBits >>> 32);
final int longitudeHash = (int) (longitudeBits ^ longitudeBits >>> 32);
return 31 * latitudeHash + longitudeHash;
}
}
Test infrastructure
ITest.java
interface ITest {
@Nonnull
default String getName() {
return getClass().getSimpleName();
}
@Nonnull
Collection<City> test(@Nonnull JsonReader jsonReader)
throws IOException;
}
main
public static void main(final String... args)
throws IOException {
final Iterable<ITest> tests = ImmutableList.of(
FirstTest.get(),
ReadAsWholeListTest.get(),
ReadAsWholeTreeSetTest.get(),
ReadJsonStreamIntoListTest.get(),
ReadJsonStreamIntoTreeSetTest.get(),
ReadJsonStreamIntoListChunksTest.get()
);
for ( int i = 0; i < 3; i++ ) {
for ( final ITest test : tests ) {
try ( final ZipInputStream zipInputStream = new ZipInputStream(Resources.getPackageResourceInputStream(Q49273660.class, "cities.json.zip")) ) {
for ( ZipEntry zipEntry = zipInputStream.getNextEntry(); zipEntry != null; zipEntry = zipInputStream.getNextEntry() ) {
if ( zipEntry.getName().equals("cities.json") ) {
final JsonReader jsonReader = new JsonReader(new InputStreamReader(zipInputStream)); // do not close
System.out.printf("%1$35s : ", test.getName());
final Stopwatch stopwatch = Stopwatch.createStarted();
final Collection<City> cities = test.test(jsonReader);
System.out.printf("in %d ms with %d elements\n", stopwatch.elapsed(TimeUnit.MILLISECONDS), cities.size());
assertSorted(cities, City::compareByName);
}
}
}
}
System.out.println("--------------------");
}
}
private static <E> void assertSorted(final Iterable<? extends E> iterable, final Comparator<? super E> comparator) {
final Iterator<? extends E> iterator = iterable.iterator();
if ( !iterator.hasNext() ) {
return;
}
E a = iterator.next();
if ( !iterator.hasNext() ) {
return;
}
do {
final E b = iterator.next();
if ( comparator.compare(a, b) > 0 ) {
throw new AssertionError(a + " " + b);
}
a = b;
} while ( iterator.hasNext() );
}
Tests
FirstTest.java
This is the slowest one. And it's just an adaptation of your question to the tests.
这是最慢的。这只是你对测试问题的一种适应。
final class FirstTest
implements ITest {
private static final ITest instance = new FirstTest();
private FirstTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
jsonReader.beginArray();
final List<City> cities = new ArrayList<>();
while ( jsonReader.hasNext() ) {
final City city = new Gson().fromJson(jsonReader, City.class);
cities.add(city);
}
jsonReader.endArray();
cities.sort(City::compareByName);
return cities;
}
}
ReadAsWholeListTest.java
This is most likely how you might implement it. It's not the winner, but it's the simplest one, and it uses default sorting.
这很可能是您实现它的方式。它不是赢家,但它是最简单的,它使用默认排序。
final class ReadAsWholeListTest
implements ITest {
private static final ITest instance = new ReadAsWholeListTest();
private ReadAsWholeListTest() {
}
static ITest get() {
return instance;
}
private static final Gson gson = new Gson();
private static final Type citiesListType = new TypeToken<List<City>>() {
}.getType();
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader) {
final List<City> cities = gson.fromJson(jsonReader, citiesListType);
cities.sort(City::compareByName);
return cities;
}
}
ReadAsWholeTreeSetTest.java
Another idea, if you're not bound to lists, is using an already-sorted collections like TreeSet
. Since I don't know if there's a way to specify a new TreeSet
comparator mechanism in Gson
, it must use a custom type adapter factory (but this is not required if City
is already comparable by name, however it's not flexible).
如果不绑定到列表,另一个想法是使用已经排序的集合,如TreeSet。由于我不知道是否有一种方法可以在Gson中指定一个新的TreeSet比较器机制,所以它必须使用一个自定义类型适配器工厂(但是如果City已经按名称进行比较,则不需要这样做,但是它不灵活)。
final class ReadAsWholeTreeSetTest
implements ITest {
private static final ITest instance = new ReadAsWholeTreeSetTest();
private ReadAsWholeTreeSetTest() {
}
static ITest get() {
return instance;
}
@SuppressWarnings({ "rawtypes", "unchecked" })
private static final TypeToken<TreeSet<?>> rawTreeSetType = (TypeToken) TypeToken.get(TreeSet.class);
private static final Map<Type, Comparator<?>> comparatorsRegistry = ImmutableMap.of(
City.class, (Comparator<City>) City::compareByName
);
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
@Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( !TreeSet.class.isAssignableFrom(typeToken.getRawType()) ) {
return null;
}
final Type elementType = ((ParameterizedType) typeToken.getType()).getActualTypeArguments()[0];
@SuppressWarnings({ "rawtypes", "unchecked" })
final Comparator<Object> comparator = (Comparator) comparatorsRegistry.get(elementType);
if ( comparator == null ) {
return null;
}
final TypeAdapter<TreeSet<?>> originalTreeSetTypeAdapter = gson.getDelegateAdapter(this, rawTreeSetType);
final TypeAdapter<?> originalElementTypeAdapter = gson.getDelegateAdapter(this, TypeToken.get(elementType));
final TypeAdapter<TreeSet<Object>> treeSetTypeAdapter = new TypeAdapter<TreeSet<Object>>() {
@Override
public void write(final JsonWriter jsonWriter, final TreeSet<Object> treeSet)
throws IOException {
originalTreeSetTypeAdapter.write(jsonWriter, treeSet);
}
@Override
public TreeSet<Object> read(final JsonReader jsonReader)
throws IOException {
jsonReader.beginArray();
final TreeSet<Object> elements = new TreeSet<>(comparator);
while ( jsonReader.hasNext() ) {
final Object element = originalElementTypeAdapter.read(jsonReader);
elements.add(element);
}
return elements;
}
}.nullSafe();
@SuppressWarnings({ "rawtypes", "unchecked" })
final TypeAdapter<T> castTreeSetTypeAdapter = (TypeAdapter<T>) treeSetTypeAdapter;
return castTreeSetTypeAdapter;
}
})
.create();
private static final Type citiesSetType = new TypeToken<TreeSet<City>>() {
}.getType();
@Nonnull
@Override
public Set<City> test(@Nonnull final JsonReader jsonReader) {
return gson.fromJson(jsonReader, citiesSetType);
}
}
JSON stream reader tests
The following class is a special reader test that uses a simplified strategy of reading the cities JSON.
下面的类是一个特殊的读取器测试,它使用了读取城市JSON的简化策略。
AbstractJsonStreamTest.java
It's probably as simplest as possible (in terms of JSON structure analysis), and it requires the JSON document to be very strict.
它可能是尽可能简单的(就JSON结构分析而言),并且它要求JSON文档非常严格。
abstract class AbstractJsonStreamTest
implements ITest {
protected static void read(final JsonReader jsonReader, final Consumer<? super City> cityConsumer)
throws IOException {
jsonReader.beginArray();
while ( jsonReader.hasNext() ) {
jsonReader.beginObject();
require(jsonReader, "country");
final String country = jsonReader.nextString();
require(jsonReader, "name");
final String name = jsonReader.nextString();
require(jsonReader, "_id");
final int id = jsonReader.nextInt();
require(jsonReader, "coord");
jsonReader.beginObject();
require(jsonReader, "lon");
final double longitude = jsonReader.nextDouble();
require(jsonReader, "lat");
final double latitude = jsonReader.nextDouble();
jsonReader.endObject();
jsonReader.endObject();
final City city = City.of(id, country, name, Coordinates.of(latitude, longitude));
cityConsumer.accept(city);
}
jsonReader.endArray();
}
private static void require(final JsonReader jsonReader, final String expectedName)
throws IOException {
final String actualName = jsonReader.nextName();
if ( !actualName.equals(expectedName) ) {
throw new JsonParseException("Expected " + expectedName + " but was " + actualName);
}
}
}
ReadJsonStreamIntoListTest.java
This one is pretty much like ReadAsWholeListTest
but it uses a simplified deserialization mechanism.
这个很像ReadAsWholeListTest,但是它使用了一个简化的反序列化机制。
final class ReadJsonStreamIntoListTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoListTest();
private ReadJsonStreamIntoListTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public Collection<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final List<City> cities = new ArrayList<>();
read(jsonReader, cities::add);
cities.sort(City::compareByName);
return cities;
}
}
ReadJsonStreamIntoTreeSetTest.java
This one, like the previous one, is also just another implementation of a more expensive implementation (ReadAsWholeTreeSetTest
), however it does not require a custom type adatpter.
与前一个一样,这个实现也是一个更昂贵的实现(ReadAsWholeTreeSetTest)的另一个实现,但是它不需要自定义类型adatpter。
final class ReadJsonStreamIntoTreeSetTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoTreeSetTest();
private ReadJsonStreamIntoTreeSetTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public Collection<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final Collection<City> cities = new TreeSet<>(City::compareByName);
read(jsonReader, cities::add);
return cities;
}
}
ReadJsonStreamIntoListChunksTest.java
The following test is based on your initial idea, but it does not sort the chunks in parallel (I'm not sure but you could give it a try). I still think that the previous two are simpler and probably easier to maintain and give more performance gain.
下面的测试是基于您最初的想法,但是它并不是并行地对块进行排序(我不确定,但是您可以尝试一下)。我仍然认为前两种方法更简单,可能更易于维护,并提供更多性能收益。
final class ReadJsonStreamIntoListChunksTest
extends AbstractJsonStreamTest {
private static final ITest instance = new ReadJsonStreamIntoListChunksTest();
private ReadJsonStreamIntoListChunksTest() {
}
static ITest get() {
return instance;
}
@Nonnull
@Override
public List<City> test(@Nonnull final JsonReader jsonReader)
throws IOException {
final Collection<List<City>> cityChunks = new ArrayList<>();
final AtomicReference<List<City>> cityChunkRef = new AtomicReference<>(new ArrayList<>());
read(jsonReader, city -> {
final List<City> cityChunk = cityChunkRef.get();
cityChunk.add(city);
if ( cityChunk.size() >= 10000 ) {
cityChunks.add(cityChunk);
cityChunkRef.set(new ArrayList<>());
}
});
if ( !cityChunkRef.get().isEmpty() ) {
cityChunks.add(cityChunkRef.get());
}
for ( final List<City> cities : cityChunks ) {
Collections.sort(cities, City::compareByName);
}
return merge(cityChunks, City::compareByName);
}
/**
* <p>Adapted from:</p>
* <ul>
* <li>Original question: https://*.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list</li>
* <li>Accepted answer: https://*.com/questions/1774256/java-code-review-merge-sorted-lists-into-a-single-sorted-list/1775748#1775748</li>
* </ul>
*/
@SuppressWarnings("MethodCallInLoopCondition")
private static <E> List<E> merge(final Iterable<? extends List<E>> lists, final Comparator<? super E> comparator) {
int totalSize = 0;
for ( final List<E> l : lists ) {
totalSize += l.size();
}
final List<E> result = new ArrayList<>(totalSize);
while ( result.size() < totalSize ) { // while we still have something to add
List<E> lowest = null;
for ( final List<E> l : lists ) {
if ( !l.isEmpty() ) {
if ( lowest == null || comparator.compare(l.get(0), lowest.get(0)) <= 0 ) {
lowest = l;
}
}
}
assert lowest != null;
result.add(lowest.get(0));
lowest.remove(0);
}
return result;
}
}
Test results
For my desktop JRE I could obtain the following test results:
对于我的桌面JRE,我可以得到以下测试结果:
FirstTest : in 5797 ms with 209557 elements
ReadAsWholeListTest : in 796 ms with 209557 elements
ReadAsWholeTreeSetTest : in 733 ms with 162006 elements
ReadJsonStreamIntoListTest : in 461 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 452 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 607 ms with 209557 elements
--------------------
FirstTest : in 3396 ms with 209557 elements
ReadAsWholeListTest : in 493 ms with 209557 elements
ReadAsWholeTreeSetTest : in 520 ms with 162006 elements
ReadJsonStreamIntoListTest : in 385 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 377 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 540 ms with 209557 elements
--------------------
FirstTest : in 3448 ms with 209557 elements
ReadAsWholeListTest : in 429 ms with 209557 elements
ReadAsWholeTreeSetTest : in 421 ms with 162006 elements
ReadJsonStreamIntoListTest : in 400 ms with 209557 elements
ReadJsonStreamIntoTreeSetTest : in 385 ms with 162006 elements
ReadJsonStreamIntoListChunksTest : in 480 ms with 209557 elements
--------------------
As you can see, creating excessive Gson
instance is definitely a wrong idea. The more optimized tests gain better performance. However, splitting a large list into sorted chunks (no parallel) to be merged later does not give much performance gain in my environment.
如您所见,创建过多的Gson实例绝对是错误的想法。更优化的测试获得更好的性能。但是,将一个大列表拆分为已排序的块(不并行)之后,在我的环境中不会有太多的性能提升。
For simplicity and probably the best choice, I would go with ReadJsonStreamInto_Collection_Test
depending on the required collection. I'm not really sure how fine would it work in a real Android environment, but you can simply do some JSON deserialization a bit better than Gson does using its internals.
为了简单,也可能是最好的选择,我将根据需要的集合使用ReadJsonStreamInto_Collection_Test。我不确定在真正的Android环境下它能有多好,但是您可以简单地做一些JSON反序列化,比Gson使用它的内部特性要好一点。
By the way:
顺便说一下:
- I'm not really sure, but did you notice 162006 unique cities? Your JSON file probably has some duplicates (at least if its
_id
is the identity). - 我不太确定,但是你注意到162006年的独特城市了吗?您的JSON文件可能有一些重复(至少如果其_id是标识的话)。
- What if you simply generate a sorted version of
cities.json
in advance on your workstation before you use it on an Android device? Additionally, you might want to filter out the duplicates if my above assumption is correct. - 如果你只是生成一个有序的城市版本。在Android设备上使用json之前,先在你的工作站上使用json ?此外,如果我的上述假设是正确的,您可能想要过滤掉重复的内容。