改进通过JSON将服务器数据库镜像到客户端数据库的过程?

时间:2022-08-05 03:54:02

I have a ready enterprise (non-AppStore) legacy iOS application for iPad that I need to refactor (it was written by another developer, my predecessor on my current job).

我有一个现成的企业(非appstore)遗留iOS应用程序,我需要重构(它是由另一个开发人员,我的前任在我目前的工作中编写的)。

This application fetches its data via JSON from a server having MSSQL database. The database schema has about 30 tables, the most capacious are: Client, City, Agency each having about 10.000 records each and the further growth is expected in the future. After the JSON is received (one JSON request-and-response pair for each table) - it is mapped to the CoreData - the process which also includes glueing together the corresponding CoreData entities (Client, City, Agency and others) with each other i.e. setting the relations beetween these entities on the CoreData layer.

该应用程序从拥有MSSQL数据库的服务器中通过JSON获取其数据。数据库模式有大约30个表,其中最宽敞的是:客户端、城市、代理,每个都有大约10000个记录,预计未来还会有进一步的增长。在接收到JSON之后(每个表有一个JSON请求和响应对)——它被映射到CoreData——这个过程还包括将相应的CoreData实体(客户端、城市、代理和其他实体)粘合在一起,例如在CoreData层设置这些实体之间的关系。

In itself the project's CoreData fetch-part (or read-part) is heavily optimized - it uses, I guess, almost all possible performance and memory tweaks CoreData has, that is why UI layer of application is very fast and responsive, so that I consider its work as completely satisfactory and adequate.

项目的CoreData取出部分(或读取部分)本身经过了大量优化——我猜,它使用了几乎所有可能的性能和内存调整CoreData,这就是为什么应用程序的UI层速度非常快,响应速度也非常快,因此我认为它的工作完全令人满意,也足够了。


The problem is the process of preparation of CoreData layer, i.e. the server-to-client synchronization process: it takes too much time. Consider 30 network requests resulting in 30 JSON packs ("pack" I mean "one table - one JSON"), which are then mapped to 30 CoreData entities, which are then glued together (the appropriate CoreData relations are set beetween them). When I first saw how all this is done in this project (too slow), the first idea to come into my head was:

问题是核心数据层的准备过程,即服务器到客户端同步过程:它花费了太多的时间。考虑30个网络请求产生30个JSON包(“pack”我的意思是“一个表—一个JSON”),然后将它们映射到30个CoreData实体,然后将它们粘在一起(设置适当的CoreData关系)。当我第一次看到这一切是如何在这个项目中完成(太慢)时,我的第一个想法是:

"For the first time a complete synchronization is performed (app's first launch time) - perform a fetch of the whole database data in, say, one archived file (something like database dump) and then somehow import it as a whole to a Core Data land".

“第一次执行完整的同步(应用程序的第一次启动)——执行整个数据库数据的抓取,比如,一个存档文件(比如数据库转储),然后以某种方式将其作为一个整体导入到一个核心数据区域”。

But then I realized that, even if such transmission of such one-file dump was possible, CoreData would still require me to perform a gluing of the corresponding CoreData entities to set the appropriate relations beetween them so that it is hard to imagine that I could benefit in performance if I would rely on this scheme.

然后我意识到,即使这样的传播这样一个文件转储,CoreData仍然需要我去执行相应的粘合CoreData实体beetween他们设置适当的关系,所以很难想象我能性能好处如果我将依靠这个计划。

Also, my colleague suggested me to consider SQLite as a complete alternative to Core Data, but unfortunately I don't have an experience of using it, that's why I am completely blind to foresee all the consequences of such serious design decision (even having the synchronization process very slow, my app does work, especially its UI performance is very good now). The only thing I can imagine about SQLite that in contrast to Core Data it will not push me to glue some additional relations on a client side, because SQLite has its good old foreign key system, doesn't it?

也,我的同事建议我考虑SQLite完全替代核心数据,但不幸的是我没有使用它的经验,这就是为什么我完全失明预见到这种严重的后果设计决策(甚至同步进程非常缓慢,我的应用程序的工作,尤其是它的UI性能很好现在)。我能想到的关于SQLite的唯一一件事是,与核心数据相比,它不会促使我在客户端粘上一些附加的关系,因为SQLite有它很好的老的外键系统,不是吗?


And so here are the questions (Respondents, please do not mix these points when you answer - there is too much confusion I have about all of them):

这里有一些问题(回答者,请不要在回答的时候混淆这些问题——我对所有的问题都有太多的困惑):

  1. Does anybody have such experience of taking "first-time large import of the whole database" approach in a way I have described above? I would be very thankful to know about any solutions should they exploit JSON<->CoreData pair or not.

    有没有人有过像我上面描述的那样采用“第一次大规模导入整个数据库”的经验?如果他们使用JSON<->CoreData对或不使用,我将非常感激了解任何解决方案。

  2. Does Core Data has some global import mechanism which can allow mass-creation of corresponding 30-tables-schema (possibly using some specific source other than "30 packs of JSON" described above) without a need of setting up corresponding relations for 30 entities?

    Core Data是否有一些全局导入机制,允许大量创建相应的30个表-模式(可能使用一些特定的源,而不是上面描述的“30包JSON”),而不需要为30个实体设置相应的关系?

  3. Are there any possibilities to speed up the synchronization process if 2) is impossible? Here I mean the improvements of current JSON<->CoreData scheme my app uses.

    如果2)是不可能的,是否有可能加速同步进程?这里我指的是我的应用使用的当前JSON<->CoreData方案的改进。

  4. Migration to SQLite: should I consider such migration? What I will benefit from it? How the whole process of replication->transmission->client preparations could look like then?

    迁移到SQLite:我应该考虑这样的迁移吗?我能从中得到什么好处?那么,复制的整个过程——>传输——>客户端准备是什么样子的呢?

  5. Other alternatives to CoreData and SQLite - what could they be or look like?

    CoreData和SQLite的其他替代方案是什么?

  6. Any other thoughts or visions you may have about the situation I've described?

    你对我所描述的情况还有什么其他的想法或设想吗?


UPDATE 1

更新1

Though the answer written by Mundi is good (one large JSON, "No" for using SQLite), I am still interested if there are any other insights into the the problem I've described.

尽管Mundi写的答案很好(一个很大的JSON,“不”用于使用SQLite),但是如果我对我所描述的问题有其他的见解,我还是很感兴趣。


UPDATE 2

更新2

I did try to use my russian english the best way I could to describe my situation in a hope for my question could become pretty clear to everyone who will read it. By this second update I will try to provide it with some more guides to make my question even more clear.

我试着用我的俄语来描述我的处境,希望我的问题能让所有读到的人都明白。在第二次更新中,我将尝试提供更多的指南,使我的问题更加清晰。

Please, consider two dichotomies:

请考虑两个二分法:

  1. What can/should I use as a data layer on iOS client - CoreData vs SQLite?
  2. 在iOS客户端- CoreData和SQLite上,我可以/应该使用什么作为数据层?
  3. What can/should I use as a transport layer - JSON (single-JSON-at-once as suggested in the answer, even zipped maybe) or some DB-itself-dumps (if it is even possible, of course - notice I am also asking this in my question).
  4. 我可以/应该使用什么作为传输层——JSON(在回答中建议使用单一JSON,甚至可能是压缩的)或一些db -itself-dump(如果可能的话,当然——注意,我在我的问题中也问了这个问题)。

I think it is pretty obvious the "sector" which is formed by intersection of these two dichotomies, choosing CoreData from the first one and JSON from the second is the most wide-spread default in iOS development world and also it is used by my app from this question.

我认为这是很明显的“扇区”是由这两个二分类的交集形成的,从第一个和第二个JSON中选择CoreData是iOS开发世界中最广泛的违约,也是我的应用程序从这个问题中使用的。

Having that said, I claim that I would be thankful to see any answers regarding CoreData-JSON pair as well as the answers considering using any other "sectors" (what about opting SQLite and some kind of its dumps approach, why not?)

话虽如此,我还是声明,我希望看到任何关于CoreData-JSON对的答案,以及考虑使用任何其他“部门”的答案(选择SQLite和它的某种转储方法,为什么不呢?)

Also, important to note, that I don't want to just drop the current option for some other alternatives, I just want to get the solution working fast on both synchronization and UI phases of its usage. So answers about improving current scheme as well as answers suggesting the other schemes are welcome!

另外,需要注意的是,我不想为其他的替代方案放弃当前选项,我只想让解决方案在同步和UI的使用阶段快速工作。所以关于改进现有方案的答案,以及建议其他方案的答案都是受欢迎的!

Now, please see the following update #3 which provides more details for my current CoreData-JSON situation:

现在,请查看以下更新#3,它为我当前的CoreData-JSON情况提供了更多细节:


UPDATE 3

更新3

As I have said, currently my app receives 30 packs of JSON - one pack for the whole table. Let's take capacious tables for example: Client, Agency, City.

正如我说过的,目前我的应用程序收到30包JSON—整个表的一个包。让我们以宽敞的桌子为例:客户、代理、城市。

It is Core Data, so if a client record has non-empty agency_id field, I need to create new Core Data entity of class Agency (NSManagedObject subclass) and fill it with this record's JSON data, that's why I need to already have corresponding Core Data entity for this agency of class Agency (NSManagedObject's subclass), and finally I need to do something like client.agency = agency; and then call [currentManagedObjectContext save:&error]. Having it done this way, later I can then ask this client to be fetched and ask its .agency property to find corresponding entity. I hope I am completely sane when I do it this way.

核心数据,如果一个客户记录非空agency_id领域,我需要创造新的核心数据实体类机构(NSManagedObject子类)和填充这个记录的JSON数据,这就是为什么我需要已经有相应的这个机构的核心数据实体类机构(NSManagedObject子类),最后我需要做类似的客户端。机构=机构;然后调用[currentManagedObjectContext save:&error]。这样做之后,我就可以请求这个客户端获取并请求它的.agency属性找到相应的实体。我希望我这样做的时候完全清醒。

Now imagine this pattern applied to the following situation:

现在假设这个模式适用于以下情况:

I have just received the following 3 separate JSON packs: 10000 clients and 4000 cities and 6000 agencies (client has one city, city has many clients; client has agency, agency has many clients, agency has one city, city has many agencies).

我刚刚收到以下3个独立的JSON包:10000个客户,4000个城市,6000个代理(客户有一个城市,城市有很多客户;客户有代理,代理有很多客户,代理有一个城市,城市有很多代理)。

Now I want to setup the following relations on Core Data level: I want my client entity client to be connected to a corresponding city and corresponding agency.

现在我想在Core Data级别上建立以下关系:我希望我的客户实体客户端连接到相应的城市和相应的代理。

The current implementation of this in the project does very ugly thing:

这个项目目前的实施情况非常糟糕:

  1. Since dependency order is the following: City -> Agency -> Client i.e. the City needs to be baked first, the application begins creating entities for City and persists them to Core Data.

    由于依赖顺序如下:City -> Agency ->客户端,即需要首先对City进行烘烤,因此应用程序开始为City创建实体并将其持久化到核心数据中。

  2. Then it deals with the JSON of agencies: it iterates through every JSON record - for every agency, it creates a new entity agency and by its city_id, it fetches corresponding entity city and connects it using the agency.city = city. After the iteration through the whole agencies JSON array is done, current managed object context is saved (actually the -[managedObjectContext save:] is done several times, each after 500 records processed). At this step it is obvious that fetching one of 4000 cities for every client for every of 6000 agencies has a huge performance impact on the whole synchronization process.

    然后它处理代理的JSON:它遍历每个JSON记录——对于每个代理,它创建一个新的实体代理,通过它的city_id,它获取相应的实体城市并使用代理连接它。城市=。在遍历整个代理JSON数组的迭代完成之后,保存当前托管对象上下文(实际上-[managedObjectContext save:]在处理了500条记录后执行了多次)。在此步骤中,很明显,为6000个代理中的每一个客户获取4000个城市中的一个,对整个同步过程有巨大的性能影响。

  3. Then, finally it deals with the JSON of clients: like in previous 2 stage, it iterates through the whole 10000-elements JSON array and one by one performs the fetch of corresponding agencies and ZOMG cities, and this impacts the overall performance in the same manner like previous stage 2 does.

    然后,最后处理客户端的JSON:与前两阶段一样,它遍历整个10,000个元素的JSON数组,一个接一个地执行相应的代理和ZOMG cities的取回操作,这与前一阶段2一样影响整体性能。

It is all very BAD.

这一切都很糟糕。

The only performance optimization I can see here, is that the first stage could leave a large dictionary with cities ids (I mean NSNumber's of real ids) and faulted City entities as values) so it would be possible to prevent ugly find process of the following stage 2 and then do the same on the stage 3 using the analogous caching trick, but the problem is that there are much more relations beetween all the 30 tables that just-described [ Client-City, Client-Agency, Agency-City ] so the final procedure involving a caching of all the entities will the most probably hit the resources iPad device reserves for my app.

我唯一可以看到的性能优化,是第一阶段可能会留下一个大字典城市id(我的意思是NSNumber真实id)和指责城市实体值)有可能防止丑陋找到下面的第二阶段的过程,然后做同样的在舞台上3使用类似缓存的诀窍,但问题是,有更多关系beetween刚刚描述的所有30表[Client-City Client-Agency,因此,最终涉及到所有实体缓存的过程很可能会影响到iPad设备为我的应用预留的资源。


UPDATE 4

更新4

Message for future respondents: I've tried my best to make this answer well-detailed and well-formed and I really expect you to answer with verbose answers. It would be great if your answer would really address the complexity of problem discussed here and complement my efforts I've made to make my question clear and general as much as possible. Thanks.

给未来的受访者的信息:我已经尽我最大的努力让这个答案变得详细和有条理,我真的希望你能给出冗长的答案。如果你的回答能真正解决这里讨论的问题的复杂性并补充我的努力,使我的问题尽可能地清晰和笼统,那就太好了。谢谢。

UPDATE 5

更新5

Related topics: Core Data on client (iOS) to cache data from a server Strategy, Trying to make a POST request with RestKit and map the response to Core Data.

相关主题:客户机上的核心数据缓存服务器策略中的数据,尝试使用RestKit发出POST请求并将响应映射到核心数据。

UPDATE 6

更新6

Even after it is no more possible to open new bounties and there is accepted answer, I still would be glad to see any other answers containing additional information about the problem this topic addresses. Thanks in advance.

即使不再有可能打开新的边界并且已经有了公认的答案,我仍然很高兴看到任何包含关于这个主题所解决的问题的其他信息的答案。提前谢谢。

7 个解决方案

#1


10  

I have experience in a very similar project. The Core Data insertions take some time, so we condition the user that this will take a while, but only the first time. The best performance tweak was of course to get the batch size right between saves, but I am sure you are aware of that.

我在一个非常相似的项目中有经验。核心数据插入需要一段时间,因此我们要求用户这需要一段时间,但这只是第一次。最好的性能调整当然是在保存之间获得批处理大小,但是我确信您已经意识到了这一点。

One performance suggestion: I have tried a few things and found that creating many download threads can be a hit on performance, I suppose because for each request there is some latency from the server etc.

一个性能建议:我尝试了一些东西,发现创建许多下载线程可能会对性能产生影响,我想,因为每个请求都有来自服务器的延迟。

Instead, I discovered that downloading all the JSON in one go was much faster. I do not know how much data you have, but I tested with > 100.000 records and a 40MB+ JSON string this works really fast, so the bottleneck is just the Core Data insertions. With an @autorelease pool this even performed acceptably on a first generation iPad.

相反,我发现一次下载所有JSON要快得多。我不知道您有多少数据,但是我用> 100.000记录和40MB+ JSON字符串进行了测试,这非常快,所以瓶颈仅仅是核心数据插入。在一个@autorelease池中,这甚至在第一代iPad上表现良好。

Stay away from the SQLite API - it will take you more than a man year (provided high productivity) to replicate the performance optimizations you get out of the box with Core Data.

远离SQLite API——您将花费超过一年的时间(提供高生产率)来复制核心数据带来的性能优化。

#2


6  

First off, you're doing a lot of work, and it will take some time no matter how you slice it, but there are ways to improve things.

首先,你要做大量的工作,不管你怎么做,都需要一段时间,但是有很多方法可以改进。

I'd recommend doing your fetches in batches, with a batch size matching your batch size for processing new objects. For example, when creating new Agency records, do something like:

我建议您成批提取,批处理大小与处理新对象的批处理大小相匹配。例如,在创建新的机构记录时,要做以下事情:

  1. Make sure the current Agency batch is sorted by city_id. (I'll explain why later).

    确保当前代理批次按city_id排序。(稍后我将解释为什么)。

  2. Get the City ID for each Agency in the batch. Depending on how your JSON is structured, this is probably a one-liner like this (since valueForKey works on arrays):

    获取批次中每个代理的城市ID。根据JSON的结构,这可能是这样的一行代码(因为valueForKey在数组上工作):

    NSArray *cityIDs = [myAgencyBatch valueForKey:@"city_id"];
    
  3. Get all the City instances for the current pass in one fetch by using the IDs you found in the previous step. Sort the results by city_id. Something like:

    通过使用前面步骤中找到的id,在一次取回中获取当前传递的所有城市实例。通过city_id对结果进行排序。喜欢的东西:

    NSFetchRequest *request = [NSFetchRequest fetchRequestWithEntityName:@"City"];
    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"city_id in %@", cityIDs];
    [request setPredicate:predicate];
    [request setSortDescriptors:@[ [NSSortDescriptor sortDescriptorWithKey:@"city_id" ascending:YES] ]];
    NSArray *cities = [context executeFetchRequest:request error:nil];
    

Now, you have one array of Agency and another one of City, both sorted by city_id. Match them up to set up the relationships (check city_id in case things don't match). Save changes, and go on to the next batch.

现在,您有一个代理数组和另一个City数组,它们都按city_id排序。匹配它们以建立关系(检查city_id以防不匹配)。保存更改,然后继续下一批。

This will dramatically reduce the number of fetches you need to do, which should speed things up. For more on this technique, see Implementing Find-or-Create Efficiently in Apple's docs.

这将极大地减少您需要执行的获取操作的数量,这应该会加快速度。有关此技术的更多信息,请参见在苹果文档中高效地实现Find-or-Create。

Another thing that may help is to "warm up" Core Data's internal cache with the objects you need before you start fetching them. This will save time later on because getting property values won't require a trip to the data store. For this you'd do something like:

另一件可能有帮助的事情是,在开始获取核心数据之前,先用需要的对象“预热”核心数据的内部缓存。这将节省以后的时间,因为获取属性值不需要访问数据存储。为此,你可以这样做:

NSFetchRequest *request = [NSFetchRequest fetchRequestWithEntityName:@"City"];
// no predicate, get everything
[request setResultType:NSManagedObjectIDResultType];
NSArray *notUsed = [context executeFetchRequest:request error:nil];

..and then just forget about the results. This is superficially useless but will alter the internal Core Data state for faster access to City instances later on.

. .然后忘掉结果。这表面上是无用的,但将改变内部核心数据状态,以便稍后更快地访问城市实例。

Now as for your other questions,

至于你的其他问题,

  • Using SQLite directly instead of Core Data might not be a terrible choice for your situation. The benefit would be that you'd have no need to set up the relationships, since you could use use fields like city_id as foreign keys. So, fast importing. The downside, of course, is that you'll have to do your own work converting your model objects to/from SQL records, and probably rewrite quite a lot of existing code that assumes Core Data (e.g. every time you follow a relationship you now need to look up records by that foreign key). This change might fix your import performance issues, but the side effects could be significant.

    直接使用SQLite而不是Core Data可能不是一个糟糕的选择。这样做的好处是,您不需要设置关系,因为您可以使用像city_id这样的字段作为外键。所以,快速导入。的缺点,当然,你要做好你自己的工作转换模型对象从SQL /记录,和重写很多现有的代码假定核心数据(例如,每一次你遵循一个关系你现在需要查记录的外键)。这个更改可能会修复您的导入性能问题,但副作用可能非常大。

  • JSON is generally a very good format if you're transmitting data as text. If you could prepare a Core Data store on the server, and if you would use that file as-is instead of trying to merge it into an existing data store, then that would almost certainly speed things up. Your import process would run once on the server and then never again. But those are big "if"s, especially the second one. If you get to where you need to merge a new server data store with existing data, you're right back to where you are now.

    如果将数据作为文本传输,JSON通常是一种非常好的格式。如果您可以在服务器上准备一个核心数据存储,并且您将按原样使用该文件,而不是试图将其合并到现有的数据存储中,那么这几乎肯定会加快速度。您的导入过程将在服务器上运行一次,然后再也不会运行。但这些都是“如果”,尤其是第二个。如果您需要将新的服务器数据存储与现有数据合并,那么您就回到了您现在的位置。

#3


5  

Do you have control of the server? I ask, because it sounds like you do from the following paragraph:

你能控制服务器吗?我这样问,因为听起来你是这样的:

"For the first time a complete synchronization is performed (app's first launch time) - perform the fetch of the whole database data in, say, one archived file (something like database dump) and then somehow import it as a whole to the CoreData land".

“第一次执行完整的同步(应用程序的第一次启动)——执行整个数据库数据的获取,例如,一个存档文件(类似数据库转储),然后以某种方式将其作为一个整体导入CoreData land”。

If sending a dump is possible, why not send the Core Data file itself? Core Data (by default) is backed by a SQLite database -- why not generate that database on the server, zip it and send it across the wire?

如果可以发送转储文件,为什么不发送核心数据文件本身呢?Core Data(默认情况下)由一个SQLite数据库支持——为什么不在服务器上生成这个数据库,压缩它,然后通过网络发送呢?

This would mean you could eliminate all the JSON parsing, network requests etc and replace it with a simple file download and archive extraction. We did this on a project and it improved performance immeasurably.

这意味着您可以消除所有JSON解析、网络请求等,并使用简单的文件下载和归档提取替换它。我们在一个项目上做了这件事,它改善了性能。

#4


4  

  1. For each row in your table there must be a timestamp column. If there isn't one, you should add it.
  2. 对于表中的每一行,必须有一个时间戳列。如果没有的话,你应该加上去。
  3. First time and each time you fetch database dump you store last update date and time.
  4. 第一次和每次获取数据库转储时都存储最后更新日期和时间。
  5. On every next time you instruct the database to return only those records that were changed or updated since the previous download operation. There also should be a "deleted" flag for you to remove vanished records.
  6. 每次您指示数据库只返回自上次下载操作以来更改或更新的记录。还应该有一个“删除”标志,以便您删除已消失的记录。
  7. Then you only need to update certain matching records saving time on all fronts.
  8. 然后,您只需要更新某些匹配记录,从而在所有方面节省时间。

To speed up the first time sync you can also ship a seed database with the app, so that it could be imported immediately without any network operations.

为了加快第一次同步,你还可以将种子数据库和应用一起发送出去,这样就可以在没有任何网络操作的情况下立即导入。

  1. Download the JSON files by hand.
  2. 手工下载JSON文件。
  3. Put them into your project.
  4. 把他们投入到你的项目中。
  5. Somewhere in the project configuration or header files take a note of download date and time.
  6. 在项目配置或头文件的某个地方记下下载日期和时间。
  7. On the first run, locate and load said files, then proceed like you're updating them.
  8. 在第一次运行时,找到并加载所述文件,然后像更新文件一样继续进行。
  9. If in doubt, refer to the manual.
  10. 如果有疑问,请参考手册。

Example:

例子:

NSString *filePath = [[NSBundle mainBundle] pathForResource:@"cities" 
                                            ofType:@"json"];
NSData *citiesData = [NSData dataWithContentsOfFile:filePath];
// I assume that you're loading an array
NSArray *citiesSeed = [NSJSONSerialization JSONObjectWithData:citiesData 
                       options:NSJSONReadingMutableContainers error:nil];

#5


4  

Here you have my recommendations:

这里有我的建议:

  • Use magicalrecord. It's a CoreData wrapper that saves you a lot of boilerplate code, plus it comes with very interesting features.
  • 使用magicalrecord。它是一个CoreData包装器,为您节省了许多样板代码,而且它还附带了非常有趣的特性。
  • Download all the JSON in one request, as others suggested. If you can embed the first JSON document into the app, you can save the download time and start populating the database right when you open the app for the first time. Also, with magicalrecord is quite easy to perform this save operation in a separate thread and then sync all contexts automatically. This can improve the responsiveness of your app.
  • 按照其他人的建议,在一个请求中下载所有JSON。如果可以将第一个JSON文档嵌入到应用程序中,则可以节省下载时间,并在首次打开应用程序时开始填充数据库。此外,使用magicalrecord可以很容易地在单独的线程中执行此保存操作,然后自动同步所有上下文。这可以提高应用程序的响应能力。
  • It seems that you should refactor that ugly method once you have solved the first import issue. Again, I would suggest to use magicalrecord to easily create those entities.
  • 似乎一旦解决了第一个导入问题,您就应该重构这个难看的方法。同样,我建议使用magicalrecord轻松创建这些实体。

#6


3  

We've recently moved a fairly large project from Core Data to SQLite, and one of the main reasons was bulk insert performance. There were quite a few features we lost in the transition, and I would not advise you to make the switch if you can avoid it. After the transition to SQLite, we actually had performance issues in areas other than bulk inserts which Core Data was transparently handling for us, and even though we fixed those new issues, it took some amount of time getting back up and running. Although we've spent some time and effort in transitioning from Core Data to SQLite, I can't say that there are any regrets.

我们最近将一个相当大的项目从核心数据转移到SQLite,主要原因之一是批量插入性能。在转换过程中我们丢失了很多特性,如果可以避免的话,我不建议您进行切换。在转换到SQLite之后,实际上除了批量插入之外,我们还遇到了性能问题,而核心数据是透明地为我们处理的,尽管我们修复了这些新问题,但是恢复和运行还是需要一些时间。虽然我们花了一些时间和精力从核心数据过渡到SQLite,但我不能说有什么遗憾。

With that cleared up, I'd suggest you get some baseline measurements before you go about fixing the bulk insert performance.

清理完这些之后,我建议您在开始修复大容量插入性能之前进行一些基线测量。

  1. Measure how long it takes to insert those records in the current state.
  2. 测量在当前状态下插入这些记录需要多长时间。
  3. Skip setting up the relationships between those objects altogether, and then measure the insert performance.
  4. 跳过设置这些对象之间的关系,然后测量插入性能。
  5. Create a simple SQLite database, and measure the insert performance with that. This should give you a very good baseline estimate of how long it takes to perform the actual SQL inserts and will also give you a good idea of the Core Data overhead.
  6. 创建一个简单的SQLite数据库,并使用它来度量插入性能。这将为您提供执行实际SQL插入所需的时间的非常好的基线估计,并且还将使您对核心数据开销有一个很好的了解。

A few things you can try off the bat to speed up inserts:

你可以试试以下方法来加速插入:

  1. Ensure that there are no active fetched results controllers when you are performing the bulk inserts. By active, I mean fetched result controllers that have a non-nil delegate. In my experience, Core Data's change tracking was the single most expensive operation when trying to do bulk insertions.
  2. 在执行批量插入时,确保没有活动的获取结果控制器。所谓活动,我的意思是获取具有非nil委托的结果控制器。在我的经验中,Core Data的变更跟踪是在尝试批量插入时最昂贵的操作。
  3. Perform all changes in a single context, and stop merging changes from different contexts until this bulk inserts are done.
  4. 在单个上下文中执行所有更改,并停止合并来自不同上下文中的更改,直到完成此批量插入。

To get more insight into what's really going on under the hood, enable Core Data SQL debugging and see the SQL queries that are being executed. Ideally, you'd want to see a lot of INSERTs, and a few UPDATEs. But if you come across too many SELECTs, and/or UPDATEs, then that's a sign that you are doing too much reading, or updating of objects.

要更深入地了解底层的实际情况,请启用Core Data SQL调试并查看正在执行的SQL查询。理想情况下,您希望看到大量的插入和一些更新。但是,如果遇到太多的选择和/或更新,则表明您正在做太多的读取或更新对象。

Use the Core-Data profiler instrument to get a better high-level overview of what's going on with Core Data.

使用Core-Data profiler工具可以对核心数据进行更好的高级概述。

#7


2  

I've decided to write my own answer summarizing the techniques and advices I found useful for my situation. Thanks to all folks who posted their answers.

我决定写我自己的答案,总结我发现对我的情况有用的技巧和建议。感谢所有发布答案的人。


I. Transport

我运输。

  1. "One JSON". This is the idea that I want to give a try. Thanks @mundi.

    “一个JSON”。这就是我想尝试的想法。谢谢@mundi。

  2. The idea of archiving JSON before sending it to a client, be it a one JSON pack or a 30 separate 'one table - one pack'.

    在将JSON发送给客户端之前对其进行归档的想法,可以是一个JSON包,也可以是30个单独的“一个表—一个包”。


II. Setting up Core Data relations

二世。建立核心数据关系。

I will describe a process of importing JSON->CoreData import using imaginary large import operation as if it was performed in one method (I am not sure will it be so or not - maybe I split it into a logical chunks).

我将描述一个使用假想的大型导入操作导入JSON->CoreData的过程,就好像它是在一个方法中执行的一样(我不确定它是这样还是那样——也许我将它分割成一个逻辑块)。

Let's imagine that in my imaginary app there are 15 capacious tables, where "capacious" means "cannot be held in memory at once, should be imported using batches" and 15 non-capacious tables each having <500 records, for example:

假设在我的虚拟app中有15个宽敞的表格,其中“capacious”表示“不能同时保存在内存中,应该分批导入”,15个非capacious表各有<500条记录:

Capacious:

宽敞的:

  • cities (15k+)
  • 城市(15 k +)
  • clients (30k+)
  • 客户(30 k +)
  • users (15k+)
  • 用户(15 k +)
  • events (5k+)
  • 事件(5 k +)
  • actions (2k+) ...
  • 行动(2 k +)……

Small:

小:

  • client_types (20-)
  • client_types(20)
  • visit_types (10-)
  • visit_types(10 -)
  • positions (10-) ...
  • 职位(10 -)…

Let's imagine, that I already have JSON packs downloaded and parsed into composite NSArray/NSDictionary variables: I have citiesJSON, clientsJSON, usersJSON, ...

让我们想象一下,我已经下载了JSON包并解析成复合NSArray/NSDictionary变量:我有citiesJSON、clientsJSON、usersJSON、……

1. Work with small tables first

1。首先使用小桌子

My pseudo-method starts with import of tiny tables first. Let's take client_types table: I iterate through clientTypesJSON and create ClientType objects (NSManagedObject's subclasses). More than that I collect resulting objects in a dictionary with these objects as its values and "ids" (foreign keys) of these objects as keys.

我的伪方法首先从导入小表开始。我们取client_types表:我遍历了clientTypesJSON并创建了ClientType对象(NSManagedObject的子类)。不仅如此,我还在字典中收集结果对象,这些对象作为其值,这些对象的“id”(外键)作为键。

Here is the pseudocode:

这是伪代码:

NSMutableDictionary *clientTypesIdsAndClientTypes = [NSMutableDictionary dictionary];
for (NSDictionary *clientTypeJSON in clientsJSON) {
    ClientType *clientType = [NSEntityDescription insertNewObjectForEntityForName:@"ClientType" inManagedObjectContext:managedObjectContext];

    // fill the properties of clientType from clientTypeJSON

    // Write prepared clientType to a cache
    [clientTypesIdsAndClientTypes setValue:clientType forKey:clientType.id];
}

// Persist all clientTypes to a store.
NSArray *clientTypes = [clientTypesIdsAndClientTypes allValues];
[managedObjectContext obtainPermanentIDsForObjects:clientTypes error:...];

// Un-fault (unload from RAM) all the records in the cache - because we don't need them in memory anymore.
for (ClientType *clientType in clientTypes) {
    [managedObjectContext refreshObject:clientType mergeChanges:NO];
}

The result is that we have a bunch of dictionaries of small tables, each having corresponding set of objects and their ids. We will use them later without a refetching because they are small and their values (NSManagedObjects) are now faults.

结果是我们有了一堆小表的字典,每一个都有相应的对象集和它们的id。我们稍后将使用它们而不需要重新获取,因为它们很小,而且它们的值(NSManagedObjects)现在是错误的。

2. Use the cache dictionary of objects from small tables obtained during step 1 to set up relationships with them

2。使用步骤1中获得的小表中的对象的缓存字典来建立与它们的关系

Let's consider complex table clients: we have clientsJSON and we need to set up a clientType relation for each client record, it is easy because we do have a cache with clientTypes and their ids:

让我们考虑复杂的表客户端:我们有clientsJSON,我们需要为每个客户端记录建立一个客户端类型关系,这很容易,因为我们有一个具有客户端类型及其id的缓存:

for (NSDictionary *clientJSON in clientsJSON) {
    Client *client = [NSEntityDescription insertNewObjectForEntityForName:@"Client" inManagedObjectContext:managedObjectContext];

    // Setting up SQLite field 
    client.client_type_id = clientJSON[@"client_type_id"];

    // Setting up Core Data relationship beetween client and clientType
    client.clientType = clientTypesIdsAndClientTypes[client.client_type_id];
}

// Save and persist

3. Dealing with large tables - batches

3所示。处理大量的表格

Let's consider a large clientsJSON having 30k+ clients in it. We do not iterate through the whole clientsJSON but split it into a chunks of appropriate size (500 records), so that [managedObjectContext save:...] is called every 500 records. Also it is important to wrap operation with each 500-records batch into an @autoreleasepool block - see Reducing memory overhead in Core Data Performance guide

让我们考虑一个拥有30k+客户端的大型客户端。我们不会遍历整个clientsJSON,而是将它分割成大小合适的块(500条记录),以便[managedObjectContext保存:…每500条记录就会被调用一次。同样,将每个500条记录的批处理封装到一个@autoreleasepool块中也很重要——请参阅Core Data Performance guide中减少内存开销

Be careful - the step 4 describes the operation applied to a batch of 500 records not to a whole clientsJSON!

小心——步骤4描述了应用于一批500条记录而不是整个clientsJSON的操作!

4. Dealing with large tables - setting up relationships with large tables

4所示。处理大表——与大表建立关系

Consider the following method, we will use in a moment:

考虑下面的方法,我们马上就会用到:

@implementation NSManagedObject (Extensions)
+ (NSDictionary *)dictionaryOfExistingObjectsByIds:(NSArray *)objectIds inManagedObjectContext:(NSManagedObjectContext *)managedObjectContext {
    NSDictionary *dictionaryOfObjects;

    NSArray *sortedObjectIds = [objectIds sortedArrayUsingSelector:@selector(compare:)];

    NSFetchRequest *fetchRequest = [[NSFetchRequest alloc] initWithEntityName:NSStringFromClass(self)];

    fetchRequest.predicate = [NSPredicate predicateWithFormat:@"(id IN %@)", sortedObjectIds];
    fetchRequest.sortDescriptors = @[[[NSSortDescriptor alloc] initWithKey: @"id" ascending:YES]];

    fetchRequest.includesPropertyValues = NO;
    fetchRequest.returnsObjectsAsFaults = YES;

    NSError *error;
    NSArray *fetchResult = [managedObjectContext executeFetchRequest:fetchRequest error:&error];

    dictionaryOfObjects = [NSMutableDictionary dictionaryWithObjects:fetchResult forKeys:sortedObjectIds];

    return dictionaryOfObjects;
}
@end

Let's consider clientsJSON pack containing a batch (500) of Client records we need to save. Also we need to set up a relationship beetween these clients and their agencies (Agency, foreign key is agency_id).

让我们考虑一下包含需要保存的一批(500)客户端记录的clientsJSON包。我们还需要在这些客户和他们的代理之间建立一种关系(代理,外键是agency_id)。

NSMutableArray *agenciesIds = [NSMutableArray array];
NSMutableArray *clients = [NSMutableArray array];

for (NSDictionary *clientJSON in clientsJSON) {
    Client *client = [NSEntityDescription insertNewObjectForEntityForName:@"Client" inManagedObjectContext:managedObjectContext];

    // fill client fields...

    // Also collect agencies ids
    if ([agenciesIds containsObject:client.agency_id] == NO) {
        [agenciesIds addObject:client.agency_id];
    }        

    [clients addObject:client];
}

NSDictionary *agenciesIdsAndAgenciesObjects = [Agency dictionaryOfExistingObjectsByIds:agenciesIds];

// Setting up Core Data relationship beetween Client and Agency
for (Client *client in clients) {
    client.agency = agenciesIdsAndAgenciesObjects[client.agency_id];
}

// Persist all Clients to a store.
[managedObjectContext obtainPermanentIDsForObjects:clients error:...];

// Un-fault all the records in the cache - because we don't need them in memory anymore.
for (Client *client in clients) {
    [managedObjectContext refreshObject:client mergeChanges:NO];
}

Most of what I use here is described in these Apple guides: Core Data performance, Efficiently importing data. So the summary for steps 1-4 is the following:

我在这里使用的大部分内容都在这些Apple指南中描述:核心数据性能,有效地导入数据。因此步骤1-4的总结如下:

  1. Turn objects into faults when they are persisted and so their property values become unnecessary as import operation goes futher.

    当对象被持久化时,将其转换为错误,因此当导入操作继续进行时,其属性值将变得不必要。

  2. Construct dictionaries with objects as values and their ids as keys, so these dictionaries can serve as lookup tables when constructing a relationships beetween these objects and other objects.

    构造以对象为值、id为键的字典,以便在构造这些对象和其他对象之间的关系时,这些字典可以作为查找表。

  3. Use @autoreleasepool when iterating through a large number of records.

    在遍历大量记录时使用@autoreleasepool。

  4. Use a method similar to dictionaryOfExistingObjectsByIds or to a method that Tom references in his answer, from Efficiently importing data - a method that has SQL IN predicate behind it to significantly reduce a number of fetches. Read Tom's answer and referenced Apple's corresponding guide to better understand this technique.

    使用一种类似于dictionaryOfExistingObjectsByIds的方法,或者使用Tom在他的回答中引用的一种方法,即有效地导入数据——一种在数据后面有SQL谓词的方法,以显著减少大量的获取。阅读Tom的回答,并参考苹果的相应指南,以更好地理解这项技术。


Good reading on this topic

这方面的阅读很好

objc.io issue #4: Importing Large Data Sets

objc。io问题4:导入大型数据集

#1


10  

I have experience in a very similar project. The Core Data insertions take some time, so we condition the user that this will take a while, but only the first time. The best performance tweak was of course to get the batch size right between saves, but I am sure you are aware of that.

我在一个非常相似的项目中有经验。核心数据插入需要一段时间,因此我们要求用户这需要一段时间,但这只是第一次。最好的性能调整当然是在保存之间获得批处理大小,但是我确信您已经意识到了这一点。

One performance suggestion: I have tried a few things and found that creating many download threads can be a hit on performance, I suppose because for each request there is some latency from the server etc.

一个性能建议:我尝试了一些东西,发现创建许多下载线程可能会对性能产生影响,我想,因为每个请求都有来自服务器的延迟。

Instead, I discovered that downloading all the JSON in one go was much faster. I do not know how much data you have, but I tested with > 100.000 records and a 40MB+ JSON string this works really fast, so the bottleneck is just the Core Data insertions. With an @autorelease pool this even performed acceptably on a first generation iPad.

相反,我发现一次下载所有JSON要快得多。我不知道您有多少数据,但是我用> 100.000记录和40MB+ JSON字符串进行了测试,这非常快,所以瓶颈仅仅是核心数据插入。在一个@autorelease池中,这甚至在第一代iPad上表现良好。

Stay away from the SQLite API - it will take you more than a man year (provided high productivity) to replicate the performance optimizations you get out of the box with Core Data.

远离SQLite API——您将花费超过一年的时间(提供高生产率)来复制核心数据带来的性能优化。

#2


6  

First off, you're doing a lot of work, and it will take some time no matter how you slice it, but there are ways to improve things.

首先,你要做大量的工作,不管你怎么做,都需要一段时间,但是有很多方法可以改进。

I'd recommend doing your fetches in batches, with a batch size matching your batch size for processing new objects. For example, when creating new Agency records, do something like:

我建议您成批提取,批处理大小与处理新对象的批处理大小相匹配。例如,在创建新的机构记录时,要做以下事情:

  1. Make sure the current Agency batch is sorted by city_id. (I'll explain why later).

    确保当前代理批次按city_id排序。(稍后我将解释为什么)。

  2. Get the City ID for each Agency in the batch. Depending on how your JSON is structured, this is probably a one-liner like this (since valueForKey works on arrays):

    获取批次中每个代理的城市ID。根据JSON的结构,这可能是这样的一行代码(因为valueForKey在数组上工作):

    NSArray *cityIDs = [myAgencyBatch valueForKey:@"city_id"];
    
  3. Get all the City instances for the current pass in one fetch by using the IDs you found in the previous step. Sort the results by city_id. Something like:

    通过使用前面步骤中找到的id,在一次取回中获取当前传递的所有城市实例。通过city_id对结果进行排序。喜欢的东西:

    NSFetchRequest *request = [NSFetchRequest fetchRequestWithEntityName:@"City"];
    NSPredicate *predicate = [NSPredicate predicateWithFormat:@"city_id in %@", cityIDs];
    [request setPredicate:predicate];
    [request setSortDescriptors:@[ [NSSortDescriptor sortDescriptorWithKey:@"city_id" ascending:YES] ]];
    NSArray *cities = [context executeFetchRequest:request error:nil];
    

Now, you have one array of Agency and another one of City, both sorted by city_id. Match them up to set up the relationships (check city_id in case things don't match). Save changes, and go on to the next batch.

现在,您有一个代理数组和另一个City数组,它们都按city_id排序。匹配它们以建立关系(检查city_id以防不匹配)。保存更改,然后继续下一批。

This will dramatically reduce the number of fetches you need to do, which should speed things up. For more on this technique, see Implementing Find-or-Create Efficiently in Apple's docs.

这将极大地减少您需要执行的获取操作的数量,这应该会加快速度。有关此技术的更多信息,请参见在苹果文档中高效地实现Find-or-Create。

Another thing that may help is to "warm up" Core Data's internal cache with the objects you need before you start fetching them. This will save time later on because getting property values won't require a trip to the data store. For this you'd do something like:

另一件可能有帮助的事情是,在开始获取核心数据之前,先用需要的对象“预热”核心数据的内部缓存。这将节省以后的时间,因为获取属性值不需要访问数据存储。为此,你可以这样做:

NSFetchRequest *request = [NSFetchRequest fetchRequestWithEntityName:@"City"];
// no predicate, get everything
[request setResultType:NSManagedObjectIDResultType];
NSArray *notUsed = [context executeFetchRequest:request error:nil];

..and then just forget about the results. This is superficially useless but will alter the internal Core Data state for faster access to City instances later on.

. .然后忘掉结果。这表面上是无用的,但将改变内部核心数据状态,以便稍后更快地访问城市实例。

Now as for your other questions,

至于你的其他问题,

  • Using SQLite directly instead of Core Data might not be a terrible choice for your situation. The benefit would be that you'd have no need to set up the relationships, since you could use use fields like city_id as foreign keys. So, fast importing. The downside, of course, is that you'll have to do your own work converting your model objects to/from SQL records, and probably rewrite quite a lot of existing code that assumes Core Data (e.g. every time you follow a relationship you now need to look up records by that foreign key). This change might fix your import performance issues, but the side effects could be significant.

    直接使用SQLite而不是Core Data可能不是一个糟糕的选择。这样做的好处是,您不需要设置关系,因为您可以使用像city_id这样的字段作为外键。所以,快速导入。的缺点,当然,你要做好你自己的工作转换模型对象从SQL /记录,和重写很多现有的代码假定核心数据(例如,每一次你遵循一个关系你现在需要查记录的外键)。这个更改可能会修复您的导入性能问题,但副作用可能非常大。

  • JSON is generally a very good format if you're transmitting data as text. If you could prepare a Core Data store on the server, and if you would use that file as-is instead of trying to merge it into an existing data store, then that would almost certainly speed things up. Your import process would run once on the server and then never again. But those are big "if"s, especially the second one. If you get to where you need to merge a new server data store with existing data, you're right back to where you are now.

    如果将数据作为文本传输,JSON通常是一种非常好的格式。如果您可以在服务器上准备一个核心数据存储,并且您将按原样使用该文件,而不是试图将其合并到现有的数据存储中,那么这几乎肯定会加快速度。您的导入过程将在服务器上运行一次,然后再也不会运行。但这些都是“如果”,尤其是第二个。如果您需要将新的服务器数据存储与现有数据合并,那么您就回到了您现在的位置。

#3


5  

Do you have control of the server? I ask, because it sounds like you do from the following paragraph:

你能控制服务器吗?我这样问,因为听起来你是这样的:

"For the first time a complete synchronization is performed (app's first launch time) - perform the fetch of the whole database data in, say, one archived file (something like database dump) and then somehow import it as a whole to the CoreData land".

“第一次执行完整的同步(应用程序的第一次启动)——执行整个数据库数据的获取,例如,一个存档文件(类似数据库转储),然后以某种方式将其作为一个整体导入CoreData land”。

If sending a dump is possible, why not send the Core Data file itself? Core Data (by default) is backed by a SQLite database -- why not generate that database on the server, zip it and send it across the wire?

如果可以发送转储文件,为什么不发送核心数据文件本身呢?Core Data(默认情况下)由一个SQLite数据库支持——为什么不在服务器上生成这个数据库,压缩它,然后通过网络发送呢?

This would mean you could eliminate all the JSON parsing, network requests etc and replace it with a simple file download and archive extraction. We did this on a project and it improved performance immeasurably.

这意味着您可以消除所有JSON解析、网络请求等,并使用简单的文件下载和归档提取替换它。我们在一个项目上做了这件事,它改善了性能。

#4


4  

  1. For each row in your table there must be a timestamp column. If there isn't one, you should add it.
  2. 对于表中的每一行,必须有一个时间戳列。如果没有的话,你应该加上去。
  3. First time and each time you fetch database dump you store last update date and time.
  4. 第一次和每次获取数据库转储时都存储最后更新日期和时间。
  5. On every next time you instruct the database to return only those records that were changed or updated since the previous download operation. There also should be a "deleted" flag for you to remove vanished records.
  6. 每次您指示数据库只返回自上次下载操作以来更改或更新的记录。还应该有一个“删除”标志,以便您删除已消失的记录。
  7. Then you only need to update certain matching records saving time on all fronts.
  8. 然后,您只需要更新某些匹配记录,从而在所有方面节省时间。

To speed up the first time sync you can also ship a seed database with the app, so that it could be imported immediately without any network operations.

为了加快第一次同步,你还可以将种子数据库和应用一起发送出去,这样就可以在没有任何网络操作的情况下立即导入。

  1. Download the JSON files by hand.
  2. 手工下载JSON文件。
  3. Put them into your project.
  4. 把他们投入到你的项目中。
  5. Somewhere in the project configuration or header files take a note of download date and time.
  6. 在项目配置或头文件的某个地方记下下载日期和时间。
  7. On the first run, locate and load said files, then proceed like you're updating them.
  8. 在第一次运行时,找到并加载所述文件,然后像更新文件一样继续进行。
  9. If in doubt, refer to the manual.
  10. 如果有疑问,请参考手册。

Example:

例子:

NSString *filePath = [[NSBundle mainBundle] pathForResource:@"cities" 
                                            ofType:@"json"];
NSData *citiesData = [NSData dataWithContentsOfFile:filePath];
// I assume that you're loading an array
NSArray *citiesSeed = [NSJSONSerialization JSONObjectWithData:citiesData 
                       options:NSJSONReadingMutableContainers error:nil];

#5


4  

Here you have my recommendations:

这里有我的建议:

  • Use magicalrecord. It's a CoreData wrapper that saves you a lot of boilerplate code, plus it comes with very interesting features.
  • 使用magicalrecord。它是一个CoreData包装器,为您节省了许多样板代码,而且它还附带了非常有趣的特性。
  • Download all the JSON in one request, as others suggested. If you can embed the first JSON document into the app, you can save the download time and start populating the database right when you open the app for the first time. Also, with magicalrecord is quite easy to perform this save operation in a separate thread and then sync all contexts automatically. This can improve the responsiveness of your app.
  • 按照其他人的建议,在一个请求中下载所有JSON。如果可以将第一个JSON文档嵌入到应用程序中,则可以节省下载时间,并在首次打开应用程序时开始填充数据库。此外,使用magicalrecord可以很容易地在单独的线程中执行此保存操作,然后自动同步所有上下文。这可以提高应用程序的响应能力。
  • It seems that you should refactor that ugly method once you have solved the first import issue. Again, I would suggest to use magicalrecord to easily create those entities.
  • 似乎一旦解决了第一个导入问题,您就应该重构这个难看的方法。同样,我建议使用magicalrecord轻松创建这些实体。

#6


3  

We've recently moved a fairly large project from Core Data to SQLite, and one of the main reasons was bulk insert performance. There were quite a few features we lost in the transition, and I would not advise you to make the switch if you can avoid it. After the transition to SQLite, we actually had performance issues in areas other than bulk inserts which Core Data was transparently handling for us, and even though we fixed those new issues, it took some amount of time getting back up and running. Although we've spent some time and effort in transitioning from Core Data to SQLite, I can't say that there are any regrets.

我们最近将一个相当大的项目从核心数据转移到SQLite,主要原因之一是批量插入性能。在转换过程中我们丢失了很多特性,如果可以避免的话,我不建议您进行切换。在转换到SQLite之后,实际上除了批量插入之外,我们还遇到了性能问题,而核心数据是透明地为我们处理的,尽管我们修复了这些新问题,但是恢复和运行还是需要一些时间。虽然我们花了一些时间和精力从核心数据过渡到SQLite,但我不能说有什么遗憾。

With that cleared up, I'd suggest you get some baseline measurements before you go about fixing the bulk insert performance.

清理完这些之后,我建议您在开始修复大容量插入性能之前进行一些基线测量。

  1. Measure how long it takes to insert those records in the current state.
  2. 测量在当前状态下插入这些记录需要多长时间。
  3. Skip setting up the relationships between those objects altogether, and then measure the insert performance.
  4. 跳过设置这些对象之间的关系,然后测量插入性能。
  5. Create a simple SQLite database, and measure the insert performance with that. This should give you a very good baseline estimate of how long it takes to perform the actual SQL inserts and will also give you a good idea of the Core Data overhead.
  6. 创建一个简单的SQLite数据库,并使用它来度量插入性能。这将为您提供执行实际SQL插入所需的时间的非常好的基线估计,并且还将使您对核心数据开销有一个很好的了解。

A few things you can try off the bat to speed up inserts:

你可以试试以下方法来加速插入:

  1. Ensure that there are no active fetched results controllers when you are performing the bulk inserts. By active, I mean fetched result controllers that have a non-nil delegate. In my experience, Core Data's change tracking was the single most expensive operation when trying to do bulk insertions.
  2. 在执行批量插入时,确保没有活动的获取结果控制器。所谓活动,我的意思是获取具有非nil委托的结果控制器。在我的经验中,Core Data的变更跟踪是在尝试批量插入时最昂贵的操作。
  3. Perform all changes in a single context, and stop merging changes from different contexts until this bulk inserts are done.
  4. 在单个上下文中执行所有更改,并停止合并来自不同上下文中的更改,直到完成此批量插入。

To get more insight into what's really going on under the hood, enable Core Data SQL debugging and see the SQL queries that are being executed. Ideally, you'd want to see a lot of INSERTs, and a few UPDATEs. But if you come across too many SELECTs, and/or UPDATEs, then that's a sign that you are doing too much reading, or updating of objects.

要更深入地了解底层的实际情况,请启用Core Data SQL调试并查看正在执行的SQL查询。理想情况下,您希望看到大量的插入和一些更新。但是,如果遇到太多的选择和/或更新,则表明您正在做太多的读取或更新对象。

Use the Core-Data profiler instrument to get a better high-level overview of what's going on with Core Data.

使用Core-Data profiler工具可以对核心数据进行更好的高级概述。

#7


2  

I've decided to write my own answer summarizing the techniques and advices I found useful for my situation. Thanks to all folks who posted their answers.

我决定写我自己的答案,总结我发现对我的情况有用的技巧和建议。感谢所有发布答案的人。


I. Transport

我运输。

  1. "One JSON". This is the idea that I want to give a try. Thanks @mundi.

    “一个JSON”。这就是我想尝试的想法。谢谢@mundi。

  2. The idea of archiving JSON before sending it to a client, be it a one JSON pack or a 30 separate 'one table - one pack'.

    在将JSON发送给客户端之前对其进行归档的想法,可以是一个JSON包,也可以是30个单独的“一个表—一个包”。


II. Setting up Core Data relations

二世。建立核心数据关系。

I will describe a process of importing JSON->CoreData import using imaginary large import operation as if it was performed in one method (I am not sure will it be so or not - maybe I split it into a logical chunks).

我将描述一个使用假想的大型导入操作导入JSON->CoreData的过程,就好像它是在一个方法中执行的一样(我不确定它是这样还是那样——也许我将它分割成一个逻辑块)。

Let's imagine that in my imaginary app there are 15 capacious tables, where "capacious" means "cannot be held in memory at once, should be imported using batches" and 15 non-capacious tables each having <500 records, for example:

假设在我的虚拟app中有15个宽敞的表格,其中“capacious”表示“不能同时保存在内存中,应该分批导入”,15个非capacious表各有<500条记录:

Capacious:

宽敞的:

  • cities (15k+)
  • 城市(15 k +)
  • clients (30k+)
  • 客户(30 k +)
  • users (15k+)
  • 用户(15 k +)
  • events (5k+)
  • 事件(5 k +)
  • actions (2k+) ...
  • 行动(2 k +)……

Small:

小:

  • client_types (20-)
  • client_types(20)
  • visit_types (10-)
  • visit_types(10 -)
  • positions (10-) ...
  • 职位(10 -)…

Let's imagine, that I already have JSON packs downloaded and parsed into composite NSArray/NSDictionary variables: I have citiesJSON, clientsJSON, usersJSON, ...

让我们想象一下,我已经下载了JSON包并解析成复合NSArray/NSDictionary变量:我有citiesJSON、clientsJSON、usersJSON、……

1. Work with small tables first

1。首先使用小桌子

My pseudo-method starts with import of tiny tables first. Let's take client_types table: I iterate through clientTypesJSON and create ClientType objects (NSManagedObject's subclasses). More than that I collect resulting objects in a dictionary with these objects as its values and "ids" (foreign keys) of these objects as keys.

我的伪方法首先从导入小表开始。我们取client_types表:我遍历了clientTypesJSON并创建了ClientType对象(NSManagedObject的子类)。不仅如此,我还在字典中收集结果对象,这些对象作为其值,这些对象的“id”(外键)作为键。

Here is the pseudocode:

这是伪代码:

NSMutableDictionary *clientTypesIdsAndClientTypes = [NSMutableDictionary dictionary];
for (NSDictionary *clientTypeJSON in clientsJSON) {
    ClientType *clientType = [NSEntityDescription insertNewObjectForEntityForName:@"ClientType" inManagedObjectContext:managedObjectContext];

    // fill the properties of clientType from clientTypeJSON

    // Write prepared clientType to a cache
    [clientTypesIdsAndClientTypes setValue:clientType forKey:clientType.id];
}

// Persist all clientTypes to a store.
NSArray *clientTypes = [clientTypesIdsAndClientTypes allValues];
[managedObjectContext obtainPermanentIDsForObjects:clientTypes error:...];

// Un-fault (unload from RAM) all the records in the cache - because we don't need them in memory anymore.
for (ClientType *clientType in clientTypes) {
    [managedObjectContext refreshObject:clientType mergeChanges:NO];
}

The result is that we have a bunch of dictionaries of small tables, each having corresponding set of objects and their ids. We will use them later without a refetching because they are small and their values (NSManagedObjects) are now faults.

结果是我们有了一堆小表的字典,每一个都有相应的对象集和它们的id。我们稍后将使用它们而不需要重新获取,因为它们很小,而且它们的值(NSManagedObjects)现在是错误的。

2. Use the cache dictionary of objects from small tables obtained during step 1 to set up relationships with them

2。使用步骤1中获得的小表中的对象的缓存字典来建立与它们的关系

Let's consider complex table clients: we have clientsJSON and we need to set up a clientType relation for each client record, it is easy because we do have a cache with clientTypes and their ids:

让我们考虑复杂的表客户端:我们有clientsJSON,我们需要为每个客户端记录建立一个客户端类型关系,这很容易,因为我们有一个具有客户端类型及其id的缓存:

for (NSDictionary *clientJSON in clientsJSON) {
    Client *client = [NSEntityDescription insertNewObjectForEntityForName:@"Client" inManagedObjectContext:managedObjectContext];

    // Setting up SQLite field 
    client.client_type_id = clientJSON[@"client_type_id"];

    // Setting up Core Data relationship beetween client and clientType
    client.clientType = clientTypesIdsAndClientTypes[client.client_type_id];
}

// Save and persist

3. Dealing with large tables - batches

3所示。处理大量的表格

Let's consider a large clientsJSON having 30k+ clients in it. We do not iterate through the whole clientsJSON but split it into a chunks of appropriate size (500 records), so that [managedObjectContext save:...] is called every 500 records. Also it is important to wrap operation with each 500-records batch into an @autoreleasepool block - see Reducing memory overhead in Core Data Performance guide

让我们考虑一个拥有30k+客户端的大型客户端。我们不会遍历整个clientsJSON,而是将它分割成大小合适的块(500条记录),以便[managedObjectContext保存:…每500条记录就会被调用一次。同样,将每个500条记录的批处理封装到一个@autoreleasepool块中也很重要——请参阅Core Data Performance guide中减少内存开销

Be careful - the step 4 describes the operation applied to a batch of 500 records not to a whole clientsJSON!

小心——步骤4描述了应用于一批500条记录而不是整个clientsJSON的操作!

4. Dealing with large tables - setting up relationships with large tables

4所示。处理大表——与大表建立关系

Consider the following method, we will use in a moment:

考虑下面的方法,我们马上就会用到:

@implementation NSManagedObject (Extensions)
+ (NSDictionary *)dictionaryOfExistingObjectsByIds:(NSArray *)objectIds inManagedObjectContext:(NSManagedObjectContext *)managedObjectContext {
    NSDictionary *dictionaryOfObjects;

    NSArray *sortedObjectIds = [objectIds sortedArrayUsingSelector:@selector(compare:)];

    NSFetchRequest *fetchRequest = [[NSFetchRequest alloc] initWithEntityName:NSStringFromClass(self)];

    fetchRequest.predicate = [NSPredicate predicateWithFormat:@"(id IN %@)", sortedObjectIds];
    fetchRequest.sortDescriptors = @[[[NSSortDescriptor alloc] initWithKey: @"id" ascending:YES]];

    fetchRequest.includesPropertyValues = NO;
    fetchRequest.returnsObjectsAsFaults = YES;

    NSError *error;
    NSArray *fetchResult = [managedObjectContext executeFetchRequest:fetchRequest error:&error];

    dictionaryOfObjects = [NSMutableDictionary dictionaryWithObjects:fetchResult forKeys:sortedObjectIds];

    return dictionaryOfObjects;
}
@end

Let's consider clientsJSON pack containing a batch (500) of Client records we need to save. Also we need to set up a relationship beetween these clients and their agencies (Agency, foreign key is agency_id).

让我们考虑一下包含需要保存的一批(500)客户端记录的clientsJSON包。我们还需要在这些客户和他们的代理之间建立一种关系(代理,外键是agency_id)。

NSMutableArray *agenciesIds = [NSMutableArray array];
NSMutableArray *clients = [NSMutableArray array];

for (NSDictionary *clientJSON in clientsJSON) {
    Client *client = [NSEntityDescription insertNewObjectForEntityForName:@"Client" inManagedObjectContext:managedObjectContext];

    // fill client fields...

    // Also collect agencies ids
    if ([agenciesIds containsObject:client.agency_id] == NO) {
        [agenciesIds addObject:client.agency_id];
    }        

    [clients addObject:client];
}

NSDictionary *agenciesIdsAndAgenciesObjects = [Agency dictionaryOfExistingObjectsByIds:agenciesIds];

// Setting up Core Data relationship beetween Client and Agency
for (Client *client in clients) {
    client.agency = agenciesIdsAndAgenciesObjects[client.agency_id];
}

// Persist all Clients to a store.
[managedObjectContext obtainPermanentIDsForObjects:clients error:...];

// Un-fault all the records in the cache - because we don't need them in memory anymore.
for (Client *client in clients) {
    [managedObjectContext refreshObject:client mergeChanges:NO];
}

Most of what I use here is described in these Apple guides: Core Data performance, Efficiently importing data. So the summary for steps 1-4 is the following:

我在这里使用的大部分内容都在这些Apple指南中描述:核心数据性能,有效地导入数据。因此步骤1-4的总结如下:

  1. Turn objects into faults when they are persisted and so their property values become unnecessary as import operation goes futher.

    当对象被持久化时,将其转换为错误,因此当导入操作继续进行时,其属性值将变得不必要。

  2. Construct dictionaries with objects as values and their ids as keys, so these dictionaries can serve as lookup tables when constructing a relationships beetween these objects and other objects.

    构造以对象为值、id为键的字典,以便在构造这些对象和其他对象之间的关系时,这些字典可以作为查找表。

  3. Use @autoreleasepool when iterating through a large number of records.

    在遍历大量记录时使用@autoreleasepool。

  4. Use a method similar to dictionaryOfExistingObjectsByIds or to a method that Tom references in his answer, from Efficiently importing data - a method that has SQL IN predicate behind it to significantly reduce a number of fetches. Read Tom's answer and referenced Apple's corresponding guide to better understand this technique.

    使用一种类似于dictionaryOfExistingObjectsByIds的方法,或者使用Tom在他的回答中引用的一种方法,即有效地导入数据——一种在数据后面有SQL谓词的方法,以显著减少大量的获取。阅读Tom的回答,并参考苹果的相应指南,以更好地理解这项技术。


Good reading on this topic

这方面的阅读很好

objc.io issue #4: Importing Large Data Sets

objc。io问题4:导入大型数据集