使用Python从PowerPivot模型中提取原始数据

时间:2021-12-18 09:54:41

What seemed like a trivial task turned into a real nightmare when I had to read in some data from a PowerPivot model using Python. I believe I've researched this very well over the last couple of days but now I hit a brick wall and would appreciate some help from the Python/SSAS/ADO community.

当我不得不使用Python从PowerPivot模型中读取一些数据时,看似微不足道的任务变成了真正的噩梦。我相信我在过去的几天里已经对此做了很好的研究,但是现在我遇到了困难,希望能得到Python/SSAS/ADO社区的帮助。

Basically, all I want to do is programmatically access raw data stored in PowerPivot models - my idea was to connect to the underlying PowerPivot (i.e. MS Analysis Services) engine via one of the methods listed below, list the tables contained in the model, then extract the raw data from each table using a simple DAX query (something like EVALUATE (table_name)). Easy peasy, right? Well, maybe not.

基本上,所有我想做的就是以编程方式访问原始数据存储在PowerPivot模型——我的想法是连接到底层PowerPivot(例如微软分析服务)引擎通过下列方法之一,列表中包含的表模型,然后从每个表中提取原始数据使用一个简单的综合查询(类似评估(table_name))。peasy简单,对吗?好吧,也许不是。

0. Some Background Information

As you can see, I've tried several different approaches. I'll try to document everything as carefully as possible so that those uninitiated in PowerPivot functionality will have a good idea of what I'd like to do.

正如您所看到的,我尝试了几种不同的方法。我将尽可能仔细地记录所有内容,以便那些不熟悉PowerPivot功能的人能够很好地了解我想做什么。

First of all, some background on programmatic access to Analysis Services engine (it says 2005 SQL Server, but all of it ought to still be applicable): SQL Server Data Mining Programmability and Data providers used for Analysis Services connections.

首先,关于对分析服务引擎的编程访问的一些背景知识(它说的是2005年的SQL Server,但所有这些都应该是适用的):SQL Server数据挖掘可编程性和用于分析服务连接的数据提供者。

The sample Excel/PowerPivot file I'll be using in the example below can be found here: Microsoft PowerPivot for Excel 2010 and PowerPivot in Excel 2013 Samples.

下面示例中使用的示例Excel/PowerPivot文件可以在这里找到:Excel 2010的Microsoft PowerPivot文件和Excel 2013的PowerPivot文件。

Also, note that I'm using Excel 2010, so some of my code is version-specific. E.g. wb.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection should be wb.Model.DataModelConnection.ModelConnection.ADOConnection if you're using Excel 2013.

另外,请注意,我使用的是Excel 2010,因此我的一些代码是特定于版本的。例如世行。连接.OLEDBConnection(“PowerPivot数据”)。ADOConnection应该wb.Model.DataModelConnection.ModelConnection。ADOConnection如果您正在使用Excel 2013。

The connection string I'll be using throughout this question is based on the information found here: Connect to PowerPivot engine with C#. Additionally, some of the methods apparently require some sort of initialization of the PowerPivot model prior to data retrieval. See here: Automating PowerPivot Refresh operation from VBA.

我将在整个问题中使用的连接字符串基于这里找到的信息:使用c#连接到PowerPivot引擎。此外,有些方法显然需要在数据检索之前对PowerPivot模型进行某种初始化。参见这里:从VBA自动执行PowerPivot刷新操作。

Finally, here's a couple of links showing that this should be achievable (note however, that these links mainly refer to C#, not Python):

最后,这里有几个链接表明这是可以实现的(但是请注意,这些链接主要指向c#,而不是Python):

1. Using ADOMD

import clr
clr.AddReference("Microsoft.AnalysisServices.AdomdClient")
import Microsoft.AnalysisServices.AdomdClient as ADOMD
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADOMD.AdomdConnection(ConnString)
Connection.Open()

Here, it appears the problem is that the PowerPivot model has not been initialized:

这里的问题是,PowerPivot模型还没有初始化:

AdomdConnectionException: A connection cannot be made. Ensure that the server is running.

2. Using AMO

import clr
clr.AddReference("Microsoft.AnalysisServices")
import Microsoft.AnalysisServices as AMO
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = AMO.Server()
Connection.Connect(ConnString)

Same story, "the server is not running":

同样的故事,“服务器没有运行”:

ConnectionException: A connection cannot be made. Ensure that the server is running.

Note that AMO is technically not used for querying data, but I included it as one of the potential ways of connecting to the PowerPivot model.

注意,AMO在技术上不用于查询数据,但我将它作为连接到PowerPivot模型的一种潜在方法。

3. Using ADO.NET

import clr
clr.AddReference("System.Data")
import System.Data.OleDb as ADONET
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = ADONET.OleDbConnection()
Connection.ConnectionString = ConnString
Connection.Open()

This is similar to What's the simplest way to access mssql with python or ironpython?. Unfortunately, this also doesn't work:

这类似于使用python或ironpython访问mssql的最简单方法吗?不幸的是,这也不起作用:

OleDbException: OLE DB error: OLE DB or ODBC error: The following system error occurred:
The requested name is valid, but no data of the requested type was found.

4. Using ADO via adodbapi module

import adodbapi
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = adodbapi.connect(ConnString)

Similar to Opposite Workings of OLEDB/ODBC between Python and MS Access VBA. The error I get is:

类似于OLEDB/ODBC在Python和MS Access VBA之间的相反工作。我得到的错误是:

OperationalError: (com_error(-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB
Provider for SQL Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The
following system error occurred:  The requested name is valid, but no data of the requested
type was found...

This is basically the same problem as with ADO.NET above.

这基本上是和ADO一样的问题。净上面。

5. Using ADO via Excel/win32com module

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
Connection = Workbook.Connections["PowerPivot Data"].OLEDBConnection.ADOConnection
Recordset = Dispatch('ADODB.Recordset')

Query = "EVALUATE(dbo_DimDate)" #sample DAX query
Recordset.Open(Query, Connection)

The idea for this approach came from this blog post that uses VBA: Export a table or DAX query from Power Pivot to CSV using VBA. Note that this approach uses an explicit Refresh command that initializes the model (i.e. "server"). Here's the error message:

这种方法的想法来自于这个使用VBA的博客文章:使用VBA将一个表或DAX查询从Power Pivot导出到CSV。注意,这种方法使用显式的Refresh命令初始化模型(例如。“服务器”)。这是错误信息:

com_error: (-2147352567, 'Exception occurred.', (0, u'ADODB.Recordset', u'Arguments are of
the wrong type, are out of acceptable range, or are in conflict with one another.',
u'C:\\Windows\\HELP\\ADO270.CHM', 1240641, -2146825287), None)

It appears, however, that the ADO connection has been established:

但是,似乎已经建立了ADO连接:

  • type(Connection) returns instance
  • 类型(连接)返回实例
  • print(Connection) returns Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member Mode=Error;Subqueries=2;Optimize Response=3;Cell Error Mode=TextValue
  • 打印(连接)返回提供者=MSOLAP.5;持久安全信息=True;初始目录=Microsoft_SQLServer_AnalysisServices;数据源=$嵌入式$;MDX兼容性=1;安全选项=2;ConnectTo=11.0;MDX丢失成员模式=错误;

It seems the problem lies in the creation of the ADODB.Recordset object.

问题似乎在于ADODB的创建。记录集对象。

6. Using ADO via Excel/win32com, direct use of ADODB.Connection

from win32com.client import Dispatch
ConnString = "Provider=MSOLAP;Data Source=$Embedded$;Locale Identifier=1033;
             Location=H:\\PowerPivotTutorialSample.xlsx;SQLQueryMode=DataKeys"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnString)

Similar to Connection to Access from Python [duplicate] and Query access using ADO in Win32 platform (Python recipe). Unfortunately, the error Python spits out is the same as in the two examples above:

类似于在Win32平台(Python菜谱)中使用ADO访问Python [duplicate]和查询访问的连接。不幸的是,Python输出的错误与上面两个示例中的错误相同:

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'OLE DB error: OLE DB or ODBC error: The following system
error occurred:  The requested name is valid, but no data of the requested type was found.
..', None, 0, -2147467259), None)

7. Using ADO via Excel/win32com, direct use of ADODB.Connection plus model refresh

from win32com.client import Dispatch
Xlfile = "H:\\PowerPivotTutorialSample.xlsx"
XlApp = Dispatch("Excel.Application")
Workbook = XlApp.Workbooks.Open(Xlfile)
Workbook.Connections["PowerPivot Data"].Refresh()
ConnStringInternal = "Provider=MSOLAP.5;Persist Security Info=True;Initial Catalog=
                     Microsoft_SQLServer_AnalysisServices;Data Source=$Embedded$;MDX
                     Compatibility=1;Safety Options=2;ConnectTo=11.0;MDX Missing Member
                     Mode=Error;Optimize Response=3;Cell Error Mode=TextValue"

Connection = Dispatch('ADODB.Connection')
Connection.Open(ConnStringInternal)

I was hoping I could initialize an instance of Excel, then initialize the PowerPivot model, and then create a connection using the internal connection string Excel uses for embedded PowerPivot data (similar to How do you copy the powerpivot data into the excel workbook as a table? - note that the connection string is different from the one I've used elsewhere). Unfortunately, this doesn't work and my guess is that Python starts the ADODB.Connection process in a separate instance (as I get the same error message when I execute the last three rows without first initializing Excel, etc.):

我希望我可以初始化一个Excel实例,然后初始化PowerPivot模型,然后使用嵌入式PowerPivot数据的内部连接字符串Excel创建一个连接(类似于如何将PowerPivot数据作为表复制到Excel工作簿中)。-注意连接字符串与我在其他地方使用的不同)。不幸的是,这不起作用,我的猜测是Python开始了ADODB。在一个单独的实例中连接进程(当我执行最后三行而没有首先初始化Excel等时,会得到相同的错误消息):

com_error: (-2147352567, 'Exception occurred.', (0, u'Microsoft OLE DB Provider for SQL
Server 2012 Analysis Services.', u'Either the user, ****** (masked), does not have access
to the Microsoft_SQLServer_AnalysisServices database, or the database does not exist.',
None, 0, -2147467259), None)

3 个解决方案

#1


4  

Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.

瞧,我终于解决了这个问题——原来使用Python访问Power Pivot数据确实是可能的!下面是我所做的简要概述——您可以在这里找到更详细的描述:鞋带上的分析服务(SSAS)。注意:该代码既没有优化效率也没有优化优雅性。

  • Install Microsoft Power BI Desktop (comes with free Analysis Services server, so no need for a costly SQL Server license - however, the same approach obviously also works if you have a proper license).
  • 安装Microsoft Power BI桌面(附带免费的分析服务服务器,因此不需要昂贵的SQL server许可证——但是,如果您有适当的许可证,同样的方法显然也可以工作)。
  • Fire up the AS engine by first creating the msmdsrv.ini settings file, then restore the database from the ABF file (using AMO.NET), then extract data using ADOMD.NET.
  • 首先创建msmdsrv启动AS引擎。ini设置文件,然后从ABF文件(使用AMO.NET)恢复数据库,然后使用ADOMD.NET提取数据。

Here's the Python code that illustrates the AS engine + AMO.NET parts:

下面是Python代码,演示了AS engine + AMO。网络部分:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

And the data-extraction part:

和数据提取的部分:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

The raw data are then extracted via something like this:

原始数据通过如下方式提取:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)

#2


2  

The problem with getting data out of PowerPivot is that the tabular engine in PowerPivot runs in-process inside Excel and the only way to connect to that engine is to have your code running inside Excel too. (I suspect that it may use shared memory or some other transport, but it's definitely not listening on a TCP port or a named pipe or anything like that which would allow an external process to connect)

从PowerPivot中获取数据的问题是,PowerPivot中的列表引擎在Excel中运行,而连接到该引擎的唯一方法就是让代码在Excel中运行。(我怀疑它可能使用共享内存或其他传输方式,但它肯定没有监听TCP端口或指定的管道或类似的东西,这将允许外部进程连接)

We do this in Dax Studio by running a C# VSTO Excel add-in in Excel. However that was only designed to work for testing analytic queries, not for doing bulk data extraction. We marshal the data across from the add-in to the UI using a string variable so the entire dataset must be less than 2Gb or the response gets truncated and you will see an "unrecognizable response" error (the data is serialized into an XMLA rowset which is quite verbose so may see it break when only extracting a few hundred Mb of data)

我们在Dax Studio中通过在Excel中运行c# VSTO Excel插件来实现这一点。然而,这只是为了测试分析查询而设计的,而不是为了进行大量数据提取。对面我们元帅数据插件UI使用字符串变量,所以整个数据集必须小于2 gb或响应被截断,您将看到一个“面目全非”的错误(数据序列化为一个了XMLA rowset相当冗长的所以只能看到它打破当提取几百Mb的数据)

If you wanted to build a script to automate extracting all the raw data from a model I don't think you will be able to do it with Python as I don't believe you can get the python interpreter running in-process inside Excel. I would look at using a vba macro like this one http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/

如果您想构建一个脚本,以便自动从模型中提取所有原始数据,我认为您无法使用Python来实现这一点,因为我认为您无法让Python解释器在Excel中运行。我将使用一个vba宏,比如http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power- pivotto csv- use -vba/

You should find that you can query the model for a list of tables with something like "SELECT * FROM $SYSTEM.DBSCHEMA_TABLES" - you could then loop over each table and extract with a variation of the code in the above link.

您应该会发现,您可以使用“SELECT * FROM $SYSTEM”之类的语句查询模型,以获取表列表。DBSCHEMA_TABLES“——然后,您可以对每个表进行循环,并使用上面链接中的代码进行提取。

#3


1  

I got in touch with Tom Gleeson (aka Gobán Saor) who was kind enough to let me post his emails here. There are some interesting nuggets in them, so hopefully others will also find them useful.

我联系了Tom Gleeson(又名Goban Saor),他好心地让我把他的邮件发到这里。其中有一些有趣的东西,所以希望其他人也能发现它们的用处。

Email #1

电子邮件# 1

When you say Python, you mean running Python.NET as a standalone exe? If that’s the case, you’re out of luck with Excel PP models (different story for Power BI desktop though). I’ve accessed PP models (2010+) successfully from both VBA, and from Python.NET (via AMO) using similar code to that in your SO question. The difference being (in both VBA & .NET version) is that my code is running in-process within Excel using Excel’s various add-in technologies. (Likely Tableau is also running as an add-in or has embedded Excel within itself enabling similar behaviour). DAX Studio (a useful C# code base to learn the how-tos of PP access) runs both as an Excel add-in and as a standalone EXE, but only as an add-in can it access Excel based PP models.

当你说Python时,你的意思是运行Python。NET作为一个独立的exe?如果是这样的话,那你就不适合使用Excel PP模型了(不过Power BI桌面的情况有所不同)。我从VBA和Python中成功地访问了PP模型(2010+)。NET(通过AMO)使用与你的SO问题类似的代码。不同的是(在VBA和。net版本中)我的代码是使用Excel的各种插件技术在Excel内部运行的。(Tableau可能也是作为外接程序运行的,或者在其内部嵌入Excel以支持类似的行为)。DAX Studio(一个有用的c#代码库,用来学习PP access的howto -tos)既作为一个Excel插件,也作为一个独立的EXE运行,但只能作为一个插件,它可以访问基于Excel的PP模型。

Email #2

电子邮件# 2

You might find the process of using Python.NET for this somewhat challenging. You would need to embed a Python engine using C#/VB.NET Excel add-in code. I’ve used Excel-DNA (a fantastic open source project) rather than MS’s highly cumbersome "official" method for developing such .NET addins in the past, but I mainly stick to VBA where at all possible.

您可能会发现使用Python的过程。这有点挑战性。您需要使用c# /VB嵌入一个Python引擎。净Excel插件代码。我曾经使用过Excel-DNA(一个很棒的开源项目),而不是MS用来开发这样的。net addins的非常繁琐的“官方”方法,但是我主要使用VBA,只要可能的话。

Using VBA you’ll not be able to access the .NET-only AMO (so no ability to create calculated columns on the fly), but by loading the resulting dataset into an ADO recordset you should be able to output to a worksheet OR to a corporate-database/MS Access OR to a flat-file/CSV etc.

使用VBA,您将无法访问. net -only AMO(因此无法动态创建计算列),但是通过将结果数据集加载到ADO记录集中,您应该能够将结果数据集输出到工作表或企业数据库/MS访问或平面文件/CSV等。

Unlike the 1M worksheet limit, for a flat-file or database output memory (RAM) will be the limiting factor, but, assuming you’re using 64bit Excel and have enough memory to hold the compacted model and the workspace for the largest of the model’s tables in un-compacted form (i.e. a row based rather than column based format that’ll result from a DAX Query), multiplied by 2ish (one instance within PP workspace the other within VBA’s ADO workspace) you should be okay.

不同于1米表限制,文件或数据库输出内存(RAM)将限制因素,但是,假设您使用的是64位的Excel和拥有足够的内存压缩模型和空间模型的最大的表un-compacted形式(即基于一行而不是列格式,会造成DAX指数查询),乘以2伊什(一个实例在PP工作区中的其他VBA的ADO工作区)你应该好了。

Having said that, I’ve never attempted extracting a very large dataset, and using models as a dataset exchange medium is not one of PP’s "use-cases"; so, very large tables might hit some other bug/constraint!

话虽如此,我从未尝试过提取一个非常大的数据集,并且使用模型作为数据集交换媒介并不是PP的“用例”之一;因此,非常大的表可能会遇到其他bug/约束!

#1


4  

Lo and behold, I finally managed to crack the problem - turns out that accessing Power Pivot data using Python is indeed possible! Below's a short recap of what I did - you can find a more detailed description here: Analysis Services (SSAS) on a shoestring. Note: the code has been optimized neither for efficiency nor elegance.

瞧,我终于解决了这个问题——原来使用Python访问Power Pivot数据确实是可能的!下面是我所做的简要概述——您可以在这里找到更详细的描述:鞋带上的分析服务(SSAS)。注意:该代码既没有优化效率也没有优化优雅性。

  • Install Microsoft Power BI Desktop (comes with free Analysis Services server, so no need for a costly SQL Server license - however, the same approach obviously also works if you have a proper license).
  • 安装Microsoft Power BI桌面(附带免费的分析服务服务器,因此不需要昂贵的SQL server许可证——但是,如果您有适当的许可证,同样的方法显然也可以工作)。
  • Fire up the AS engine by first creating the msmdsrv.ini settings file, then restore the database from the ABF file (using AMO.NET), then extract data using ADOMD.NET.
  • 首先创建msmdsrv启动AS引擎。ini设置文件,然后从ABF文件(使用AMO.NET)恢复数据库,然后使用ADOMD.NET提取数据。

Here's the Python code that illustrates the AS engine + AMO.NET parts:

下面是Python代码,演示了AS engine + AMO。网络部分:

import psutil, subprocess, random, os, zipfile, shutil, clr, sys, pandas

def initialSetup(pathPowerBI):
    sys.path.append(pathPowerBI)

    #required Analysis Services assemblies
    clr.AddReference("Microsoft.PowerBI.Amo.Core")
    clr.AddReference("Microsoft.PowerBI.Amo")     
    clr.AddReference("Microsoft.PowerBI.AdomdClient")

    global AMO, ADOMD
    import Microsoft.AnalysisServices as AMO
    import Microsoft.AnalysisServices.AdomdClient as ADOMD

def restorePowerPivot(excelName, pathTarget, port, pathPowerBI):   
    #create random folder
    os.chdir(pathTarget)
    folder = os.getcwd()+str(random.randrange(10**6, 10**7))
    os.mkdir(folder)

    #extract PowerPivot model (abf backup)
    archive = zipfile.ZipFile(excelName)
    for member in archive.namelist():
        if ".data" in member:
            filename = os.path.basename(member)
            abfname = os.path.join(folder, filename) + ".abf"
            source = archive.open(member)
            target = file(os.path.join(folder, abfname), 'wb')
            shutil.copyfileobj(source, target)
            del target
    archive.close()

    #start the cmd.exe process to get its PID
    listPIDpre = [proc for proc in psutil.process_iter()]
    process = subprocess.Popen('cmd.exe /k', stdin=subprocess.PIPE)
    listPIDpost = [proc for proc in psutil.process_iter()]
    pid = [proc for proc in listPIDpost if proc not in listPIDpre if "cmd.exe" in str(proc)][0]
    pid = str(pid).split("=")[1].split(",")[0]

    #msmdsrv.ini
    msmdsrvText = '''<ConfigurationSettings>
       <DataDir>{0}</DataDir>
       <TempDir>{0}</TempDir>
       <LogDir>{0}</LogDir>
       <BackupDir>{0}</BackupDir>
       <DeploymentMode>2</DeploymentMode>
       <RecoveryModel>1</RecoveryModel>
       <DisklessModeRequested>0</DisklessModeRequested>
       <CleanDataFolderOnStartup>1</CleanDataFolderOnStartup>
       <AutoSetDefaultInitialCatalog>1</AutoSetDefaultInitialCatalog>
       <Network>
          <Requests>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
          </Requests>
          <Responses>
             <EnableBinaryXML>1</EnableBinaryXML>
             <EnableCompression>1</EnableCompression>
             <CompressionLevel>9</CompressionLevel>
          </Responses>
          <ListenOnlyOnLocalConnections>1</ListenOnlyOnLocalConnections>
       </Network>
       <Port>{1}</Port>
       <PrivateProcess>{2}</PrivateProcess>
       <InstanceVisible>0</InstanceVisible>
       <Language>1033</Language>
       <Debug>
          <CallStackInError>0</CallStackInError>
       </Debug>
       <Log>
          <Exception>
             <CrashReportsFolder>{0}</CrashReportsFolder>
          </Exception>
          <FlightRecorder>
             <Enabled>0</Enabled>
          </FlightRecorder>
       </Log>
       <AllowedBrowsingFolders>{0}</AllowedBrowsingFolders>
       <ResourceGovernance>
          <GovernIMBIScheduler>0</GovernIMBIScheduler>
       </ResourceGovernance>
       <Feature>
          <ManagedCodeEnabled>1</ManagedCodeEnabled>
       </Feature>
       <VertiPaq>
          <EnableDisklessTMImageSave>0</EnableDisklessTMImageSave>
          <EnableProcessingSimplifiedLocks>1</EnableProcessingSimplifiedLocks>
       </VertiPaq>
    </ConfigurationSettings>'''

    #save ini file to disk, fill it with required parameters
    msmdsrvini = open(folder+"\\msmdsrv.ini", "w")
    msmdsrvText = msmdsrvText.format(folder, port, pid) #{0},{1},{2}
    msmdsrvini.write(msmdsrvText)
    msmdsrvini.close()

    #run AS engine inside the cmd.exe process
    initString = "\"{0}\\msmdsrv.exe\" -c -s \"{1}\""
    initString = initString.format(pathPowerBI.replace("/","\\"),folder)
    process.stdin.write(initString + " \n")

    #connect to the AS instance from Python
    AMOServer = AMO.Server()
    AMOServer.Connect("localhost:{0}".format(port))

    #restore database from PowerPivot abf backup, disconnect
    AMORestoreInfo = AMO.RestoreInfo(os.path.join(folder, abfname))
    AMOServer.Restore(AMORestoreInfo)
    AMOServer.Disconnect()

    return process

And the data-extraction part:

和数据提取的部分:

def runQuery(query, port, flag):
    #ADOMD assembly
    ADOMDConn = ADOMD.AdomdConnection("Data Source=localhost:{0}".format(port))
    ADOMDConn.Open()
    ADOMDCommand = ADOMDConn.CreateCommand() 
    ADOMDCommand.CommandText = query

    #read data in via AdomdDataReader object
    DataReader = ADOMDCommand.ExecuteReader()

    #get metadata, number of columns
    SchemaTable = DataReader.GetSchemaTable()
    numCol = SchemaTable.Rows.Count #same as DataReader.FieldCount

    #get column names
    columnNames = []
    for i in range(numCol):
        columnNames.append(str(SchemaTable.Rows[i][0]))

    #fill with data
    data = []
    while DataReader.Read()==True:
        row = []
        for j in range(numCol):
            try:
                row.append(DataReader[j].ToString())
            except:
                row.append(DataReader[j])
        data.append(row)
    df = pandas.DataFrame(data)
    df.columns = columnNames 

    if flag==0:
        DataReader.Close()
        ADOMDConn.Close()

        return df     
    else:   
        #metadata table
        metadataColumnNames = []
        for j in range(SchemaTable.Columns.Count):
            metadataColumnNames.append(SchemaTable.Columns[j].ToString())
        metadata = []
        for i in range(numCol):
            row = []
            for j in range(SchemaTable.Columns.Count):
                try:
                    row.append(SchemaTable.Rows[i][j].ToString())
                except:
                    row.append(SchemaTable.Rows[i][j])
            metadata.append(row)
        metadf = pandas.DataFrame(metadata)
        metadf.columns = metadataColumnNames

        DataReader.Close()
        ADOMDConn.Close()

        return df, metadf

The raw data are then extracted via something like this:

原始数据通过如下方式提取:

pathPowerBI = "C:/Program Files/Microsoft Power BI Desktop/bin"
initialSetup(pathPowerBI)
session = restorePowerPivot("D:/Downloads/PowerPivotTutorialSample.xlsx", "D:/", 60000, pathPowerBI)
df, metadf = runQuery("EVALUATE dbo_DimProduct", 60000, 1)
endSession(session)

#2


2  

The problem with getting data out of PowerPivot is that the tabular engine in PowerPivot runs in-process inside Excel and the only way to connect to that engine is to have your code running inside Excel too. (I suspect that it may use shared memory or some other transport, but it's definitely not listening on a TCP port or a named pipe or anything like that which would allow an external process to connect)

从PowerPivot中获取数据的问题是,PowerPivot中的列表引擎在Excel中运行,而连接到该引擎的唯一方法就是让代码在Excel中运行。(我怀疑它可能使用共享内存或其他传输方式,但它肯定没有监听TCP端口或指定的管道或类似的东西,这将允许外部进程连接)

We do this in Dax Studio by running a C# VSTO Excel add-in in Excel. However that was only designed to work for testing analytic queries, not for doing bulk data extraction. We marshal the data across from the add-in to the UI using a string variable so the entire dataset must be less than 2Gb or the response gets truncated and you will see an "unrecognizable response" error (the data is serialized into an XMLA rowset which is quite verbose so may see it break when only extracting a few hundred Mb of data)

我们在Dax Studio中通过在Excel中运行c# VSTO Excel插件来实现这一点。然而,这只是为了测试分析查询而设计的,而不是为了进行大量数据提取。对面我们元帅数据插件UI使用字符串变量,所以整个数据集必须小于2 gb或响应被截断,您将看到一个“面目全非”的错误(数据序列化为一个了XMLA rowset相当冗长的所以只能看到它打破当提取几百Mb的数据)

If you wanted to build a script to automate extracting all the raw data from a model I don't think you will be able to do it with Python as I don't believe you can get the python interpreter running in-process inside Excel. I would look at using a vba macro like this one http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power-pivot-to-csv-using-vba/

如果您想构建一个脚本,以便自动从模型中提取所有原始数据,我认为您无法使用Python来实现这一点,因为我认为您无法让Python解释器在Excel中运行。我将使用一个vba宏,比如http://www.powerpivotblog.nl/export-a-table-or-dax-query-from-power- pivotto csv- use -vba/

You should find that you can query the model for a list of tables with something like "SELECT * FROM $SYSTEM.DBSCHEMA_TABLES" - you could then loop over each table and extract with a variation of the code in the above link.

您应该会发现,您可以使用“SELECT * FROM $SYSTEM”之类的语句查询模型,以获取表列表。DBSCHEMA_TABLES“——然后,您可以对每个表进行循环,并使用上面链接中的代码进行提取。

#3


1  

I got in touch with Tom Gleeson (aka Gobán Saor) who was kind enough to let me post his emails here. There are some interesting nuggets in them, so hopefully others will also find them useful.

我联系了Tom Gleeson(又名Goban Saor),他好心地让我把他的邮件发到这里。其中有一些有趣的东西,所以希望其他人也能发现它们的用处。

Email #1

电子邮件# 1

When you say Python, you mean running Python.NET as a standalone exe? If that’s the case, you’re out of luck with Excel PP models (different story for Power BI desktop though). I’ve accessed PP models (2010+) successfully from both VBA, and from Python.NET (via AMO) using similar code to that in your SO question. The difference being (in both VBA & .NET version) is that my code is running in-process within Excel using Excel’s various add-in technologies. (Likely Tableau is also running as an add-in or has embedded Excel within itself enabling similar behaviour). DAX Studio (a useful C# code base to learn the how-tos of PP access) runs both as an Excel add-in and as a standalone EXE, but only as an add-in can it access Excel based PP models.

当你说Python时,你的意思是运行Python。NET作为一个独立的exe?如果是这样的话,那你就不适合使用Excel PP模型了(不过Power BI桌面的情况有所不同)。我从VBA和Python中成功地访问了PP模型(2010+)。NET(通过AMO)使用与你的SO问题类似的代码。不同的是(在VBA和。net版本中)我的代码是使用Excel的各种插件技术在Excel内部运行的。(Tableau可能也是作为外接程序运行的,或者在其内部嵌入Excel以支持类似的行为)。DAX Studio(一个有用的c#代码库,用来学习PP access的howto -tos)既作为一个Excel插件,也作为一个独立的EXE运行,但只能作为一个插件,它可以访问基于Excel的PP模型。

Email #2

电子邮件# 2

You might find the process of using Python.NET for this somewhat challenging. You would need to embed a Python engine using C#/VB.NET Excel add-in code. I’ve used Excel-DNA (a fantastic open source project) rather than MS’s highly cumbersome "official" method for developing such .NET addins in the past, but I mainly stick to VBA where at all possible.

您可能会发现使用Python的过程。这有点挑战性。您需要使用c# /VB嵌入一个Python引擎。净Excel插件代码。我曾经使用过Excel-DNA(一个很棒的开源项目),而不是MS用来开发这样的。net addins的非常繁琐的“官方”方法,但是我主要使用VBA,只要可能的话。

Using VBA you’ll not be able to access the .NET-only AMO (so no ability to create calculated columns on the fly), but by loading the resulting dataset into an ADO recordset you should be able to output to a worksheet OR to a corporate-database/MS Access OR to a flat-file/CSV etc.

使用VBA,您将无法访问. net -only AMO(因此无法动态创建计算列),但是通过将结果数据集加载到ADO记录集中,您应该能够将结果数据集输出到工作表或企业数据库/MS访问或平面文件/CSV等。

Unlike the 1M worksheet limit, for a flat-file or database output memory (RAM) will be the limiting factor, but, assuming you’re using 64bit Excel and have enough memory to hold the compacted model and the workspace for the largest of the model’s tables in un-compacted form (i.e. a row based rather than column based format that’ll result from a DAX Query), multiplied by 2ish (one instance within PP workspace the other within VBA’s ADO workspace) you should be okay.

不同于1米表限制,文件或数据库输出内存(RAM)将限制因素,但是,假设您使用的是64位的Excel和拥有足够的内存压缩模型和空间模型的最大的表un-compacted形式(即基于一行而不是列格式,会造成DAX指数查询),乘以2伊什(一个实例在PP工作区中的其他VBA的ADO工作区)你应该好了。

Having said that, I’ve never attempted extracting a very large dataset, and using models as a dataset exchange medium is not one of PP’s "use-cases"; so, very large tables might hit some other bug/constraint!

话虽如此,我从未尝试过提取一个非常大的数据集,并且使用模型作为数据集交换媒介并不是PP的“用例”之一;因此,非常大的表可能会遇到其他bug/约束!