I'm trying to call the HtmlTidy library dll from C#. There's a few examples floating around on the net but nothing definitive... and I'm having no end of trouble. I'm pretty certain the problem is with the p/invoke declaration... but danged if I know where I'm going wrong.
我试图从C#调用HtmlTidy库dll。网上有几个漂浮的例子,但没有任何确定性......而且我没有遇到麻烦。我很确定问题在于p / invoke声明......但如果我知道我哪里出错了,那就太危险了。
I got the libtidy.dll from http://www.paehl.com/open_source/?HTML_Tidy_for_Windows which seems to be a current version.
我从http://www.paehl.com/open_source/?HTML_Tidy_for_Windows获得了libtidy.dll,它似乎是当前版本。
Here's a console app that demonstrates the problem I'm having:
这是一个控制台应用程序,演示了我遇到的问题:
using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;
namespace ConsoleApplication5
{
class Program
{
[StructLayout(LayoutKind.Sequential)]
public struct TidyBuffer
{
public IntPtr bp; // Pointer to bytes
public uint size; // # bytes currently in use
public uint allocated; // # bytes allocated
public uint next; // Offset of current input position
};
[DllImport("libtidy.dll")]
public static extern int tidyBufAlloc(ref TidyBuffer tidyBuffer, uint allocSize);
static void Main(string[] args)
{
Console.WriteLine(CleanHtml("<html><body><p>Hello World!</p></body></html>"));
}
static string CleanHtml(string inputHtml)
{
byte[] inputArray = Encoding.UTF8.GetBytes(inputHtml);
byte[] inputArray2 = Encoding.UTF8.GetBytes(inputHtml);
TidyBuffer tidyBuffer2;
tidyBuffer2.size = 0;
tidyBuffer2.allocated = 0;
tidyBuffer2.next = 0;
tidyBuffer2.bp = IntPtr.Zero;
//
// tidyBufAlloc overwrites inputArray2... why? how? seems like
// tidyBufAlloc is stomping on the stack a bit too much... but
// how? I've tried changing the calling convention to cdecl and
// stdcall but no change.
//
Console.WriteLine((inputArray2 == null ? "Array2 null" : "Array2 not null"));
tidyBufAlloc(ref tidyBuffer2, 65535);
Console.WriteLine((inputArray2 == null ? "Array2 null" : "Array2 not null"));
return "did nothing";
}
}
}
All in all I'm a bit stumpped. Any help would be appreciated!
总而言之,我有点难过。任何帮助,将不胜感激!
3 个解决方案
#1
You are working with an old definition of the TidyBuffer structure. The new structure is larger so when you call the allocate method it is overwriting the stack location for inputArray2. The new definition is:
您正在使用TidyBuffer结构的旧定义。新结构更大,因此当您调用allocate方法时,它会覆盖inputArray2的堆栈位置。新定义是:
[StructLayout(LayoutKind.Sequential)]
public struct TidyBuffer
{
public IntPtr allocator; // Pointer to custom allocator
public IntPtr bp; // Pointer to bytes
public uint size; // # bytes currently in use
public uint allocated; // # bytes allocated
public uint next; // Offset of current input position
};
#2
For what it's worth, we tried Tidy at work and switched to HtmlAgilityPack.
为了它的价值,我们在工作中尝试了Tidy并切换到HtmlAgilityPack。
#3
Try changing your tidyBufAlloc declaration to:
尝试将您的tidyBufAlloc声明更改为:
[DllImport("libtidy.dll", CharSet = CharSet.Ansi)]
private static extern int tidyBufAlloc(ref TidyBuffer Buffer, int allocSize);
Note the CharSet.Ansi addition and the "int allocSize" (instead of uint).
注意CharSet.Ansi的添加和“int allocSize”(而不是uint)。
Also, see this sample code for an example of using HTML Tidy in C#.
另外,请参阅此示例代码,以获取在C#中使用HTML Tidy的示例。
In your example, if inputHTML is large, say 50K, inputArray and inputArray2 will be also be 50K each.
在您的示例中,如果inputHTML很大,比如50K,则inputArray和inputArray2也将是50K。
You are then also trying to allocate 65K in the tidyBufAlloc call.
然后,您还尝试在tidyBufAlloc调用中分配65K。
If a pointer is not initialised correctly, it is quite possible a random .NET heap address is being used. Hence overwriting part or all of a seemingly unrelated variable/buffer occurs. It is problaby just luck, or that you have already allocated large buffers, that you are not overwriting a code block which would likely cause a Invalid Memory access error.
如果指针未正确初始化,则很可能使用随机的.NET堆地址。因此,重写部分或全部看似无关的变量/缓冲区。只是好运,或者您已经分配了大缓冲区,您没有覆盖可能导致无效内存访问错误的代码块。
#1
You are working with an old definition of the TidyBuffer structure. The new structure is larger so when you call the allocate method it is overwriting the stack location for inputArray2. The new definition is:
您正在使用TidyBuffer结构的旧定义。新结构更大,因此当您调用allocate方法时,它会覆盖inputArray2的堆栈位置。新定义是:
[StructLayout(LayoutKind.Sequential)]
public struct TidyBuffer
{
public IntPtr allocator; // Pointer to custom allocator
public IntPtr bp; // Pointer to bytes
public uint size; // # bytes currently in use
public uint allocated; // # bytes allocated
public uint next; // Offset of current input position
};
#2
For what it's worth, we tried Tidy at work and switched to HtmlAgilityPack.
为了它的价值,我们在工作中尝试了Tidy并切换到HtmlAgilityPack。
#3
Try changing your tidyBufAlloc declaration to:
尝试将您的tidyBufAlloc声明更改为:
[DllImport("libtidy.dll", CharSet = CharSet.Ansi)]
private static extern int tidyBufAlloc(ref TidyBuffer Buffer, int allocSize);
Note the CharSet.Ansi addition and the "int allocSize" (instead of uint).
注意CharSet.Ansi的添加和“int allocSize”(而不是uint)。
Also, see this sample code for an example of using HTML Tidy in C#.
另外,请参阅此示例代码,以获取在C#中使用HTML Tidy的示例。
In your example, if inputHTML is large, say 50K, inputArray and inputArray2 will be also be 50K each.
在您的示例中,如果inputHTML很大,比如50K,则inputArray和inputArray2也将是50K。
You are then also trying to allocate 65K in the tidyBufAlloc call.
然后,您还尝试在tidyBufAlloc调用中分配65K。
If a pointer is not initialised correctly, it is quite possible a random .NET heap address is being used. Hence overwriting part or all of a seemingly unrelated variable/buffer occurs. It is problaby just luck, or that you have already allocated large buffers, that you are not overwriting a code block which would likely cause a Invalid Memory access error.
如果指针未正确初始化,则很可能使用随机的.NET堆地址。因此,重写部分或全部看似无关的变量/缓冲区。只是好运,或者您已经分配了大缓冲区,您没有覆盖可能导致无效内存访问错误的代码块。