Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements.
目前我正在尝试使用SAX解析器,但是大约3/4的解析器在文件中完全死机了,我尝试分配更多的内存等等,但是没有得到任何改进。
Is there any way to speed this up? A better method?
有什么方法可以加快这个速度吗?一个更好的方法?
Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like.
将它简化为基本代码,这样我现在就有了下面的代码,当在命令行中运行时,它仍然没有我希望的那么快。
Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000
使用“java -Xms-4096m -Xmx8192m -jar阅读器”运行。jar“我得到的GC开销限制超过了大约700000条
Main:
主要:
public class Read {
public static void main(String[] args) {
pages = XMLManager.getPages();
}
}
XMLManager
XMLManager
public class XMLManager {
public static ArrayList<Page> getPages() {
ArrayList<Page> pages = null;
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("..\\enwiki-20140811-pages-articles.xml");
PageHandler pageHandler = new PageHandler();
parser.parse(file, pageHandler);
pages = pageHandler.getPages();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return pages;
}
}
PageHandler
PageHandler
public class PageHandler extends DefaultHandler{
private ArrayList<Page> pages = new ArrayList<>();
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(){
super();
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
stringBuilder = new StringBuilder();
if (qName.equals("page")){
page = new Page();
idSet = false;
} else if (qName.equals("redirect")){
if (page != null){
page.setRedirecting(true);
}
}
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (page != null && !page.isRedirecting()){
if (qName.equals("title")){
page.setTitle(stringBuilder.toString());
} else if (qName.equals("id")){
if (!idSet){
page.setId(Integer.parseInt(stringBuilder.toString()));
idSet = true;
}
} else if (qName.equals("text")){
String articleText = stringBuilder.toString();
articleText = articleText.replaceAll("(?s)<ref(.+?)</ref>", " "); //remove references
articleText = articleText.replaceAll("(?s)\\{\\{(.+?)\\}\\}", " "); //remove links underneath headings
articleText = articleText.replaceAll("(?s)==See also==.+", " "); //remove everything after see also
articleText = articleText.replaceAll("\\|", " "); //Separate multiple links
articleText = articleText.replaceAll("\\n", " "); //remove new lines
articleText = articleText.replaceAll("[^a-zA-Z0-9- \\s]", " "); //remove all non alphanumeric except dashes and spaces
articleText = articleText.trim().replaceAll(" +", " "); //convert all multiple spaces to 1 space
Pattern pattern = Pattern.compile("([\\S]+\\s*){1,75}"); //get first 75 words of text
Matcher matcher = pattern.matcher(articleText);
matcher.find();
try {
page.setSummaryText(matcher.group());
} catch (IllegalStateException se){
page.setSummaryText("None");
}
page.setText(articleText);
} else if (qName.equals("page")){
pages.add(page);
page = null;
}
} else {
page = null;
}
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
stringBuilder.append(ch,start, length);
}
public ArrayList<Page> getPages() {
return pages;
}
}
2 个解决方案
#1
26
Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList
.
解析代码可能工作得很好,但是正在加载的数据量可能太大,无法在ArrayList中保存。
You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.
您需要某种管道将数据传递到它的实际目的地,而不需要一次将它们全部存储在内存中。
What I've sometimes done for this sort of situation is similar to the following.
我有时为这种情况所做的类似于下面的事情。
Create an interface for processing a single element:
创建处理单个元素的接口:
public interface PageProcessor {
void process(Page page);
}
Supply an implementation of this to the PageHandler
through a constructor:
通过构造函数将此实现提供给PageHandler:
public class Read {
public static void main(String[] args) {
XMLManager.load(new PageProcessor() {
@Override
public void process(Page page) {
// Obviously you want to do something other than just printing,
// but I don't know what that is...
System.out.println(page);
}
}) ;
}
}
public class XMLManager {
public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);
parser.parse(file, pageHandler);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Send data to this processor instead of putting it in the list:
将数据发送到此处理器,而不是将其放入列表:
public class PageHandler extends DefaultHandler {
private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(PageProcessor processor) {
this.processor = processor;
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//Unchanged from your implementation
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
//Unchanged from your implementation
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change
} else if (qName.equals("page")){
processor.process(page);
page = null;
}
} else {
page = null;
}
}
}
Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler
collect pages locally in a smaller list and periodically send the list off for processing and clear the list.
当然,您可以让您的接口处理多个记录的块,而不是只处理一个,并让PageHandler在较小的列表中本地收集页面,并定期发送列表以进行处理,并清除列表。
Or (perhaps better) you could implement the PageProcessor
interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.
或者(也许更好),您可以实现这里定义的PageProcessor接口,并在其中构建逻辑,缓冲数据并将其发送到块中进行进一步处理。
#2
0
Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType
has its Java POJO
equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.
Don Roby的方法有点类似于我所采用的方法,即创建一个用于解决这个特定问题的代码生成器(在2008年构思了一个早期版本)。基本上,每个complexType都有其Java POJO等效项,当上下文更改该元素时,特定类型的处理程序将被激活。我在SEPA、事务银行和迪斯科(30GB)等应用了这种方法。您可以使用propeties文件声明地指定在运行时要处理的元素。
XML2J uses mapping of complexTypes
to Java POJOs on the one hand, but lets you specify events you want to listen on. E.g.
XML2J一方面使用复杂类型到Java pojo的映射,但允许指定要侦听的事件。如。
account/@process = true
account/accounts/@process = true
account/accounts/@detach = true
The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.
本质在第三行。分离确保不会将单个帐户添加到帐户列表中。所以它不会溢出。
class AccountType {
private List<AccountType> accounts = new ArrayList<>();
public void addAccount(AccountType tAccount) {
accounts.add(tAccount);
}
// etc.
};
In your code you need to implement the process method (by default the code generator generates an empty method:
在您的代码中,您需要实现流程方法(默认情况下,代码生成器生成一个空方法:
class AccountsProcessor implements MessageProcessor {
static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);
// assuming Spring data persistency here
final String path = new ClassPathResource("spring-config.xml").getPath();
ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext(path);
AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);
@Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
if (evt == XMLEvent.END) {
if( data instanceof AccountType) {
process((AccountType)data);
}
}
}
private void process(AccountType data) {
if (logger.isInfoEnabled()) {
// do some logging
}
repo.save(data);
}
}
Note that XMLEvent.END
marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN
for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END
you would then update the parent.
注意,XMLEvent。结束标记元素的结束标记。所以,当你处理它时,它是完整的。如果您必须将它(使用FK)与数据库中的父对象关联起来,您可以处理XMLEvent。从父节点开始,在数据库中创建一个占位符,并使用它的键与每个子节点一起存储。在最后XMLEvent。然后,您将更新父进程。
Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.
注意,代码生成器生成所需的所有内容。你只需要实现那个方法,当然还有DB胶水代码。
There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.
有一些样本可以让你开始。代码生成器甚至生成POM文件,因此您可以在生成后立即构建项目。
The default process method is like this:
默认的流程方法如下:
@Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
/*
* TODO Auto-generated method stub implement your own handling here.
* Use the runtime configuration file to determine which events are to be sent to the processor.
*/
if (evt == XMLEvent.END) {
data.print( ConsoleWriter.out );
}
}
Downloads:
下载:
- https://github.com/lolkedijkstra/xml2j-core
- https://github.com/lolkedijkstra/xml2j-core
- https://github.com/lolkedijkstra/xml2j-gen
- https://github.com/lolkedijkstra/xml2j-gen
- https://sourceforge.net/projects/xml2j/
- https://sourceforge.net/projects/xml2j/
First mvn clean install
the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME
as per directions in the usermanual.
首先mvn清理安装核心(必须在本地maven repo中),然后是生成器。不要忘记按照用户手册中的指示设置环境变量XML2J_HOME。
#1
26
Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList
.
解析代码可能工作得很好,但是正在加载的数据量可能太大,无法在ArrayList中保存。
You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once.
您需要某种管道将数据传递到它的实际目的地,而不需要一次将它们全部存储在内存中。
What I've sometimes done for this sort of situation is similar to the following.
我有时为这种情况所做的类似于下面的事情。
Create an interface for processing a single element:
创建处理单个元素的接口:
public interface PageProcessor {
void process(Page page);
}
Supply an implementation of this to the PageHandler
through a constructor:
通过构造函数将此实现提供给PageHandler:
public class Read {
public static void main(String[] args) {
XMLManager.load(new PageProcessor() {
@Override
public void process(Page page) {
// Obviously you want to do something other than just printing,
// but I don't know what that is...
System.out.println(page);
}
}) ;
}
}
public class XMLManager {
public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);
parser.parse(file, pageHandler);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Send data to this processor instead of putting it in the list:
将数据发送到此处理器,而不是将其放入列表:
public class PageHandler extends DefaultHandler {
private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;
public PageHandler(PageProcessor processor) {
this.processor = processor;
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//Unchanged from your implementation
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
//Unchanged from your implementation
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change
} else if (qName.equals("page")){
processor.process(page);
page = null;
}
} else {
page = null;
}
}
}
Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler
collect pages locally in a smaller list and periodically send the list off for processing and clear the list.
当然,您可以让您的接口处理多个记录的块,而不是只处理一个,并让PageHandler在较小的列表中本地收集页面,并定期发送列表以进行处理,并清除列表。
Or (perhaps better) you could implement the PageProcessor
interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.
或者(也许更好),您可以实现这里定义的PageProcessor接口,并在其中构建逻辑,缓冲数据并将其发送到块中进行进一步处理。
#2
0
Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType
has its Java POJO
equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.
Don Roby的方法有点类似于我所采用的方法,即创建一个用于解决这个特定问题的代码生成器(在2008年构思了一个早期版本)。基本上,每个complexType都有其Java POJO等效项,当上下文更改该元素时,特定类型的处理程序将被激活。我在SEPA、事务银行和迪斯科(30GB)等应用了这种方法。您可以使用propeties文件声明地指定在运行时要处理的元素。
XML2J uses mapping of complexTypes
to Java POJOs on the one hand, but lets you specify events you want to listen on. E.g.
XML2J一方面使用复杂类型到Java pojo的映射,但允许指定要侦听的事件。如。
account/@process = true
account/accounts/@process = true
account/accounts/@detach = true
The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.
本质在第三行。分离确保不会将单个帐户添加到帐户列表中。所以它不会溢出。
class AccountType {
private List<AccountType> accounts = new ArrayList<>();
public void addAccount(AccountType tAccount) {
accounts.add(tAccount);
}
// etc.
};
In your code you need to implement the process method (by default the code generator generates an empty method:
在您的代码中,您需要实现流程方法(默认情况下,代码生成器生成一个空方法:
class AccountsProcessor implements MessageProcessor {
static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);
// assuming Spring data persistency here
final String path = new ClassPathResource("spring-config.xml").getPath();
ClassPathXmlApplicationContext context = new ClassPathXmlApplicationContext(path);
AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);
@Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
if (evt == XMLEvent.END) {
if( data instanceof AccountType) {
process((AccountType)data);
}
}
}
private void process(AccountType data) {
if (logger.isInfoEnabled()) {
// do some logging
}
repo.save(data);
}
}
Note that XMLEvent.END
marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN
for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END
you would then update the parent.
注意,XMLEvent。结束标记元素的结束标记。所以,当你处理它时,它是完整的。如果您必须将它(使用FK)与数据库中的父对象关联起来,您可以处理XMLEvent。从父节点开始,在数据库中创建一个占位符,并使用它的键与每个子节点一起存储。在最后XMLEvent。然后,您将更新父进程。
Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.
注意,代码生成器生成所需的所有内容。你只需要实现那个方法,当然还有DB胶水代码。
There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.
有一些样本可以让你开始。代码生成器甚至生成POM文件,因此您可以在生成后立即构建项目。
The default process method is like this:
默认的流程方法如下:
@Override
public void process(XMLEvent evt, ComplexDataType data)
throws ProcessorException {
/*
* TODO Auto-generated method stub implement your own handling here.
* Use the runtime configuration file to determine which events are to be sent to the processor.
*/
if (evt == XMLEvent.END) {
data.print( ConsoleWriter.out );
}
}
Downloads:
下载:
- https://github.com/lolkedijkstra/xml2j-core
- https://github.com/lolkedijkstra/xml2j-core
- https://github.com/lolkedijkstra/xml2j-gen
- https://github.com/lolkedijkstra/xml2j-gen
- https://sourceforge.net/projects/xml2j/
- https://sourceforge.net/projects/xml2j/
First mvn clean install
the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME
as per directions in the usermanual.
首先mvn清理安装核心(必须在本地maven repo中),然后是生成器。不要忘记按照用户手册中的指示设置环境变量XML2J_HOME。