【文件属性】:
文件名称:Chinese Entity Linking Comprehensive
文件大小:23.64MB
文件格式:TGZ
更新时间:2022-06-16 05:15:25
LDC Chinese Entity 中文实体连接
TAC KBP Chinese Entity Linking
Comprehensive Training and Evaluation Data 2011-2014
LDC2015E17
March 20, 2015
Linguistic Data Consortium
1. Overview
Text Analysis Conference (TAC) is a series of workshops organized by
the National Institute of Standards and Technology (NIST). TAC was
developed to encourage research in natural language processing (NLP)
and related applications by providing a large test collection, common
evaluation procedures, and a forum for researchers to share their
results. Through its various evaluations, the Knowledge Base
Population (KBP) track of TAC encourages the development of systems
that can match entities mentioned in natural texts with those
appearing in a knowledge base and extract novel information about
entities from a document collection and add it to a new or existing
knowledge base.
The goal of Entity Linking is to determine whether or not the entity
referred to in each query has a matching entity node in the reference
Knowledge Base (KB) (LDC2014T16). If there is a matching node for a
query, annotators create a link between the two. If there is not a
matching node for a query, the entity is marked as 'NIL' and then
clustered with other NIL entities into equivalence classes. For more
information, please refer to the Entity Linking section of NIST's 2014
TAC KBP website (2014 was the last year in which the Chinese Entity
Linking evaluation was conducted as of the time this package was
created) at http://nlp.cs.rpi.edu/kbp/2014/
This package contains all evaluation and training data developed
in support of TAC KBP Chinese Entity Linking during the four years
since the task's inception in 2011. This includes queries, KB links,
equivalence class clusters for NIL entities (those that could not
be linked to an entity in the knowledge base), and entity type
information for each of the queries.
The data included in this package were originally released by LDC
to TAC KBP coordinators and performers under the following ecorpora
catalog IDs and titles:
LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1
LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1
LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation
LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations
LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2
LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2
LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data
LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0
2. Contents
./README.txt
This file
./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml
This file contains 2176 queries. Each query entry consists of
the following fields:
- A query ID formatted as the letters "EL_CLCMN_"
(if a Chinese language query) or "EL_CLENG_" (if
an English language query) plus a five-digit
zero-padded, sequentially assigned integer
(e.g. "EL_CLCMN_00001").
- The full namestring of the query entity.
- An ID for a document in ./data/2011/eval/source_documents/
from which the namestring was extracted.
The queries are distributed by language and type as follows:
KB-Link GPE ORG PER Total
----------------------------------------
CMN NW NIL: 120 291 420 831
CMN NW Non-NIL: 279 150 221 650
ENG NW NIL: 90 129 20 239
ENG NW Non-NIL: 93 72 104 269
ENG WB NIL: 16 0 5 21
ENG WB Non-NIL: 44 68 54 166
----------------------------------------
Total: 624 710 824 2176
./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab
This file contains the responses for each query as identified by
human annotators at LDC. This file is tab delimited, with 4 fields total.
The column descriptions are as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
four-digit zero-padded sequentially assigned
integer (e.g. NIL-0001, NIL-0002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (WB for web data, NW for
newswire data, or DF for discussion forum data).
./data/2011/eval/source_documents/*
This directory contains all of the source documents listed in the
attribute for each query in
tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See
section 5 for more information about source documents.
./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml
This file is a concatenation of the queries files originally released
in LDC2011E46 (sample) and LDC2011E55 (training). This file contains
2171 queries. Each query entry consists of the following fields:
- A query ID formatted as the letters "EL_CLCMN_"
(if a Chinese language query) or "EL_CLENG_" (if
an English language query) plus a five-digit
zero-padded, sequentially assigned integer
(e.g. "EL_CLCMN_00001").
- The full namestring of the query entity.
- An ID for a document in ./data/2011/training/source_documents/
from which the namestring was extracted.
The queries are distributed by language and type as follows:
KB-Link GPE ORG PER Total
----------------------------------------
CMN NW NIL: 124 293 426 843
CMN NW Non-NIL: 284 149 227 660
ENG NW NIL: 143 116 63 322
ENG NW Non-NIL: 122 100 100 322
ENG WB NIL: 0 1 0 1
ENG WB Non-NIL: 14 3 6 23
----------------------------------------
Total: 687 662 822 2171
./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab
This file is a concatenation of the KB_links files originally released
in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the
responses for each query as identified by human annotators at LDC. This
file is tab delimited, with 4 fields total. The column descriptions are
as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
four-digit zero-padded sequentially assigned
integer (e.g. NIL-0001, NIL-0002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (WB for web data, NW for
newswire data, or DF for discussion forum data).
./data/2011/training/source_documents/*
This directory contains all of the source documents listed in the
of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml
See section 5 for more information about source documents.
./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml
This file contains 2122 queries. Each query entry consists of
the following fields:
- A query ID formatted as the letters "EL_CMN_" plus
a five-digit zero-padded, sequentially assigned
integer (e.g., "EL_CMN_00001").
- The full namestring of the query entity.
- An ID for a document in ./data/2012/eval/source_documents/
from which the namestring was extracted.
- The starting offset for the namestring.
- The ending offset for the namestring.
The queries are distributed by language and type as follows:
KB-Link GPE ORG PER Total
----------------------------------------
CMN NW NIL: 99 89 167 355
CMN NW Non-NIL: 164 167 148 479
CMN WB NIL: 88 86 68 242
CMN WB Non-NIL: 131 112 110 353
ENG NW NIL: 90 79 68 237
ENG NW Non-NIL: 101 107 83 291
ENG WB NIL: 6 26 16 48
ENG WB Non-NIL: 26 52 39 117
----------------------------------------
Total: 705 718 699 2122
./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab
This file contains the responses for each query as identified by
human annotators at LDC. This file is tab delimited, with 5 fields total.
The column descriptions are as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
three-digit zero-padded sequentially assigned
integer (e.g. NIL001, NIL002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (WB for web data, NW for
newswire data, or DF for discussion forum data).
5. web-search - (Y/N) indicating whether the annotator made
use of web searches in order to make the linking
judgment.
./data/2012/eval/source_documents/*
This directory contains all of the source documents listed in the
of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml
See section 5 for more information about source documents.
./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml
This file contains 158 queries. Each query entry consists of
the following fields:
- A query ID formatted as the letters "EL_CMN_" plus
a five-digit zero-padded, sequentially assigned
integer (e.g., "EL_CMN_00001").
- The full namestring of the query entity.
- An ID for a document in ./data/2012/training/source_documents/
from which the namestring was extracted.
- The starting offset for the namestring.
- The ending offset for the namestring.
The queries are distributed by language and type as follows:
KB-Link GPE ORG PER Total
----------------------------------------
CMN NW NIL: 2 2 2 6
CMN NW Non-NIL: 0 2 0 2
CMN WB NIL: 16 16 17 49
CMN WB Non-NIL: 24 25 24 73
ENG WB NIL: 3 4 0 7
ENG WB Non-NIL: 7 5 9 21
----------------------------------------
Total: 52 54 52 158
./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab
This file contains the responses for each query as identified by
human annotators at LDC. This file is tab delimited, with 5 fields total.
The column descriptions are as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2012_chinese_entity_linking_training_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
three-digit zero-padded sequentially assigned
integer (e.g. NIL001, NIL002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (WB for web data, NW for
newswire data, or DF for discussion forum data).
5. web-search - (Y/N) indicating whether the annotator made
use of web searches in order to make the linking
judgment.
./data/2012/training/source_documents/*
This directory contains all of the source documents listed in the
of tac_kbp_2012_chinese_entity_linking_training_queries.xml
See section 5 for more information about source documents.
./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml
This file contains 2155 queries. Each query entry consists of
the following fields:
- A query ID formatted as the letters "EL13_CMN" plus
a four-digit zero-padded, sequentially assigned
integer (e.g., "EL13_CMN_0001").
- The full namestring of the query entity.
- An ID for a document in ./data/2013/eval/source_documents/
from which the namestring was extracted.
- The starting offset for the namestring.
- The ending offset for the namestring.
The queries are distributed by language and type as follows:
KB-Link PER ORG GPE Total
-----------------------------------------
CMN NW NIL: 123 197 125 445
CMN NW Non-NIL: 124 119 163 406
CMN WB NIL: 112 105 87 304
CMN WB Non-NIL: 173 150 162 485
ENG NW NIL: 52 16 68 136
ENG NW Non-NIL: 83 87 64 234
ENG WB NIL: 11 19 7 37
ENG WB Non-NIL: 28 42 38 108
-----------------------------------------
Total: 706 735 714 2155
./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab
This file contains the responses for each query as identified by
human annotators at LDC. This file is tab delimited, with 6 fields total.
The column descriptions are as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
three-digit zero-padded sequentially assigned
integer (e.g. NIL001, NIL002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (WB for web data, NW for
newswire data, or DF for discussion forum data).
5. web-search - (Y/N) indicating whether the annotator made
use of web searches in order to make the linking
judgment.
6. wiki text - (Y/N) indicating whether the annotator made
use of the wiki text in the knowledge base (as
opposed to just the infobox information) in order
to make the linking judgment.
./data/2013/eval/source_documents/*
This directory contains all of the source documents listed in the
of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml
See section 5 for more information about source documents.
./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml
This file contains 2739 queries. Each query entry consists of
the following fields:
- A query ID formatted as the letters "EL14_CMN_" plus
a four-digit zero-padded, sequentially assigned
integer (e.g., "EL14_CMN_0001").
- The full namestring of the query entity.
- An ID for a document in ./data/2014/eval/source_documents/
from which the namestring was extracted.
- The starting offset for the namestring.
- The ending offset for the namestring.
The queries are distributed by language and type as follows:
KB-Link PER ORG GPE Total
---------------------------------------------
CMN DF NIL: 118 40 16 174
CMN DF Non-NIL: 426 61 66 553
CMN NW NIL: 179 413 300 892
CMN NW Non-NIL: 349 139 184 672
ENG DF NIL: 1 4 5 10
ENG DF Non-NIL: 5 26 25 56
ENG NW NIL: 10 65 32 107
ENG NW Non-NIL: 87 66 119 272
ENG WB Non-NIL: 1 0 2 3
---------------------------------------------
Total: 1176 814 749 2739
./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab
This file contains the responses for each query as identified by
human annotators at LDC. This file is tab delimited, with 6 fields total.
The column descriptions are as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
three-digit zero-padded sequentially assigned
integer (e.g. NIL001, NIL002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (WB for web data, NW for
newswire data, or DF for discussion forum data).
5. web-search - (Y/N) indicating whether the annotator made
use of web searches in order to make the linking
judgment.
6. wiki text - (Y/N) indicating whether the annotator made
use of the wiki text in the knowledge base (as
opposed to just the infobox information) in order
to make the linking judgment.
./data/2014/eval/source_documents/*
This directory contains all of the source documents listed in the
of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml
See section 5 for more information about source documents.
./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml
This file contains 514 queries. Each query entry consists of
the following fields:
- A query ID formatted as the letters "EL14_CMN_TRAINING" plus
a four-digit zero-padded, sequentially assigned
integer (e.g., "EL14_CMN_TRAINING_0001").
- The full namestring of the query entity.
- An ID for a document in ./data/2014/training/source_documents/
from which the namestring was extracted.
- The starting offset for the namestring.
- The ending offset for the namestring.
The queries are distributed by language and type as follows:
KB-Link PER ORG GPE Total
-----------------------------------------
ENG DF NIL: 1 6 3 10
ENG DF Non-NIL: 33 37 41 111
CMN DF NIL: 28 46 6 80
CMN DF Non-NIL: 109 83 121 313
-----------------------------------------
Total: 171 172 171 514
./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab
This file contains the responses for each query as identified by
human annotators at LDC. This file is tab delimited, with 6 fields total.
The column descriptions are as follows:
1. query ID - The ID for the query detailed in
tac_kbp_2014_chinese_entity_linking_training_queries.xml
to which the subsequent information pertains
2. entity ID - A unique entity node ID or NIL ID, correspondent
to entity linking annotation and NIL-coreference
(clustering) annotation respectively. If the entity
node ID begins with "E", the text refers to an
entity in the Knowledge Base (TAC KBP Reference Knowledge
Base - LDC2014T16). If the given query is not linked to
an entity in the Knowledge Base (KB), then it is
given a NIL-ID, which consists of "NIL" plus a
three-digit zero-padded sequentially assigned
integer (e.g. NIL001, NIL002). Both the entities
with an entity node ID of "E" type and "NIL" type
are assumed to be co-referenced (clustered), with
the same "E" type ID or the same "NIL" ID if they
refer to the same entity. Each "E" type ID and NIL
ID is distinct from one another.
3. entity-type - GPE, ORG, or PER type indicator for the entity
4. genre - WB/NW/DF indicating the source genre of the
document for the query (all DF or discussion
forum threads in these data).
5. web-search - (Y/N) indicating whether the annotator made
use of web searches in order to make the linking
judgment.
6. wiki text - (Y/N) indicating whether the annotator made
use of the wiki text in the knowledge base (as
opposed to just the infobox information) in order
to make the linking judgment.
./data/2014/training/source_documents/*
This directory contains all of the source documents listed in the
of tac_kbp_2014_chinese_entity_linking_training_queries.xml
See section 5 for more information about source documents.
./dtd/2011_kbpentlink.dtd
DTD for:
tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml
tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml
./dtd/2012_2013_2014_kbpentlink.dtd
DTD for:
tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml
tac_kbp_2012_chinese_entity_linking_training_queries.xml
tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml
tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml
tac_kbp_2014_chinese_entity_linking_training_queries.xml
3. Annotation
Given a name string and using information from the query's source
document, bilingual Chinese/English-speaking annotators used a
specialized search engine to look in the Knowledge Base for a page in
which the entity referred to by the query was the central topic. If
such a page was found, a link was created between the query and the
matching KB node ID. If no matching page was found, the query was
marked as NIL and later coreferenced with other NIL entities.
Annotators were allowed to use online searching to assist in
determining the KB link/NIL status. Queries for which a human
annotator could not confidently determine the KB link status were
removed from the final data sets.
4. Text Normalization
Name string matches are case and punctuation sensitive. The only text
normalization performed was:
1. conversion of newlines to spaces, except where preceding
characters were hyphens ("-"), in which case newlines
were removed
2. conversion of multiple spaces to a single space
5. Source Documents
All the text data in the source files have been taken directly from
previous LDC corpus releases, and are being provided here essentially
"as-is", with little or no additional quality control. An overall scan
of character content in the source collections indicates some relatively
small quantities of various problems, especially in the web and
discussion forum data, including language mismatch (characters from
Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors
(some documents have apparently undergone "double encoding" into UTF-8,
and others may have been "noisy" to begin with, or may have gone through
an improper encoding conversion, yielding occurrences of the
Unicode "replacement character" (U+FFFD) throughout the corpus); the web
collection also has characters whose Unicode code points lie
outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF.
All documents that have filenames beginning with "cmn-NG" and "eng-NG"
are Web Document data (WB) and some of these fail XML parsing (see below
for details). All files that start with "bolt-" are Discussion Forum
threads (DF) and have the XML structure described below. All other files are
Newswire data (NW) and have the newswire markup pattern detailed below.
Note as well that some source documents are duplicated across a few of
the separated source_documents directories, indicating that some queries
from different data sets originated from the same source documents. As
it is acceptable for sources to be reused for Entity Linking queries, this
duplication is intentional and expected.
The subsections below go into more detail regarding the markup and
other properties of the three source data types:
5.1 Newswire Data
Newswire data use the following markup framework:
<HEADLINE>
...
</HEADLINE>
...
...
...
where the HEADLINE and DATELINE tags are optional (not always
present), and the TEXT content may or may not include " ...
"
tags (depending on whether or not the "doc_type_label" is "story").
All the newswire files are parseable as XML.
5.2 Discussion Forum Data
Discussion forum files use the following markup framework:
<headline>
...
</headline>
...
...
...
...
where there may be arbitrarily deep nesting of quote elements, and
other elements may be present (e.g. "..." anchor tags). As
mentioned in section 2 above, each unit contains at least five
post elements.
All the discussion forum files are parseable as XML.
5.3 Web Document Data
"Web" files use the following markup framework:
{doc_id_string}
...
...
<BODY>
<HEADLINE>
...
</HEADLINE>
...
...
...
</BODY>
Other kinds of tags may be present ("", "", etc).
Some of the web source documents contain material that interferes
with XML parsing (e.g. unescaped "&", or "" tags that lack
a corresponding "
").
6. Using the Data
6.1 Offset calculation
The values of the beg and end XML elements in the later queries.xml files
indicate character offsets to identify text extents in the source. Offset
counting starts from the initial character (character 0) of the source
document and includes newlines and all markup characters - that is, the
offsets are based on treating the source document file as "raw text", with all
its markup included.
6.2 Proper ingesting of XML queries
While the character offsets are calculated based on treating the source
document as "raw text", the "name" strings being referenced by the queries
sometimes contain XML metacharacters, and these had to be "re-escaped" for
proper inclusion in the queries.xml file. For example, an actual name like
"AT&T" may show up a source document file as "AT&T" (because the source
document was originally formatted as XML data). But since the source doc is
being treated here as raw text, this name string is treated in queries.xml as
having 7 characters (i.e., the character offsets, when provided, will point to
a string of length 7).
However, the "name" element itself, as presented in the queries.xml file, will
be even longer - "AT&T" - because the queries.xml file is intended to
be handled by an XML parser, which will return "AT&T" when this "name"
element is extracted. Using the queries.xml data without XML parsing would
yield a mismatch between the "name" value and the corresponding string in the
source data.
7. Copyright Information
(c) 2015 Trustees of the University of Pennsylvania
8. Contact Information
For further information about this data release, contact the following
project staff at LDC:
Joseph Ellis, Project Manager
Jeremy Getman, Lead Annotator
Stephanie Strassel, PI
--------------------------------------------------------------------------
README created by Jeremy Getman on February 4, 2015
updated by Joe Ellis on February 16, 2015
updated by Jeremy Getman on February 17, 2015
updated by Joe Ellis on March 18, 2015