我如何在语义上表示通用的,提取的文本?

时间:2022-04-11 18:36:26

I am working on a project that extracts content from web pages and normalizes this content to a discrete set of types. Right now I am only working with text and images.

我正在开发一个从网页中提取内容并将此内容规范化为一组离散类型的项目。现在我只处理文本和图像。

For images, I've found https://schema.org/ImageObject, which seems to fit just fine.

对于图像,我找到了https://schema.org/ImageObject,它看起来很合适。

For text, however, I am not sure what to use. Except for the primitive datatype http://schema.org/Text, I'm not finding anything on schema.org that represents generic text. I am new to linked, semantic data, and not sure whether primitives are intended to be used as full-on types.

但是,对于文本,我不确定要使用什么。除了原始数据类型http://schema.org/Text之外,我在schema.org上找不到代表通用文本的任何内容。我不熟悉链接的语义数据,也不确定基元是否打算用作全开型。

Furthermore, I would like to be able to distinguish text fragments by their use on the source webpage. For example, I'd like to be able to specify that one span of text was paragraph text, while another was header text. On schema.org there is https://schema.org/WebPageElement, which also includes https://schema.org/WPHeader, but there is no WPParagaph, or WPTextFragment, or anything like that.

此外,我希望能够通过在源网页上使用来区分文本片段。例如,我希望能够指定一段文本是段落文本,而另一段是标题文本。在schema.org上有https://schema.org/WebPageElement,它还包括https://schema.org/WPHeader,但没有WPParagaph或WPTextFragment,或类似的东西。

I've looked around other vocabularies, but not sure which might be a good fit. Above all, I am looking to employ something that already exists and people recognize.

我查看了其他词汇表,但不确定哪个词汇表适合。最重要的是,我希望使用已经存在并且人们认可的东西。

1 个解决方案

#1


Have you taken a look at the Open Annotation ontology, from the W3C? (http://www.openannotation.org/spec/core/core.html#BodyEmbed). Currently it is only a draft, but it could help you annotating pieces of text. It also allows you to assert from which document you have extracted the text and ownership of the annotations (i.e., their provenance). I don't think it includes terms such as "header", but it has selectors for specifying the concrete parts of the annotated web page/document you are annotating: http://www.openannotation.org/spec/core/specific.html#TextPositionSelector.

你有没有看过W3C的Open Annotation本体? (http://www.openannotation.org/spec/core/core.html#BodyEmbed)。目前它只是一个草稿,但它可以帮助您注释文本。它还允许您断言您从哪个文档中提取了注释的文本和所有权(即它们的出处)。我不认为它包含诸如“标题”之类的术语,但是它具有用于指定要注释的带注释的网页/文档的具体部分的选择器:http://www.openannotation.org/spec/core/specific。 HTML#TextPositionSelector。

It also provides the mechanisms to annotate areas of images (http://www.openannotation.org/spec/core/specific.html#SvgSelector). It could be as simple or complex as you want.

它还提供了注释图像区域的机制(http://www.openannotation.org/spec/core/specific.html#SvgSelector)。它可能是你想要的简单或复杂。

#1


Have you taken a look at the Open Annotation ontology, from the W3C? (http://www.openannotation.org/spec/core/core.html#BodyEmbed). Currently it is only a draft, but it could help you annotating pieces of text. It also allows you to assert from which document you have extracted the text and ownership of the annotations (i.e., their provenance). I don't think it includes terms such as "header", but it has selectors for specifying the concrete parts of the annotated web page/document you are annotating: http://www.openannotation.org/spec/core/specific.html#TextPositionSelector.

你有没有看过W3C的Open Annotation本体? (http://www.openannotation.org/spec/core/core.html#BodyEmbed)。目前它只是一个草稿,但它可以帮助您注释文本。它还允许您断言您从哪个文档中提取了注释的文本和所有权(即它们的出处)。我不认为它包含诸如“标题”之类的术语,但是它具有用于指定要注释的带注释的网页/文档的具体部分的选择器:http://www.openannotation.org/spec/core/specific。 HTML#TextPositionSelector。

It also provides the mechanisms to annotate areas of images (http://www.openannotation.org/spec/core/specific.html#SvgSelector). It could be as simple or complex as you want.

它还提供了注释图像区域的机制(http://www.openannotation.org/spec/core/specific.html#SvgSelector)。它可能是你想要的简单或复杂。