使用php从文本中提取单词

时间:2022-11-12 10:54:35

Hello friends have a little problem. I need to extract only the words of a text "anyone".

你好朋友有一点问题。我只需要提取文本“任何人”的单词。

I tried to retrieve the words using strtok (), strstr (). some regular expressions, but only managed to extract some words.

我尝试使用strtok(),strstr()检索单词。一些正则表达式,但只设法提取一些单词。

The problem is complex due to the number of characters and symbols that can accompany the words.

由于可以伴随单词的字符和符号的数量,问题是复杂的。

The example text which must be extracted words. This is a sample text:

必须提取单词的示例文本。这是一个示例文本:

Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and http://www.google.com (r) The 509th "composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters).

Sample text, for testing.

The result of extracting the text should be:

提取文本的结果应该是:

Main article our required but March Gutenberg's a go or and The composite and dog as is done article agriculture cat now Hi meters

Sample text for testing

The first function I wrote to facilitate the work

我写的第一个函数是为了方便工作

function PreText($text){
  $text = str_replace("\n", ".", $text);
  $text = str_replace("\r", ".", $text);

  $text = str_replace("'", "", $text);
  $text = str_replace("?", "", $text);
  $text = str_replace("¿", "", $text);
  $text = str_replace("(", "", $text);
  $text = str_replace(")", "", $text);
  $text = str_replace('"', "", $text);
  $text = str_replace(';', "", $text);
  $text = str_replace('!', "", $text);
  $text = str_replace('<', "", $text);
  $text = str_replace('>', "", $text);
  $text = str_replace('#', "", $text);

  $text = str_replace(",", "", $text);

  $text = str_replace(".c", "", $text);
  $text = str_replace(".C", "", $text);
  return $text;
}

Split function:

function SplitWords($text){
  $words = explode(" ", $text);
  $ContWords = count($words);

  for ($i = 0; $i < $ContWords; $i++){
    if (ctype_alpha($words[$i])) {
      $NewText .= $words[$i].", ";
    }
  }
  return $NewText;
}

The program:

<?
  include_once ('functions.php');

  $text = "Main article: our 46,000 ...";
  $text = PreText($text);
  $text = SplitWords($text);
  echo $text;
?>

Is that the code has a long way. We appreciate your help.

是代码还有很长的路要走。感谢您的帮助。

2 个解决方案

#1


5  

If I understand you correctly, you want to remove all non-letters from the string. I would use preg_replace

如果我理解正确,您要删除字符串中的所有非字母。我会用preg_replace

$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

This should remove everything that is not a letter, apostrophe or a space.

这应该删除所有不是字母,撇号或空格的东西。

#2


0  

Try this almost your requirement

试试这几乎是你的要求

<?php
$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
HEREDOC;
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;
?>

#1


5  

If I understand you correctly, you want to remove all non-letters from the string. I would use preg_replace

如果我理解正确,您要删除字符串中的所有非字母。我会用preg_replace

$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

This should remove everything that is not a letter, apostrophe or a space.

这应该删除所有不是字母,撇号或空格的东西。

#2


0  

Try this almost your requirement

试试这几乎是你的要求

<?php
$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
HEREDOC;
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;
?>