
时间:2022-11-12 10:54:35

Hello friends have a little problem. I need to extract only the words of a text "anyone".


I tried to retrieve the words using strtok (), strstr (). some regular expressions, but only managed to extract some words.


The problem is complex due to the number of characters and symbols that can accompany the words.


The example text which must be extracted words. This is a sample text:


Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and http://www.google.com (r) The 509th "composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters).

Sample text, for testing.

The result of extracting the text should be:


Main article our required but March Gutenberg's a go or and The composite and dog as is done article agriculture cat now Hi meters

Sample text for testing

The first function I wrote to facilitate the work


function PreText($text){
  $text = str_replace("\n", ".", $text);
  $text = str_replace("\r", ".", $text);

  $text = str_replace("'", "", $text);
  $text = str_replace("?", "", $text);
  $text = str_replace("¿", "", $text);
  $text = str_replace("(", "", $text);
  $text = str_replace(")", "", $text);
  $text = str_replace('"', "", $text);
  $text = str_replace(';', "", $text);
  $text = str_replace('!', "", $text);
  $text = str_replace('<', "", $text);
  $text = str_replace('>', "", $text);
  $text = str_replace('#', "", $text);

  $text = str_replace(",", "", $text);

  $text = str_replace(".c", "", $text);
  $text = str_replace(".C", "", $text);
  return $text;

Split function:

function SplitWords($text){
  $words = explode(" ", $text);
  $ContWords = count($words);

  for ($i = 0; $i < $ContWords; $i++){
    if (ctype_alpha($words[$i])) {
      $NewText .= $words[$i].", ";
  return $NewText;

The program:

  include_once ('functions.php');

  $text = "Main article: our 46,000 ...";
  $text = PreText($text);
  $text = SplitWords($text);
  echo $text;

Is that the code has a long way. We appreciate your help.


2 个解决方案



If I understand you correctly, you want to remove all non-letters from the string. I would use preg_replace


$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

This should remove everything that is not a letter, apostrophe or a space.




Try this almost your requirement


$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;



If I understand you correctly, you want to remove all non-letters from the string. I would use preg_replace


$text = "Main article: our 46,000...";
$text = preg_replace("/[^a-zA-Z' ]/","",$text);

This should remove everything that is not a letter, apostrophe or a space.




Try this almost your requirement


$text = <<<HEREDOC
Main article: our 46,000 required, !but (1947-2011) mail@server.com March 8, 2014 Gutenberg's 34-DE 'a' 3,1415 Us: @unknown n go http://google.com or www.google.com and
        http://www.google.com (r) The 509th composite" and; C-54 #dog v4.0 ¿as is done? ¿article... agriculture? x ¿cat? now! Hi!! (87 meters). Sample text, for testing.
//replace all kind of URLs and emails from text
$url_email = "((https?|ftp)\:\/\/)?"; // SCHEME
$url_email .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$url_email .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$url_email .= "(\:[0-9]{2,5})?"; // Port
$url_email .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$url_email .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$url_email .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor

$text = preg_replace("/$url_email/","",$text);
//replace anything like Us: @unknown
$text = preg_replace("/Us:.?@\\w+/","",$text);
//replace all Non-Alpha characters
$text = preg_replace("/[^a-zA-Z' ]/","",$text);
echo $text;