【netcore基础】.Net core通过 Lucene.Net 和 jieba.NET 处理分词搜索功能

时间:2024-09-02 18:03:08

业务要求是对商品标题可以进行模糊搜索

例如用户输入了【我想查询下雅思托福考试】,这里我们需要先将这句话分词成【查询】【雅思】【托福】【考试】,然后搜索包含相关词汇的商品。

思路如下

首先我们需要把数据库里的所有商品内容,自动同步到 Lucene 的分词索引目录下缓存,效果如下

【netcore基础】.Net core通过 Lucene.Net 和 jieba.NET 处理分词搜索功能

这里就用到了之前写的自动作业 Hangfire 大家可以参考下面的博文

https://www.cnblogs.com/jhli/p/10027074.html

定时更新缓存,后面就可以分词搜索了,更新索引代码如下

        public void UpdateMerchIndex()
{
try
{
Console.WriteLine($"[{DateTime.Now}] UpdateMerchIndex job begin..."); var indexDir = Path.Combine(System.IO.Directory.GetCurrentDirectory(), "temp", "lucene", "merchs");
if (System.IO.Directory.Exists(indexDir) == false)
{
System.IO.Directory.CreateDirectory(indexDir);
} var VERSION = Lucene.Net.Util.LuceneVersion.LUCENE_48;
var director = FSDirectory.Open(new DirectoryInfo(indexDir));
var analyzer = new JieBaAnalyzer(TokenizerMode.Search);
var indexWriterConfig = new IndexWriterConfig(VERSION, analyzer); using (var indexWriter = new IndexWriter(director, indexWriterConfig))
{
if (File.Exists(Path.Combine(indexDir, "segments.gen")) == true)
{
indexWriter.DeleteAll();
} var query = _merchService.Where(t => t.IsDel == false); var index = ;
var size = ; var count = query.Count(); if (count > )
{
while (true)
{
var rs = query.OrderBy(t => t.CreateTime)
.Skip((index - ) * size)
.Take(size).ToList(); if (rs.Count == )
{
break;
} var addDocs = new List<Document>(); foreach (var item in rs)
{
var merchid = item.IdentityId.ToLowerString(); var doc = new Document();
var field1 = new StringField("merchid", merchid, Field.Store.YES);
var field2 = new TextField("name", item.Name?.ToLower(), Field.Store.YES);
doc.Add(field1);
doc.Add(field2);
addDocs.Add(doc);// 添加文本到索引中 } if (addDocs.Count > )
{
indexWriter.AddDocuments(addDocs);
} index = index + ;
} } } Console.WriteLine($"[{DateTime.Now}] UpdateMerchIndex job end!");
}
catch (Exception ex)
{
Console.WriteLine($"UpdateMerchIndex ex={ex}");
}
}

剩下的就是去查询索引内容,匹配到id,然后去数据库查询响应id的项。

搜索代码

        protected List<Guid> SearchMerchs(string key)
{
if (string.IsNullOrEmpty(key))
{
return null;
}
key = key.Trim().ToLower(); var rs = new List<Guid>(); try
{
var indexDir = Path.Combine(System.IO.Directory.GetCurrentDirectory(), "temp", "lucene", "merchs"); var VERSION = Lucene.Net.Util.LuceneVersion.LUCENE_48; if (System.IO.Directory.Exists(indexDir) == true)
{
var reader = DirectoryReader.Open(FSDirectory.Open(new DirectoryInfo(indexDir)));
var search = new IndexSearcher(reader); var directory = FSDirectory.Open(new DirectoryInfo(indexDir), NoLockFactory.GetNoLockFactory());
var reader2 = IndexReader.Open(directory);
var searcher = new IndexSearcher(reader2); var parser = new QueryParser(VERSION, "name", new JieBaAnalyzer(TokenizerMode.Search));
var booleanQuery = new BooleanQuery(); var list = CutKeyWord(key);
foreach (var word in list)
{
var query1 = new TermQuery(new Term("name", word));
booleanQuery.Add(query1, Occur.SHOULD);
} var collector = TopScoreDocCollector.Create(, true);
searcher.Search(booleanQuery, null, collector);
var docs = collector.GetTopDocs(, collector.TotalHits).ScoreDocs; foreach (var d in docs)
{
var num = d.Doc;
var document = search.Doc(num);// 拿到指定的文档 var merchid = document.Get("merchid");
var name = document.Get("name"); if (Guid.TryParse(merchid, out Guid mid) == true)
{
rs.Add(mid);
}
}
}
}
catch (Exception ex)
{
Console.WriteLine($"SearchMerchs ex={ex}");
} return rs;
}

对用户输入的话进行拆分分词代码 JiebaNet

        protected List<string> CutKeyWord(string key)
{
var rs = new List<string>();
var segmenter = new JiebaSegmenter();
var list = segmenter.Cut(key);
if (list != null && list.Count() > )
{
foreach (var item in list)
{
if (string.IsNullOrEmpty(item) || item.Length <= )
{
continue;
} rs.Add(item);
}
}
return rs;
}

需要添加的 nuget 引用的包和对应版本

Hangfire 1.7.0-beta1

Lucene.Net 4.8.0-beta00005

Lucene.Net.Analysis.Common 4.8.0-beta00005

Lucene.Net.QueryParser 4.8.0-beta00005

需要单独引用的dll文件

JiebaNet.Segmenter.dll 

下载地址

https://pan.baidu.com/s/1D7mQnow0FmoqedNYzugfKw

如果本地调试没有问题,发布到服务器上 自动执行作业就遇到这个问题

https://*.com/questions/47746582/hangfire-job-throws-system-typeloadexception

 
System.TypeLoadException

Could not load type ‘***’ from assembly ‘***, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null’.

其实这个报错并不是原因,把异常打印出来就知道了

原因是没有将 Resources 文件夹下的字典文件 dict.txt 发布到服务器上

这个坑让我浪费了半天时间。。。