2012年3月26日星期一

Full Text Search Problem

This is the way FTS works. Look up "noise words" in Books Online. The noise
file (e.g. noise.dat) lists all the words that are ignored when building
full-text indexes, which means searching for them is not possible, hence the
error (which becomes a warning in SQL Server 2005).
You have two options:
1) Handle it in the client application: prevent users from issuing searches
where only the ignored words have been used. You can use the noise file to
programmatically test each search string;
2) Remove the words from the noise list (leave empty lines): this may
increase the space used by full-text catalogs significantly, so only remove
those words that you expect the users to search for.
Perhaps other frequent posters in this newsgroup have other suggestions.
ML
http://milambda.blogspot.com/
Just to piggy back off ML's comment.
I used to recommend stripping the noise words out of your query phrase,
however this will frequently lead to errors, for example a search on
"University Of California" when stripped of its noise word OF, and then the
search conducted on "University California" will miss results containing
"University of California" and "University to California".
IMHO the best approach is to empty your noise word list and replace it with
a single space or as ML points out a line feed.
Note that a FreeText search gets around this problem but may return too many
results and its speed is slower than the Contains.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"ML" <ML@.discussions.microsoft.com> wrote in message
news:3A90A3EB-E4BE-4E1F-A8DF-1C6AAD9653E9@.microsoft.com...
> This is the way FTS works. Look up "noise words" in Books Online. The
> noise
> file (e.g. noise.dat) lists all the words that are ignored when building
> full-text indexes, which means searching for them is not possible, hence
> the
> error (which becomes a warning in SQL Server 2005).
> You have two options:
> 1) Handle it in the client application: prevent users from issuing
> searches
> where only the ignored words have been used. You can use the noise file to
> programmatically test each search string;
> 2) Remove the words from the noise list (leave empty lines): this may
> increase the space used by full-text catalogs significantly, so only
> remove
> those words that you expect the users to search for.
> Perhaps other frequent posters in this newsgroup have other suggestions.
>
> ML
> --
> http://milambda.blogspot.com/
|||Yes, noise words aren't all bad, but search strings containing nothing but
noise words are.
ML
http://milambda.blogspot.com/
|||Hi ML, very true and well said.
Historically Noise words were intended to conserve disk space as back in the
80's when search was first starting disks were very expensive. Today they
are intended to "hide" noisy phrases from searching. For example a search on
Microsoft SQL Server is the functional equivalent of a search on SQL Server.
So you get better search efficiency by not looking for Microsoft.
Microsoft (at one time, perhaps still the case) added Microsoft to their
noise word list on their search engines for this reason. Apparently at one
time they also would add words greater than 26 letters to their noise word
list as you would be unable to search on them.
MSN search was one of the first big search engines to allow you to search on
noise words, for example a search on "the" when MSN Search first came out
would return the number on hit to the white house.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"ML" <ML@.discussions.microsoft.com> wrote in message
news:90F35431-C496-4273-8C9C-A79B1E9F9D20@.microsoft.com...
> Yes, noise words aren't all bad, but search strings containing nothing but
> noise words are.
>
> ML
> --
> http://milambda.blogspot.com/
|||Thanks for that info - it's essential, a must-know.
Do you by any chance have a list of characters ignored by FTS that aren't
included in noise files (e.g. punctuation marks)?
ML
http://milambda.blogspot.com/
|||Basically all alpha-numeric letters are indexed. Hyphens and capitalization
are respect in some languages. In some languages the indexing process knows
a character occurs after a single letter (i.e. C#), but doesn't index what
the character is, i.e. a search on C# will match with C$.
Currency symbols change how a number is stored in the index as well as
apparent date strings.
Abbreviations are handled differently, for example f.b.i is indexed as f, b,
and i, whereas F.B.I is indexed as FBI, and F.B.I.
IMHO I did an ok job in this article discussing language options in SQL FTS.
http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/
If you are really interested in the internals of how this works with most
search engines you might want to look at the code in Lucene or Foundations
of Statistical Natural Language Processing. There is another book which is
really good on this and presents algorithms but I can't recall the name of
it right now.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"ML" <ML@.discussions.microsoft.com> wrote in message
news:0B5AC307-1395-4F25-8DF0-4FD2C8A2423C@.microsoft.com...
> Thanks for that info - it's essential, a must-know.
> Do you by any chance have a list of characters ignored by FTS that aren't
> included in noise files (e.g. punctuation marks)?
>
> ML
> --
> http://milambda.blogspot.com/
|||Thank you again! That article is now a permanent reference.
ML
http://milambda.blogspot.com/
sql

没有评论:

发表评论