2012年3月19日星期一

Full text indexing with Japanese characters problem

Hi all,
I am quite experimented with SQL Server, but not that much with full
text indexing. After some successful attempts with english fields, I've
decided to try it with Japanese characters. I don't know why, but it
seems to have a strange behaviour.
As in this screenshot
(http://img65.imageshack.us/img65/980/jap3xt.gif), the CONTAINS
function does not seem to return only fields with an exact word match
of the given "word" (query), but also strange results which does not
even correspond to the query. Can anybody help me with that one?
Thanks!
ibiza
This is probably due to the way the Japanese characters are broken by the
word breakers at index time and stored in the full text index. For example
Japanese consists of 5 different character sets. Japanese "words" are
largely syllables, so when you search on a "word" what it matches to are a
variety of sub tokens/syllables.
What you need to do is have someone who is fluent in Japanese verify that
your search application is actually finding what you are looking for. My
knowledge of Japanese is limited to a few phrases.
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"ibiza" <lambertb@.gmail.com> wrote in message
news:1143825609.907248.76110@.t31g2000cwb.googlegro ups.com...
> Hi all,
> I am quite experimented with SQL Server, but not that much with full
> text indexing. After some successful attempts with english fields, I've
> decided to try it with Japanese characters. I don't know why, but it
> seems to have a strange behaviour.
>
> As in this screenshot
> (http://img65.imageshack.us/img65/980/jap3xt.gif), the CONTAINS
> function does not seem to return only fields with an exact word match
> of the given "word" (query), but also strange results which does not
> even correspond to the query. Can anybody help me with that one?
>
> Thanks!
>
> ibiza
>
|||Hi,
thanks a lot for your reply
Well, even if I am not "fluent" in Japanese, I can still read japanese
"hiragana" characters, that is, what my query is about.
Each symbol can be translated as a syllable, so =E3=81=8A=E3=82=82=E3=81=97=
=E3=82=8D (the
query) would be "omoshiro", "o-mo-shi-ro".
The three first rows are translated as "omoni", "omo" and "omoi". So
I'm wondering why these results are showing, because they only partly
contains the query word (the =E3=81=8A=E3=82=82 ("omo") part)
As far as I know Japanese and full text indexing, the query should only
return results where the full and comple quey is found (=E3=81=8A=E3=82=82=
=E3=81=97=E3=82=8D),
as I did a CONTAINS search with a whole word.
Can you help me a little more with that information?
thanks again
ibiza
|||This could be by design. For example in German a word like wanderlust is
broken as wardern, lust and wanderlist. So if you were to do a contains
query for wanderlust you would get hits to any of these words. It looks like
it is doing something similar.
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"ibiza" <lambertb@.gmail.com> wrote in message
news:1143922571.663177.238920@.i39g2000cwa.googlegr oups.com...
Hi,
thanks a lot for your reply
Well, even if I am not "fluent" in Japanese, I can still read japanese
"hiragana" characters, that is, what my query is about.
Each symbol can be translated as a syllable, so ? (the
query) would be "omoshiro", "o-mo-shi-ro".
The three first rows are translated as "omoni", "omo" and "omoi". So
I'm wondering why these results are showing, because they only partly
contains the query word (the ? ("omo") part)
As far as I know Japanese and full text indexing, the query should only
return results where the full and comple quey is found (?),
as I did a CONTAINS search with a whole word.
Can you help me a little more with that information?
thanks again
ibiza
|||Hi there,
thanks again for your reply.
I did some other tests with english and they all seemed successful
until this one :
http://img155.imageshack.us/img155/2001/untitled1tx.gif
Why is the CONTAINS returning no row at all? ;_;
This makes me doubt about the validity of all the previous succesful
queries I made...Could there be missing words, if some queries like the
one with 'now' isn't returning any at all?
What's the problem with full text indexing? :S
And yes I did repopulate all the indexes before running this query...
thanks for your help!
ibiza

没有评论:

发表评论