2012年3月26日星期一

Full text search on Chinese or Chinese/English mix?

I want to use SQL 2005 FT to search on web page I crawled from web. The
page can be Chinese, English or Chinese/English(Chinese article with English
phrase in it).
First question is that what language word breaker I should choose. Does
Chinese word breaker make its English content hard to search?.
Second question, Should I store text in different language in difference
catalog so that I can choose the specific word breaker for the FTS? but how
to determine what language a web page is using. Most of Chinese and English
web page uses utf-8 charset which make it indistinguishable for my program
to determine which language it is using. Shouldn't SQL server figure out
what word breaker to use automattically by examining the bytes of utf-8
encoding of the text?
Third, what encoding I should use when I insert the content of web page into
the full text database? use utf-8, or gb2312(chinese) or Unicode? Does it
matter?
Your inputs are greatly appreciated.
You have to use the ms.locale metatag for this to work, store your documents
in the image or varbinary data type, and then query using the Language
keyword. The language type you assign to the column is irrelevant as the
langauge tags in the document type dominate. Here is an example
CREATE TABLE blob
(pk INT not null IDENTITY(1,1) CONSTRAINT primarykey PRIMARY KEY,
blob VARBINARY(MAX),
blobtype VARCHAR(10))
GO
CREATE FULLTEXT INDEX ON blob
(blob TYPE COLUMN blobtype LANGUAGE 1033) --note the LCID is for American
English
KEY INDEX PrimaryKey ON catalog_name
GO
--note that these html documents we are pushing in are tagged with French
language metatags.
INSERT INTO blob (blob,blobtype)
VALUES(CONVERT(VARBINARY(256),'<HTML><HEAD><META name="ms.locale"
CONTENT="FR"></HEAD><BODY>mang</BODY></HTML>'),'.htm')
INSERT INTO blob (blob,blobtype)
VALUES(CONVERT(VARBINARY(256),'<HTML><HEAD><META name="ms.locale"
CONTENT="FR"></HEAD><BODY>manger</BODY></HTML>'),'.htm')
GO
Querying for all stemmed forms of the French verb manger (to eat).
SELECT * FROM blob WHERE CONTAINS(*, 'formsof(inflectional,manger), language
1036)
--two rows returned.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Xin Chen" <xchen@.xtremework.com> wrote in message
news:uReDrmxJFHA.3184@.TK2MSFTNGP09.phx.gbl...
> I want to use SQL 2005 FT to search on web page I crawled from web. The
> page can be Chinese, English or Chinese/English(Chinese article with
English
> phrase in it).
> First question is that what language word breaker I should choose. Does
> Chinese word breaker make its English content hard to search?.
> Second question, Should I store text in different language in difference
> catalog so that I can choose the specific word breaker for the FTS? but
how
> to determine what language a web page is using. Most of Chinese and
English
> web page uses utf-8 charset which make it indistinguishable for my program
> to determine which language it is using. Shouldn't SQL server figure out
> what word breaker to use automattically by examining the bytes of utf-8
> encoding of the text?
> Third, what encoding I should use when I insert the content of web page
into
> the full text database? use utf-8, or gb2312(chinese) or Unicode? Does it
> matter?
> Your inputs are greatly appreciated.
>
|||Maybe I didn't answer your question to well.
1) It doesn't matter what word breaker you select as for varbinary or image
data type columns where the document's contains language tags the iFilter
understands (HTML docs tagged with the ms.locale metatag, or Word and other
Office docs) the embedded language tag will control the word breaker used.
2) You don't have to if you are using the Image or varbinary data type
columns. For other data type columns you will.
3) utf-8 should work.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Xin Chen" <xchen@.xtremework.com> wrote in message
news:uReDrmxJFHA.3184@.TK2MSFTNGP09.phx.gbl...
> I want to use SQL 2005 FT to search on web page I crawled from web. The
> page can be Chinese, English or Chinese/English(Chinese article with
English
> phrase in it).
> First question is that what language word breaker I should choose. Does
> Chinese word breaker make its English content hard to search?.
> Second question, Should I store text in different language in difference
> catalog so that I can choose the specific word breaker for the FTS? but
how
> to determine what language a web page is using. Most of Chinese and
English
> web page uses utf-8 charset which make it indistinguishable for my program
> to determine which language it is using. Shouldn't SQL server figure out
> what word breaker to use automattically by examining the bytes of utf-8
> encoding of the text?
> Third, what encoding I should use when I insert the content of web page
into
> the full text database? use utf-8, or gb2312(chinese) or Unicode? Does it
> matter?
> Your inputs are greatly appreciated.
>

没有评论:

发表评论