2012年3月11日星期日

Full Text Index not working when populated from nvarchar. Bug?

Hi, I was wondering if any SQL Server gurus out there could help me...
I have a table I'm trying to apply a full text catalog to, however no
results are ever returned due to the text column being cataloged being of
varbinary(max) that's being populated from a converted nvarchar(max) value.
To re-create the problem quickly...
If I populate the column via
CONVERT(varbinary(max), 'test text')
then there is no problem, I get results as expected.
However if I populate the column via
CONVERT(varbinary(max), CAST('test text' as nvarchar(max)))
no results are ever returned.
Is this a bug with SQL Server 2005 Full Text Indexing? I'm happily creating
full text catalogs when an nvarchar is not getting converted into a varbinary.
I'm setting the Document Type column to '.html' (I've tried changing this to
'.txt' in case it was a fault with the html ifilter but the problem persists
so I believe I can rule this out).
The reason I need to convert an nvarchar to varbinary is that the table
holds multi-lingual text and I'm adding a html meta tag <META
NAME="MS.LOCALE" CONTENT="ES"> to the beginning in order for the full text
indexing word breaker to select the correct language to catalog the text
with. The aim being to provide more relevant searches in users native
languages from a single table (I've read a few articles that describe this
technique, but it's the first time I've tried to apply it).
Any pointers / suggestions would be greatly appreciated. Cheers,
Gavin.
Below is a T-SQL script you can run to demonstrate the effect I'm
experiencing...
-- Create test database
CREATE DATABASE FullTextTest
GO
USE FullTextTest
GO
-- Create test data table
CREATE TABLE TestTable
(
pk UNIQUEIDENTIFIER NOT NULL CONSTRAINT tablePK PRIMARY KEY,
varbinarycol VARBINARY(MAX),
documentExtension VARCHAR(5),
)
GO
-- The below single entry WILL BE FOUND (the text source is being entered
directly)
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(),
CONVERT(VARBINARY(MAX),'<META NAME="MS.LOCALE" CONTENT="EN">test entry 1'),
'.html')
-- The bellow two entries below WILL NOT BE FOUND (the text source is taken
from an NVARCHAR(MAX) value)
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(),
CONVERT(VARBINARY(MAX), CAST('<META NAME="MS.LOCALE" CONTENT="EN">test entry
2' AS NVARCHAR(MAX))), '.html')
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(),
CONVERT(VARBINARY(MAX), CAST('<META NAME="MS.LOCALE" CONTENT="EN">test entry
3' AS NVARCHAR(MAX))), '.html')
GO
-- Create the full text catalog
sp_fulltext_database 'enable'
GO
CREATE FULLTEXT CATALOG TEST AS DEFAULT
GO
CREATE FULLTEXT INDEX ON TestTable (varbinarycol TYPE COLUMN
documentExtension LANGUAGE 1033)
KEY INDEX tablePK
GO
-- NOTE: You might need to give the catalog a chance to build before running
the script below.
-- Now do a search that SHOULD RETURN 3 ROWS of data, but ONLY 1 ROW IS
RETURNED
SELECT CAST(varbinarycol AS NVARCHAR(MAX)) FROM TestTable WHERE
CONTAINS(varbinarycol, 'test')
I don't believe you need to use nvarchar(max) here, varchar(max) will work.
HTML will have a code page or utf metatag in it, which the html ifilter will
understand.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"gavin" <gavin@.discussions.microsoft.com> wrote in message
news:3CDA0C57-863D-43CE-898C-B0F18DEA7A4E@.microsoft.com...
> Hi, I was wondering if any SQL Server gurus out there could help me...
> I have a table I'm trying to apply a full text catalog to, however no
> results are ever returned due to the text column being cataloged being of
> varbinary(max) that's being populated from a converted nvarchar(max)
> value.
> To re-create the problem quickly...
> If I populate the column via
> CONVERT(varbinary(max), 'test text')
> then there is no problem, I get results as expected.
> However if I populate the column via
> CONVERT(varbinary(max), CAST('test text' as nvarchar(max)))
> no results are ever returned.
> Is this a bug with SQL Server 2005 Full Text Indexing? I'm happily
> creating
> full text catalogs when an nvarchar is not getting converted into a
> varbinary.
> I'm setting the Document Type column to '.html' (I've tried changing this
> to
> '.txt' in case it was a fault with the html ifilter but the problem
> persists
> so I believe I can rule this out).
> The reason I need to convert an nvarchar to varbinary is that the table
> holds multi-lingual text and I'm adding a html meta tag <META
> NAME="MS.LOCALE" CONTENT="ES"> to the beginning in order for the full text
> indexing word breaker to select the correct language to catalog the text
> with. The aim being to provide more relevant searches in users native
> languages from a single table (I've read a few articles that describe this
> technique, but it's the first time I've tried to apply it).
> Any pointers / suggestions would be greatly appreciated. Cheers,
> Gavin.
> Below is a T-SQL script you can run to demonstrate the effect I'm
> experiencing...
> -- Create test database
> CREATE DATABASE FullTextTest
> GO
> USE FullTextTest
> GO
> -- Create test data table
> CREATE TABLE TestTable
> (
> pk UNIQUEIDENTIFIER NOT NULL CONSTRAINT tablePK PRIMARY KEY,
> varbinarycol VARBINARY(MAX),
> documentExtension VARCHAR(5),
> )
> GO
> -- The below single entry WILL BE FOUND (the text source is being entered
> directly)
> INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES
> (NEWID(),
> CONVERT(VARBINARY(MAX),'<META NAME="MS.LOCALE" CONTENT="EN">test entry
> 1'),
> '.html')
> -- The bellow two entries below WILL NOT BE FOUND (the text source is
> taken
> from an NVARCHAR(MAX) value)
> INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES
> (NEWID(),
> CONVERT(VARBINARY(MAX), CAST('<META NAME="MS.LOCALE" CONTENT="EN">test
> entry
> 2' AS NVARCHAR(MAX))), '.html')
> INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES
> (NEWID(),
> CONVERT(VARBINARY(MAX), CAST('<META NAME="MS.LOCALE" CONTENT="EN">test
> entry
> 3' AS NVARCHAR(MAX))), '.html')
> GO
> -- Create the full text catalog
> sp_fulltext_database 'enable'
> GO
> CREATE FULLTEXT CATALOG TEST AS DEFAULT
> GO
> CREATE FULLTEXT INDEX ON TestTable (varbinarycol TYPE COLUMN
> documentExtension LANGUAGE 1033)
> KEY INDEX tablePK
> GO
> -- NOTE: You might need to give the catalog a chance to build before
> running
> the script below.
> -- Now do a search that SHOULD RETURN 3 ROWS of data, but ONLY 1 ROW IS
> RETURNED
> SELECT CAST(varbinarycol AS NVARCHAR(MAX)) FROM TestTable WHERE
> CONTAINS(varbinarycol, 'test')
|||Hi Hilary,
Thanks for the input, but I was under the impression from what I've read so
far that I would need to use nvarchar to future proof the input for the table
for when I start translating my application into Japanese, Chinese, and other
non-latin based scripts? Can store all these inputs safely in a varchar
variable? I was under the impression nvarchar had been introduced to get
around the language support limitation of varchars? I'm still on a learning
curve here so please tell me if I'm wrong?
I'm actually new to multi-lingual applications, but perhaps I should provide
more detail of what I'm trying to achieve...
I want to store multiple languages in the same table. In order to get the
full text index to use the correct word breakers on a row by row basis the
extra html meta tag is being added to the text to describe the language the
ifilter should go and fetch the relevant word breaker for. From what I
understand, this is only possible with varbinary data so I'm trying to
populate it with a passed in nvarchar parameter to my stored procedure. By
default sql server expects a single language on a table and this technique
should allow me to circumnavigate this limitation. The technique is described
in this article here -
http://www.simple-talk.com/sql/learn-sql-server/sql-server-full-text-search-language-features/
(look for the heading "Via language tags in the content being indexed" about
a quarter of the way down the article).
So at the end of the day: I'm storing plain text (in any language), adding
an html meta tag to describe the language and pretending it's html to get SQL
Server to index based on language on a row by row basis.
Regards,
Gavin.
|||Hi Hilary, just woke up a bit more and made the connection - you wrote the
article I was using as my inspiration! Nice article by the way ;)
|||I've also tried adding the following extra meta tag - <META
http-equiv="Content-Type" content="text/html; charset=utf-16">
But this seemed to break the Full Text Index even further. If applied to the
example T-SQL I give above then even the non-nvarchar input is no longer
returned:
INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES (NEWID(),
CONVERT(VARBINARY(MAX),'<META http-equiv="Content-Type" content="text/html;
charset=utf-16"><META NAME="MS.LOCALE" CONTENT="EN">test entry 1'), '.html')
|||You need to complete html document, not just the metatags. here is an
example which works for me.
CREATE DATABASE CODE_SAMPLE5
GO
USE CODE_SAMPLE5
GO
CREATE TABLE CODE_SAMPLE5
(PK INT NOT NULL IDENTITY CONSTRAINT PRIMARYKEY PRIMARY KEY,
VARBINARYCOL VARBINARY(MAX),
DOCUMENTTYPECOLUMN CHAR(4))
GO
CREATE FULLTEXT CATALOG TEST AS DEFAULT
GO
CREATE FULLTEXT INDEX ON CODE_SAMPLE5 (VARBINARYCOL TYPE COLUMN
DOCUMENTTYPECOLUMN) KEY INDEX PRIMARYKEY
GO
--The first document is an html file. We are building this on the fly. Note
the title metatag, and the
--keywords metatag.
INSERT INTO CODE_SAMPLE5(VARBINARYCOL,
DOCUMENTTYPECOLUMN)VALUES(CONVERT(VARBINARY(500),' <HTML><HEAD><TITLE>this is
the title</TITLE><meta name=''keywords''
content=''keyword''></HEAD><BODY>test</BODY></HTML>'),'.htm')
SELECT* from code_sample5 where contains(*,'test')
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"gavin" <gavin@.discussions.microsoft.com> wrote in message
news:026EDE33-B2EC-4761-AADE-A79EA13AA4F4@.microsoft.com...
> I've also tried adding the following extra meta tag - <META
> http-equiv="Content-Type" content="text/html; charset=utf-16">
> But this seemed to break the Full Text Index even further. If applied to
> the
> example T-SQL I give above then even the non-nvarchar input is no longer
> returned:
> INSERT INTO TestTable (pk, varbinarycol, documentExtension) VALUES
> (NEWID(),
> CONVERT(VARBINARY(MAX),'<META http-equiv="Content-Type"
> content="text/html;
> charset=utf-16"><META NAME="MS.LOCALE" CONTENT="EN">test entry 1'),
> '.html')
|||Hi Hilary,
I've also tried playing around with a full set of html wrapping tags as you
give in your example. I still get the same problem. If you tweak you're code
a little I believe you will experience the same problem. Convert your text
into an nvarchar before converting to varbinary and you'll see.
-- Ensure input is NVARCHAR
DECLARE @.Text nvarchar(max)
SET @.Text = '<HTML><HEAD><TITLE>this is
the title</TITLE><meta name=''keywords''
content=''keyword''></HEAD><BODY>test</BODY></HTML>'
-- Insert
INSERT INTO CODE_SAMPLE5(VARBINARYCOL,
DOCUMENTTYPECOLUMN)VALUES(CONVERT(VARBINARY(500),@. Text),'.htm')
Regards,
Gavin.
|||Have you checked the FTS crawl log for possible errors?
ML
http://milambda.blogspot.com/

没有评论:

发表评论