2012年2月19日星期日

FTS query performance on SQL 2005

I am seeing some query performance issues on a full-text search on SQL 2005.
My table has about 20 million rows, containing an integer primary key and a
field of type text. This is running on a dual core Xeon 2.8 Ghz processor, 2
GB RAM, 15k RPM drives in a RAID 5 configuration.
My query looks like this:
select ItemID FROM Item WHERE CONTAINS(ItemText, 'there')
which returns about 40,000 rows, but the query takes over 2 minutes! Is that
normal performance for this beefy server for such a simple query? If I add
"TOP 20" after the SELECT, the query takes under 1 second.
I have checked the hardware and I can't seem to find constraints, either in
CPU, memory or disk. Any ideas why that query takes over 2 minutes?
Thanks.
Speed depends on the complexity of your query and the amount of rows you are
returning. In your case you have a simple query and what appears to be
causing the performance problem. I would use containstable with the
top_n_by_rank parameter to limit your results set to 100 or 200 rows.
Is this SQL 2005? There are some optimizations for SQL 2005 which will offer
better performance.
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Bahama Joe" <someone@.microsoft.com> wrote in message
news:ub9L57y%23GHA.1224@.TK2MSFTNGP05.phx.gbl...
>I am seeing some query performance issues on a full-text search on SQL
>2005. My table has about 20 million rows, containing an integer primary key
>and a field of type text. This is running on a dual core Xeon 2.8 Ghz
>processor, 2 GB RAM, 15k RPM drives in a RAID 5 configuration.
> My query looks like this:
> select ItemID FROM Item WHERE CONTAINS(ItemText, 'there')
> which returns about 40,000 rows, but the query takes over 2 minutes! Is
> that normal performance for this beefy server for such a simple query? If
> I add "TOP 20" after the SELECT, the query takes under 1 second.
> I have checked the hardware and I can't seem to find constraints, either
> in CPU, memory or disk. Any ideas why that query takes over 2 minutes?
> Thanks.
>
|||Thanks for the reply. Yes, this is SQL 2005 - what optimizations are you
referring to? I believe I have applied all optimizations that I've been able
to find in the various posts and online docs.
I know that I can limit the size of the resultset with top_n_by_rank, but in
this case, I'm trying to get back the full set of results. Another form of
my query is to do a "select count(*)", which has the same response times. I
believe this is because the full resultset is returned from the FTE back to
SQL Server, and then the count is taken on that. Is there a way to structure
the query to tell the FTE that you just want the count of results, so that
it doesn't ship the full results back to SQL Server?
Essentially, I'm trying to create a search engine, where the results will be
displayed back to the user in paginated form, so it will always display a
subset of the resultset, but I'd like to also display "Showing results 1-20
of 40,000", so I need a way to get the size of the resultset.
With regard to hardware, would you expect that I will get the most bang for
my buck by a) spreading my data across more disks (via RAID) in a single
server, b) creating a cluster of separate servers, c) adding more memory, or
d) adding more CPU's?
Thanks.
|||Thanks for the reply. Yes, this is SQL 2005 - what optimizations are you
referring to? I believe I have applied all optimizations that I've been able
to find in the various posts and online docs.
I know that I can limit the size of the resultset with top_n_by_rank, but in
this case, I'm trying to get back the full set of results. Another form of
my query is to do a "select count(*)", which has the same response times. I
believe this is because the full resultset is returned from the FTE back to
SQL Server, and then the count is taken on that. Is there a way to structure
the query to tell the FTE that you just want the count of results, so that
it doesn't ship the full results back to SQL Server?
Essentially, I'm trying to create a search engine, where the results will be
displayed back to the user in paginated form, so it will always display a
subset of the resultset, but I'd like to also display "Showing results 1-20
of 40,000", so I need a way to get the size of the resultset.
With regard to hardware, would you expect that I will get the most bang for
my buck by a) spreading my data across more disks (via RAID) in a single
server, b) creating a cluster of separate servers, c) adding more memory, or
d) adding more CPU's?
Thanks.
|||I'm struggling with the same issues as you.
Basically what i do is bank on the fact that it is rare for most people to
look beyond the first page of results, so I write the results of the search
to a table, and return it to a data reader displaying the first 25 results
and a count of all search results. Repeat searches go against the cached
table.
The optimizations are
1) use 64 bit
2) use a high resource_usage keeping in mind this can cause locking
3) reorganize your catalogs frequently
4) set ft crawl bandwidth (max) and ft notify bandwidth (max) to 0
5) max full-text crawl range to the number of cpu's on your system
6) convert your binary data to text
and you get the best bank for your buck by placing the full-text catalog on
its own disk subsystem and controller with the fastest disks available. RAID
5 offers best read performance, but for frequently updated catalogs use raid
10.
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Bahama Joe" <someone@.microsoft.com> wrote in message
news:epadrr9%23GHA.4712@.TK2MSFTNGP03.phx.gbl...
> Thanks for the reply. Yes, this is SQL 2005 - what optimizations are you
> referring to? I believe I have applied all optimizations that I've been
> able
> to find in the various posts and online docs.
> I know that I can limit the size of the resultset with top_n_by_rank, but
> in
> this case, I'm trying to get back the full set of results. Another form of
> my query is to do a "select count(*)", which has the same response times.
> I
> believe this is because the full resultset is returned from the FTE back
> to
> SQL Server, and then the count is taken on that. Is there a way to
> structure
> the query to tell the FTE that you just want the count of results, so that
> it doesn't ship the full results back to SQL Server?
> Essentially, I'm trying to create a search engine, where the results will
> be
> displayed back to the user in paginated form, so it will always display a
> subset of the resultset, but I'd like to also display "Showing results
> 1-20
> of 40,000", so I need a way to get the size of the resultset.
> With regard to hardware, would you expect that I will get the most bang
> for
> my buck by a) spreading my data across more disks (via RAID) in a single
> server, b) creating a cluster of separate servers, c) adding more memory,
> or
> d) adding more CPU's?
> Thanks.
>
>
|||Thanks for the reply. I had already applied optimizations 2, 4 and 5, based
on one of your earlier posts in another thread. I don't have access to a
64-bit server, but this is a good suggestion. My data is static data, so I
don't expect I'll need to reorganize the catalogs frequently. Also, I only
have text data stored in a single column of type 'text', no binary data.
With regard to caching, I find that subsequent queries using the same search
string (i.e. to support the user going to the next page) come back in about
3-4 ms, meaning that there's already some good caching in place within SQL
Server. It seems that relying on that should be sufficient, instead of
creating my own sepearate caching mechanism as you describe. Is that not
your experience? The most pressing issue I'm trying to resolve is why the
first query (without a top_n_by_rank) can sometimes take 2-3 minutes to
respond.
Thanks.
|||This could be caching of the catalog pages. What is your max server memory
setting?
sp_configure 'max server memory (MB)'
Also what happens if you just issue a new contains query ie like this
select * from TableName where contains(*,'"George Bush"')
supposing you have not searched for George Bush recently and he is in your
content?
It could be that the other tables are causing the performance hit and not
your full-text queries themselves.
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Bahama Joe" <someone@.microsoft.com> wrote in message
news:ON3tvnE$GHA.5092@.TK2MSFTNGP04.phx.gbl...
> Thanks for the reply. I had already applied optimizations 2, 4 and 5,
> based on one of your earlier posts in another thread. I don't have access
> to a 64-bit server, but this is a good suggestion. My data is static data,
> so I don't expect I'll need to reorganize the catalogs frequently. Also, I
> only have text data stored in a single column of type 'text', no binary
> data.
> With regard to caching, I find that subsequent queries using the same
> search string (i.e. to support the user going to the next page) come back
> in about 3-4 ms, meaning that there's already some good caching in place
> within SQL Server. It seems that relying on that should be sufficient,
> instead of creating my own sepearate caching mechanism as you describe. Is
> that not your experience? The most pressing issue I'm trying to resolve is
> why the first query (without a top_n_by_rank) can sometimes take 2-3
> minutes to respond.
> Thanks.
>
|||Hello Bahama,
Is this an upgrade or a new SQL install?
Make sure you update your statistics and possibly rebuild your indexes.
To get a count it is better to do select count(1) from containstable (table,
column,search)
To get the paged results do SELECT TOP x where x covers the page the users
wants.
Make sure you SQL box isn't using all the memory, you need to share it with
full text.
How big is your database? 20 million rows and 2Gb of cache doesn't allow
for much of you DB to be cached.
What indexes do you have on your tables?
Simon Sabin
SQL Server MVP
http://sqlblogcasts.com/blogs/simons

> Thanks for the reply. I had already applied optimizations 2, 4 and 5,
> based on one of your earlier posts in another thread. I don't have
> access to a 64-bit server, but this is a good suggestion. My data is
> static data, so I don't expect I'll need to reorganize the catalogs
> frequently. Also, I only have text data stored in a single column of
> type 'text', no binary data.
> With regard to caching, I find that subsequent queries using the same
> search string (i.e. to support the user going to the next page) come
> back in about 3-4 ms, meaning that there's already some good caching
> in place within SQL Server. It seems that relying on that should be
> sufficient, instead of creating my own sepearate caching mechanism as
> you describe. Is that not your experience? The most pressing issue I'm
> trying to resolve is why the first query (without a top_n_by_rank) can
> sometimes take 2-3 minutes to respond.
> Thanks.
>
|||Hilary,
Here is the result of that query:
name=max server memory (MB)
minimum=16
maximum=2147483647
config_value=4096
run_value=4096
Is that configured correctly for a server w/ 2GB of physical RAM?
Also, in one of your earlier posts, you mentioned setting "max full-text
crawl range" to the number of CPU's on the system. I have a dual-core Xeon
processor, meaning it has 2 CPU's (in one chip). However, due to Hyper
Threading, Windows and SQL Server see this server as having 4 CPU's, so I
have that value set to 4. Is that a good setting?
All of my tests have been on just the single table that contains my
full-text index, such as "select ItemID from Item where
contains(ItemText,'"George Bush"')", so there are no other tables involved.
Thanks for all of your insightful replies.
|||Hello Simon,
This is a new SQL install. I just created the table, loaded it with all 20
million rows from a text file via "Import data...", then created the
full-text index (which took about 2 hours). My table basically includes an
"int" primary key (clustered index) column and a "text" column with the text
to be indexed. The table also includes 2 other "text" fields, but they're
not involved in these queries. The result is a database .MDF file of approx.
12GB and a full-text index with files of approx. 2GB, with the full-text
index containing approx. 1.4 million unique keys.
When I look at task manager for memory usage, I see the following:
sqlservr.exe --> mem usage=740MB, VM size=749MB
msftesql.exe --> mem usage=7.6MB, VM size=5MB
Right after a reboot, these numbers are continually rising, but after a
number of queries, these numbers steady out to the ones listed above. Based
on this, would you recommend that I change my memory configuration?
Thanks.

没有评论:

发表评论