Indexing from A to Z

Please note**: Portions of this transcript have been edited for clarity

Hosts:

  • Lubor Kollar, Group Program Manager
  • Nam Kang, Support Professional
  • Mirek Sztajno, Program Manager
  • Milind Joshi, Technical Lead
  • Srikumar Rangarajan, Technical Lead
  • Cesar Galindo-Legaria, Development Lead
  • Stephen Dybing, MVP Lead
  • Stephen Wynkoop, https://www.sswug.org

Moderator: Stephen_D (Microsoft)
Welcome to today's Indexing from A to Z chat. I will ask the hosts to introduce themselves.

Host: Lubor (Microsoft)
Lubor Kollar, Program Manager in Query Processor that is part of the SQL Server Engine

Host: swynk (MVP)
This is Stephen Wynkoop, SQL Server MVP, founder and e-editor for SSWUG.ORG.

Host: NK (Microsoft)
HI! My name is Nam Kang and I work for the SQL Server Developer Support Team.

Host: Mirek (Microsoft)
I am Mirek, Program Manager in Storage Engine SQL Server

Host: Milind (Microsoft)
I work on the query optimizer.

Host: Sri (Microsoft)
Howdy all, This is Srikumar Rangarajan, Tech Lead SQL server storage engine.

Host: Cesar (Microsoft)
Cesar Galindo-Legaria, manager of Query Optimizer.

Moderator: Stephen_D (Microsoft)
And... I am Stephen Dybing, SQL Server MVP Lead. Glad you all could make it today!

Please note that this chat is focused on Indexing from A to Z. We're not talking all that much about Yukon yet. If you want to hear about Yukon, please join us at the Microsoft Professional Developers Conference, https://msdn.microsoft.com/events/pdc/ or the PASS Community Summit, https://www.microsoft.com/presspass/events/sqlpass/default.mspx.

**Host: Cesar (Microsoft)
Q:**how does the query optimizer differ in Yukon? has it been re-written?
A: The query optimizer has been extended to support some new features, but it is NOT a rewrite.

**Moderator: Stephen_D (Microsoft)
Q:**When will the first publicly available beta be available for Yukon?
A: It'll be released in the first half of next year. I'm sorry, I cannot be more specific than that.

**Host: Lubor (Microsoft)
Q:**how much can a non-clustered index can affect the performance of dml operations in benchmark test(if microsoft have any test)
A: Usually it is single digit percent. TPC-C has such indexes.

**Host: NK (Microsoft)
Q:**What column is the best candidate for CLUSTERED INDEX(or if u can put them in order of priority for clustered index) - column involved in Joins, Group by, or Order by, range queries, primary key, identity column
A: This really depends on your query. If you are running queries that have a lot of joins then you should consider putting the clustered index on the join condition. If the queries have a lot of queries that use the primary key then it would be a good candidate. You should consider using the ITW (Index Tuning Wizard) to see what it recommends.

**Host: Mirek (Microsoft)
Q:**Can someone explain the differences between clustered indexes on 6.5, 7 and 2000?
A: At the logical level, no changes. We did include changes in the performance design but practically speaking, there is no difference between 7.0 and 2000. However, we did change the page size and the max key length between 6.5 and 7.0.

**Host: Sri (Microsoft)
Q:**I create large MSDE Databases (2GB). I have two tables that add records continuously; I have other processes that delete records that are "too old". Both of these databases are indexed (necessary for performance) I have been concerned about FILL Can you specify the indexes you have on the tables. In general having a lot of churn you will have on the fill factor is bound to degrade. In general we have found it to be close to 60% full. What is the best FILL for tables that turn over all of their data. I have found that rebuilding the indexes can be expensive.
A: Depends on the app. If you are going to have a lot of inserts I would go with 60-70%

**Host: Sri (Microsoft)
Q:**Question2 Can I rebuild indexes safely and still keep them open for inserts. What is best practice?
A: In Yukon you can. We support online index builds which will open up the indexes being rebuilt for inserts and delete.

**Host: Cesar (Microsoft)
Q:**Could someone explain how statistics works with indexes?
A: Statistics are created automatically on the index columns. A histogram is built for the leading columns, and multi-column densities are created on all prefixes of the index columns. E.g. if the index is on (A, B, C), you have a histogram on A, and densities on A; AB; and ABC.

**Host: Lubor (Microsoft)
A:**There is white paper on MSDN explaining details https://msdn.microsoft.com/library/en-us/dnsql2k/html/statquery.asp

Host: Milind (Microsoft)
Q:
When we rebuild our indexes using DBCC DBREINDEX, does SQL Server rebuild the Indexed views also?
A: No. Indexed view is a separate entity. For more information about reindex and indexdefrag please refer to the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx.

**Host: Sri (Microsoft)
Q:**How much can a non-clustered index can affect the performance of dml operations in benchmark test (if microsoft have any test)?
A: Every insert and delete operation must maintain the non-clustered indexes if the key columns of those indexes are updated. In general non-clustered indexes help a lot of query performance if the query is covered by the NC-index. So it is a balance between the update and the select costs.

**Host: Lubor (Microsoft)
Q:**Do the histogram buckets effect the optimizer decision to use index for particular value of column. Is there any way we can populate the histogram values in sysindexes ourselves?
A: Yes the bucket sizes are used. No you cannot populate the histogram values yourself.

**Host: Cesar (Microsoft)
Q:**How optimizer choose to use particular index...in our environment we can often see that for particular value for column optimizer is not using index where it should be ...but choosing for all other columns?
A: These are cost-based decisions. The optimizer will estimate the number of disk accesses that are needed in each case and will pick the plan with least anticipated cost. Something to keep in mind is whether or not the index is covering, i.e. if it includes all the columns required by the query. If an index is not covering then the base table needs to be accessed, typically requiring random I/Os, which are more expensive that sequential I/Os.

**Host: Lubor (Microsoft)
Q:**I'm struggling to optimize an update statement which has multiple inner joins; based on the selectivity, are non-clustered indexes more appropriate? I’ve seen the same performance against the table with 0 indexes.
A: Did you try Index Tuning Wizard? Paste your statement to QA and go from there...

**Moderator: Stephen_D (Microsoft)
Q:**I wanted to install full-text indexing on my clustered SQL server. But the setup program always fails. After typing in the virtual server it disappears. It runs for a few minutes in the background and then ends with an application error.
A: I'm sorry, but we don't have any FTS experts here. You might try asking in the microsoft.public.sqlserver.fulltext newsgroup. If you're not familiar with newsgroups, see https://www.microsoft.com/technet/community/newsgroups/server/sql.mspx.

**Host: Mirek (Microsoft)
Q:**From the output of showcontig what should or can be the possible threshold of fragmentation that is good candidate of index rebuild?
A: For information about index rebuild please refer to the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx

**Host: Lubor (Microsoft)
Q:**Any statistics available on how much an index affects insert/update/delete performance? I'm thinking of adding some covering indexes but don't want change operations to suffer too much.
A: It depends on the data. Did you try Index Tuning Wizard? Paste your workload to QA and go from there...

Host: Milind (Microsoft)
Q:
Which one is better design index on unique identity key or non-unique key which is used in joins
A: It depends on your other queries. The non-uniqueness should not be an issue if that is what you are worried about.

**Host: Lubor (Microsoft)
Q:**If we have clustered index on col1...but in query where clause we use 3 columns col1,col2..is it better to create a separate new non-clustered index on col1,col2 or just on col2?
A: If your col2 predicate restricts the set (after applying the col1 predicate) significantly it is better to have additional index on col1, col2.

**Host: swynk (MVP)
Q:**Are there interfaces that make it easier to get status information on indexes to be read by a program and acted upon rather than parsing the return report from things like ShowContig?
A: Try looking at show plan in Query Analyzer - it will show you exactly what SQL Server is doing to retrieve your data.

**Host: Cesar (Microsoft)
Q:**I have queries that will run faster using nonclustered indexes. But I have clustered indexes I don't want to drop. Can you have a clustered and a nonclustered index on one column?
A: Sure, you can have both a clustered index as well as a nonclustered index on a given column. The non-clustered index will be narrower as it contains all the columns you need it will be more efficient to read.

**Host: NK (Microsoft)
Q:**I understand that for a clustered index, the data is in the index page; for a non-clustered index, where is the index stored if it isn't on the same page?
A: The leaf layer of a nonclustered index does not consist of the data pages. Instead, the leaf nodes contain index rows. Each index row contains the nonclustered key value and one or more row locators that point to the data row (or rows if the index is not unique) having the key value.

**Host: Mirek (Microsoft)
Q:**Any performance or maintenance advantage to placing indexes in their own files separate from the data?
A: Generally, we don’t expect significant performance advantages. From the maintenance standpoint, building an index elsewhere can give you more room in the filegroup that contains the data.

Host: Milind (Microsoft)
Q:
I've noticed the optimizer doesn't get the number of rows returned very close when you do a "like '%mytext%' " query. Is there any way to "educate" it?
A: The optimizer uses histograms and densities to estimate the number of rows. It is somewhat hard to do with wildcards at the beginning of the text. We are working on improving it in Yukon.

**Host: swynk (MVP)
Q:**Do MS have any benchmarking tools for SQL2000? Particularly for comparing HyperThreading with non-HyperThreading setups?
A: We're not aware of any tools specific to benchmarking SQL2000 in the two environments - but you should be able to get meaningful information from your hardware vendor in terms of comparisons.

**Host: Sri (Microsoft)
Q:**If I use dbcc indexdefrag regularly, should I still use dbcc dbreindex as well?
A: If you defrag regularly re-index is not really needed. Reindex is a faster operation as it can be run in parallel. For more information about reindex and indexdefrag please refer to the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx.

**Host: Lubor (Microsoft)
Q:**Aside from profiler/tuning wiz, what are some tips for looking for good index candidates?
A: The ITW is doing a great job, it’s hard to beat it. Make sure that you have covering indexes (no fetches/bookmark lookups in your plan). Also sort order may be helped by index. Look for GROUP BY, ORDER BY and index those.

**Host: Mirek (Microsoft)
Q:**Is parallelism(ON/OFF) effect the decision whether to use particular index or not?
A: Generally it should not be the case, however, sometimes, turning off parallelism may cause the optimizer to choose different query plan which may cause to use different index.

**Host: Sri (Microsoft)
Q:**In general, if I have multiple drives and file groups, will I get some performance gain by placing my indexes in a separate/drive/filegroup. If so, by a lot, a little..?
A: It is not very clear if you will see a definite performance boost. If the data is spread uniformly across the disk it should be quite good enough.

**Host: swynk (MVP)
Q:**Jeffksr: Moving from RMS I files to SQL and in process have combined a series of small tables to a larger one that will be search on25-30 values. Does this cause performance issues on inserts and updates?
A: Sounds like a design issue that you'll want to review. Sometimes you can see better performance, but I've often seen where the only way to improve SELECT performance is to add indexes - and of course a 25-column index isn't going to be pretty on INSERTs. So, you're better off normalizing the tables after you merge the data - that will be your best bet.

**Host: Lubor (Microsoft)
Q:**for data loads; would it be better to drop the indexes, do etl, and then re-create later or load the data in the order of the clustered index?
A: There is a break even point - it depends on the number of indexes, size of the table and size of the delta.

Host: Milind (Microsoft)
Q:
what's the best query/index approach when doing wildcard searching using LIKE as '%sometext%' or is fulltext search the answer to this kind of query?
A: Fulltext may be the answer, but note that fulltext has different semantics than SQL like. It looks for words.

**Host: Lubor (Microsoft)
Q:**Statistics Questions: Is there anything about sp3a that would account for an increase in logical reads and is there any change coming that will make it easier to transfer statistics e.g. (to test servers)?
A: Compare query plans. Different plan may be a cause of different logical reads. You can transfer statistics to test servers by updating the catalogs. Will be easier in the future...

**Host: Cesar (Microsoft)
Q:**What are "covered" indexes and do they really make a difference in performance? I've heard people talk about them, but don't really understand why they help/hurt.
A: A "covering index" is one that contains all the columns required by the query. For example, if you have a table with 26 columns A, B, C, ..., Z. But your queries only need two of those columns, say A and B, you can create a non-clustered index on columns (A, B). In this case, only the non-clustered index will be touched. Since it contains fewer columns, it takes up less space and will be more efficient to read.

**Host: Sri (Microsoft)
Q:**We are seeing heavy fragmentation in our VLDB DB's 180G. We are doing incremental loads. Our disk IO queues have gone off the charts. Is there a way to minimize this?
A: If the incremental load into existing index and the data is inserted in a random fashion you cannot avoid defragmentation. You can run defrag to alleviate your problems.

**Host: Lubor (Microsoft)
Q:**fullscan and default percentage in the statistics collection sometimes gives the same times/io for queries, any suggestions on how to select a better percentage?
A: If the query plan is the same then you will see the same performance. You should usually get similar query plan with sample and without it.

**Host: NK (Microsoft)
Q:**Is there a good rule of thumb on having too many indexes? I've seen performance decrease as the number of indexes gets out of hand of course, I'm more interested in whether there's some magic number of indexes to consider.
A: There is no magic number on the number of indexes. It really depends on what you are doing with the query. I would use the index tuning wizard with a load to see what indexes you should create or drop.

**Host: Cesar (Microsoft)
Q:**Does "count(distinct..)) " always going to do a table scan?
A: It will either perform a table or index scan.

Host: Milind (Microsoft)
Q:
When optimizer decided on whether to use index or not..does it measure cost based on Physical IO or Logical IO?
A: The optimizer does not know what pages will be in the buffer at run time. Therefore it costs with the assumption that all the required pages will have to be read from the disk.

**Host: Lubor (Microsoft)
Q:**I have queries that join tables from multiple DB's, is it possible to get the ITW to evaluate these types of queries? How about joins to linked servers?
A: Not the remote part. You should break the query into "local" portions and try to tune each one of them locally.

**Host: Sri (Microsoft)
Q:**in what situations would I need to modify the pad index value which I understand is the non-leaf level?
A: Pad index is used to leave space on non-leaf level index page to reduce page split caused by future updates. This feature was proven to be not very useful. So don't worry about when to modify the pad index value.

**Host: Sri (Microsoft)
Q:**Is there any particular reason that fillfactor and padindex does not keep value for updates to tables and only effect when creating or rebuilding the index?
A: Fill factor and pad index are used during index creation and rebuild to leave space on pages to reduce page split caused by future update. So there is no point to honor it in future update.

**Host: swynk (MVP)
Q:**DerekB: Follow-up - more specifics on Show Plan
A: Here's an MSDN article on reviewing the execution plan: https://msdn.microsoft.com/library/en-us/optimsql/odp_tun_1_5pde.asp (URL). If you can't see the URL, let me know - you may have to change your chat options.

**Host: Lubor (Microsoft)
Q:**Assume table with many columns and millions of rows. I want to ensure a reference table has all unique values from one column, let's call it UserCode. I know SELECT DISTINCT UserCode INTO tblRef FROM tblBig works, but is a table scan...
A: Unfortunately I must see all rows to find all distinct values... You can speed it up significantly by creating a single column index.

**Host: Mirek (Microsoft)
Q:**I use DBCC Reindex on a weekly basis along with sp_updatestats, is there anything else I should run to keep my indexes in good shape?
A: One more issue: Why do you think you have to run reindex on a weekly basis? For more information about reindex and indexdefrag please refer to the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx.

**Host: swynk (MVP)
Q:**Can I get a transcript of this session?
A: You bet! It'll be posted to www.sswug.org within several days - we'll announce it in the newsletter as well.

**Host: Cesar (Microsoft)
Q:**Sometimes we note that optimizer is using optimized plan with all proper index when running the procedure from Query Analyzer but when we run same procedure from application it does not use optimal plan...so if I clear the cache or recompile the procedure, it start working properly from application also.
A: One possible issue is parameters in the stored procedure. When we compile a plan we use the values provided to parameters on the first invocation to do card estimation. If those values are atypical then we may generate a plan that is good for those specific values but that may not be good for other values. The other possible issue, as somebody noted, is to check on the set options, which can affect the plan picked.

**Host: Lubor (Microsoft)
Q:**With SELECT DISTINCT UserCode INTO tblRef FROM tblBig, a single column index on UserCode makes it an index scan (instead of table scan), but that still has an entry per main table row, so is still slow. Anything faster?
A: You have to see all rows. There are some techniques to do it faster with index, but it still may take a time.

**Host: Sri (Microsoft)
Q:**Are indexes on bit columns treated any differently from on tinyint, etc?
A: From indexing point of view no. They occupy lesser space as we use only a bit to represent the data on disk. If you have multiple bit columns they are grouped into a byte. If you have only a single bit column in the index then storage wise it is the same.

**Host: Lubor (Microsoft)
Q:**Are there any white papers regarding highly skewed dataset statistics, that is statistics sampling best practices, etc.?
A: There is a white paper https://msdn.microsoft.com/library/en-us/dnsql2k/html/statquery.asp but it is not talking about the skew nor best practices.

Host: Milind (Microsoft)
Q:
is table scan the same as clustered index scan?
A: Yes (assuming you have a clustered index)

**Host: Lubor (Microsoft)
Q:**We have a database where 7 out of hundreds of tables count for 1/3 of the storage, should that be considered as skewed database?
A: No.

**Host: Lubor (Microsoft)
Q:**what's the definition for skewed database?
A: We are usually talking about skewed data, then the database htat has such data may be called skewed database.

**Host: Gang (Microsoft)
Q:**I have joins that are faster with non-clustered indexes. But the columns have clustered indexes on them. Can I have a clustered and a nonclustered index on the same column?
A: Yes, you can build a clustered and a nonclustered index on the same column.

**Host: Lubor (Microsoft)
Q:**OK, I'll ask, what's "skewed data" then?
A: For example a table with 1 million rows, and half of the values in one column is "0" and the rest 1/2 mil has unique values.

**Host: Mirek (Microsoft)
Q:**Should index utilities be run on any specific schedule or time frame?
A: See the showcontig script in BOL. You should analyze the index and only then choose defrag or re-index. For more information about reindex and indexdefrag please refer to the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx.

**Host: Lubor (Microsoft)
Q:**in skewed database, do you have to collect statistics in fullscan?
A: No, you don't need to. The algorithm is adapting to a skew.

**Host: swynk (MVP)
Q:**Statistics Questions: Is there anything about sp3a that would account for an increase in logical reads and is there any change coming that will make it easier to transfer statistics e.g. (to test servers)
A: Compare query plans. Different plan may be a cause of different logical reads. You can transfer statistics to test servers by updating the catalogs. Will be easier in the future...

**Host: Sri (Microsoft)
Q:**How often should I be running the command?
A: Regarding reindexing or running defrag, we will post a link to the white paper which should answer all you concerns/questions shortly.

For more information about reindex and indexdefrag please refer to the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx.

**Host: Mirek (Microsoft)
Q:**Is there any way to automatically be notified when indexes need to be rebuilt - any kind of alert that's thrown that we could trap and email an alert?
A: Please see the whitepaper at https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx

**Host: Lubor (Microsoft)
Q:**Say Lubor_MS how does the algorithm is adapt to a skewness?
A: It is selecting steps in the histogram to minimize the effect of the skew. But you can always beat it... E.g. try the table with 1 mil rows, half of them single values and the other half unique. Look at the histogram then (dbcc show_statistics)

**Host: Mirek (Microsoft)
Q:**Is there any way to automatically be notified when indexes need to be rebuilt - any kind?
A: see the whitepaper https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx

Host: Milind (Microsoft)
Q:
If you have a clustered and non-clustered index on the same column, which will the optimizer choose and why?
A: It depends on what columns the rest of your query uses. The clustered index will have all the columns i.e. it will always be covering. The non-clustered index will usually have fewer pages and will therefore require less I/O if it has all the columns you need to answer the query.

**Host: Lubor (Microsoft)
Q:**Is there any facility to find Unused indexes? Can profiler do this?
A: Yes, you can grab the profiler output (make sure the query plans are there) and then look for the particular index. No hit means no used in this workload.

**Host: Sri (Microsoft)
Q:**Our DB files are separated; we have the data and log on separated physical drives. We only have 1 file per drive. We know that are dbs are fragmented, but why would diskkeeper report that the 1 file is heavily fragmented? Is there any relation?
A: do you have auto grow set? Was the file created in one shot? It could be due to other files (non SQL) on that particular drive.

**Host: Lubor (Microsoft)
Q:**Sorry, Lubor_MS, I should have asked is there anything about sp3a that would account for an increase in logical reads with the same plans? My guess is no but...?
A: Well, if the plans are the *same* I have no answer. Otherwise I would not be so concerned about the logical I/Os. In some circumstances we count the page as one, in others each row in the page as one...

**Host: Lubor (Microsoft)
Q:**Is there a significant performance hit you take when "Auto Update Stats" is turned on? During ETL loads is it recommended to turn this off?
A: Not if you are doing "big" inserts. If you are doing it row by row it’s a bit more impact. But in general, don't worry.

**Host: Sri (Microsoft)
Q:**Yes, File was pre-created to fix size. ALL autos turned off...
A: If you preallocate there should not be any file fragmentation. Was the file fragmented to start of with?

**Host: Lubor (Microsoft)
Q:**Care to shed more light on this statement? "In some circumstances we count the page as one, in others each row in the page as one..."
A: If you scan a table, one read per page. If you perform seeks by an index, you may seek into the same page many times and each counts.

**Moderator: Stephen_D (Microsoft)
Q:**if you capture a session with profiler, can you "play it back" against a different server, with empty tables -- as long as the database has the same schema? Just don't want to run the wiz against production.
A: We don't seem to have any Profiler experts here, so you might ask this question in the newsgroup microsoft.public.sqlserver.tools.

**Host: Lubor (Microsoft)
Q:**RE Lubor on Skewness: It is selecting steps in the histogram to minimize the effect of the skew. But you can always beat it... Do you mean there is an effective means of helping SS along? Different sampling settings, etc?
A: No, just the opposite. You can come with such skew that lowers effectiveness of the histogram.

**Host: Sri (Microsoft)
Q:**No, these were new drives brought online. The strange thing is if you run a Diskeeper Defrag on the drive that is 200G it takes less then 30 seconds to complete then says that there is no fragmentation. After ETL it is back...
A: I am not sure how this could be. We can follow up on this in the newsgroup sqlserver.

Moderator: Stephen_D (Microsoft)
That would probably be microsoft.public.sqlserver.server…

Moderator: Stephen_D (Microsoft)
Thanks for joining us today! You've asked some great questions. We'll finish the final questions that we're working on, but unfortunately, it's time to go.

Host: Lubor (Microsoft)
Bye! Thanks for the great questions!

Host: Milind (Microsoft)
Goodbye!

Host: swynk (MVP)
Thanks everyone, I appreciate you stopping by - make sure you check www.sswug.org for transcripts in the next several days (we'll announce it in the newsletter too). See you online!

Host: Mirek (Microsoft)
Thanks guys; Bye!

Host: Cesar (Microsoft)
Goodbye, thanks.

Host: Gang (Microsoft)
Goodbye, thanks.

For further information on this topic or about SQL Server, please visit the following:

Newsgroups: SQL Server Newsgroups

SQL Transcripts: Read the archive of past SQL chats.

Website: Visit the Microsoft SQL Website