Pendahuluan Dengan diluncurkannya SQL Server 2016 Service Pack 1, teknologi In-Memory ColumnStore sekarang juga tersedia di Standard, Web dan bahkan Express dan LocalDB Editions. Selain manfaat hanya 1 basis kode yang harus dipertahankan, perubahan kebijakan ini juga akan menjadi ruang penyimpanan disk yang jelas karena rasio de-duplikasi dan rasio data yang tinggi dan, yang terakhir namun tidak kalah pentingnya, juga merupakan kinerja kueri ad-hoc yang serius. Booster Perbedaan utama antara rasa SQL adalah berapa banyak daya CPU dan memori dialokasikan untuk tugas seperti (re-) building dari Clustered ColumnStore Index. Sebagai contoh: dengan Edisi Standar satu inti (waktu prosesor 100 dari proses sqlservr) sedang digunakan dan query CCI terjadi dengan maksimal 2 CPU (MAXDOP2), versus memanfaatkan semua CPU yang ada di Edisi Enterprise. Membangun Clustered ColumnStore Index (CCI) dengan SQL Server 2016 Standard Edition: Membangun CCI dengan semua 4 core yang tersedia dengan SQL Server 2016 Enterprise Edition: Waktu dasar untuk memuat 7.2 GB 60 Juta baris dari satu file TPCH lineItem tidak menunjukkan banyak hal. Perbedaan antara rasa saat Massal memasukkan data langsung ke meja tumpukan atau meja dengan CCI perbedaannya menjadi jelas saat kita membandingkan waktu yang dibutuhkan untuk membangun CCI di atas meja tumpukan atau membangun kembali CCI: Meringkas, mutlak Cara tercepat untuk memiliki data yang tersedia dalam sebuah tabel dengan Clustered ColumnStore Index adalah: load ke heap membangun CCI setelahnya dengan SQL 2016 Ent. Ed. Beban langsung ke CCI Untuk tabel dengan Clustered ColumnStore Index sudah dibuat, pastikan Anda melakukan streaming langsung ke Compressed Row Groups untuk memaksimalkan throughput. Untuk melakukannya, ukuran batch Insert harus sama atau lebih besar dari 100K Rows (102400 tepatnya). Batch yang lebih kecil akan ditulis ke dalam tabel delta store yang dikompres terlebih dahulu sebelum menjadi tuple pindah ke segmen Row Group yang dikompresi akhir, yang berarti SQL Server harus menyentuh data dua kali: Ada berbagai pilihan untuk memuat data dan kita akan membahas yang paling sering digunakan. Yang, seperti perintah Massal Insert, BCP dan SSIS. Mari kita lihat apa yang dibutuhkan untuk mendapatkan performa terbaik dan cara memonitor 1) T-SQL Massal Insert Mari kita mulai dengan perintah BULK INSERT: Memeriksa Kemajuan Muatan Data Untuk memeriksa Jumlah Baris yang sudah masuk ke CCI, bahkan ketika Pilihan Table Lock sedang digunakan, query dmv baru yang disebut sys. dmdbcolumnstorerowgroupphysicalstats: DMV ini juga akan mengungkapkan kemungkinan negara Resource Group secara lebih rinci saat memuat. Ada empat kemungkinan Row Group menyatakan saat memuat data. Bila Anda melihat negara INVISBILE seperti pada gambar di bawah berarti data dikompres menjadi RowGroup. 0: INVISIBLE (RowGroup sedang dalam proses dibangun dari data di delta store) 1: OPEN160160160160160160160 (RowGroup menerima catatan baru) 2: CLOSED160160160 (RowGroup terisi namun belum dikompres oleh proses penggerak tuple) 3: COMPRESSED160 ( RowGroup diisi dan dikompres). 4 TOMBSTONE160 (RowGroup siap menjadi sampah yang dikumpulkan dan dihapus) Dengan menentukan Ukuran Batch dengan nilai 102400 atau lebih tinggi, Anda akan mencapai kinerja maksimal dan data akan dialirkan dan dikompres secara langsung ke dalam RG terakhirnya. Perilaku ini akan muncul sebagai COMPRESSED. Anda juga dapat memeriksa DMV yang diperkenalkan dengan SQL2014 untuk memeriksa RowGroup State, yaitu sys. columnstorerowgroups DMV: Hasil Uji Massal memasukkan data ke dalam tabel dengan CCI melalui perintah Massal Insert sedikit dapat diperbaiki dengan menambahkan Batchsize102400 dan Pilihan TABLOCK Ini membawa 8 peningkatan throughput. 2) BCP. exe Utilitas BCP masih digunakan cukup banyak di lingkungan produksi banyak yang layak untuk memeriksanya dengan cepat: secara default, BCP memberi sents 1000 baris pada saat itu ke SQL Server. Waktu yang dibutuhkan untuk memuat 7.2GB data via BCP: 530 detik. Or160 113K rowssec Negara RowGroup menunjukkan NVISIBLE yang berarti bahwa dengan pengaturan default, Delta Store sedang digunakan. Untuk memastikan perintah BCP mengalirkan data secara langsung ke dalam RG yang dikompres, Anda harus menambahkan opsi batchsize b dengan nilai minimal 102400. Saya menjalankan berbagai tes dengan ukuran batch yang lebih besar: hingga 1048576, namun 102400 memberi yang terbaik untuk saya. hasil. BCP DB. dbo. LINEITEMCCI di F: TPCHlineitem. tbl S. - c - T - tquotquot - b 102400 h tablock Status RowGroup sekarang menunjukkan COMPRESSED yang berarti kita melewati Delta Store dan data stream ke dalam RG terkompresi: Hasil: BCP Selesai dalam 457 detik, atau 133K baris per detik atau Selama pengujian saya melihat bahwa pengaturan SSIS 2016 default menggunakan ukuran buffer memori yang juga berpotensi membatasi ukuran bets menjadi kurang dari 100K Rows. Pada contoh di bawah ini, Anda melihat data tersebut mendarat di toko delta: negara bagian RG ditutup dan ladang deltastorehobtid dihuni, yang berarti toko delta adalah leveraged. Inilah saatnya untuk mengulurkan tangan dan memeriksakan diri kepada rekan-rekan saya yang untungnya telah memperhatikan hal ini dan solusinya sudah ada di sana (lihat: Data Flow Buffer Auto Sizing kemampuan manfaat data loading ke CCI). Untuk sepenuhnya memanfaatkan kemampuan streaming CCI Anda harus meningkatkan pengaturan BufferSize Amp MaxRows default: Ubah ini menjadi nilai 10x lebih besar: 8211 DefaultMaxBufferRows dari 10000 menjadi 1024000 dan yang paling penting: 8211 DefaultBufferSize dari 10485760 menjadi 104857600. Catatan: pengaturan AutoAdjustBufferSize yang baru harus disetel ke True saat Anda memuat baris data yang sangat luas. Ubah juga nilai untuk adaptor Tujuan: 8211 Baris per Batch: 160 dari none ke 102400 8211 Ukuran komit Insert maksimum: dari 2147483647 menjadi 102400 Paritas fitur diperkenalkan dengan SQL Server 2016 SP1 membuka rentang kemungkinan baru untuk mendapatkan keuntungan dari semoga Panduan di atas membantu Anda memaksimalkan kinerja Massal Sisipkan, BCP, dan SSIS saat memuat data ke dalam Indeks Kolom Kolom Clustered Apa yang akan menjadi cara tercepat mutlak untuk memuat data dari flatfile ke dalam tabel di dalam SQL Server 2016 Banyak yang telah berubah sejak awal saya Posting tentang topik ini bertahun-tahun yang lalu, seperti pengenalan tabel di-memori yang dioptimalkan dan indeks tabel Columnstore yang Dapat Diperoleh. Juga daftar kendaraan pengangkut data yang bisa dipilih berkembang: selain BCP, perintah T-SQL Bulk Insert, SSIS sebagai alat ETL dan PowerShell ada beberapa yang baru ditambahkan, seperti PolyBase, External R Script atau ADF. Dalam posting ini saya akan mulai dengan memeriksa berapa banyak yang lebih cepat dari ampli tahan lama yang baru yang tidak tahan lama. Tabel memori Dalam Menetapkan Baseline Untuk tes ini saya menggunakan Standar Azure DS4V2 Standard VM dengan 8 core28 GB RAM dan 2 HDD Volumes with host caching RW diaktifkan (Kedua Luns menyediakan 275 MBsec RW throughput meskipun GUI menyatakan batas 60MBsec). Saya menghasilkan sebuah file flat lineidem 60 Juta baris7.2 Gigabyte TPCH lineitem sebagai data untuk dimuat. Sebagai dasar untuk digunakan untuk perbandingan, kami akan menggunakan waktu yang dibutuhkan untuk memuat file ke dalam tabel Heap: Perintah Massal Insert reguler ini selesai dalam waktu 7 menit dengan rata-rata 143K rowssec. Mengaktifkan database uji untuk tabel yang dioptimalkan oleh Memori (pada SQL20142016 Enterprise amp developer Edition) memperkenalkan tabel memori yang dirancang untuk OLTP yang sangat cepat dengan banyak transaksi kecil dan konkurensi tinggi, yang merupakan jenis beban kerja yang sama sekali berbeda dengan penyisipan massal namun, hanya saja Keluar dari keingintahuan mencobanya. Ada 2 jenis tabel memori: tabel tahan lama dan tidak tahan lama. Yang tahan lama akan bertahan data pada disk, yang tidak tahan lama wont. Untuk mengaktifkan opsi ini kita harus melakukan beberapa pekerjaan rumah tangga dan menetapkan volume disk yang cepat untuk hosting file-file ini. Pertama, ubah database untuk mengaktifkan opsi Contains MEMORYOPTIMIZEDDATA diikuti dengan menambahkan lokasi File dan Filegroup yang akan berisi tabel Memory-Optimized: Hal ketiga yang harus dilakukan adalah menambahkan kolam memori terpisah ke instance SQL Server sehingga dapat menyimpan semua Data yang akan kita muat ke dalam tabel memori dipisahkan dari kolam memori defaultnya: Mengikat basis data ke kolam memori Langkah-langkah untuk menentukan kolam memori terpisah dan mengikat basis data ke daftar berikut: Kolam memori ekstra dikelola melalui Gubernur Sumber Daya SQL. Langkah keempat dan terakhir adalah mengikat database uji ke kolam memori baru dengan perintah sys. spxtpbinddbresourcepool.160 Agar pengikatan menjadi efektif, kita harus mengambil database secara offline dan mengembalikannya secara online. Setelah terikat, kita dapat secara dinamis mengubah jumlah memori yang diberikan ke kolamnya melalui perintah ALTER RESOURCE POOL PoolHk WITH (MAXMEMORYPERCENT 80). Sisipkan Massal ke dalam tabel Memori Dalam Tahan Sekarang kita semua siap dengan opsi In-memory yang diaktifkan, kita dapat membuat tabel memori. Setiap tabel yang dioptimalkan memori setidaknya memiliki satu indeks (baik indeks Range-atau Hash) yang benar-benar (re-) tersusun dalam memori dan tidak pernah disimpan di disk. Sebuah tabel tahan lama harus memiliki primary key yang dideklarasikan, yang kemudian dapat didukung oleh indeks yang dibutuhkan. Untuk mendukung kunci primer, saya menambahkan kolom ROWID1 rownumber ekstra ke tabel: Menentukan ukuran bets 1 (hingga 5) Juta baris ke perintah insert massal membantu menahan data ke disk sementara sisipan massal sedang berlangsung (alih-alih menyimpan Itu semua pada akhirnya) sehingga meminimalkan tekanan memori pada kolam memori PookHK yang kita buat. Beban data ke dalam tabel In-Memory yang tahan lama akan selesai dalam 5 menit 28 detik, atau 183K Rowssec. Itu adalah waktu yang baik tapi tidak terlalu cepat dari baseline kami. Melihat sys. dmoswaitstats menunjukkan bahwa waitstat no.1 adalah IMPPROVIOWAIT yang terjadi saat SQL Server menunggu beban massal IO selesai. Melihat kinerja counter Bulk Copy Rowssec dan Disk Write Bytessec menunjukkan pembilasan pada lonjakan disk sebesar 275 MBsec begitu sebuah batch masuk (lonjakan hijau). Itu adalah maksimum dari apa yang disk dapat memberikan tapi doesnt menjelaskan semuanya. Dengan keuntungan kecil, kita akan memarkir yang satu ini untuk penyelidikan masa depan. Pemantauan Kolam Memori Melalui sys. dmresourcegovernorresourcepools dmv dapatkah kita memeriksa apakah tabel memori kita memanfaatkan poolHK memory Pool yang baru dibuat: Output menunjukkan bahwa ini adalah kasus 7.2GB (beberapa tambahan untuk Rowid) terkompresi ke memori PoolHk pool: Jika Anda mencoba memuat lebih banyak data daripada memori yang tersedia di kolam, Anda akan mendapatkan pesan yang tepat seperti ini: Pernyataan telah dihentikan. Msg 701, Level 17, State 103, Line 5 Ada memori sistem yang tidak mencukupi di pool sumber 8216PookHK untuk menjalankan query ini. Untuk melihat satu tingkat lebih dalam pada alokasi ruang memori pada basis tabel In-memory Anda dapat menjalankan query berikut (diambil dari SQL Server In-Memory OLTP Internal untuk dokumen SQL Server 2016): Data yang baru saja kita muat disimpan sebagai Struktur varheap dengan indeks hash: Sejauh ini bagus Sekarang mari kita lanjutkan dan lihat bagaimana pentahapan di meja yang tidak tahan lama melakukan Sisipkan Massal ke dalam tabel In-Memory Non-Tahan Lama Untuk tabel IMND kita tidak memerlukan kunci Primer jadi kita hanya Tambahkan dan indeks Hash yang tidak berkerumun dan setel DURABILITY SCHEMAONLY. Bulk memasukkan Data loading ke dalam tabel non-tahan lama selesai dalam waktu 3 menit dengan throughput 335K rowssec (vs 7 menit) Ini adalah 2.3x lebih cepat kemudian dimasukkan ke dalam tumpukan meja. Untuk pementasan data ini pasti cepat menang SSIS Single Massal Insert ke dalam tabel Non-Durable Secara tradisional SSIS adalah cara tercepat untuk memuat file dengan cepat ke SQL Server karena SSIS akan menangani semua data pre-processing sehingga mesin SQL Server bisa Menghabiskan CPU kutu pada terus data ke disk. Akankah ini masih terjadi saat memasukkan data ke dalam tabel yang tidak tahan lama Berikut adalah ringkasan tes yang saya hadapi dengan SSIS untuk posting ini: opsi SSIS Fastparse dan160 pengaturan DefaultBufferMaxRows dan DefaultBufferSize adalah penguat kinerja utama. Juga penyedia OLE DB (SQLOLEDB.1) asli melakukan sedikit lebih baik daripada Klien Asli SQL (SQLNCLI11.1). Ketika Anda menjalankan SSIS dan SQL Server berdampingan, meningkatkan ukuran paket jaringan tidak diperlukan.160160 Hasil bersih: paket SSIS dasar yang membaca sumber file datar dan menulis data secara langsung ke tabel Non-Tahan Lama melalui tujuan OLE DB Melakukan hal yang sama seperti perintah Massal Insert ke dalam tabel IMND: 60 Juta baris dimuat dalam 2menit 59 detik atau 335K rowssec, identik dengan perintah Insert Massal. SSIS dengan Distributor Data Seimbang Tapi wait8230160 tabel in-memory dirancang untuk bekerja mengunci pengunci latch secara gratis sehingga ini berarti kita dapat memuat data juga melalui beberapa aliran. Yang mudah dicapai dengan SSIS, Distributor Data Seimbang akan membawa hanya itu (BDD Tercantum di bagian umum dari SSIS Toolbox) Menambahkan komponen BDD dan memasukkan data ke dalam tabel Non-tahan lama yang sama dengan 3 stream memberikan throughput terbaik: kita sekarang sampai dengan 526000 Rowssec Melihat garis datar ini, hanya dengan 160 dari waktu CPU yang digunakan oleh SQLServer, nampaknya kita mengalami beberapa hambatan: Saya dengan cepat mencoba untuk menjadi kreatif dengan memanfaatkan fungsi modulo dan menambahkan 2 arus data lagi dalam paket (masing-masing memproses 13 data) 160 tapi itu tidak memperbaiki Banyak (1 min52sec) jadi topik yang bagus untuk diselidiki untuk opsi post160160 masa depan Post-Memory Non-Durable membawa beberapa peningkatan kinerja yang serius untuk pemangkasan data yang memuat data 1.5x lebih cepat dengan pemain reguler Bulk T dan sampai 3.6x kali lebih cepat dengan SSIS. Pilihan ini, yang terutama dirancang untuk mempercepat OLTP, juga bisa membuat perbedaan besar untuk mengecilkan jendela batch Anda dengan cepat (Terus berlanjut) Kebanyakan orang mengenal ungkapan ini, quotthis ini akan membunuh dua burung dengan satu batu batu. Jika tidak, fase mengacu pada pendekatan yang membahas dua tujuan dalam satu tindakan. (Sayangnya, ungkapan itu sendiri agak tidak menyenangkan, karena kebanyakan dari kita tidak ingin melempar batu pada hewan yang tidak berdosa) Hari ini saya akan membahas beberapa hal mendasar mengenai dua fitur hebat di SQL Server: indeks Columnstore (hanya tersedia di SQL Server Enterprise) dan SQL Query Store. Microsoft benar-benar menerapkan indeks Columnstore di SQL 2012 Enterprise, meskipun telah menyempurnakannya dalam dua rilis terakhir dari SQL Server. Microsoft memperkenalkan Query Store di SQL Server 2016. Jadi, apa saja fitur ini dan mengapa mereka penting Nah, saya punya demo yang akan mengenalkan kedua fitur tersebut dan menunjukkan bagaimana mereka dapat membantu kita. Sebelum saya melangkah lebih jauh, saya juga membahas fitur ini (dan fitur SQL 2016 lainnya) di artikel Majalah KODE saya tentang fitur baru SQL 2016. Sebagai pengantar dasar, indeks Columnstore dapat membantu mempercepat kueri yang memindai berdasarkan data dalam jumlah besar, dan Query Query melacak eksekusi query, rencana eksekusi, dan statistik runtime yang biasanya perlu Anda kumpulkan secara manual. Percayalah ketika saya mengatakannya, ini adalah fitur hebat. Untuk demo ini, saya akan menggunakan database demo Data Warehouse Microsoft Contoso. Ngomong ngomong, Contoso DW seperti kuota AdventureWorksquot yang sangat besar, dengan tabel berisi jutaan baris. (Tabel AdventureWorks terbesar berisi sekitar 100.000 baris paling banyak). Anda bisa mendownload database Contoso DW disini: microsoften-usdownloaddetails. aspxid18279. Contoso DW bekerja sangat baik saat Anda ingin menguji kinerja pada pertanyaan melawan tabel yang lebih besar. Contoso DW berisi tabel data warehouse standar yang disebut FactOnLineSales, dengan 12,6 juta baris. Itu tentu bukan tabel gudang data terbesar di dunia, tapi juga permainan anak-anak. Misalkan saya ingin meringkas jumlah penjualan produk untuk tahun 2009, dan memberi peringkat produk. Saya mungkin menanyakan tabel fakta dan bergabung ke tabel Dimensi Produk dan menggunakan fungsi RANK, seperti: Berikut adalah hasil parsial dari 10 baris teratas, oleh Total Sales. Di laptop saya (i7, 16 GB RAM), permintaan membutuhkan waktu 3-4 detik untuk dijalankan. Itu mungkin tidak tampak seperti akhir dunia, namun beberapa pengguna mungkin mengharapkan hasil hampir instan (seperti yang mungkin Anda lihat seketika saat menggunakan Excel dari kubus OLAP). Satu-satunya indeks yang saya miliki saat ini di tabel ini adalah indeks berkerumun pada kunci penjualan. Jika saya melihat rencana eksekusi, SQL Server membuat sebuah saran untuk menambahkan indeks penutup ke tabel: Sekarang, hanya karena SQL Server menyarankan sebuah indeks, tidak berarti Anda harus secara membabi buta membuat indeks pada setiap pesan kuota indeks kuota. Namun, dalam hal ini, SQL Server mendeteksi bahwa kita memfilter berdasarkan tahun, dan menggunakan Product Key dan Sales Amount. Jadi, SQL Server menyarankan indeks penutup, dengan DateKey sebagai bidang indeks kunci. Alasan kami menyebutnya indeks quotcoveringquot adalah karena SQL Server akan melakukan kuota sepanjang fieldquot non-key yang kami gunakan dalam query, quotfor the ridequot. Dengan cara itu, SQL Server tidak perlu menggunakan tabel atau indeks berkerumun di semua mesin database hanya dengan menggunakan indeks pengaitan untuk kueri. Meliputi indeks sangat populer di data pergudangan dan pelaporan database skenario tertentu, meskipun harganya terjangkau oleh mesin database. Catatan: Meliputi indeks telah ada sejak lama, jadi saya belum membuka indeks Columnstore dan Query Store. Jadi, saya akan menambahkan indeks penutupnya: Jika saya menjalankan kueri yang sama dengan saya, saya berlari beberapa saat yang lalu (yang mengumpulkan jumlah penjualan untuk setiap produk), kueri kadang tampaknya berjalan sekitar satu detik lebih cepat, dan saya mendapatkan Rencana eksekusi yang berbeda, yang menggunakan Indeks Seek dan bukan Index Scan (dengan menggunakan tombol tanggal pada indeks penutup untuk mengambil penjualan untuk tahun 2009). Jadi, sebelum Indeks Columnstore, ini bisa menjadi salah satu cara untuk mengoptimalkan kueri ini di banyak versi SQL Server yang lebih lama. Ini berjalan sedikit lebih cepat daripada yang pertama, dan saya mendapatkan rencana eksekusi dengan Index Seek daripada Index Scan. Namun, ada beberapa masalah: Dua operator eksekusi quotIndex Seekquot dan quotHash Match (Aggregate) mengutip keduanya pada dasarnya mengoperasikan quotrow oleh rowquot. Bayangkan ini dalam sebuah tabel dengan ratusan juta baris. Terkait, pikirkan isi tabel fakta: dalam kasus ini, satu nilai kunci tanggal dan atau satu nilai kunci produk dapat diulang di ratusan ribu baris (ingat, tabel fakta juga memiliki kunci untuk geografi, promosi, salesman , Dll.) Jadi, ketika quotIndex Seekquot dan quotHash Matchquot bekerja baris demi baris, mereka melakukannya atas nilai yang mungkin diulang di banyak baris lainnya. Ini biasanya terjadi di segelintir indeks SQL Server Columnstore, yang menawarkan skenario untuk meningkatkan kinerja kueri ini dengan cara yang menakjubkan. Tapi sebelum saya melakukannya, mari kita kembali ke masa lalu. Mari kembali ke tahun 2010, saat Microsoft memperkenalkan add-in untuk Excel yang dikenal sebagai PowerPivot. Banyak orang mungkin ingat melihat demo PowerPivot for Excel, di mana pengguna bisa membaca jutaan baris dari sumber data luar ke Excel. PowerPivot akan memampatkan data, dan menyediakan mesin untuk membuat Tabel Pivot dan Diagram Pivot yang tampil dengan kecepatan luar biasa terhadap data yang dikompres. PowerPivot menggunakan teknologi in-memory yang disebut Microsoft quotVertiPaqquot. Teknologi in-memory di PowerPivot pada dasarnya akan mengambil nilai kunci bisnis duplikat kunci utama dan memampatkannya ke satu vektor tunggal. Teknologi in-memory juga akan memilah-milah nilai-nilai ini secara paralel, dalam blok beberapa ratus sekaligus. Intinya adalah Microsoft memanggang sejumlah besar penyempurnaan kinerja ke dalam fitur memori VertiPaq yang bisa kita gunakan, langsung dari kotak pepatah. Mengapa saya mengambil jalan kecil ini menyusuri jalur memori Karena di SQL Server 2012, Microsoft menerapkan salah satu fitur terpenting dalam sejarah mesin database mereka: indeks Columnstore. Indeks benar-benar sebuah indeks hanya dalam nama: ini adalah cara untuk mengambil tabel SQL Server dan membuat kolom kolom terkompresi dalam memori yang memampatkan nilai kunci asing duplikat ke nilai vektor tunggal. Microsoft juga menciptakan kolam penyangga baru untuk membaca nilai vektor terkompresi ini secara paralel, menciptakan potensi peningkatan kinerja yang sangat besar. Jadi, saya akan membuat indeks kolom di atas meja, dan saya akan melihat seberapa jauh lebih baik (dan lebih efisien) kueri berjalan, versus kueri yang berjalan melawan indeks penutup. Jadi, saya akan membuat salinan duplikat FactOnlineSales (saya akan menamakannya FactOnlineSalesDetailNCCS), dan saya akan membuat indeks kolom di tabel duplikat sehingga saya tidak akan mengganggu tabel asli dan indeks penutupan dengan cara apa pun. Selanjutnya, saya akan membuat indeks kolom di tabel baru: Perhatikan beberapa hal: Saya telah menetapkan beberapa kolom kunci asing, serta Angka Penjualan. Ingatlah bahwa indeks kolom tidak seperti indeks toko-toko tradisional. Tidak ada quotkeyquot. Kami hanya menunjukkan kolom mana yang harus dikompres SQL Server dan ditempatkan di gudang data dalam memori. Untuk menggunakan analogi PowerPivot untuk Excel saat kita membuat indeks kolom, kita akan memberitahu SQL Server untuk melakukan hal yang sama seperti yang dilakukan PowerPivot saat kita mengimpor 20 juta baris ke Excel menggunakan PowerPivot Jadi, saya akan menjalankan kembali kueri, kali ini menggunakan Tabel factOnlineSalesDetailNCCS yang digandakan yang berisi indeks kolomstore. Permintaan ini berjalan seketika dalam waktu kurang dari satu detik. Dan saya juga bisa mengatakan bahwa meskipun tabel itu memiliki ratusan juta baris, buku itu tetap akan terbentang dari kutipan bulu mata yang pepatah. Kita bisa melihat rencana eksekusi (dan dalam beberapa saat, kita akan melakukannya), tapi sekarang saatnya meliput fitur Query Store. Bayangkan sejenak, bahwa kami menjalankan kedua pertanyaan semalam: kueri yang menggunakan tabel FactOnlineSales biasa (dengan indeks penutup) dan kemudian kueri yang menggunakan tabel duplikat dengan indeks Columnstore. Saat kita masuk keesokan paginya, kami ingin melihat rencana eksekusi untuk kedua pertanyaan saat mereka berlangsung, begitu pula dengan statistik eksekusi. Dengan kata lain, kami ingin melihat statistik yang sama bahwa kami dapat melihat apakah kami menjalankan kedua kueri secara interaktif di SQL Management Studio, menyerahkan TIME dan IO Statistics, dan melihat rencana eksekusi tepat setelah menjalankan kueri. Nah, itulah yang dilakukan oleh Toko Kueri agar kita dapat mengaktifkan (enable) Query Store untuk database, yang akan memicu SQL Server untuk menyimpan eksekusi query dan merencanakan statistik sehingga kita dapat melihatnya nanti. Jadi, saya akan mengaktifkan Query Store di database Contoso dengan perintah berikut (dan saya juga akan menghapus semua caching): Kemudian saya akan menjalankan dua query (dan quotpretendquot yang saya jalankan beberapa jam yang lalu): Sekarang mari kita berpura-pura berlari berjam-jam. Lalu. Menurut apa yang saya katakan, Query Store akan menangkap statistik eksekusi. Jadi bagaimana cara melihatnya Untungnya, itu cukup mudah. Jika saya memperluas basis data Contoso DW, saya akan melihat folder Query Store. Toko Kueri memiliki fungsionalitas yang luar biasa dan saya akan mencoba meliputnya dalam entri blog berikutnya. Tapi untuk saat ini, saya ingin melihat statistik eksekusi pada dua query, dan secara khusus memeriksa operator eksekusi untuk indeks kolomstore. Jadi, saya akan mengklik kanan pada Kuasa Mengonsumsi Sumber Daya Teratas dan menjalankan opsi itu. Itu memberi saya bagan seperti di bawah ini, di mana saya bisa melihat durasi eksekusi (dalam milidetik) untuk semua pertanyaan yang telah dieksekusi. Dalam contoh ini, Query 1 adalah query terhadap tabel asli dengan indeks penutup, dan Query 2 melawan tabel dengan indeks kolomstore. Angka tersebut tidak terletak pada indeks kolom kerja mengungguli indeks tablecovering asli dengan faktor hampir 7 banding 1. Saya dapat mengubah metrik untuk melihat konsumsi memori. Dalam kasus ini, perhatikan bahwa query 2 (query indeks kolomstore) menggunakan lebih banyak memori. Ini menunjukkan dengan jelas mengapa indeks columnstore mewakili teknologi kuotasi-memoriquot SQL Server memuat seluruh indeks kolom di memori, dan menggunakan kolam penyangga yang sama sekali berbeda dengan operator eksekusi yang ditingkatkan untuk memproses indeks. OK, jadi kita punya beberapa grafik untuk melihat statistik eksekusi kita bisa melihat rencana eksekusi (dan eksekusi operator) yang terkait dengan setiap eksekusi Ya, kita bisa Jika Anda klik pada baris vertikal untuk query yang menggunakan indeks kolomstore, Anda akan melihat eksekusi Rencanakan di bawah ini Hal pertama yang kita lihat adalah bahwa SQL Server melakukan pemindaian indeks kolom, dan itu mewakili hampir 100 dari biaya kueri. Anda mungkin berkata, quotWait sebentar, kueri pertama menggunakan indeks penutup dan melakukan pencarian indeks jadi bagaimana pemindaian indeks kolom bisa lebih cepat. Pertanyaan yang sah, dan untungnya ada sebuah jawaban. Bahkan ketika query pertama melakukan pencarian indeks, ia masih mengeksekusi quotrow oleh rowquot. Jika saya meletakkan mouse di atas operator pemindai indeks kolom, saya melihat tooltip (seperti yang ada di bawah), dengan satu pengaturan penting: Mode Eksekusi adalah BATCH (berlawanan dengan ROW), yaitu apa yang kami lakukan dengan kueri pertama yang menggunakan Meliputi indeks). Mode BATCH mengatakan bahwa SQL Server sedang memproses vektor terkompresi (untuk nilai kunci asing yang diduplikasi, seperti kunci produk dan kunci tanggal) dalam jumlah hampir 1.000, secara paralel. Jadi SQL Server masih bisa mengolah indeks columnstore jauh lebih efisien. Selain itu, jika saya menempatkan mouse di atas tugas Hash Match (Aggregate), saya juga melihat bahwa SQL Server mengumpulkan indeks kolom menggunakan mode Batch (walaupun operator itu sendiri mewakili persentase kecil dari biaya kueri) Akhirnya, Anda Mungkin bertanya, quotOK, jadi SQL Server memampatkan nilai dalam data, memperlakukan nilai sebagai vektor, dan membacanya di blok hampir seribu nilai secara paralel namun kueri saya hanya menginginkan data untuk tahun 2009. Begitu juga pemindaian SQL Server atas Seluruh rangkaian dataquot Sekali lagi, sebuah pertanyaan bagus. Jawabannya adalah, quotNot reallyquot. Untungnya bagi kami, pool buffer index kolom baru melakukan fungsi lain yang disebut quotmentment eliminationquot. Pada dasarnya, SQL Server akan memeriksa nilai vektor untuk kolom kunci tanggal di indeks kolomstore, dan menghilangkan segmen yang berada di luar cakupan tahun 2009. Saya akan berhenti di sini. Dalam posting blog berikutnya, saya akan membahas indeks kolom dan Query Store secara lebih rinci. Intinya, apa yang telah kita lihat di sini hari ini adalah bahwa indeks Columnstore secara signifikan dapat mempercepat kueri yang memindai berdasarkan data dalam jumlah besar, dan Toko Kueri akan menangkap eksekusi kueri dan memungkinkan kita memeriksa statistik eksekusi dan kinerja di lain waktu. Pada akhirnya, kami ingin menghasilkan kumpulan hasil yang menunjukkan hal berikut. Perhatikan tiga hal: Kolom dasarnya pivot semua Alasan Kembali yang mungkin, setelah menunjukkan jumlah penjualan Hasil set berisi subtotal oleh tanggal akhir minggu (minggu) di semua klien (di mana Klien adalah NULL) Kumpulan hasil berisi jumlah keseluruhan Baris (dimana Client dan Date keduanya NULL) Pertama, sebelum masuk ke akhir SQL kita bisa menggunakan kemampuan pivotmatrix dinamis di SSRS. Kita hanya perlu menggabungkan dua set hasil dengan satu kolom dan kemudian kita dapat memberi umpan hasilnya pada kontrol matriks SSRS, yang akan menyebarkan alasan pengembalian di sumbu kolom laporan. Namun, tidak semua orang menggunakan SSR (walaupun kebanyakan orang seharusnya). Tapi bahkan saat itu, terkadang pengembang perlu mengonsumsi set hasil dalam sesuatu selain alat pelaporan. Jadi, untuk contoh ini, mari kita asumsikan kami ingin menghasilkan hasil yang ditetapkan untuk halaman grid web dan mungkin pengembang ingin mengeluarkan kuota baris subtotal (di mana saya memiliki nilai ResultSetNum 2 dan 3) dan menempatkannya di kolom ringkasan. Jadi intinya, kita perlu menghasilkan output di atas secara langsung dari prosedur yang tersimpan. Dan sebagai twist tambahan minggu depan mungkin ada Return Reason X dan Y dan Z. Jadi kita tidak tahu berapa banyak alasan pengembalian yang ada. Kami ingin query sederhana untuk berpaling pada kemungkinan nilai yang berbeda untuk Return Reason. Inilah dimana T-SQL PIVOT memiliki batasan yang kita butuhkan untuk memberikan nilai yang mungkin. Karena kita tidak tahu bahwa sampai run-time, kita perlu menghasilkan string query secara dinamis dengan menggunakan pola SQL dinamis. Pola SQL dinamis melibatkan pembuatan sintaks, sepotong demi sepotong, menyimpannya dalam sebuah string, dan kemudian mengeksekusi string di akhir. Dynamic SQL bisa jadi rumit, karena kita harus menanamkan sintaks di dalam sebuah string. Tapi dalam kasus ini, ini satu-satunya pilihan sejati jika kita ingin menangani sejumlah alasan pengembalian. Saya selalu menemukan bahwa cara terbaik untuk menciptakan solusi SQL yang dinamis adalah dengan mencari tahu apa query yang dihasilkan oleh kuotaalquot pada akhirnya (dalam kasus ini, mengingat alasan Kembali yang kita ketahui). Dan kemudian membalik-ulangnya dengan memilah-milahnya Itu bersama satu bagian pada satu waktu. Jadi, inilah SQL yang kita butuhkan jika kita mengetahui Alasan Kembali tersebut (A sampai D) bersifat statis dan tidak akan berubah. Pertanyaannya adalah sebagai berikut: Menggabungkan data dari SalesData dengan data dari ReturnData, di mana kita quothard-wirequot kata Sales sebagai Tipe Aksi membentuk Tabel Penjualan, dan kemudian menggunakan Return Reason dari Return Data ke kolom ActionType yang sama. Itu akan memberi kita kolom ActionType yang bersih untuk diputar. Kami menggabungkan dua pernyataan SELECT ke dalam common table expression (CTE), yang pada dasarnya adalah subquery tabel turunan yang kemudian kami gunakan dalam pernyataan berikutnya (untuk PIVOT) Pernyataan PIVOT melawan CTE, yang menetapkan jumlah dolar untuk Tipe Aksi Berada di salah satu nilai Action Type yang mungkin. Perhatikan bahwa ini adalah hasil akhir yang ditetapkan. Kami menempatkan ini ke CTE yang berbunyi dari CTE pertama. Alasan untuk ini adalah karena kita ingin melakukan beberapa pengelompokan di akhir. Pernyataan SELECT terakhir, yang terbaca dari PIVOTCTE, dan menggabungkannya dengan query berikutnya melawan PIVOTCTE yang sama, namun di mana kita juga menerapkan dua pengelompokan dalam fitur PENGATURAN SETELAH DI SQL 2008: MENGELOMPOKAN pada Tanggal Akhir Minggu (dbo. WeekEndingDate) PENGELOMPOKAN untuk semua baris () Jadi, jika kita tahu dengan pasti bahwa kita tidak akan pernah memiliki kode alasan pengembalian yang lebih banyak, maka itu akan menjadi solusinya. Namun, kita perlu memperhitungkan kode alasan lainnya. Jadi, kita perlu menghasilkan keseluruhan kueri di atas sebagai satu string besar di mana kita membuat kemungkinan alasan pengembalian sebagai satu daftar dipisahkan koma. Aku akan menunjukkan seluruh kode T-SQL untuk menghasilkan (dan mengeksekusi) kueri yang diinginkan. Dan kemudian aku akan memecahnya menjadi beberapa bagian dan menjelaskan setiap langkahnya. Jadi pertama, inilah keseluruhan kode untuk menghasilkan secara dinamis apa yang telah saya hadapi di atas. Pada dasarnya ada lima langkah yang perlu kita liput. Langkah 1 . Kita tahu bahwa di suatu tempat dalam campuran, kita perlu menghasilkan sebuah string untuk ini dalam query: SalesAmount, Reason A, Reason B, Reason C, Reason D0160016001600160 Apa yang dapat kita lakukan adalah membangun sebuah ekspresi tabel umum sementara yang menggabungkan kutipan kabel keras. Kolom Amountquot dengan daftar kode kemungkinan yang unik. Begitu kita memilikinya di CTE, kita bisa menggunakan sedikit trik bagus untuk XML PATH (3939) untuk menghancurkan baris tersebut menjadi satu string, meletakkan koma di depan setiap baris yang dibaca query, dan kemudian menggunakan STUFF untuk mengganti Contoh koma pertama dengan ruang kosong. Ini adalah trik yang bisa Anda temukan di ratusan blog SQL. So this first part builds a string called ActionString that we can use further down. Langkah 2 . we also know that we39ll want to SUM the generatedpivoted reason columns, along with the standard sales column. So we39ll need a separate string for that, which I39ll call SUMSTRING. I39ll simply use the original ActionString, and then REPLACE the outer brackets with SUM syntax, plus the original brackets. Step 3: Now the real work begins. Using that original query as a model, we want to generate the original query (starting with the UNION of the two tables), but replacing any references to pivoted columns with the strings we dynamically generated above. Also, while not absolutely required, I39ve also created a variable to simply any carriage returnline feed combinations that we want to embed into the generated query (for readability). So we39ll construct the entire query into a variable called SQLPivotQuery. Langkah 4. We continue constructing the query again, concatenating the syntax we can quothard-wirequot with the ActionSelectString (that we generated dynamically to hold all the possible return reason values) Step 5 . Finally, we39ll generate the final part of the Pivot Query, that reads from the 2 nd common table expression (PIVOTCTE, from the model above) and generates the final SELECT to read from the PIVOTCTE and combine it with a 2 nd read against PIVOTCTE to implement the grouping sets. Finally, we can quotexecutequot the string using the SQL system stored proc spexecuteSQL So hopefully you can see that the process to following for this type of effort is Determine what the final query would be, based on your current set of data and values (i. e. built a query model) Write the necessary T-SQL code to generate that query model as a string. Arguably the most important part is determining the unique set of values on which you39ll PIVOT, and then collapsing them into one string using the STUFF function and the FOR XML PATH(3939) trick So whats on my mind today Well, at least 13 items Two summers ago, I wrote a draft BDR that focused (in part) on the role of education and the value of a good liberal arts background not just for the software industry but even for other industries as well. One of the themes of this particular BDR emphasized a pivotal and enlightened viewpoint from renowned software architect Allen Holub regarding liberal arts. Ill (faithfully) paraphrase his message: he highlighted the parallels between programming and studying history, by reminding everyone that history is reading and writing (and Ill add, identifying patterns), and software development is also reading and writing (and again, identifying patterns). And so I wrote an opinion piece that focused on this and other related topics. But until today, I never got around to either publishingposting it. Every so often Id think of revising it, and Id even sit down for a few minutes and make some adjustments to it. But then life in general would get in the way and Id never finish it. So what changed A few weeks ago, fellow CoDe Magazine columnist and industry leader Ted Neward wrote a piece in his regular column, Managed Coder , that caught my attention. The title of the article is On Liberal Arts. and I highly recommend that everyone read it. Ted discusses the value of a liberal arts background, the false dichotomy between a liberal arts background and success in software development, and the need to writecommunicate well. He talks about some of his own past encounters with HR personnel management regarding his educational background. He also emphasizes the need to accept and adapt to changes in our industry, as well as the hallmarks of a successful software professional (being reliable, planning ahead, and learning to get past initial conflict with other team members). So its a great read, as are Teds other CoDe articles and blog entries. It also got me back to thinking about my views on this (and other topics) as well, and finally motivated me to finish my own editorial. So, better late than never, here are my current Bakers Dozen of Reflections: I have a saying: Water freezes at 32 degrees . If youre in a trainingmentoring role, you might think youre doing everything in the world to help someone when in fact, theyre only feeling a temperature of 34 degrees and therefore things arent solidifying for them. Sometimes it takes just a little bit more effort or another ideachemical catalyst or a new perspective which means those with prior education can draw on different sources. Water freezes at 32 degrees . Some people can maintain high levels of concentration even with a room full of noisy people. Im not one of them occasionally I need some privacy to think through a critical issue. Some people describe this as you gotta learn to walk away from it. Stated another way, its a search for the rarefied air. This past week I spent hours in half-lit, quiet room with a whiteboard, until I fully understood a problem. It was only then that I could go talk with other developers about a solution. The message here isnt to preach how you should go about your business of solving problems but rather for everyone to know their strengths and what works, and use them to your advantage as much as possible. Some phrases are like fingernails on a chalkboard for me. Use it as a teaching moment is one. (Why is it like fingernails on a chalkboard Because if youre in a mentoring role, you should usually be in teaching moment mode anyway, however subtly). Heres another I cant really explain it in words, but I understand it. This might sound a bit cold, but if a person truly cant explain something in words, maybe they dont understand. Sure, a person can have a fuzzy sense of how something works I can bluff my way through describing how a digital camera works but the truth is that I dont really understand it all that well. There is a field of study known as epistemology (the study of knowledge). One of the fundamental bases of understanding whether its a camera or a design pattern - is the ability to establish context, to identify the chain of related events, the attributes of any components along the way, etc. Yes, understanding is sometimes very hard work, but diving into a topic and breaking it apart is worth the effort. Even those who eschew certification will acknowledge that the process of studying for certification tests will help to fill gaps in knowledge. A database manager is more likely to hire a database developer who can speak extemporaneously (and effortlessly) about transaction isolation levels and triggers, as opposed to someone who sort of knows about it but struggles to describe their usage. Theres another corollary here. Ted Neward recommends that developers take up public speaking, blogging, etc. I agree 100. The process of public speaking and blogging will practically force you to start thinking about topics and breaking down definitions that you might have otherwise taken for granted. A few years ago I thought I understood the T-SQL MERGE statement pretty well. But only after writing about it, speaking about, fielding questions from others who had perspectives that never occurred to me that my level of understanding increased exponentially. I know a story of a hiring manager who once interviewed an authordeveloper for a contract position. The hiring manager was contemptuous of publications in general, and barked at the applicant, So, if youre going to work here, would you rather be writing books or writing code Yes, Ill grant that in any industry there will be a few pure academics. But what the hiring manager missed was the opportunities for strengthening and sharpening skill sets. While cleaning out an old box of books, I came across a treasure from the 1980s: Programmers at Work. which contains interviews with a very young Bill Gates, Ray Ozzie, and other well-known names. Every interview and every insight is worth the price of the book. In my view, the most interesting interview was with Butler Lampson. who gave some powerful advice. To hell with computer literacy. Its absolutely ridiculous. Study mathematics. Learn to think. Read. Write. These things are of more enduring value. Learn how to prove theorems: A lot of evidence has accumulated over the centuries that suggests this skill is transferable to many other things. Butler speaks the truth . Ill add to that point learn how to play devils advocate against yourself. The more you can reality-check your own processes and work, the better off youll be. The great computer scientistauthor Allen Holub made the connection between software development and the liberal arts specifically, the subject of history. Here was his point: what is history Reading and writing. What is software development Among other things, reading and writing . I used to give my students T-SQL essay questions as practice tests. One student joked that I acted more like a law professor. Well, just like Coach Donny Haskins said in the movie Glory Road, my way is hard. I firmly believe in a strong intellectual foundation for any profession. Just like applications can benefit from frameworks, individuals and their thought processes can benefit from human frameworks as well. Thats the fundamental basis of scholarship. There is a story that back in the 1970s, IBM expanded their recruiting efforts in the major universities by focusing on the best and brightest of liberal arts graduates. Even then they recognized that the best readers and writers might someday become strong programmersystems analysts. (Feel free to use that story to any HR-type who insists that a candidate must have a computer science degree) And speaking of history: if for no other reason, its important to remember the history of product releases if Im doing work at a client site thats still using SQL Server 2008 or even (gasp) SQL Server 2005, I have to remember what features were implemented in the versions over time. Ever have a favorite doctor whom you liked because heshe explained things in plain English, gave you the straight truth, and earned your trust to operate on you Those are mad skills . and are the result of experience and HARD WORK that take years and even decades to cultivate. There are no guarantees of job success focus on the facts, take a few calculated risks when youre sure you can see your way to the finish line, let the chips fall where they may, and never lose sight of being just like that doctor who earned your trust. Even though some days I fall short, I try to treat my client and their data as a doctor would treat patients. Even though a doctor makes more money There are many clichs I detest but heres one I dont hate: There is no such thing as a bad question. As a former instructor, one thing that drew my ire was hearing someone criticize another person for asking a supposedly, stupid question. A question indicates a person acknowledges they have some gap in knowledge theyre looking to fill. Yes, some questions are better worded than others, and some questions require additional framing before they can be answered. But the journey from forming a question to an answer is likely to generate an active mental process in others. There are all GOOD things. Many good and fruitful discussions originate with a stupid question. I work across the board in SSIS, SSAS, SSRS, MDX, PPS, SharePoint, Power BI, DAX all the tools in the Microsoft BI stack. I still write some code from time to time. But guess what I still spend so much time doing writing T-SQL code to profile data as part of the discovery process. All application developers should have good T-SQL chops. Ted Neward writes (correctly) about the need to adapt to technology changes. Ill add to that the need to adapt to clientemployer changes. Companies change business rules. Companies acquire other companies (or become the target of an acquisition). Companies make mistakes in communicating business requirements and specifications. Yes, we can sometimes play a role in helping to manage those changes and sometimes were the fly, not the windshield. These sometimes cause great pain for everyone, especially the I. T. people. This is why the term fact of life exists we have to deal with it. Just like no developer writes bug-free code every time, no I. T. person deals well with change every single time. One of the biggest struggles Ive had in my 28 years in this industry is showing patience and restraint when changes are flying from many different directions. Here is where my prior suggestion about searching for the rarified air can help. If you can manage to assimilate changes into your thought process, and without feeling overwhelmed, odds are youll be a significant asset. In the last 15 months Ive had to deal with a huge amount of professional change. Its been very difficult at times, but Ive resolved that change will be the norm and Ive tried to tweak my own habits as best I can to cope with frequent (and uncertain) change. Its hard, very hard. But as coach Jimmy Duggan said in the movie A League of Their Own: Of course its hard. If it wasnt hard, everyone would do it. The hard, is what makes it great . A powerful message. Theres been talk in the industry over the last few years about conduct at professional conferences (and conduct in the industry as a whole). Many respected writers have written very good editorials on the topic. Heres my input, for what its worth. Its a message to those individuals who have chosen to behave badly: Dude, it shouldnt be that hard to behave like an adult. A few years ago, CoDe Magazine Chief Editor Rod Paddock made some great points in an editorial about Codes of Conduct at conferences. Its definitely unfortunate to have to remind people of what they should expect out of themselves. But the problems go deeper. A few years ago I sat on a five-person panel (3 women, 2 men) at a community event on Women in Technology. The other male stated that men succeed in this industry because the Y chromosome gives men an advantage in areas of performance. The individual who made these remarks is a highly respected technology expert, and not some bozo making dongle remarks at a conference or sponsoring a programming contest where first prize is a date with a bikini model. Our world is becoming increasingly polarized (just watch the news for five minutes), sadly with emotion often winning over reason. Even in our industry, recently I heard someone in a position of responsibility bash software tool XYZ based on a ridiculous premise and then give false praise to a competing tool. So many opinions, so many arguments, but heres the key: before taking a stand, do your homework and get the facts . Sometimes both sides are partly rightor wrong. Theres only one way to determine: get the facts. As Robert Heinlein wrote, Facts are your single clue get the facts Of course, once you get the facts, the next step is to express them in a meaningful and even compelling way. Theres nothing wrong with using some emotion in an intellectual debate but it IS wrong to replace an intellectual debate with emotion and false agenda. A while back I faced resistance to SQL Server Analysis Services from someone who claimed the tool couldnt do feature XYZ. The specifics of XYZ dont matter here. I spent about two hours that evening working up a demo to cogently demonstrate the original claim was false. In that example, it worked. I cant swear it will always work, but to me thats the only way. Im old enough to remember life at a teen in the 1970s. Back then, when a person lost hisher job, (often) it was because the person just wasnt cutting the mustard. Fast-forward to today: a sad fact of life is that even talented people are now losing their jobs because of the changing economic conditions. Theres never a full-proof method for immunity, but now more than ever its critical to provide a high level of what I call the Three Vs (value, versatility, and velocity) for your employerclients. I might not always like working weekends or very late at night to do the proverbial work of two people but then I remember there are folks out there who would give anything to be working at 1 AM at night to feed their families and pay their bills. Always be yourselfyour BEST self. Some people need inspiration from time to time. Heres mine: the great sports movie, Glory Road. If youve never watched it, and even if youre not a sports fan I can almost guarantee youll be moved like never before. And Ill close with this. If you need some major motivation, Ill refer to a story from 2006. Jason McElwain, a high school student with autism, came off the bench to score twenty points in a high school basketball game in Rochester New York. Heres a great YouTube video. His mother said it all . This is the first moment Jason has ever succeeded and is proud of himself. I look at autism as the Berlin Wall. He cracked it. To anyone who wanted to attend my session at todays SQL Saturday event in DC I apologize that the session had to be cancelled. I hate to make excuses, but a combination of getting back late from Detroit (client trip), a car thats dead (blown head gasket), and some sudden health issues with my wife have made it impossible for me to attend. Back in August, I did the same session (ColumnStore Index) for PASS as a webinar. You can go to this link to access the video (itll be streamed, as all PASS videos are streamed) The link does require that you fill out your name and email address, but thats it. And then you can watch the video. Feel free to contact me if you have questions, at kgoffkevinsgoff November 15, 2013 Getting started with Windows Azure and creating SQL Databases in the cloud can be a bit daunting, especially if youve never tried out any of Microsofts cloud offerings. Fortunately, Ive created a webcast to help people get started. This is an absolute beginners guide to creating SQL Databases under Windows Azure. It assumes zero prior knowledge of Azure. You can go to the BDBI Webcasts of this website and check out my webcast (dated 11102013). Or you can just download the webcast videos right here: here is part 1 and here is part 2. You can also download the slide deck here. November 03, 2013 Topic this week: SQL Server Snapshot Isolation Levels, added in SQL Server 2005. To this day, there are still many SQL developers, many good SQL developers who either arent aware of this feature, or havent had time to look at it. Hopefully this information will help. Companion webcast will be uploaded in the next day look for it in the BDBI Webcasts section of this blog. October 26, 2013 Im going to start a weekly post of T-SQL tips, covering many different versions of SQL Server over the years Heres a challenge many developers face. Ill whittle it down to a very simple example, but one where the pattern applies to many situations. Suppose you have a stored procedure that receives a single vendor ID and updates the freight for all orders with that vendor id. create procedure dbo. UpdateVendorOrders update Purchasing. PurchaseOrderHeader set Freight Freight 1 where VendorID VendorID Now, suppose we need to run this for a set of vendor IDs. Today we might run it for three vendors, tomorrow for five vendors, the next day for 100 vendors. We want to pass in the vendor IDs. If youve worked with SQL Server, you can probably guess where Im going with this. The big question is how do we pass a variable number of Vendor IDs Or, stated more generally, how do we pass an array, or a table of keys, to a procedure Something along the lines of exec dbo. UpdateVendorOrders SomeListOfVendors Over the years, developers have come up with different methods: Going all the way back to SQL Server 2000, developers might create a comma-separated list of vendor keys, and pass the CSV list as a varchar to the procedure. The procedure would shred the CSV varchar variable into a table variable and then join the PurchaseOrderHeader table to that table variable (to update the Freight for just those vendors in the table). I wrote about this in CoDe Magazine back in early 2005 (code-magazinearticleprint. aspxquickid0503071ampprintmodetrue. Tip 3) In SQL Server 2005, you could actually create an XML string of the vendor IDs, pass the XML string to the procedure, and then use XQUERY to shred the XML as a table variable. I also wrote about this in CoDe Magazine back in 2007 (code-magazinearticleprint. aspxquickid0703041ampprintmodetrue. Tip 12)Also, some developers will populate a temp table ahead of time, and then reference the temp table inside the procedure. All of these certainly work, and developers have had to use these techniques before because for years there was NO WAY to directly pass a table to a SQL Server stored procedure. Until SQL Server 2008 when Microsoft implemented the table type. This FINALLY allowed developers to pass an actual table of rows to a stored procedure. Now, it does require a few steps. We cant just pass any old table to a procedure. It has to be a pre-defined type (a template). So lets suppose we always want to pass a set of integer keys to different procedures. One day it might be a list of vendor keys. Next day it might be a list of customer keys. So we can create a generic table type of keys, one that can be instantiated for customer keys, vendor keys, etc. CREATE TYPE IntKeysTT AS TABLE ( IntKey int NOT NULL ) So Ive created a Table Typecalled IntKeysTT . Its defined to have one column an IntKey. Nowsuppose I want to load it with Vendors who have a Credit Rating of 1..and then take that list of Vendor keys and pass it to a procedure: DECLARE VendorList IntKeysTT INSERT INTO VendorList SELECT BusinessEntityID from Purchasing. Vendor WHERE CreditRating 1 So, I now have a table type variable not just any table variable, but a table type variable (that I populated the same way I would populate a normal table variable). Its in server memory (unless it needs to spill to tempDB) and is therefore private to the connectionprocess. OK, can I pass it to the stored procedure now Well, not yet we need to modify the procedure to receive a table type. Heres the code: create procedure dbo. UpdateVendorOrdersFromTT IntKeysTT IntKeysTT READONLY update Purchasing. PurchaseOrderHeader set Freight Freight 1 FROM Purchasing. PurchaseOrderHeader JOIN IntKeysTT TempVendorList ON PurchaseOrderHeader. VendorID Te mpVendorList. IntKey Notice how the procedure receives the IntKeysTT table type as a Table Type (again, not just a regular table, but a table type). It also receives it as a READONLY parameter. You CANNOT modify the contents of this table type inside the procedure. Usually you wont want to you simply want to read from it. Well, now you can reference the table type as a parameter and then utilize it in the JOIN statement, as you would any other table variable. Jadi begitulah. A bit of work to set up the table type, but in my view, definitely worth it. Additionally, if you pass values from , youre in luck. You can pass an ADO data table (with the same tablename property as the name of the Table Type) to the procedure. For developers who have had to pass CSV lists, XML strings, etc. to a procedure in the past, this is a huge benefit. Finally I want to talk about another approach people have used over the years. SQL Server Cursors. At the risk of sounding dogmatic, I strongly advise against Cursors, unless there is just no other way. Cursors are expensive operations in the server, For instance, someone might use a cursor approach and implement the solution this way: DECLARE VendorID int DECLARE dbcursor CURSOR FASTFORWARD FOR SELECT BusinessEntityID from Purchasing. Vendor where CreditRating 1 FETCH NEXT FROM dbcursor INTO VendorID WHILE FETCHSTATUS 0 EXEC dbo. UpdateVendorOrders VendorID FETCH NEXT FROM dbcursor INTO VendorID The best thing Ill say about this is that it works. And yes, getting something to work is a milestone. But getting something to work and getting something to work acceptably are two different things. Even if this process only takes 5-10 seconds to run, in those 5-10 seconds the cursor utilizes SQL Server resources quite heavily. Thats not a good idea in a large production environment. Additionally, the more the of rows in the cursor to fetch and the more the number of executions of the procedure, the slower it will be. When I ran both processes (the cursor approach and then the table type approach) against a small sampling of vendors (5 vendors), the processing times where 260 ms and 60 ms, respectively. So the table type approach was roughly 4 times faster. But then when I ran the 2 scenarios against a much larger of vendors (84 vendors), the different was staggering 6701 ms versus 207 ms, respectively. So the table type approach was roughly 32 times faster. Again, the CURSOR approach is definitely the least attractive approach. Even in SQL Server 2005, it would have been better to create a CSV list or an XML string (providing the number of keys could be stored in a scalar variable). But now that there is a Table Type feature in SQL Server 2008, you can achieve the objective with a feature thats more closely modeled to the way developers are thinking specifically, how do we pass a table to a procedure Now we have an answer Hope you find this feature help. Feel free to post a comment. SQL Server IO Performance Everything You Need To Consider SQL Server IO performance is crucial to overall performance. Access to data on disk is much slower than in memory, so getting the most out of local disk and SAN is essential. There is a lot of advice on the web and in books about SQL Server IO performance, but I havent found a single source listing everything to consider. This is my attempt to bring all the information together in one place. So here is a list of everything I can think of that can impact IO performance. I have ordered it starting at the physical disks and moving up the wire to the server and finally the code and database schema. Failed Disk When a drive fails in a disk array it will need to be replaced. The impact on performance before replacement depends on the storage array and RAID configuration used. RAID 5 and RAID 6 use distributed parity, and this parity is used to calculate the reads when a disk fails. Read performance loses the advantage of reading from multiple disks. This is also true, although to a lesser degree, on RAID 1 (mirrored) arrays. Reads lose the advantage of reading from multiple stripes for data on the failed disk, and writes may be slightly slower due to the increase in average seek time. Write Cache When a transaction is committed, the write to the transaction log has to complete before the transaction is marked as being committed. This is essential to ensure transactional integrity. It used to be that write cache was not recommended, but a lot of the latest storage arrays have battery-backed caches that are fully certified for use with SQL Server. If you have the option to vary the distribution of memory between read and write cache, try to allocate as much as possible to the write cache. This is because SQL Server performs its own read caching via the buffer pool, so any additional read cache on the disk controller has no benefit. Thin Provisioning Thin provisioning is a technology provided by some SANs whereby the actual disk storage used is just enough for the data, while appearing to the server to be full sized, with loads of free space. Where the total disk allocated to all servers exceeds the amount of physical storage, this is known as over-provisioning. Some SAN vendors try to claim that performance is not affected, but thats not always true. I saw this issue recently on a 3PAR array. Sequential reads were significantly slower on thin provisioned LUNs. Switching to thick provisioned LUNs more than doubled the sequential read throughput. Where Are The Disks Are they where you think they are It is perfectly possible to be connected to a storage array, but for the IO requests to pass through that array to another. This is sometimes done as a cheap way to increase disk space - using existing hardware that is being underutilized is less costly than purchasing more disks. The trouble is that this introduces yet another component into the path and is detrimental to performance - and the DBA may not even be aware of it. Make sure you know how the SAN is configured. Smart Tiering This is called different things by different vendors. The storage array will consist of two or more types of disk, of varying performance and cost. There are the slower 10K disks - these are the cheapest. Then you have the 15K disks. These are faster but more expensive. And then there may be some super-fast SSDs. These are even more expensive, although the price is coming down. Smart tiering migrates data between tiers so that more commonly accessed data is on the faster storage while less commonly used data drops down to the slower storage. This is OK in principle, but you are the DBA. You should already know which data needs to be accessed quickly and which can be slower. Do you really want an algorithm making this decision for you And regular maintenance tasks can confuse the whole thing anyway. Consider a load of index rebuilds running overnight. Lets suppose the last database to be processed is an archive database - do you want this is to be hogging the SSD when the users login first thing in the morning, while the mission critical database is languishing down in the bottom tier This is an oversimplification, of course. The tiering algorithms are more sophisticated than that, but my point stands. You should decide the priorities for your SQL Server data. Dont let the SAN vendors (or storage admins) persuade you otherwise. Storage Level Replication Storage level replication is a disaster recovery feature that copies block level data from the primary SAN to another - often located in a separate data center. The SAN vendors claim no impact on performance, and this is true if correctly configured. But I have seen poorly configured replication have a serious impact on performance. One client suffered a couple of years of poor IO performance. When I joined them I questioned whether the storage replication was responsible. I was told not to be so silly - the vendor has checked and it is not the problem - it must be SQL Server itself A few months later I was contacted again - they had turned off the replication while in the process of moving to a new data center and guess what Write latency improved by an order of magnitude. Let me repeat that this was caused by poor configuration and most storage replication does not noticeably affect performance. But its another thing to consider if youre struggling with SQL Server IO performance. Host Bus Adapters Check that the SAN and HBA firmware are compatible. Sometimes when a SAN is upgraded, the HBAs on the servers are overlooked. This can result in irregular errors, or even make the storage inaccessible. Have a look at the HBA queue depth. A common default is 32, which may not be optimal. Some studies have shown that increasing this to 64 or higher can improve performance. It could also make things worse, depending on workload, SAN make and model, disk layout, etc. So test thoroughly if you can. Some storage admins discourage modifying HBA queue depth as they think everyone will want the same on their servers and the storage array will be swamped. And theyre right, too Persuade them that it is just for you. Promise not to tell anyone else. Whatever. Just get your extra queue depth if you think it will benefit performance. Too Many Servers When a company forks out a small fortune on a storage area network, they want to get value for money. So naturally, every new server that comes along gets hooked up so it can make use of all that lovely disk space. This is fine until a couple of servers start issuing a lot of IO requests and other users complain of a performance slowdown. This is something I see repeatedly at so many clients, and there is no easy solution. The company doesnt want or cant afford to purchase another SAN. If you think this is a problem for you, put a schedule together of all jobs - across all servers - and try to reschedule some so that workload is distributed more evenly. Partition Alignment and Formatting I will briefly mention partition alignment, although Windows 2008 uses a default offset of 1MB so this is less of an issue than it used to be. I am also not convinced that a lot of modern SANs benefit much from the practise. I performed a test on an EVA a few years ago and found just a 2 improvement. Nevertheless, a few percent is still worth striving for. Unfortunately you will have to tear down your volumes and recreate your partitions if this is to be fixed on an existing system. This is probably not worth the hassle unless you are striving for every last inch of performance. Formatting is something else that should be performed correctly. SQL Server stores data in 8KB pages, but these are retrieved in blocks of 8, called extents. If the disks are formatted with 64KB allocation units, this can have a significant performance benefit. Multipathing If you are not using local disk then you should have some redundancy built into your storage subsystem. If you have a SAN you have a complicated network of HBAs, fabric, switches and controllers between SQL Server and the disks. There should be at least two HBAs, switches, etc. and these should all be connected together in such a way that there are multiple paths to the disks. This redundancy is primarily for high availability, but if the multipathing has been configured as activeactive you may see performance benefits as well. Network Attached Storage Since SQL Server 2008 R2 it has been possible to create, restore or attach a database on a file share. This has a number of possible uses, and particularly for devtest environments it can make capacity management easier, and make moving databases between servers much quicker. The question to be asked, though, is quotDo you really want this in productionquot Performance will not be as good as local or SAN drives. There are additional components in the chain, so reliability may not be as good. And by using the network, your data uses the same infrastructure as all the other TCPIP traffic, which again could impact performance. But theres good news While availability is still a worry, improvements in SMB on Windows Server 2012 (and via an update to WIndows Server 2008 R2) have made it significantly faster. I saw a quote from a Microsoft employee somewhere that claimed 97 of the performance of local storage. I cant find the quote now, and I dont remember if he was measuring latency or throughput. Disk Fragmentation How often do you use the Disk Defragmenter tool on your PC to analyze and defragment your C: drive How often do you check fragmentation on the disks on your SQL Servers For most people that is nowhere near as often, Ill bet. Yet volume fragmentation is just as detrimental to SQL Server performance as it is to your PC. You can reduce the likelihood of disk fragmentation in a number of ways: Pre-size data and log files, rather than rely on auto-growth Set auto-growth increments to sensible values instead of the default 10 Avoid shrinking data and log files Never, ever use the autoshrink database option Ensure disks are dedicated to SQL Server and not shared with other applications You can check fragmentation using the same tool as on your PC. Disk Defragmenter is available on all server versions of Windows. Another way to check is via the Win32Volume class in WMI. This bit of PowerShell reports the file percent fragmentation for all volumes on a given server. If you have significant fragmentation there are a couple of ways to fix it. My preferred option is as follows, but requires some downtime. Stop the SQL services Backup the files on the disk (especially mdf, ndf and ldf files - better safe than sorry) Run the Windows Disk Defragmenter tool Start the SQL services Check the error log to ensure no errors during startup Run CHECKDB against all databases (except tempdb). Ive never seen the defrag tool cause corruption, but you cant be too careful Another option that doesnt require downtime is to use a third party tool such as Diskeeper. This can be very effective at fixing and preventing disk fragmentation, but it costs money and uses a filter driver - see my comments below. Filter Drivers A filter driver is a piece of software that sits between an IO request and the write to disk. It allows the write to be examined and rejected, modified or audited. The most common type of filter driver is installed by anti-virus software. You do not want anti-virus software checking every single write to your database files. You also dont want it checking your backups either, or writes to the error log, or default trace. If you have AV software installed, you can specify exclusions. Exclude all folders used by SQL Server, plus the drives used by data and log files, plus the folders used for backups. Even better is to turn off online AV checking, and schedule a scan at a quiet time. OLTP and BI on the Same Server It is rare to find a system that is purely OLTP. Most will have some sort of reporting element as well. Unfortunately, the two types of workload do not always coexist happily. Ive been reading a lot of articles by Joe Chang, and in one article he explains why this is the case. Essentially, OLTP query plans retrieve rows in small batches (less than a threshold of 25 rows) and these IO requests are handled synchronously by the database engine, meaning that they wait for the data to be retrieved before continuing. Large BI workloads and reporting queries, often with parallel plans, issue asynchronous IO requests and take full advantage of the HBA ability to queue requests. As a result, the OLTP requests have to queue up behind the BI requests, causing OLTP performance to degrade significantly. Auto-grow and Instant File Initialization It is good to have auto-grow enabled, just as a precaution, although you should also pre-size data and log files so that it is rarely needed. However, what happens if a data file grows and you dont have instant file initialization enabled Especially if the auto-grow is set too big. All IO against the file has to wait for the file growth to complete, and this may be reported in the infamous quotIOs taken longer than 15 seconds to completequot message in the error log. Instant initialization wont help with log growth, so make sure log auto-growth increments are not too high. For more information about instant file initialization and how to enable it, see this link Database File Initialization . And while on the subject of auto-grow, see the section on proportional fill, below. Transaction Log Performance How long do your transaction log writes take Less than 1ms More than 5ms Look at virtual file stats, performance counters, or the WRITELOG wait time to see if log write latency is an issue for you. Writes to the transaction log are sequential, and so the write head on the disk should ideally be where it was from the last log write. This means no seek time, and blazingly fast write times. And since a transaction cannot commit until the log has hardened to disk, you rely on these fast writes for a performant system. Advice for years has been for the transaction log for each database to be on its own disk. And this advice is still good for local disk, and for some storage arrays. But now that a lot of SANs have their own battery-backed write cache, this advice is not as critical as it used to be. Provided the cache is big enough to cope with peak bursts of write activity (and see my earlier comments about allocating more cache to writes than to reads) you will get very low latency. So what if you dont have the luxury of a mega-bucks SAN and loads of write cache Then the advice thats been around since the 1990s is still valid: One transaction log file per database on its own drive RAID 1, RAID 10 or RAID 01 So assuming you are happy with your log file layout, what else could be slowing down your log writes Virtual Log Files Although a transaction log is written to sequentially, the file itself can become fragmented internally. When it is first created it consists of several chunks called virtual log files. Every time it is grown, whether manually or automatically, several more virtual log files are added. A transaction log that grows multiple times can end up with thousands of virtual log files. Having too many VLFs can slow down logging and may also slow down log backups. You also need to be careful to avoid VLFs that are too big. An inactive virtual log file is not cleared until the end is reached and the next one starts to be used. For full recovery model, this doesnt happen until the next log backup. So a log backup will suddenly have a lot more work to, and may cause performance problems while it takes place. The answer for a big transaction log is to set an initial size of maximum 8000MB, and then manually grow in chunks of 8000MB up to the target size. This results in maximum VLF size of 512MB, without creating an excessively large number of VLFs. Note: this advice is for manual growth only. Do not auto grow by 8000MB All transactions in the database will stop while the extra space is initialised. Autogrow should be much smaller - but try to manually size the file so that auto grow is unlikely to be needed. Log Manager Limits The database engine sets limits on the amount of log that can be in flight at any one time. This is a per-database limit, and depends on the version of SQL Server being used. SQL Server limits the number of outstanding IOs and MB per second. The limits vary with version and whether 32 bit or 64 bit. See Diagnosing Transaction Log Performance Issues and Limits of the Log Manager for more details. This is why the write latency should be as low as possible. If it takes 20ms to write to the transaction log, and you are limited to 32 IOs in flight at a time, that means a maximum of 1600 transactions per second, well below what a lot of high volume OLTP databases require. This also emphasises the importance of keeping transaction sizes small, as one very large transaction could conceivably hold up other transactions while it commits. If you think these limits are affecting log write performance in your databases there are several ways to tackle the problem: Work on increasing log write performance If you have minimally logged operations you can switch the database to use the BULK LOGGED recovery model. Careful though - a log backup containing a minimally logged operation has to be restored in full. Point in time restore is not possible. Split a high volume database into 2 or more databases, as the log limits apply per database Non-Sequential Log Activity There are actions performed by the database engine that move the write head away from the end of the log file. If transactions are still being committed while this happens, you have a seek overhead and log performance gets worse. Operations that read from the log files include rollback of large transactions, log backups and replication (the log reader agent). There is little you can do about most of these, but avoiding large rollbacks is something that should be tackled at the design and development stage of an application. Proportional Fill Very active tables can be placed in a file group that has multiple data files. This can improve read performance if they are on different physical disks, and it can improve write performance by limiting contention in the allocation pages (especially true for tempdb). You lose some of the benefit, though, if you dont take advantage of the proportional fill algorithm. Proportional fill is the process by which the database tries to allocate new pages in proportion to the amount of free space in each data file in the file group. To get the maximum benefit make sure that each file is the same size, and is always grown by the same increment. This is for both manual and auto growth. One thing to be aware of is how the auto growth works. SQL Server does its best to fill the files at the same rate, but one will always fill up just before the others, and this file will then auto grow on its own. This then gets more new page allocations than the others and becomes a temporary hotspot until the others also auto grow and catch up. This is unlikely to cause problems for most databases, although for tempdb it may be more noticeable. Trace flag 1117 causes all data files in a file group to grow together, so is worth considering if this is an issue for you. Personally I would rather manually size the files so that auto growth isnt necessary. tempdb Configuration Lets start with a few things that everybody agrees on: tempdb files should be placed on the fastest storage available. Local SSD is ideal, and from SQL Server 2012 this is even possible on a cluster Pre-size the data and log files, as auto growth may cause performance issues while it occurs New temporary objects are created all the time, so contention in the GAM, SGAM and PFS pages may be an issue in some environments And now some differences of opinion: There is loads of advice all over the web to create one tempdb data file per core to reduce allocation contention. Paul Randall disagrees (A SQL Server DBA myth a day: (1230) tempdb should always have one data file per processor core ). He says that too many files can actually make things worse. His solution is to create fewer files and to increase only if necessary There is more advice, often repeated, to separate tempdb files from other databases and put them on their own physical spindles. Joe Chang disagrees and has a very good argument for using the common pool of disks. (Data, Log and Temp file placement ). Ill leave you to decide what to do AutoShrink The AutoShrink database option has been around ever since I started using SQL Server, causing lots of performance problems for people who have enabled it without fully realising what it does. Often a third party application will install a database with this option enabled, and the DBA may not notice it until later. So why is it bad Two reasons: It is always used in conjunction with auto grow, and the continuous cycle of grow-shrink-grow causes a huge amount of physical disk fragmentation. Ive already covered that topic earlier in this article While it performs the shrink there is a lot of additional IO, which slows down the system for everything else Disable it. Allocate enough space for the data and log files, and size them accordingly. And dont forget to fix all that fragmentation while youre at it. Insufficient Memory This is an article about SQL Server IO performance, not memory. So I dont want to cover it in any detail here - that is a subject for a different article. I just want to remind you that SQL Server loves memory - the more the better. If your entire database(s) fits into memory youll have a much faster system, bypassing all that slow IO. Lack of memory can lead to dirty pages being flushed to disk more often to make space for more pages being read. Lack of memory can also lead to increased tempdb IO, as more worktables for sort and hash operations have to spool to disk. Anyway, the point of this section is really to make one statement: Fill your servers with as much memory as you can afford, and as much as the edition of SQL Server and Windows can address. SQL Server 2014 has a new feature allowing some tables to be retained in memory, and accessed via natively compiled stored procedures. Some redesign of some of your existing code may be needed to take advantage of this, but it looks like a great performance boost for those OLTP systems that start to use it. High Use of tempdb tempdb can be a major consumer of IO and may affect overall performance if used excessively. It is worth looking at the various reasons for its use, and examining your system to ensure you have minimized these as far as possible. User-created temporary objects The most common of these are temporary tables, table variables and cursors. If there is a high rate of creation this can lead to allocation page contention, although increasing the number of tempdb data-files may partially alleviate this. Processes creating very large temporary tables or table variables are a big no-no, as these can cause a lot of IO. Internal Objects The database engine creates work-tables in tempdb for handling hash joins, sorting and spooling of intermediate result sets. When sort operations or hash joins need more memory than has been granted they spill to disk (using tempdb) and you will see Hash warnings and Sort warnings in the default trace. I originally wrote a couple of paragraphs about how and why this happens and what you can do to prevent it, but then I found this post that explains it much better - Understanding Hash, Sort and Exchange Spill Events . Version Store The third use of tempdb is for the version store. This is used for row versioning. Row versions are created when snapshot isolation or read committed snapshot option is used. They are also created during online index rebuilds for updates and deletes made during the rebuild and for handling data modifications to multiple active result sets (MARS). A poorly written application (or rogue user) performing a large update that affects many thousands of rows when a row versioning based isolation level is in use may cause rapid growth in tempdb and adversely impact IO performance for other users. Table and Index Scans A table scan is a scan of a heap. An index scan is a scan of a clustered or non-clustered index. Both may be the best option if a covering index does not exist and a lot of rows are likely to be retrieved. A clustered index scan performs better than a table scan - yet another reason for avoiding heaps But what causes a scan to be used in the first place, and how can you make a seek more likely Out of date statistics Before checking indexes and code, make sure that statistics are up to date. Enable quotauto create statisticsquot. If quotauto update statisticsquot is not enabled make sure you run a manual statistics update regularly. This is a good idea even if quotauto update statisticsquot is enabled, as the threshold of approximately 20 of changed rows before the auto update kicks in is often not enough, especially where new rows are added with an ascending key. Index Choice Sometimes an existing index is not used. Have a look at improving its selectivity, possibly by adding additional columns, or modifying the column order. Consider whether a covering index could be created. A seek is more likely to be performed if no bookmark lookups will be needed. See these posts on the quottipping pointquot by Kimberly Tripp. The Tipping Point . Inefficient TSQL The way a query is written can also result in a scan, even if a useful index exists. Some of the reasons for this are: Non-sargable expressions in the WHERE clause. quotsargquot means Simple ARGument. So move calculations away from the columns and onto the constants instead. So for example, this will not use the index on OrderDate: WHERE DATEADD ( DAY. 1. OrderDate ) gt GETDATE () Whereas this will use an index if it exists (and it is selective enough): WHERE OrderDate gt DATEADD ( DAY. - 1. GETDATE ()) Implicit conversions in a query may also result in a scan. See this post by Jonathan Kehayias Implicit Conversions that cause Index Scans . Bad Parameter Sniffing Parameter sniffing is a good thing. It allows plan re-use and improves performance. But sometimes it results in a less efficient execution plan for some parameters. Index Maintenance Every index has to be maintained. Im not talking about maintenance plans, but about the fact that when rows are inserted, deleted and updated, the non-clustered indexes also have to be changed. This means additional IO for each index on a table. So it is a mistake to have more indexes than you need. Check that all indexes are being used. Check for duplicates and redundant indexes (where the columns in one are a subset of the columns in another). Check for indexes where the first column is identical but the rest are not - sometimes these can be merged. And of course, test, test, test. Index Fragmentation Index fragmentation affects IO performance in several ways. Range scans are less efficient, and less able to make use of read-ahead reads Empty space created in the pages reduces the density of the data, meaning more read IO is necessary The fragmentation itself is caused by page splits, which means more write IO There are a number things that can be done to reduce the impact of fragmentation, or to reduce the amount of fragmentation. Rebuild or reorganize indexes regularly Specify a lower fill factor so that page splits occur less often (though not too low, see below) Change the clustered index to use an ascending key so that new rows are appended to the end, rather than inserted in a random place in the middle Forwarded Records When a row in a heap is updated and requires more space, it is copied to a new page. But non-clustered indexes are not updated to point to the new page. Instead, a pointer is added to the original page to show where the row has moved to. This is called a forwarding pointer, and there could potentially be a long chain of these pointers to traverse to find the eventual data. Naturally, this means more IO. A heap cannot be defragmented by rebuilding the index (there isnt one). The only way to do this is to create a clustered index on the heap, and then drop it afterwards. Be aware that this will cause all non-clustered indexes to be rebuilt twice - once for the new clustered index, and again when it is dropped. If there are a lot of these it is a good idea to drop the non-clustered indexes first, and recreate them afterwards. Better still is to avoid heaps where possible. I accept there may be cases where they are the more efficient choice (inserting into archive tables, for example), but always consider whether a clustered index would be a better option - it usually is. Wasted Space In an ideal world every data page on disk (and in memory) would be 100 full. This would mean the minimum of IO is needed to read and write the data. In practise, there is wasted space in nearly all pages - sometimes a very high percent - and there are a lot of reasons why this occurs. Low fill factor Ive mentioned fill factor already. If it is too high, and page splits are occurring when rows are inserted or updated, it is sensible to rebuild the index with a lower fill factor. However, if the fill factor is too low you may have a lot of wasted space in the database pages, resulting in more IO and memory use. This is one of those quotsuck it and seequot scenarios. Sometimes a compromise is needed. Page splits This is also discussed above. But as well as fragmentation, page splits can also result in wasted space if the empty space is not reused. The solution is to defragment by rebuilding or reorganizing indexes regularly. Wasteful Choice of Data Types Use the smallest data types you can. And try to avoid the fixed length datatypes, like CHAR(255), unless you regularly update to the longest length and want to avoid page splits. The reasoning is simple. If you only use 20 characters out of 200, that is 90 wasted space, and more IO as result. The higher density of data per page the better. Lazy thinking might make developers create AddressLine1, AddressLine2, etc as CHAR(255), because they dont actually know what the longest should be. In this case, either do some research, find out that the longest is 50 characters (for example) and reduce them to CHAR(50), or use a variable length data type. Schema Design Ive already mentioned choice of data types above, but there are other schema design decisions that can affect the amount of IO generated by an application database. The most common one is designing tables that are too wide. I sometimes see a table with 20, 30, 50, even 100 columns. This means fewer rows fit on a page, and for some extreme cases there is room for just one row per page - and often a lot of wasted space as well (if the row is just slightly wider than half a page, thats 50 wasted). If you really do need 50 columns for your Customer table, ask yourself how many of these are regularly accessed. An alternative is to split into 2 tables. Customer, with just a few of the commonly used columns, and CustomerDetail with the rest. Of course, the choice of which columns to move is important. You dont want to start joining the tables for every query as that defeats the object of the exercise. Page or Row Compression Compression is another way of compacting the data onto a page to reduce disk space and IO. Use of row or page compression can dramatically improve IO performance, but CPU usage does increase. As long as you are not already seeing CPU bottlenecks, compression may be an option to consider. Be aware that compression is an Enterprise edition feature only. Backup Compression Since SQL Server 2008 R2, backup compression has been available on Standard edition as well as Enterprise. This is major benefit and I recommend that it be enabled on all instances. As well as creating smaller backups, it is also quicker and means less write IO. The small increase in CPU usage is well worth it. Enable it by default so that if someone sets off an ad hoc backup it will have minimal IO impact. Synchronous MirroringAlwaysOn High safety mode in database mirroring, or synchronous commit mode in AlwaysOn, both emphasise availability over performance. A transaction on the mirroring principal server or primary replica does not commit until it receives a message back from the mirror or secondary replica that the transaction has been hardened to the transaction log. This increases transactional latency, particularly when the servers are in different physical locations. Resource Governor in 2014 Up until and including SQL Server 2012 resource governor has only been able to throttle CPU and memory usage. Finally the ability to include IO in a resource pool has been added to SQL Server 2014. This has obvious use as a way of limiting the impact of reports on the system from a particular user, department or application. Gathering The Evidence There are a lot of ways you can measure SQL Server IO performance and identify which areas need looking at. Most of what follows is available in SQL CoPilot in graphical and tabular form, both as averages since last service start and as snapshots of current activity. Wait Types Use sys. dmoswaitstats to check number of waits and wait times for IOCOMPLETION, LOGBUFFER, WRITELOG and PAGEIOLATCH. Use this script to focus on the IO wait types: SELECT waittype. waitingtaskscount. waittimems - signalwaittimems AS totalwaittimems , 1. ( waittimems - signalwaittimems ) CASE WHEN waitingtaskscount 0 THEN 1 ELSE waitingtaskscount END AS avgwaitms FROM sys. dmoswaitstats WHERE waittype IN ( IOCOMPLETION. LOGBUFFER. WRITELOG. PAGEIOLATCHSH. PAGEIOLATCHUP. PAGEIOLATCHEX. PAGEIOLATCHDT. PAGEIOLATCHKP ) This shows averages since the last service restart, or since the wait stats were last cleared. To clear the wait stats, use DBCC SQLPERF (sys. dmoswaitstats, CLEAR) You can also check sys. dmoswaitingtasks to see what is currently being waited for. Virtual File Stats Query sys. dmiovirtualfilestats to find out which data and log files get the most read and write IO, and the latency for each file calculated using the stall in ms. SELECT d. name AS databasename. mf. name AS logicalfilename. numofbytesread. numofbyteswritten. numofreads. numofwrites. 1. iostallreadms ( numofreads 1 ) avgreadstallms. 1. iostallwritems ( numofwrites 1 ) avgwritestallms FROM sys. dmiovirtualfilestats (NULL, NULL) vfs JOIN sys. masterfiles mf ON vfs. databaseid mf. databaseid AND vfs. FILEID mf. FILEID JOIN sys. databases d ON mf. databaseid d. databaseid Performance Counters There are two ways of looking at performance counters. Select from sys. dmosperformancecounters, which shows all the SQL Server counters, or use Windows Performance Monitor (perfmon) to see the other OS counters as well. Some counters to look at are: SQL Server:Buffer Manager Lazy writessec The number of times per second that dirty pages are flushed to disk by the Lazy Writer process. An indication of low memory, but listed here as it causes more IO. Checkpoint pagessec The number of dirty pages flushed to disk per second by the checkpoint process. Page readssec Number of physical pages read from disk per second Page writessec Number of physical pages written to disk per second Readahead pagessec Pages read from disk in advance of them being needed. Expect to see high values in BI workloads, but not for OLTP SQL Server:Access Methods Forwarded recordssec Should be as low as possible. See above for explanation of forwarded records. Full scanssec The number of unrestricted full scans. Use of UDFs and table variables can contribute to this, but concentrating on seeks will help to keep the value down Page splitssec The number of page splits per second - combining splits due to pages being added to the end of a clustered index as well as quotgenuinequot splits when a row is moved to a new page. Use the technique from the link in the section on index fragmentation, above, to get a more accurate breakdown Skipped ghosted recordssec For information about ghosted records see An In-depth Look at Ghost Records in SQL Server Workfiles createdsec A measure of tempdb activity Worktables createdsec A measure of tempdb activity SQL Server:Databases Log bytes flushedsec The rate at which log records are written to disk Log flush wait time The duration of the last log flush for each database Log flush waitssec The number of commits per second waiting for a log flush Logical Disk Avg Disk secsRead Avg Disk secsWrite Avg Disk Read bytessec Avg Disk Write bytessec Using the sys. dmosperformancecounters DMV, a lot of counters display a raw value, which has to be monitored over time to see values per second. Others have to be divided by a base value to get a percentage. This makes this DMV less useful unless you perform these calculations and either monitor over time or take an average since the last server restart. This script uses the tempdb creation date to get the number of seconds since the service started and calculates the averages for these counters. It also retrieves all other counters and calculates those that are derived from a base value. USE master SET NOCOUNT ON DECLARE upsecs bigint SELECT upsecs DATEDIFF ( second. createdate. GETDATE ()) FROM sys. databases WHERE name tempdb SELECT RTRIM ( objectname ) objectname. RTRIM ( instancename ) instancename. RTRIM ( countername ) countername. cntrvalue FROM sys. dmosperformancecounters WHERE cntrtype 65792 UNION ALL SELECT RTRIM ( objectname ), RTRIM ( instancename ), RTRIM ( countername ), 1. CAST ( cntrvalue AS bigint ) upsecs FROM sys. dmosperformancecounters WHERE cntrtype 272696576 UNION ALL SELECT RTRIM ( v. objectname ), RTRIM ( v. instancename ), RTRIM ( v. countername ), 100. v. cntrvalue CASE WHEN b. cntrvalue 0 THEN 1 ELSE b. cntrvalue END FROM ( SELECT objectname. instancename. countername. cntrvalue FROM sys. dmosperformancecounters WHERE cntrtype 537003264 ) v JOIN ( SELECT objectname. instancename. countername. cntrvalue FROM sys. dmosperformancecounters WHERE cntrtype 1073939712 ) b ON v. objectname b. objectname AND v. instancename b. instancename AND RTRIM ( v. countername ) base RTRIM ( b. countername ) UNION ALL SELECT RTRIM ( v. objectname ), RTRIM ( v. instancename ), RTRIM ( v. countername ), 1. v. cntrvalue CASE WHEN b. cntrvalue 0 THEN 1 ELSE b. cntrvalue END FROM ( SELECT objectname. instancename. countername. cntrvalue FROM sys. dmosperformancecounters WHERE cntrtype 1073874176 ) v JOIN ( SELECT objectname. instancename. countername. cntrvalue FROM sys. dmosperformancecounters WHERE cntrtype 1073939712 ) b ON v. objectname b. objectname AND v. instancename b. instancename AND REPLACE ( RTRIM ( v. countername ), (ms). ) Base RTRIM ( b. countername ) ORDER BY objectname. instancename. countername Dynamic Management Views and Functions As well as the DMVs in the above scripts, there are a number of others that are useful for diagnosing SQL Server IO performance problems. Here are all the ones I use. Ill add some sample scripts when I get the time: sys. dmoswaitstats sys. dmiovirtualfilestats sys. dmosperformancecounters sys. dmiopendingiorequests sys. dmdbindexoperationalstats sys. dmdbindexusagestats sys. dmdbindexphysicalstats sys. dmosbufferdescriptors It can also be useful to see what activity there is on the instance. Here are your options: The Profiler tool is quick and easy to use - you can start tracing in a matter of seconds. However, there is some overhead and it may impact performance itself - especially when a lot of columns are selected. A server side trace is a better option. A server-side trace has less of an impact than running Profiler. It has to be scripted using system stored procedures, but Profiler has the ability to generate the script for you. Extended Event Sessions Extended events were first introduced in SQL Server 2008, and have been considerably enhanced in SQL 2012. They are very lightweight, and the use of server-side traces and Profiler is now deprecated. Nevertheless, use of extended events may impact performance of high transaction systems if you are not careful. Use an asynchronous target and avoid complicated predicates to limit the overhead. There are a number of tools for gathering performance data from your servers. SQLIO is a simple tool that creates a file on disk and tests latency and throughput for randomsequential IO, at various block sizes and with a variable number of threads. These are all fully configurable. SQLIO is a great way of getting a baseline on a new server or storage, for future comparison. Third party tools are another option for viewing performance metrics. Some show you what is happening on the server right now. Others are built into more complex (and expensive) monitoring solutions. Performance metrics obtained on virtual servers are unreliable. Performance counters and wait stats may give the impression that everything is OK, when it is not. I recommend the use of the performance monitoring tools provided by the VM vendor. In the case of VMWare, this is very easy to use and is built into Virtual Center. This turned into a much bigger article than I expected - SQL Server IO performance is a big subject I started with everything I knew, and double checked my facts by searching the web and checking books. In the process I learnt a whole lot of new stuff and found a lot of useful links. It has been a useful exercise. Hopefully this has been useful for you too.
No comments:
Post a Comment