Related Projects for Data Mining Solutions
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
The minimum that is required for a data mining solution is the data mining project, which defines data sources, data source views, mining structures and mining models. However, when data mining models are used in daily decision making, it is important that data mining be integrated with other part of a predictive analytics solution, which can include these processes and components:
Preparation and selection of data and of variables. Includes data cleansing, metadata management and integration of multiple data sources, and the conversion, merging, and uploading of data into a data warehouse.
Reporting of analysis, presentation of predictions, and auditing/tracking of data mining activities.
Using multidimensional models or tabular models to explore findings.
Refinement of the data mining solution to support new data, or changes in the support infrastructure driven by current analysis.
This topic describes the other features of SQL Server 2016 that are often part of a predictive analytics solution, either to support the processes of data preparation and data mining, or to support users by providing tools for analysis and action.
Integration Services provides components and features that are required for the data preparation and training phases of a data mining project. Although you can perform many data cleansing or preparation tasks by using other tools, such as scripts, Integration Services has numerous advantages for data mining:
Represents tasks as part of a workflow, which can be repeated, automated, branched, and extended.
Provides extensive support for auditing and multiple ways of capturing errors and logging events.
In addition to capturing data lineage, you can monitor changes to the data throughout the data transformation pipeline.
You can also integrate your SSIS workflows with the features that support Change Data Capture functionality in SQL Server.
Data mining can be incorporated in the Integration Services workflow, to intelligently separate incoming data into multiple tables. For example, you could use a prediction query to split new customers into different groups for targeting in a mailing campaign.
The following lists provide links to the Integration Services components that are most widely used in support of data mining.
Control Flow Components
Data Flow Components
Although Reporting Services is typically not seen as a critical component of data mining solutions, it provides the following features that are useful for presentation of data mining solutions.
Integration of data from multiple sources in complex reports. Create queries against the model content for analysts, and reports that show predictions and trends for end users.
The ability to create a report that lets users directly query against an existing mining model.
Integration with Analysis Services, to support drillthrough and exploration of data mining dimensions and data mining cubes created from OLAP models.
parameterization and formatting features that are available in Reporting Services.
For more information about how to use Reporting Services with DMX queries as a data source, see these links:
However, it is not necessary to use DMX as the data source. The Integration Services components for data mining also support saving the results of a prediction query to a relational database. If you have an established workflow for updating models using Integration Services, persisting predictions and other data mining query results to SQL Server enable you to use Power View for reporting, as well as other tools that do not interface with DMX.
For more information about using Reporting Services as the presentation layer for data sources, see Integrating Reporting Services into Applications.
Data Quality Services (DQS) is new in SQL Server 2016. Because data problems can make data mining impossible, data miners who perform repeated analysis or who work in large organizations with complex data sources are expected to find that a well-planned data project using DQS is a more reliable solution for support of data mining than ad hoc cleansing of data using Transact-SQL or other scripts.
The following features of DQS should be considered for data preparation and data integrity in a data mining solution.
For more information, see DQS Knowledge Bases and Domains.
Full-Text Search in SQL Server lets applications and users run full-text queries against character-based data in SQL Server tables. When full-text search is enabled, you can perform searches against text data that are enhanced by language-specific rules about the multiple forms of a word or phrase. You can also configure search conditions, such as the distance between multiple terms, and use functions to constrain the results that are returned in order of likelihood.
Because full-text queries are a feature provided by the SQL Server engine, you can create parameterized queries, generate custom data sets or term vectors by using full-text search features on a text data source, and use these sources in data mining.
For more information about how full-text queries interact with the full-text index, see Query with Full-Text Search.
An advantage of using the full-text search features of SQL Server is that you can leverage the linguistic intelligence that is contained in the word breakers and stemmers shipped for all SQL Server languages. By using the supplied word breakers and stemmers, you can ensure that words are separated using the characters appropriate for each language, and that synonyms based on diacritics or orthographic variations (such as the multiple number formats in Japanese) are not overlooked.
In addition to linguistic intelligence that governs word boundaries, the stemmers for each language can reduce variants of a word to a single term, based on knowledge of the rules for conjugation and orthographic variation in that language. The rules for linguistic analysis differ for each language and are developed based on extensive research on real-life corpora.
For more information, see Configure and Manage Word Breakers and Stemmers for Search.
The version of a word that is stored after full-text indexing is a token in compressed form. Subsequent queries to the full-text index generate multiple inflectional forms of a particular word based on the rules of that language, to ensure that all probable matches are made. For example, although the token that is stored might be “run”, the query engine also looks for the terms "running", "ran", and "runner," because these are regularly derived morphological variations of the root word "run".
You can also create and build a user thesaurus to store synonyms and enable better search results, or categorization of terms. By developing a thesaurus tailored to your full-text data, you can effectively broaden the scope of full-text queries on that data. For more information, see Configure and Manage Thesaurus Files for Full-Text Search.
Requirements for using full-text search include the following:
The database administrator must create a full-text index on the table.
Only one full-text index is allowed per table.
Each column that you index must have a unique key.
Full-text indexing is supported only for columns with these data types: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max). If the column is varbinary, varbinary(max), image, or xml, you must specify the file extension of the indexable document (.doc, .pdf, .xls, and so forth), in a separate type column.
Semantic search builds upon the existing full-text search features in SQL Server, but uses additional capabilities and statistics to enable scenarios such as automatic keyword extraction and discovery of related documents. For example, you might use semantic search to build a base taxonomy for an organization, or classify a corpus of documents. Or, you could use the combination of extracted terms and document similarity scores in clustering or decision tree models.
After you have enabled semantic search successfully, and indexed your data columns, you can use the functions that are provided natively with semantic indexing to do the following:
Return single-word key phrases with their score.
Return documents that contain a specified key phrase.
Return similarity scores, and the terms that contribute to the score.
For more information, see Find Key Phrases in Documents with Semantic Search and Find Similar and Related Documents with Semantic Search.
For more information about the database objects that support semantic indexing, see Enable Semantic Search on Tables and Columns.
Requirements for using semantic search include the following:
Full-text search also be enabled.
Installation of the semantic search components also creates a special system database, which cannot be renamed, altered, or replaced.
Documents that you index using the service must be stored in SQL Server, in any of the database objects that are supported for full-text indexing, including tables and indexed views.
Not all of the full-text languages support semantic indexing. For a list of supported languages, see sys.fulltext_semantic_languages (Transact-SQL).