3/5 - (2 votes)

Here are some more code optimization approaches related to LINQ queries.

It is known that LINQ (Language-Integrated Query) is a simple and convenient language for querying a data source.

And LINQ to SQL is a data access technology in a DBMS. This is a powerful tool for working with data, where queries are constructed through a declarative language, which will then be converted into SQL queries by the platform and sent to the database server for execution. In our case, by DBMS we mean MS SQL Server.

However, LINQ queries are not converted to optimally written SQL queries that an experienced DBA could write with all the nuances of optimizing SQL queries:

  1.     optimal connections (JOIN) and filtering of results (WHERE)
  2.     many nuances in the use of compounds and group conditions
  3.     many variations in replacing IN conditions with EXISTS and NOT IN, <> with EXISTS
  4.     intermediate caching of results through temporary tables, CTE, table variables
  5.     using a clause (OPTION) with instructions and table hints WITH (…)
  6.     the use of indexed views, as one of the means to get rid of excessive readings of data in samples

The main performance bottlenecks of the resulting SQL queries when compiling LINQ queries are:

  •     consolidation of the entire data selection mechanism in one request
  •     duplication of identical blocks of code, which ultimately leads to multiple extra readings of data
  •     groups of multicomponent conditions (logical “and” and “or”) – AND and OR, combining in difficult conditions, leads to the fact that the optimizer, having suitable non-clustered indexes, by the necessary fields, eventually starts to scan by the cluster index (INDEX SCAN) by condition group
  •     deep location of subqueries makes it very problematic to parse SQL statements and parse query plans from developers and DBAs

Optimization methods

Now we pass directly to the optimization methods.

1)Additional indexing

It is best to consider filters on the main sampling tables, since very often the entire query is built around one or two main tables (applications-people-operations) and with a standard set of conditions (IsClosed, Canceled, Enabled, Status). It is important for the identified samples to create the corresponding indexes.

This solution makes sense when choosing from these fields significantly limits the returned set to the query.

For example, we have 500,000 applications. However, there are only 2,000 active entries. Then a correctly selected index will save us from INDEX SCAN over a large table and will allow us to quickly select data through a non-clustered index.

Also, the lack of indexes can be detected through the help of parsing query plans or collecting statistics from the system views of MS SQL Server:

  1. sys.dm_db_missing_index_groups
  2. sys.dm_db_missing_index_group_stats
  3. sys.dm_db_missing_index_details

All view data contains information about missing indexes, with the exception of spatial indexes.

However, indexes and caching are often methods of dealing with the effects of poorly written LINQ queries and SQL queries.

As the practice of life shows for the business, it is often important to implement business features by a certain date. And therefore, often heavy queries are put in the background with caching.

This is partly justified, since the user does not always need the latest data and an acceptable level of user interface response occurs.

This approach allows you to solve business needs, but ultimately reduces the efficiency of the information system, simply delaying the solution of problems.

It is also worth remembering that during the search process necessary to add new indexes, the MS SQL optimization proposals may be incorrect, including under the following conditions:

  •     if indexes with a similar set of fields already exist
  •     if the fields in the table cannot be indexed due to indexing restrictions.

2) Merging attributes into one new attribute

Sometimes some fields from the same table by which a group of conditions occurs can be replaced by the introduction of one new field.

This is especially true for state fields, which by type are usually either bitwise or integer.

Example:

IsClosed = 0 AND Canceled = 0 AND Enabled = 0 is replaced by Status = 1.

Here the integer attribute Status is entered, provided by filling these statuses in the table. The next step is indexing this new attribute.

This is a fundamental solution to the performance problem, because we are asking for data without unnecessary calculations.

3)Submission materialization

Unfortunately, LINQ queries cannot directly use temporary tables, CTEs, and table variables.

However, there is another way to optimize for this case – this is indexed views.

A group of conditions (from the example above) IsClosed = 0 AND Canceled = 0 AND Enabled = 0 (or a set of other similar conditions) becomes a good option for using them in an indexed view, caching a small slice of data from a large set.

But there are a number of limitations when materializing a view:

  •     using subqueries, EXISTS clauses should be replaced using JOIN
  •    you cannot use UNION, UNION ALL, EXCEPTION, INTERSECT clauses
  •     you cannot use table hints and OPTION clauses
  •     no ability to work with cycles
  •     it is impossible to display data in one view from different tables

It is important to remember that the real benefits of using an indexed view can actually be obtained only by indexing it.

But when invoking a view, these indices may not be used, and to use them explicitly, you must specify WITH (NOEXPAND).

Since you cannot define table hints in LINQ queries, so you have to make another representation – a “wrapper” of the following form:

CREATE VIEW ИМЯ_представления AS SELECT * FROM MAT_VIEW WITH (NOEXPAND);

4)Using table functions

Often in LINQ queries, large subquery blocks or blocks that use representations with a complex structure form the final query with a very complex and not optimal execution structure.

Key benefits of using table functions in LINQ queries:

  •     The ability, as in the case of views, to use and indicate as an object, but you can pass a set of input parameters:

    FROM FUNCTION (@ param1, @ param2 …)

    as a result, you can achieve flexible data sampling

    In the case of using a table function, there are no such strong restrictions as in the case of indexed views described above:

  •         Table hints:

        through LINQ, you cannot specify which indexes to use and determine the level of data isolation during the query.

        But in the function there are these possibilities.

        With the function, you can achieve a fairly constant plan for the execution request, where the rules for working with indexes and data isolation levels are defined

  •  Using the function allows, in comparison with indexed views, to get:
  1.  complex logic of data sampling (up to the use of loops)
  2.  fetching data from many different tables
  3.  use of UNION and EXISTS

    The OPTION clause is very useful when we need to provide concurrency control for OPTION (MAXDOP N), the order of the query execution plan. For instance:

  • you can specify the forced re-creation of the query plan OPTION (RECOMPILE)
  •  you can specify the need to ensure that the query plan enforces the connection order specified in the OPTION request (FORCE ORDER)

 Using the narrowest and most required data slice:

    There is no need to keep large data sets in caches (as is the case with indexed views), of which it is still necessary to filter data by parameter.

    For example, there is a table that uses three fields (a, b, c) for the WHERE filter.

    Conditionally for all requests there is a constant condition a = 0 and b = 0.

    However, the query on the c field is more variable.

    Suppose that the condition a = 0 and b = 0 really helps us to limit the required set to thousands of records, but the condition with us narrows the selection to hundreds of records.

Examples

Let’s consider an example implementation using the example of the Questions database.

There is a SELECT query that combines several tables and uses one view (OperativeQuestions), which verifies by email the affiliation (via EXISTS) of “Active Queries” ([OperativeQuestions]):

@p__linq__0 nvarchar(4000))SELECT
1 AS [C1],
[Extent1].[Id] AS [Id],
[Join2].[Object_Id] AS [Object_Id],
[Join2].[ObjectType_Id] AS [ObjectType_Id],
[Join2].[Name] AS [Name],
[Join2].[ExternalId] AS [ExternalId]
FROM [dbo].[Questions] AS [Extent1]
INNER JOIN (SELECT [Extent2].[Object_Id] AS [Object_Id],
[Extent2].[Question_Id] AS [Question_Id], [Extent3].[ExternalId] AS [ExternalId],
[Extent3].[ObjectType_Id] AS [ObjectType_Id], [Extent4].[Name] AS [Name]
FROM [dbo].[ObjectQuestions] AS [Extent2]
INNER JOIN [dbo].[Objects] AS [Extent3] ON [Extent2].[Object_Id] = [Extent3].[Id]
LEFT OUTER JOIN [dbo].[ObjectTypes] AS [Extent4] 
ON [Extent3].[ObjectType_Id] = [Extent4].[Id] ) AS [Join2] 
ON [Extent1].[Id] = [Join2].[Question_Id]
WHERE ([Extent1].[AnswerId] IS NULL) AND (0 = [Extent1].[Exp]) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[OperativeQuestions] AS [Extent5]
WHERE (([Extent5].[Email] = @p__linq__0) OR (([Extent5].[Email] IS NULL) 
AND (@p__linq__0 IS NULL))) AND ([Extent5].[Id] = [Extent1].[Id])
));

The view has a rather complicated structure: it has subquery joins and the use of DISTINCT sorting, which in the general case is a rather resource-intensive operation.

A selection of about ten thousand records from OperativeQuestions.

The main problem of this query is that for records from an external query, an internal subquery is performed on the [OperativeQuestions] view, which should limit the output sample (via EXISTS) to hundreds of records for [Email] = @ p__linq__0.

And it might seem that the subquery should once calculate the records by [Email] = @ p__linq__0, and then these couple of hundreds of records should be joined by Id c Questions, and the query will be fast.

In fact, all tables are connected in series: Id Questions and Id from OperativeQuestions are checked for compliance, and Email is filtered.

In fact, the request works with all tens of thousands of OperativeQuestions records, and you only need data of interest on Email.

OperativeQuestions view text:

 
CREATE VIEW [dbo].[OperativeQuestions]
AS
SELECT DISTINCT Q.Id, USR.email AS Email
FROM            [dbo].Questions AS Q INNER JOIN
                         [dbo].ProcessUserAccesses AS BPU ON BPU.ProcessId = CQ.Process_Id 
OUTER APPLY
                     (SELECT   1 AS HasNoObjects
                      WHERE   NOT EXISTS
                                    (SELECT   1
                                     FROM     [dbo].ObjectUserAccesses AS BOU
                                     WHERE   BOU.ProcessUserAccessId = BPU.[Id] AND BOU.[To] IS NULL)
) AS BO INNER JOIN
                         [dbo].Users AS USR ON USR.Id = BPU.UserId
WHERE        CQ.[Exp] = 0 AND CQ.AnswerId IS NULL AND BPU.[To] IS NULL 
AND (BO.HasNoObjects = 1 OR
              EXISTS (SELECT   1
                           FROM   [dbo].ObjectUserAccesses AS BOU INNER JOIN
                                      [dbo].ObjectQuestions AS QBO 
                                                  ON QBO.[Object_Id] =BOU.ObjectId
                               WHERE  BOU.ProcessUserAccessId = BPU.Id 
                               AND BOU.[To] IS NULL AND QBO.Question_Id = CQ.Id));

In this particular case, a solution to this problem is considered without infrastructural changes, without introducing a separate table with ready-made results (“Active Queries”), for which a mechanism would be needed to fill in its data and keep it up to date.

Although this is a good solution, there is another option for optimizing this task.

The main goal is to cache entries by [Email] = @ p__linq__0 from the OperativeQuestions view.

We enter the table function [dbo]. [OperativeQuestionsUserMail] in the database.

Sending Email as the input parameter, we get back the table of values:

CREATE FUNCTION [dbo].[OperativeQuestionsUserMail]
(
    @Email  nvarchar(4000)
)
RETURNS
@tbl TABLE
(
    [Id]           uniqueidentifier,
    [Email]      nvarchar(4000)
)
AS
BEGIN
        INSERT INTO @tbl ([Id], [Email])
        SELECT Id, @Email
        FROM [OperativeQuestions]  AS [x] WHERE [x].[Email] = @Email;
     
    RETURN;
END

This returns a table of values ​​with a predefined data structure.

In order for the queries to OperativeQuestionsUserMail to be optimal, to have optimal query plans, a strict structure is required, and not RETURNS TABLE AS RETURN… 

In this case, the desired Request 1 is converted to Request 4:

(@p__linq__0 nvarchar(4000))SELECT
1 AS [C1],
[Extent1].[Id] AS [Id],
[Join2].[Object_Id] AS [Object_Id],
[Join2].[ObjectType_Id] AS [ObjectType_Id],
[Join2].[Name] AS [Name],
[Join2].[ExternalId] AS [ExternalId]
FROM (
    SELECT Id, Email FROM [dbo].[OperativeQuestionsUserMail] (@p__linq__0)
) AS [Extent0]
INNER JOIN [dbo].[Questions] AS [Extent1] ON([Extent0].Id=[Extent1].Id)
INNER JOIN (SELECT [Extent2].[Object_Id] AS [Object_Id], [Extent2].[Question_Id] AS [Question_Id], [Extent3].[ExternalId] AS [ExternalId], [Extent3].[ObjectType_Id] AS [ObjectType_Id], [Extent4].[Name] AS [Name]
FROM [dbo].[ObjectQuestions] AS [Extent2]
INNER JOIN [dbo].[Objects] AS [Extent3] ON [Extent2].[Object_Id] = [Extent3].[Id]
LEFT OUTER JOIN [dbo].[ObjectTypes] AS [Extent4] 
ON [Extent3].[ObjectType_Id] = [Extent4].[Id] ) AS [Join2] 
ON [Extent1].[Id] = [Join2].[Question_Id]
WHERE ([Extent1].[AnswerId] IS NULL) AND (0 = [Extent1].[Exp]);

The order of the execution time has decreased from 200-800 ms, to 2-20 ms., Etc., that is, ten times faster.

If we take more averaged, then instead of 350 ms we got 8 ms.

From the obvious pluses, we also get:

  •     general decrease in reading load
  •     Significantly reduced blocking probability
  •     reduction of average blocking time to acceptable values

Conclusion

Optimization and fine-tuning of calls to the MS SQL database through LINQ is a problem that can be solved.

In this work mindfulness and consistency are very important.

At the beginning of the process:

  •     it is necessary to check the data with which the query works (values, selected data types)
  •     properly index this data
  •     check the correctness of the connecting conditions between the tables

At the next iteration, the optimization reveals:

  • basis of the request and the main filter of the request
  • repeated similar query blocks and intersection of conditions is analyzed
  •  in the SSMS or other GUI for SQL Server, the SQL query itself is optimized (allocation of an intermediate data store, building the resulting query using this store (there may be several))
  •     at the last stage, taking the resulting SQL query as a basis, the structure of the LINQ query is rebuilt

As a result, the resulting LINQ query should become identical in structure to the identified optimal SQL query from paragraph 3.