Advice on how to improve query efficiency when Filtering on nvarchar(4000)

ions

Access User
Local time
Yesterday, 16:36
Joined
May 23, 2004
Messages
849
Hello,

I have a table with 31 million records. So far most queries return within 2-4 seconds but there is one important query that is taking 14 seconds.

There is a field called FEDERAL_WASTE_CODES and 75% of the time it contains <= 3 waste codes (D001, F002, F003) and 95% of the time it contains <= 10 waste codes (D004, D005, D006, D007, D008, D009, D010, D011).

However, there are times it contains 40+ Waste Codes (see end of post). When I filter for a specific waste code (LIKE '%K061%') the query becomes too slow. I did not index this field.

What are your recommendations? Thank you!

This query takes ~14 seconds when done in C#
Code:
return await dbContext.ManifestLines
    .Where(x => x.SHIPPED_DATE.Year == 2024
             && x.GENERATOR_LOCATION_STATE == "NY"
             && x.FEDERAL_WASTE_CODES != null
             && x.FEDERAL_WASTE_CODES.Contains("K061"))
    .ToListAsync();


D001, D004, D005, D006, D007, D008, D009, D010, D011, D012, D013, D014, D015, D016, D017, D018, D019, D020, D021, D022, D023, D024, D025, D026, D027, D028, D029, D030, D031, D032, D033, D034, D035, D036, D037, D038, D039, D040, D041, D042, D043, F007, F008, F009, F010, F011, F012, F019, F024, F025, F037, F038, F039, K001, K002, K003, K004, K005, K006, K007, K008, K009, K010, K011, K013, K014, K015, K016, K017, K018, K019, K020, K021, K022, K023, K024, K028, K029, K030, K031, K032, K033, K034, K035, K036, K037, K038, K040, K041, K042, K043, K044, K045, K046, K047, K048, K049, K050, K051, K052, K060, K061, K062, K069, K071, K073, K083, K084, K085, K086, K087, K088, K093, K094, K095, K096, K097, K098, K099, K100, K101, K102, K103, K104, K105, K106, K111, K115, K117, K118, K131, K132, K136, K141, K142, K143, K144, K145, K147, K148, K149, K150, K151, K156, K157, K158, K159, K161, P004, P006, P009, P010, P011, P012, P013, P020, P021, P022, P024, P029, P030, P031, P033, P034, P036, P037, P038, P039, P047, P048, P050, P051, P056, P059, P060, P063, P065, P068, P071, P073, P074, P077, P081, P082, P089, P092, P094, P096, P097, P098, P099, P101, P103, P104, P105, P106, P110, P112, P113, P114, P115, P119, P120, P121, P122, P123, P127, P128, P185, P188, P189, P190, P191, P192, P194, P196, P197, P198, P199, P201, P202, P203, P204, P205, U005, U009, U012, U018, U019, U022, U023, U024, U025, U027, U028, U029, U030, U031, U032, U036, U037, U039, U043, U044, U045, U047, U048, U050, U051, U052, U057, U060, U061, U063, U066, U067, U068, U069, U070, U071, U072, U075, U076, U077, U078, U079, U080, U081, U082, U083, U084, U086, U088, U096, U098, U099, U101, U102, U103, U105, U106, U107, U108, U109, U111, U112, U115, U117, U118, U120, U121, U124, U127, U128, U129, U130, U131, U133, U134, U135, U136, U137, U138, U140, U141, U142, U144, U145, U146, U151, U152, U154, U155, U157, U158, U159, U160, U161, U162, U165, U169, U170, U172, U174, U179, U180, U181, U183, U184, U185, U187, U188, U189, U190, U192, U196, U203, U204, U205, U207, U208, U209, U210, U211, U214, U215, U216, U217, U220, U225, U226, U227, U228, U235, U239, U240, U243, U246, U247, U249, U271, U278, U279, U280, U364, U367, U372, U373, U387, U389, U394, U395, U404, U409, U410, U411
 
I would normalize this data after import. The new table would have a FK pointing to the parent record, and then one WasteCode.
Yes, you may have 100M records in this table, but now you can index it and retrieval will be so fast your head will spin :)
 
For SELECT queries, indexing the filter / join fields will significantly improve performance. From my own tests anything up to 250 times faster.
 
Why are you using like? And why not indexed? The use of the initial % will negate the use of an index anyway
 
I did not index this field.

What are your recommendations?
Index that field. Matching is also expensive in queries. You can store all those WASTE_CODES in a lookup table. Convert all the string values to their corresponding integer Lookup ID values and query the ID value of K061. e.g. "K061" = integer ID 120.
 
For SELECT queries, indexing the filter / join fields will significantly improve performance. From my own tests anything up to 250 times faster.
The performance gain is exponential, not linear.

But the OP said: "When I filter for a specific waste code"
And such code can be in the middle of the field value. Indexing won't help (it works from the left).
 
The performance gain is exponential, not linear.

But the OP said: "When I filter for a specific waste code"
And such code can be in the middle of the field value. Indexing won't help (it works from the left).
Correct. The user will be filtering this at Runtime. I will be building the Where Clause with the criteria they provide.

I guess I will normalize unless any other options are presented. I was hoping it wouldn't have to come to this. :)

Thank you.
 
For SELECT queries, indexing the filter / join fields will significantly improve performance. From my own tests anything up to 250 times faster.
Yes the Join fields are both Indexed. I get good performance when I filter on any other field.
 
Another option would be "Full Text Index", but I very much doubt it will perform better. It was created to solve a different problem.
There *may* also be a way to run an XPath query, but again, hard to see how that would perform better.
 
Hello,

I have a table with 31 million records. So far most queries return within 2-4 seconds but there is one important query that is taking 14 seconds.

There is a field called FEDERAL_WASTE_CODES and 75% of the time it contains <= 3 waste codes (D001, F002, F003) and 95% of the time it contains <= 10 waste codes (D004, D005, D006, D007, D008, D009, D010, D011).

However, there are times it contains 40+ Waste Codes (see end of post). When I filter for a specific waste code (LIKE '%K061%') the query becomes too slow. I did not index this field.

What are your recommendations? Thank you!

This query takes ~14 seconds when done in C#
Code:
return await dbContext.ManifestLines
    .Where(x => x.SHIPPED_DATE.Year == 2024
             && x.GENERATOR_LOCATION_STATE == "NY"
             && x.FEDERAL_WASTE_CODES != null
             && x.FEDERAL_WASTE_CODES.Contains("K061"))
    .ToListAsync();


D001, D004, D005, D006, D007, D008, D009, D010, D011, D012, D013, D014, D015, D016, D017, D018, D019, D020, D021, D022, D023, D024, D025, D026, D027, D028, D029, D030, D031, D032, D033, D034, D035, D036, D037, D038, D039, D040, D041, D042, D043, F007, F008, F009, F010, F011, F012, F019, F024, F025, F037, F038, F039, K001, K002, K003, K004, K005, K006, K007, K008, K009, K010, K011, K013, K014, K015, K016, K017, K018, K019, K020, K021, K022, K023, K024, K028, K029, K030, K031, K032, K033, K034, K035, K036, K037, K038, K040, K041, K042, K043, K044, K045, K046, K047, K048, K049, K050, K051, K052, K060, K061, K062, K069, K071, K073, K083, K084, K085, K086, K087, K088, K093, K094, K095, K096, K097, K098, K099, K100, K101, K102, K103, K104, K105, K106, K111, K115, K117, K118, K131, K132, K136, K141, K142, K143, K144, K145, K147, K148, K149, K150, K151, K156, K157, K158, K159, K161, P004, P006, P009, P010, P011, P012, P013, P020, P021, P022, P024, P029, P030, P031, P033, P034, P036, P037, P038, P039, P047, P048, P050, P051, P056, P059, P060, P063, P065, P068, P071, P073, P074, P077, P081, P082, P089, P092, P094, P096, P097, P098, P099, P101, P103, P104, P105, P106, P110, P112, P113, P114, P115, P119, P120, P121, P122, P123, P127, P128, P185, P188, P189, P190, P191, P192, P194, P196, P197, P198, P199, P201, P202, P203, P204, P205, U005, U009, U012, U018, U019, U022, U023, U024, U025, U027, U028, U029, U030, U031, U032, U036, U037, U039, U043, U044, U045, U047, U048, U050, U051, U052, U057, U060, U061, U063, U066, U067, U068, U069, U070, U071, U072, U075, U076, U077, U078, U079, U080, U081, U082, U083, U084, U086, U088, U096, U098, U099, U101, U102, U103, U105, U106, U107, U108, U109, U111, U112, U115, U117, U118, U120, U121, U124, U127, U128, U129, U130, U131, U133, U134, U135, U136, U137, U138, U140, U141, U142, U144, U145, U146, U151, U152, U154, U155, U157, U158, U159, U160, U161, U162, U165, U169, U170, U172, U174, U179, U180, U181, U183, U184, U185, U187, U188, U189, U190, U192, U196, U203, U204, U205, U207, U208, U209, U210, U211, U214, U215, U216, U217, U220, U225, U226, U227, U228, U235, U239, U240, U243, U246, U247, U249, U271, U278, U279, U280, U364, U367, U372, U373, U387, U389, U394, U395, U404, U409, U410, U411
Make a full text index and use CONTAINS. If you can afford the cost of the index broadly speaking, Contains can move lightning quick in certain cases where LIKE was slow as molasses.
Interestingly and annoyingly, (perhaps), Contains is a deep subject. You'd think it's a simple page of documentation, ok, give me 5 minutes I can digest this........No. Contains has many different ways it can be used and the sumtotal of the implications of what you can do is really staggeringly huge. I've only scratched the surface and used it in the simplest of ways, but it's proven its value to me many times in the right circumstances. Just by itself it's capabilities are equal to an entire suite of python (or whatever) libraries to mine text data
 
Last edited:
Make a full text index and use CONTAINS. If you can afford the cost of the index broadly speaking, Contains can move lightning quick in certain cases where LIKE was slow as molasses.
Interestingly and annoyingly, (perhaps), Contains is a deep subject. You'd think it's a simple page of documentation, ok, give me 5 minutes I can digest this........No. Contains has many different ways it can be used and the sumtotal of the implications of what you can do is really staggeringly huge. I've only scratched the surface and used it in the simplest of ways, but it's proven its value to me many times in the right circumstances. Just by itself it's capabilities are equal to an entire suite of python (or whatever) libraries to mine text data
Index a nvarchar(4000) field?
 
I took ~20 minutes to run a simple Update Query? Is this normal for 31 million records? I've never worked with Tables over 100K records. New experience for me.

1758717428740.png
 
If the fields being updated are indexed then the index also needs updating, which when it involves millions of records can take a considerable amount of extra time.

When doing enormous updates it can be beneficial to drop, update, and then recreate the indexes.
 
Index a nvarchar(4000) field?
Yes. It's called a full text index and it's meant especially for your situation.

In fact your size is kind of small, people usually have Max length when they've used one but you can certainly use one at 4,000.
 
I thought about suggesting a full text index on this field, but like Tom said, I'm not convinced it would be all that helpful in this scenario because it's intended for a different kind of problem. It couldn't hurt to try, if only for your own learning.
 
Not sure I agree with that. It's intended exactly for this situation, to be able to effectively and efficiently search text beyond what like can efficiently do
 
If the fields being updated are indexed then the index also needs updating, which when it involves millions of records can take a considerable amount of extra time.

When doing enormous updates it can be beneficial to drop, update, and then recreate the indexes.
The field wasn't being indexed and it still took 20 minutes.

I learned about ALTER DATABASE UMA SET RECOVERY BULK_LOGGED;

I am about to Normalize the Federal Waste Codes. I hope this takes under 3 hours.

-- 1. Switch to BULK_LOGGED recovery
ALTER DATABASE UMA SET RECOVERY BULK_LOGGED;

-- 2. Run the normalization stored procedure
EXEC dbo.NormalizeManifestDetailsFWC_Batched;

-- 3. Switch back to FULL recovery
ALTER DATABASE UMA SET RECOVERY FULL;


Code:
USE [UMA]
GO
/****** Object:  StoredProcedure [dbo].[NormalizeManifestDetailsFWC_Batched]    Script Date: 2025/09/24 9:53:04 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER   PROCEDURE [dbo].[NormalizeManifestDetailsFWC_Batched]
AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @BatchSize INT = 250000;
    DECLARE @MinID INT, @MaxID INT, @UpperBound INT;

    -- Drop the unique index if it exists
    IF EXISTS (
        SELECT 1
        FROM sys.indexes
        WHERE name = 'IX_tblManifestDetailsFWC_NaturalKey'
          AND object_id = OBJECT_ID('dbo.tblManifestDetailsFWC')
    )
    BEGIN
        DROP INDEX IX_tblManifestDetailsFWC_NaturalKey ON dbo.tblManifestDetailsFWC;
        PRINT 'Dropped index IX_tblManifestDetailsFWC_NaturalKey';
    END

    -- Start fresh
    TRUNCATE TABLE dbo.tblManifestDetailsFWC;

    -- Find ID bounds
    SELECT
        @MinID = MIN(ID),
        @MaxID = MAX(ID)
    FROM dbo.tblManifestDetails
    WHERE FEDERAL_WASTE_CODES IS NOT NULL;

    SET @UpperBound = @MinID + @BatchSize - 1;

    WHILE @MinID <= @MaxID
    BEGIN
        ;WITH Batch AS (
            SELECT ID, FEDERAL_WASTE_CODES
            FROM dbo.tblManifestDetails
            WHERE ID BETWEEN @MinID AND @UpperBound
              AND FEDERAL_WASTE_CODES IS NOT NULL
        ),
        Tokens AS (
            SELECT
                d.ID AS ManifestLineID,
                UPPER(LTRIM(RTRIM(s.value))) AS FederalWasteCode
            FROM Batch d
            CROSS APPLY STRING_SPLIT(d.FEDERAL_WASTE_CODES, ',') s
        )
        INSERT INTO dbo.tblManifestDetailsFWC (ManifestLineID, FederalWasteCode)
        SELECT DISTINCT ManifestLineID, FederalWasteCode
        FROM Tokens
        WHERE FederalWasteCode <> '';

        PRINT CONCAT('Processed IDs ', @MinID, ' to ', @UpperBound);

        -- Advance to next batch
        SET @MinID = @UpperBound + 1;
        SET @UpperBound = @MinID + @BatchSize - 1;
    END

    -- Recreate the unique index
    CREATE UNIQUE NONCLUSTERED INDEX IX_tblManifestDetailsFWC_NaturalKey
        ON dbo.tblManifestDetailsFWC (ManifestLineID ASC, FederalWasteCode ASC);

    PRINT 'Recreated index IX_tblManifestDetailsFWC_NaturalKey';
END
 
Last edited:
Not sure I agree with that. It's intended exactly for this situation, to be able to effectively and efficiently search text beyond what like can efficiently do
I am going assess the results of the normalization but I will check this out. I encounter Codes that are "," separated in a field often.

Thank you for providing an alternative suggestion. This is why I made the original post... to get different approaches on how to solve this.
 
Good News!

Normalization took under 18 minutes. I believe RECOVERY BULK_LOGGED made the difference in speed.

The normalized table has 100,158,139 records.

The below LINQ C# query took under 4 seconds which is decent.

Thank you for everyone's advice!

return await dbContext.ManifestLines
.Where(x => x.SHIPPED_DATE.Year == 2024
&& x.GENERATOR_LOCATION_STATE == "NY"
&& x.FederalWasteCode == "K061")
.ToListAsync();

1758725054588.png
 
> -- Recreate the unique index<br>
> CREATE UNIQUE NONCLUSTERED INDEX IX_tblManifestDetailsFWC_NaturalKey<br>
> ON dbo.tblManifestDetailsFWC (ManifestLineID ASC, FederalWasteCode ASC);

What I don't see here is an index on FederalWasteCode, so your lookup "where FederalWasteCode = 'K061') is not using this index. You can prove that by running the query in SSMS (bypassing for a moment any smarts that LINQ might add) and turning on the Execution Plan.

Mind you, an index on ManifestLineID is important for joining with the parent table, but you need an additional index.
 

Users who are viewing this thread

Back
Top Bottom