You need to update one or more columns in a table with millions of records with data from another table, what’s a good way of going this with low risk and in an optimized way? One of my favorite ways of dealing with millions of records in a table is processing inserts, deletes, or updates in batches.
Updating data in batches of 10,000 records at a time and using a transaction is a simple and efficient way of performing updates on millions of records. The key is to batch out your data to allow you to process a smaller number of records at a time.
Example:
use DatabaseName
go
DECLARE @batchId INT
DECLARE @batchSize INT
DECLARE @results INT
SET @results = 1
SET @batchSize = 10000
SET @batchId = 0
-- when 0 rows returned, exit the loop
WHILE (@results > 0)
BEGIN
BEGIN TRAN;
UPDATE Table1 SET columnA = Table2.columnA
FROM Table2
INNER JOIN Table1 ON Table2.Id = Table1.Id
WHERE Table1.columnA is null
AND (Table2.Id > @batchId
AND Table2.Id <= @batchId + @batchSize)
SET @results = @@ROWCOUNT
-- next batch
SET @batchId = @batchId + @batchSize
COMMIT TRAN;
END
-- the sql below is just to measure the performance of this update, it is not needed to update your data.
DECLARE @startTime DATETIME
SELECT @startTime = GETDATE()
SET STATISTICS PROFILE ON
SELECT COUNT(*)
FROM Table1 first OPTION (RECOMPILE)
SET STATISTICS PROFILE OFF
SELECT 'It took ' + CONVERT(VARCHAR(8), DATEDIFF(ms, @startTime, GETDATE())) + ' ms'
GO
In the query above we declare the following variables to do the batching:
- @results – this variable will hold the number of records updated; when zero, the query will stop. This is to avoid an infinite loop.
- @batchId – this is set to zero initially, and it is used to compare the table id against it, and after each update, it is set to the id plus the batch size. This allows splitting the millions of records into batches.
- @batchSize – the number of records to update at a time.
It is important to know that the above algorithm will only work when your table’s primary keys are of type int. If your table’s primary keys are guids, the above will not work.
The section at the end of the query is only used to review the performance of your updates, it is not necessary to update the data and it is a nice way to see how the query performs overall.
I hope the information in this post is helpful, and it helps you learn a simple way to update millions of records in batches, making it easier and lower risk than attempting to do it all at once.
If instead of updating you need to efficiently delete millions of records, please read this post.
Leave a Reply