Home » Blog » SharePoint Online » How to Find Duplicate Files in SharePoint Online? A Comprehensive Guide

How to Find Duplicate Files in SharePoint Online? A Comprehensive Guide

author
Published By Raj Kumar
Anuraag Singh
Approved By Anuraag Singh
Published On October 8th, 2024
Reading Time 8 Minutes Reading

Summary: How to find duplicate files in SharePoint Online? If you are also searching for the same, this write-up will be fruitful for you. Here you will find out how to find and delete duplicate files in SharePoint Online.

When SharePoint is used to track the ongoing progress of the project, then there might be a chance that multiple team members are accessing it. It is also possible that unintentionally team members can upload the same files to the document library.

As a result, due to the duplicate files, your SharePoint Online environment becomes cluttered. It starts consuming valuable storage space. Also, you might face SharePoint search not working. It’s essential to regularly identify and remove these duplicate files to prepare a clean and organized workspace.

In this comprehensive guide, we’ll explore various methods to find and remove duplicate files in SharePoint Online. So, let’s start from the beginning.

Different Methods to Find Identical Files in SharePoint Online

To manage large lists and libraries in SharePoint Online efficiently, you need to find and delete duplicate files in SharePoint Online. Microsoft does not offered any SharePoint duplicate analysis tool. But you can find the duplicate files from SharePoint Online either by manual search or using the PowerShell scripts. The manual search requires so much time to filter out duplicate files. On the other hand, the PowerShell script is an automated way to search the duplicate files and folders.

Manually Searching Duplicate Files in SharePoint Online

For manually identifying the SharePoint duplicate files-

Open the document library and start reviewing the files thoroughly. Check the file name, size, and modified time, and then compare them with the other files. You can also open the documents to ensure content similarity.

There is also an advanced search option in SharePoint to filter the files based on several factors. It includes file size, date, author, etc.

PowerShell Script to Find Duplicate Files in SharePoint Online

Execute the below commands to search for the data duplication in SharePoint.

#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"

#Parameters
$URLofSite = ""
$CSVLoc = "C:\Temp\Duplicates.csv"
$BatchSize = 1000
#Array for Result Data
$DataCollection = @()

#Get credentials to connect
$Credt = Get-Credential

Try {
#Setup the Context
$Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($URLofSite)
$Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Credt.UserName, $Credt.Password)

#Get the Web
$Web = $Ctx.Web
$ListsContainer = $Web.Lists
$Ctx.Load($Web)
$Ctx.Load($ListsContainer)
$Ctx.ExecuteQuery()

ForEach($List in $ListsContainer)
{

If($List.BaseType -eq "DocumentLibrary" -and $List.Hidden -eq $False -and $List.ItemCount -gt 0 -and $List.Title -Notin("PagesofSite","Style Library", "Preservation Hold Library"))
{

$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = "@

$BatchSize
"

$Count = 1

Do {
$ListItems = $List.GetItems($Query)
$Ctx.Load($ListItems)
$Ctx.ExecuteQuery()

ForEach($Item in $ListItems)
{
#Fiter Files
If($Item.FileSystemObjectType -eq "File")
{
#Get the File from Item
$File = $Item.File
$Ctx.Load($File)
$Ctx.ExecuteQuery()
Write-Progress -PercentComplete ($Count / $List.ItemCount * 100) -Activity "File Processing $Count of $($List.ItemCount) in $($List.Title) of $($Web.URL)" -Status "Scan Files '$($File.Name)'"

#Get The File Hash
$Bytes = $File.OpenBinaryStream()
$Ctx.ExecuteQuery()
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))

#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "NameofFile" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "FileHashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URLoFile" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "SizeofFile" -value $File.Length
$DataCollection += $Data
}
$Count++
}
#Update Position of the ListItemCollectionPosition
$Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
}While($Query.ListItemCollectionPosition -ne $null)
}
}
#Export All Data to CSV
$DataCollection | Export-Csv -Path $CSVLoc -NoTypeInformation
Write-host -f Green "Files Inventory has been Exported to $CSVLoc"

$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate files according to the Hashcode:"
$Duplicates | Format-table -AutoSize

#Group Based on File Name
$DuplicatesFileName = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host “Duplicates according to the File Name:"
$DuplicatesFileName| Format-table -AutoSize

#Group Based on File Size
$DuplicatesFileSize = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicates according to the File Size:"
$DuplicatesFileSize| Format-table -AutoSize
}
Catch {
write-host -f Red "Error:" $_.Exception.Message
}

PnP PowerShell Commands to Identify Duplicate Files in SharePoint Online

Find duplicate files in SharePoint Online using the below script.

#Parameters
$SiteURL = ""
$SizeofPage = 1000
$ReportResult = "C:\Temp\Duplicates.csv"

#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive

#Array to store results
$DataCollection = @()

#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}

#Iterate through each document library
ForEach($Library in $DocumentLibraries)
{
#Get All documents from the library
$global:count = 0;
$Documents = Get-PnPListItem -List $Library -SizeofPage $SizeofPage -Fields ID, File_x0020_Type -ScriptBlock `
{ Param($items) $global:count += $items.Count; Write-Progress -PercentComplete ($global:Count / ($Library.ItemCount) * 100) -Activity `
"Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Count of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}

$CountItem = 0
#Iterate through each document
Foreach($Document in $Documents)
{
#Get the File from Item
$File = Get-PnPProperty -ClientObject $Document -Property File

#Get The File Hash
$Bytes = $File.OpenBinaryStream()
Invoke-PnPQuery
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))

#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "File-Name" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "Hash-Code" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "File-URL" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
$DataCollection += $Data
$CountItem++
Write-Progress -PercentComplete ($CountItem / ($Library.ItemCount) * 100) -Activity "gathering data from Documents $CountItem of $($Library.ItemCount) from $($Library.Title)" `
-Status "Reading Document’s data '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
}
}
#Get Duplicate Files by Grouping Hash code
$DuplicatesFiles = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$DuplicatesFiles | Format-table -AutoSize

#Export the duplicate results to CSV
$DuplicatesFiles | Export-Csv -Path $ReportResult -NoTypeInformation

How to Delete Duplicate Files in SharePoint Online Using PowerShell?

Go with these commands to delete the redundant or duplicate files from SharePoint.

# Define the source path
$sourceLoc = "C:\Temp\New"

# Get all files with the same size
$Files = Get-ChildItem -Path $sourceLoc -File -Recurse | Sort-Object LastWriteTime -Descending | Group-Object -Property Length | Where-Object {$_.Count -gt 1}

# Group files by their hash and find duplicates
$DuplicateFiles = $Files | Select -ExpandProperty Group | Get-FileHash | Group-Object -Property Hash | Where-Object {$_.Count -gt 1}

#Delete the Duplicate files
if ($duplicateFiles.Count -eq 0) {
Write-Output "No duplicate files found."
} else {
Write-Output "Founded Duplicate files are deleted successfully:"
$duplicateFiles | ForEach-Object {
$filesforDelete = $_.Group | Select-Object -Skip 1
$filesforDelete | ForEach-Object {
Write-Output "Deleting: $($_.Path)"
Remove-Item -Path $_.Path -Force
}
}
}

Transfer Duplicate Files to Another Location

One of the alternative solutions is to move the duplicate files to another location if you are not sure whether you should remove duplicate files in SharePoint Online or not.

$FolderLoc = "C:\Temp1\Logs"

# Find Duplicate Files
$DuplicateFiles = Get-ChildItem -Path $FolderLoc -File |
Sort-Object LastWriteTime -Descending |
Group-Object -Property Length |
Where-Object {$_.count -gt 1} |
Select-Object -ExpandProperty Group |
Get-FileHash |
Group-Object -Property Hash |
Where-Object { $_.Count -gt 1 }

# Temporary location for duplicate files
$TempLoc = "C:\Temp1\DuplicateFiles"
New-Item -ItemType Directory -Path $TempLoc -Force | Out-Null

# Log File to store what's moved
$Filelog = "C:\Temp1\LogFile.txt"

# Move Duplicate Files to another folder
ForEach ($Group in $DuplicateFiles) {
$Group.Group | Select-Object -Skip 1 | ForEach-Object {
$filePath = $_.Path
Add-Content -Path $Filelog -Value $filePath
Move-Item -Path $filePath -Destination $tempDir
Write-host "Duplicate File Moved:"$filePath
}
}

Relocate Your SharePoint Data Another SharePoint Account

One of the best practices that you should follow before going to find and delete duplicate files in SharePoint Online. You should store the SharePoint content in another location to prevent data loss. You can achieve it by using the Expert’s Recommended SharePoint Migration Tool.

Download Now Purchase Now

This is the first step before performing the deletion of the duplicate files. This tool is well-versed with the latest features and also does not require any technical expertise while operating it. Below are the basic steps of the tool.

  • Step 1. Download and Run the tool.
  • Step 2. Finalize O365 as Source & Destination.
  • Step 3. Choose the Site option.
  • Step 4. Complete both platform details.
  • Step 5. Add Users and click on Start Migration.

How to Reduce Chances of Data Duplicacy in SharePoint Online?

  • Perform a scheduled scan to identify the duplicate files to prevent data redundancy.
  • Make use of SharePoint’s advanced search features to filter out duplicate files efficiently.
  • Evaluate the metadata of the files such as file size, modified date, author, and content to identify files with similar characteristics.
  • Before deleting files, make sure it is redundant and not required further.
  • Inform your team members about the issues of creating data duplicacy in SharePoint.

Final Words

There are a lot of disadvantages you might face because of data duplicacy in SharePoint Online. Because there is no SharePoint duplicate analysis tool offered by Microsoft. However, you can find duplicate files in SharePoint Online using manual search or executing the PowerShell commands. The manual search requires you to spend a lot of time, while the PowerShell commands can make your task easier.

Connect With Us

+9111-28084986