admin管理员组文章数量:1122826
I have a script that will look through all files in some folders and then search for text inside them. The problem is that I want it to selectively only search through files that are not binary in nature. This seems quite difficult to do. I've tried to combine many techniques, but it still fails to work properly (in that many files, e.g., C:\csb.log
which appears to be a system file of some kind, are flagged as binary, but they are not, and are just text files, and then there are files like PDF or EPUB / MOBI which are sort of text, but sort of not; it's quite confusing). I particularly don't like having to have lines like the below
$binaryExtensions = @('.exe', '.dll', '.bin', '.iso', '.zip', '.tar', '.rar', '.7z', '.gz', '.pdf', '.epub', '.mobi', '.azw', '.azw2', '.azw3')
as I'd like to think we can quickly detect the nature of a file without relying upon it's file extension.
How can we detect (hopefully simply) files that are not binary and so can be cleanly searched through for text, and possibly also files that are partly binary and partly text, like PDF, EPUB, MOBI, and ignore the binary parts but cleanly search for the text in the not binary parts)?
Here is my PowerShell attempt so far.
function Is-BinaryFile {
param (
[string]$FilePath,
[switch]$errors
)
# Define the log file path based on the function name
$functionName = $MyInvocation.MyCommand.Name
$logFile = "$Env:TEMP\$functionName.log"
# If $errors is specified without $FilePath, output the log file contents
if ($errors -and -not $FilePath) {
if (Test-Path $logFile) {
Get-Content -Path $logFile
} else {
Write-Host "Log file not found: $logFile"
}
return
}
# If the FilePath is a directory, consider it to be a 'binary file' for this and return true
if ((Test-Path $FilePath) -and (Get-Item $FilePath).PSIsContainer) {
return $true
}
# Check for common binary file extensions before reading the file
$binaryExtensions = @('.exe', '.dll', '.bin', '.iso', '.zip', '.tar', '.rar', '.7z', '.gz', '.pdf', '.epub', '.mobi', '.azw', '.azw2', '.azw3')
$fileExtension = [System.IO.Path]::GetExtension($FilePath).ToLower()
if ($binaryExtensions -contains $fileExtension) {
return $true
}
# Retry opening the file if it's locked using FileStream in shared mode
$maxRetries = 5
$retryDelay = 2 # in seconds
$attempt = 0
$fileStream = $null
while ($attempt -lt $maxRetries -and -not $fileStream) {
try {
# Open file with shared read/write mode
$fileStream = [System.IO.FileStream]::new($FilePath, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
} catch {
Write-Host "File is locked, retrying in $retryDelay seconds..."
Start-Sleep -Seconds $retryDelay
$attempt++
}
}
if (-not $fileStream) {
$timestamp = Get-Date -Format "yyyy-MM-dd_HH-mm-ss"
$logMessage = "[$timestamp] Failed to open '$FilePath' after $maxRetries attempts"
Write-Host $logMessage
$logMessage | Out-File -FilePath $logFile -Append
return $false
}
# Rest of your code to check if it's a binary file...
$reader = $null
try {
$reader = New-Object System.IO.BinaryReader($fileStream)
# Check for BOM (Byte Order Mark)
$bomBuffer = New-Object byte[] 4
$bytesRead = $reader.Read($bomBuffer, 0, $bomBuffer.Length)
if ($bytesRead -ge 2 -and (
($bomBuffer[0] -eq 0xEF -and $bomBuffer[1] -eq 0xBB -and $bomBuffer[2] -eq 0xBF) -or # UTF-8 BOM
($bomBuffer[0] -eq 0xFF -and $bomBuffer[1] -eq 0xFE) -or # UTF-16 LE BOM
($bomBuffer[0] -eq 0xFE -and $bomBuffer[1] -eq 0xFF) # UTF-16 BE BOM
)) {
return $false # It's a text file
}
# If no BOM, continue checking for non-printable characters
$binaryBytes = 0
$textBytes = 0
$buffer = New-Object byte[] 1024
while (($bytesRead = $reader.Read($buffer, 0, $buffer.Length)) -gt 0) {
for ($i = 0; $i -lt $bytesRead; $i++) {
if ($buffer[$i] -eq 0) {
$binaryBytes++
} elseif ($buffer[$i] -lt 32 -and $buffer[$i] -ne 9 -and $buffer[$i] -ne 10 -and $buffer[$i] -ne 13) {
$binaryBytes++
} else {
$textBytes++
}
}
}
return $binaryBytes -gt $textBytes
} finally {
if ($reader) { $reader.Close() }
if ($fileStream) { $fileStream.Close() }
}
}
I have a script that will look through all files in some folders and then search for text inside them. The problem is that I want it to selectively only search through files that are not binary in nature. This seems quite difficult to do. I've tried to combine many techniques, but it still fails to work properly (in that many files, e.g., C:\csb.log
which appears to be a system file of some kind, are flagged as binary, but they are not, and are just text files, and then there are files like PDF or EPUB / MOBI which are sort of text, but sort of not; it's quite confusing). I particularly don't like having to have lines like the below
$binaryExtensions = @('.exe', '.dll', '.bin', '.iso', '.zip', '.tar', '.rar', '.7z', '.gz', '.pdf', '.epub', '.mobi', '.azw', '.azw2', '.azw3')
as I'd like to think we can quickly detect the nature of a file without relying upon it's file extension.
How can we detect (hopefully simply) files that are not binary and so can be cleanly searched through for text, and possibly also files that are partly binary and partly text, like PDF, EPUB, MOBI, and ignore the binary parts but cleanly search for the text in the not binary parts)?
Here is my PowerShell attempt so far.
function Is-BinaryFile {
param (
[string]$FilePath,
[switch]$errors
)
# Define the log file path based on the function name
$functionName = $MyInvocation.MyCommand.Name
$logFile = "$Env:TEMP\$functionName.log"
# If $errors is specified without $FilePath, output the log file contents
if ($errors -and -not $FilePath) {
if (Test-Path $logFile) {
Get-Content -Path $logFile
} else {
Write-Host "Log file not found: $logFile"
}
return
}
# If the FilePath is a directory, consider it to be a 'binary file' for this and return true
if ((Test-Path $FilePath) -and (Get-Item $FilePath).PSIsContainer) {
return $true
}
# Check for common binary file extensions before reading the file
$binaryExtensions = @('.exe', '.dll', '.bin', '.iso', '.zip', '.tar', '.rar', '.7z', '.gz', '.pdf', '.epub', '.mobi', '.azw', '.azw2', '.azw3')
$fileExtension = [System.IO.Path]::GetExtension($FilePath).ToLower()
if ($binaryExtensions -contains $fileExtension) {
return $true
}
# Retry opening the file if it's locked using FileStream in shared mode
$maxRetries = 5
$retryDelay = 2 # in seconds
$attempt = 0
$fileStream = $null
while ($attempt -lt $maxRetries -and -not $fileStream) {
try {
# Open file with shared read/write mode
$fileStream = [System.IO.FileStream]::new($FilePath, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
} catch {
Write-Host "File is locked, retrying in $retryDelay seconds..."
Start-Sleep -Seconds $retryDelay
$attempt++
}
}
if (-not $fileStream) {
$timestamp = Get-Date -Format "yyyy-MM-dd_HH-mm-ss"
$logMessage = "[$timestamp] Failed to open '$FilePath' after $maxRetries attempts"
Write-Host $logMessage
$logMessage | Out-File -FilePath $logFile -Append
return $false
}
# Rest of your code to check if it's a binary file...
$reader = $null
try {
$reader = New-Object System.IO.BinaryReader($fileStream)
# Check for BOM (Byte Order Mark)
$bomBuffer = New-Object byte[] 4
$bytesRead = $reader.Read($bomBuffer, 0, $bomBuffer.Length)
if ($bytesRead -ge 2 -and (
($bomBuffer[0] -eq 0xEF -and $bomBuffer[1] -eq 0xBB -and $bomBuffer[2] -eq 0xBF) -or # UTF-8 BOM
($bomBuffer[0] -eq 0xFF -and $bomBuffer[1] -eq 0xFE) -or # UTF-16 LE BOM
($bomBuffer[0] -eq 0xFE -and $bomBuffer[1] -eq 0xFF) # UTF-16 BE BOM
)) {
return $false # It's a text file
}
# If no BOM, continue checking for non-printable characters
$binaryBytes = 0
$textBytes = 0
$buffer = New-Object byte[] 1024
while (($bytesRead = $reader.Read($buffer, 0, $buffer.Length)) -gt 0) {
for ($i = 0; $i -lt $bytesRead; $i++) {
if ($buffer[$i] -eq 0) {
$binaryBytes++
} elseif ($buffer[$i] -lt 32 -and $buffer[$i] -ne 9 -and $buffer[$i] -ne 10 -and $buffer[$i] -ne 13) {
$binaryBytes++
} else {
$textBytes++
}
}
}
return $binaryBytes -gt $textBytes
} finally {
if ($reader) { $reader.Close() }
if ($fileStream) { $fileStream.Close() }
}
}
Share
Improve this question
asked Nov 21, 2024 at 15:58
YorSubsYorSubs
4,0309 gold badges46 silver badges94 bronze badges
3
本文标签: PowerShelldistinguish between files that are text or binary in natureStack Overflow