One of my first projects in ColdFusion was a tool that compares files between two different folders and reports differences. The tool was intended for developers to identify differences between code in different phases of development. It wasn’t very fast, and when run against a large amount of files it crapped out completely.
I recently overhauled the code and was able to make it much faster. Maybe someone else will find it useful.
In the process I learned a few things:
- <cfdirectory action=”list” recurse=”yes” /> is not always efficient. Directly using java.io.file can be dramatically faster — up to 200% faster when listing files & directories from a network location.
- Opening two files into memory using <cffile> and comparing the contents is not the most efficient way to tell if they are different. Duh.
- Recursively calling a function is a great way to walk through multi-level data.
I ended up tightening things down to two functions. BuildFileDictionary walks through a given base folder using java.io.File path and builds a struct of file information. The struct is keyed on the relative path of each file, thus the FileDictionary naming. This way, I can access information for a specific file using simple code like this: FileDictionary[relativeFilePath]. It also helps determine which files from one source do not exist under another by simply checking for the existence of a struct key.
The second function, CompareFileDictionaries compares two file dictionary structs and returns a couple lists: files only in dictionary one, files only in dictionary two, and files in common but out-of-synch. I settled on file size and last modified date as an acceptable basis to indicate file differences. There are other indicators, but these values were the most efficient to access.
BuildFileDictionary
<cffunction name="BuildFileDictionary" returntype="struct" access="public" output="false"
hint="Recurses through a directory, building a struct with file information. The result struct is keyed on each file’s path relative to the passed-in base path. ie - details on TopFolder/SubFolder/File.txt are accessed through ReturnStruct[’TopFolder/SubFolder/File.txt’]">
<cfargument name="BasePath" type="string"
hint="Full path to the folder you want to build a dictionary for. UNC paths are acceptable.">
<cfargument name="ExcludeFileExtensions" type="string"
hint="List of file extensions to exclude from results.">
<cfargument name="ExcludeDirectories" type="string"
hint="List of directories to exclude from results. Leave off trailing / for each.">
<cfargument name="Files" type="any" default="-999" required="false"
hint="Optional Java File collection to recurse. Leave this arg off to start with the BasePath location.">
<cfargument name="FileDictionary" type="struct" default="#StructNew()#" required="false"
hint="FileDictionary to append results to. Leave this arg off to start fresh.">
<cfset var fileWalker = ArrayNew(1) />
<cfset var fileCounter = "" />
<cfset var relFilePath = "" />
<!— Grab collection of files from Java.io.File if not passed in —>
<!— Java.io.File is *much* faster than cfdirectory action=’list’ especially when accessing UNC paths —>
<cfif arguments.Files EQ -999>
<cfif DirectoryExists(BasePath)>
<cfset arguments.Files = createObject("java","java.io.File").init(arguments.BasePath).listFiles() />
<cfelse>
<cfthrow type="Application" message="BasePath does not exist" detail="The provided BasePath does not exist. Please enter a valid BasePath." />
</cfif>
</cfif>
<cfset fileWalker = arguments.Files />
<!— Loop through current level of files and directories —>
<cfloop from="1" to="#ArrayLen(fileWalker)#" index="fileCounter">
<cfset relFilePath = Replace(fileWalker[fileCounter].getAbsolutePath(),arguments.BasePath,"") />
<!— Recursively call this function for sub-directory —>
<cfif fileWalker[fileCounter].isDirectory()>
<cfif ListFindNoCase(arguments.ExcludeDirectories,relFilePath) EQ 0>
<cfinvoke method="BuildFileDictionary">
<cfinvokeargument name="BasePath" value="#arguments.BasePath#">
<cfinvokeargument name="ExcludeFileExtensions" value="#arguments.ExcludeFileExtensions#">
<cfinvokeargument name="ExcludeDirectories" value="#arguments.ExcludeDirectories#">
<cfinvokeargument name="Files" value="#fileWalker[fileCounter].listFiles()#">
<cfinvokeargument name="FileDictionary" value="#arguments.FileDictionary#">
</cfinvoke>
</cfif>
<!— Grab details about this file —>
<cfelseif fileWalker[fileCounter].isFile()>
<cfif ListContainsNoCase(arguments.ExcludeFileExtensions,ListLast(relFilePath,".")) EQ 0>
<cfset arguments.FileDictionary[relFilePath] = StructNew() />
<cfset arguments.FileDictionary[relFilePath].FileName = fileWalker[fileCounter].getName() />
<cfset arguments.FileDictionary[relFilePath].FilePath = fileWalker[fileCounter].getPath() />
<cfset arguments.FileDictionary[relFilePath].AbsolutePath = fileWalker[fileCounter].getAbsolutePath() />
<!— The hash value can be used to identify file contents, but it seems to slow things down to grab it —>
<!— <cfset arguments.FileDictionary[dictionaryKey].HashCode = fileWalker[fileCounter].hashCode() /> —>
<cfset arguments.FileDictionary[relFilePath].LastModified = fileWalker[fileCounter].lastModified() />
<cfset arguments.FileDictionary[relFilePath].Size = fileWalker[fileCounter].length() />
</cfif>
</cfif>
</cfloop>
<cfreturn arguments.FileDictionary />
</cffunction>
CompareFileDictionaries
<cffunction name="CompareFileDictionaries" access="public" output="false" returntype="struct"
hint="Compares two file dictionaries and returns a struct containing files only in one, files only in two, and common files out of synch.">
<cfargument name="fileDictionaryOne" type="struct" required="yes" hint="File dictionary one">
<cfargument name="fileDictionaryTwo" type="struct" required="yes" hint="File dictionary two">
<!— Variable declarations —>
<cfset var comparisonResults = StructNew() />
<cfset var relFilePath = "" />
<!— Build up struct properties for results —>
<cfset comparisonResults.NamesOneOnly = ArrayNew(1) />
<cfset comparisonResults.NamesTwoOnly = ArrayNew(1) />
<cfset comparisonResults.OutOfSynch = false />
<cfset comparisonResults.NamesCommonOutOfSynch = ArrayNew(1) />
<!— Loop through the relative file paths in the first file dictionary checking if the —>
<!— same file relative path exists in the second file dictionary. If it does exist in —>
<!— second dictionary, check if files attributes are out of synch between the two. —>
<cfloop list="#ListSort(StructKeyList(arguments.fileDictionaryOne),"textnocase","ASC")#" index="fileRelativePath">
<cfif StructKeyExists(arguments.fileDictionaryTwo,fileRelativePath)>
<cfif (arguments.fileDictionaryOne[fileRelativePath].LastModified NEQ arguments.fileDictionaryTwo[fileRelativePath].LastModified)
AND (arguments.fileDictionaryOne[fileRelativePath].Size NEQ arguments.fileDictionaryTwo[fileRelativePath].Size)>
<!— File last modified data and file size does not match, add to NamesCommonOutOfSynch array —>
<cfset ArrayAppend(comparisonResults.NamesCommonOutOfSynch,fileRelativePath) />
</cfif>
<cfelse>
<!— File does not exist in second dictionary, add to NamesOneOnly —>
<cfset ArrayAppend(comparisonResults.NamesOneOnly,fileRelativePath) />
</cfif>
</cfloop>
<!— Loop through the relative file paths in the second dictionary checking if each exists —>
<!— in the first dictionary. —>
<cfloop list="#ListSort(StructKeyList(arguments.fileDictionaryTwo),"textnocase","ASC")#" index="fileRelativePath">
<cfif not StructKeyExists(arguments.fileDictionaryOne,fileRelativePath)>
<cfset ArrayAppend(comparisonResults.NamesTwoOnly,fileRelativePath) />
</cfif>
</cfloop>
<cfif ArrayLen(comparisonResults.NamesCommonOutOfSynch) GT 0
OR ArrayLen(comparisonResults.NamesOneOnly) GT 0
OR ArrayLen(comparisonResults.NamesTwoOnly) GT 0>
<cfset comparisonResults.OutOfSynch = true />
</cfif>
<cfreturn comparisonResults />
</cffunction>