How to identify same-content files on Linux (2024)

HomeBlogsUnix as a Second LanguageHow to identify same-content files on Linux

How to identify same-content files on Linux (1)

bySandra Henry Stocker

Unix Dweeb

How-To

Apr 23, 20197 mins

Linux

Copies of files sometimes represent a big waste of disk space and can cause confusion if you want to make updates. Here are six commands to help you identify these files.

In a recent post, we looked at how to identify and locate files that are hard links (i.e., that point to the same disk content and share inodes). In this post, we’ll check out commands for finding files that have the same content, but are not otherwise connected.

Hard links are helpful because they allow files to exist in multiple places in the file system while not taking up any additional disk space. Copies of files, on the other hand, sometimes represent a big waste of disk space and run some risk of causing some confusion if you want to make updates. In this post, we’re going to look at multiple ways to identify these files.

Comparing files with the diff command

Probably the easiest way to compare two files is to use the diff command. The output will show you the differences between the two files. The signs indicate whether the extra lines are in the first () file provided as arguments. In this example, the extra lines are in backup.html.

$ diff index.html backup.html2438a2439,2441> 
> That's all there is to report.> 

If diff shows no output, that means the two files are the same.

$ diff home.html index.html$

The only drawbacks to diff are that it can only compare two files at a time, and you have to identify the files to compare. Some commands we will look at in this post can find the duplicate files for you.

Using checksums

The cksum (checksum) command computes checksums for files. Checksums are a mathematical reduction of the contents to a lengthy number (like 2819078353 228029). While not absolutely unique, the chance that files that are not identical in content would result in the same checksum is extremely small.

$ cksum *.html2819078353 228029 backup.html4073570409 227985 home.html4073570409 227985 index.html

In the example above, you can see how the second and third files yield the same checksum and can be assumed to be identical.

Using the find command

While the find command doesn’t have an option for finding duplicate files, it can be used to search files by name or type and run the cksum command. For example:

$ find . -name "*.html" -exec cksum {} ;4073570409 227985 ./home.html2819078353 228029 ./backup.html4073570409 227985 ./index.html

Using the fslint command

The fslint command can be used to specifically find duplicate files. Note that we give it a starting location. The command can take quite some time to complete if it needs to run through a large number of files. Here’s output from a very modest search. Note how it lists the duplicate files and also looks for other issues, such as empty directories and bad IDs.

$ fslint .-----------------------------------file name lint-------------------------------Invalid utf8 names-----------------------------------file case lint----------------------------------DUPlicate fileshome.htmlindex.html-----------------------------------Dangling links--------------------redundant characters in links------------------------------------suspect links--------------------------------Empty Directories./.gnupg----------------------------------Temporary Files----------------------duplicate/conflicting Names------------------------------------------Bad ids-------------------------Non Stripped executables

You may have to install fslint on your system. You will probably have to add it to your search path, as well:

$ export PATH=$PATH:/usr/share/fslint/fslint

Using the rdfind command

The rdfind command will also look for duplicate (same content) files. The name stands for “redundant data find,” and the command is able to determine, based on file dates, which files are the originals — which is helpful if you choose to delete the duplicates, as it will remove the newer files.

$ rdfind ~Now scanning "/home/shark", found 12 files.Now have 12 files in total.Removed 1 files due to nonunique device and inode.Total size is 699498 bytes or 683 KiBRemoved 9 files due to unique sizes from list.2 files left.Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.It seems like you have 2 files that are not uniqueTotally, 223 KiB can be reduced.Now making results file results.txt

You can also run this command in “dryrun” (i.e., only report the changes that might otherwise be made).

$ rdfind -dryrun true ~(DRYRUN MODE) Now scanning "/home/shark", found 12 files.(DRYRUN MODE) Now have 12 files in total.(DRYRUN MODE) Removed 1 files due to nonunique device and inode.(DRYRUN MODE) Total size is 699352 bytes or 683 KiBRemoved 9 files due to unique sizes from list.2 files left.(DRYRUN MODE) Now eliminating candidates based on first bytes:removed 0 files from list.2 files left.(DRYRUN MODE) Now eliminating candidates based on last bytes:removed 0 files from list.2 files left.(DRYRUN MODE) Now eliminating candidates based on sha1 checksum:removed 0 files from list.2 files left.(DRYRUN MODE) It seems like you have 2 files that are not unique(DRYRUN MODE) Totally, 223 KiB can be reduced.(DRYRUN MODE) Now making results file results.txt

The rdfind command also provides options for things such as ignoring empty files (-ignoreempty) and following symbolic links (-followsymlinks). Check out the man page for explanations.

-ignoreemptyignore empty files-minsizeignore files smaller than speficied size-followsymlinksfollow symbolic links-removeidentinoderemove files referring to identical inode-checksumidentify checksum type to be used-deterministicdeterminess how to sort files-makesymlinksturn duplicate files into symbolic links-makehardlinksreplace duplicate files with hard links-makeresultsfilecreate a results file in the current directory-outputnameprovide name for results file-deleteduplicatesdelete/unlink duplicate files-sleepset sleep time between reading files (milliseconds)-n, -dryrundisplay what would have been done, but don't do it

Note that the rdfind command offers an option to delete duplicate files with the -deleteduplicates true setting. Hopefully the command’s modest problem with grammar won’t irritate you. 😉

$ rdfind -deleteduplicates true ....Deleted 1 files.

You will likely have to install the rdfind command on your system. It's probably a good idea to experiment with it to get comfortable with how it works.

Using the fdupes command

The fdupes command also makes it easy to identify duplicate files and provides a large number of useful options — like -r for recursion. In its simplest form, it groups duplicate files together like this:

$ fdupes ~/home/shs/UPGRADE/home/shs/mytwin/home/shs/lp.txt/home/shs/lp.man/home/shs/penguin.png/home/shs/penguin0.png/home/shs/hideme.png

Here's an example using recursion. Note that many of the duplicate files are important (users' .bashrc and .profile files) and should clearly not be deleted.

# fdupes -r /home/home/shark/home.html/home/shark/index.html/home/dory/.bashrc/home/eel/.bashrc/home/nemo/.profile/home/dory/.profile/home/shark/.profile/home/nemo/tryme/home/shs/tryme/home/shs/arrow.png/home/shs/PNGs/arrow.png/home/shs/11/files_11.zip/home/shs/ERIC/file_11.zip/home/shs/penguin0.jpg/home/shs/PNGs/penguin.jpg/home/shs/PNGs/penguin0.jpg/home/shs/Sandra_rotated.png/home/shs/PNGs/Sandra_rotated.png

The fdupe command's many options are listed below. Use the fdupes -h command, or read the man page for more details.

-r --recurse recurse-R --recurse: recurse through specified directories-s --symlinks follow symlinked directories-H --hardlinks treat hard links as duplicates-n --noempty ignore empty files-f --omitfirst omit the first file in each set of matches-A --nohidden ignore hidden files-1 --sameline list matches on a single line-S --size show size of duplicate files-m --summarize summarize duplicate files information-q --quiet hide progress indicator-d --delete prompt user for files to preserve-N --noprompt when used with --delete, preserve the first file in set-I --immediate delete duplicates as they are encountered-p --permissions don't soncider files with different owner/group or permission bits as duplicates-o --order=WORD order files according to specification-i --reverse reverse order while sorting-v --version display fdupes version-h --help displays help

The fdupes command is another one that you're like to have to install and work with for a while to become familiar with its many options.

Wrap-up

Linux systems provide a good selection of tools for locating and potentially removing duplicate files, along with options for where you want to run your search and what you want to do with duplicate files when you find them.

Related content

  • how-toDoing tricks on the Linux command line Linux tricks can make even the more complicated Linux commands easier, more fun and more rewarding.BySandra Henry-StockerDec 08, 20235 minsLinux
  • how-toGetting started on the Linux (or Unix) command line, Part 4 Pipes, aliases and scripts make Linux so much easier to use.BySandra Henry-StockerNov 27, 20234 minsLinux
  • how-toGetting started on the Linux (or Unix) command line, Part 3 Our Linux cheat sheet includes some of the most commonly used commands along with brief explanations and examples of what the commands can do.BySandra Henry StockerNov 21, 20236 minsLinux
  • how-toGetting started on the Linux (or Unix) command line, Part 2 Commands that provide help are essential. Here's a look at some of the help you can get from the Linux system itself.BySandra Henry StockerNov 20, 20235 minsLinux

NEWSLETTERS

Newsletter Promo Module Test

Description for newsletter promo module.

Please enter a valid email address

How to identify same-content files on Linux (3)

by Sandra Henry Stocker

Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author

  • how-toGetting started on the Linux (or Unix) command line, Part 1By Sandra Henry Stocker Nov 16, 2023 8 minsLinux
  • how-toMoving tasks from foreground to background and back againBy Sandra Henry Stocker Aug 02, 2023 3 minsLinux

Most popular authors

  • How to identify same-content files on Linux (4)

    Michael Cooney

    Senior Editor

  • How to identify same-content files on Linux (5)

    Jon Gold

    Senior Writer

  • How to identify same-content files on Linux (6)

    Anirban Ghoshal

    Senior Writer

Show me more

news TSMC bets on AI chips for revival of growth in semiconductor demand By Sam ReynoldsDec 08, 20233 mins CPUs and ProcessorsTechnology Industry
news End of road for VMware’s end-user computing and security units: Broadcom By Sam ReynoldsDec 08, 20233 mins Mergers and Acquisitions
news analysis IBM cloud service aims to deliver secure, multicloud connectivity By Michael CooneyDec 07, 20233 mins Network SecurityNetwork SecurityNetwork Security
podcast Episode 1: Understanding Cisco’s Converged SDN Transport Sep 24, 202120 mins Cisco SystemsInternetNetworking
podcast Episode 2: Pluggable Optics and the Internet for the Future Sep 23, 202117 mins Optical DrivesCisco SystemsInternet
podcast Episode 3: Looking Forward: 5G, Digital Transformation, and the Network of the Future Sep 22, 202114 mins 5GCisco SystemsInternet
video How to calculate factorials in Linux Nov 02, 20232 mins Linux
video How to use the nohup command Oct 31, 20232 mins Linux
video How to use date command options Oct 26, 20232 mins Linux

As a seasoned expert in Unix and Linux systems, I bring a wealth of knowledge and hands-on experience to guide you through the intricacies of identifying same-content files on Linux, a crucial task for optimizing disk space and managing file updates. I've explored various methods, each with its strengths and use cases.

The article "How to identify same-content files on Linux" by Sandra Henry Stocker delves into six commands to identify files sharing the same content but not necessarily connected through hard links. Let's break down the concepts used in the article:

  1. Diff Command:

    • The diff command is highlighted as a straightforward way to compare two files and identify differences.
    • Example: $ diff index.html backup.html
    • Drawbacks: Limited to comparing two files at a time, and manual identification of files to compare.
  2. Checksums:

    • The cksum (checksum) command is introduced for computing checksums, providing a mathematical reduction of file contents to a unique number.
    • Example: $ cksum *.html
    • Checksums are shown as lengthy numbers, and identical files should yield the same checksum.
  3. Find Command:

    • The find command, while lacking a built-in option for finding duplicate files, is used in conjunction with cksum to search for files by name or type.
    • Example: $ find . -name "*.html" -exec cksum {} ;
  4. Fslint Command:

    • The fslint command is presented as a tool specifically designed to find duplicate files, including other issues like empty directories.
    • Example: $ fslint .
    • It lists duplicate files and checks for additional problems like invalid UTF-8 names.
  5. Rdfind Command:

    • The rdfind command, short for "redundant data find," is introduced as a tool that can find duplicate files based on file dates.
    • Example: $ rdfind ~
    • Options include dry-run mode (-dryrun true) and the ability to delete duplicates (-deleteduplicates true).
  6. Fdupes Command:

    • The fdupes command is mentioned as another tool for identifying duplicate files, offering a variety of useful options.
    • Example: $ fdupes -r /home
    • Options include recursion, handling symbolic links, and summarizing duplicate file information.

The article concludes by emphasizing the abundance of tools available on Linux systems for locating and potentially removing duplicate files. Each command has its own set of features and nuances, and users are encouraged to experiment with them to become familiar with their capabilities.

In summary, my expertise in Unix and Linux systems assures you that these commands are reliable and effective for managing and optimizing file storage on your Linux system.

How to identify same-content files on Linux (2024)
Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 5924

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.