More commands for managing files

Using find to locate files

The find command is used to locate files in the Linux file system. The command requires two arguments:

the name of the directory where the file is looked for
the search condition

The basic syntax of the command is:

find directory search_condition

The search condition is normally based on the name of the file (-name value), but you can also use options that refer to dates or access settings. The find command can also have a third argument that defines what operation is performed on the found files. The default action that is used if no command argument is given is -print, which prints the path and name of the matching files. The following sample command would look for a file called dataset27.dat from the current directory. In this case, the file is found from a subdirectory dataset3.

$ find ./ -name dataset27.txt
./dataset3/dataset27.txt

You can also use wildcards in the name search conditions. Note, however, that in such case you must quote the search condition. The following command locates from your home directory ($HOME) all files that have extension .tmp.

find $HOME/ -name "*.tmp"

In the last find command examples we use -mtime search condition, which picks files based on their modification date. With the following command you can check what files have not been accessed in directory /scratch/project_2001234 during the last 28 days:

find /scratch/project_2001234 -mtime +28

Here +28 means "more than 28 days". In the same way minus-character (-) means less than. So to see what files have been modified in your current directory less than 24 hours ago, you could use the command:

find ./ -mtime -1

File command tells the file type

The file command evaluates the type of the given file. The syntax of the command is:

file file_name

The command prints the name of the file and a one-line description of the file type. The file command recognizes most common text file formats, compressed files and Linux executables. It also studies the content of the file and tries to estimate e.g. if a normal text file contains program code or some commonly used data formatting types like XML. Note however that file often fails to classify correctly application-specific files. If the file is a binary file that is not recognized by the file command, it is reported to be a data file.

In the example below, file types of all the files in the current working directory are listed.

$ file ./*
./a.out:                   ELF 64-bit MSB MIPS-IV executable, MIPS, version 1
./common.py:               a python script text executable .
/data_old.gz:              gzip compressed data, from Unix
./data.txt:                ASCII text
./instrction.html:         HTML document text
./molecule.msv:            data
./output4.jpg:             JPEG image data, JFIF standard 1.01
./outout4.png:             PNG image data, 640 x 480, 4-bit colormap
./output4.xml:             XML document text
./poster1.pdf:             PDF document, version 1.4
./report.doc:              Microsoft Office Document

Count rows and characters with `wc`

Command wc (Word Count) is a tool that can be used to count characters (-m), words (-w) or rows (-l) that a Linux text file contains. The most common use of wc command is to quickly check the row count of your file:

wc -l file_name

Another common use is checking how many rows the output of a command contains. For example, the following command would give the number of files with extension .dat in the current directory.

ls *.dat | wc -l

Comparing two files with diff

The diff command can be used to compare two files. diff goes through the files row by row and prints out lines that are not identical. diff is most useful when you need to compare two nearly identical files like two versions of the same program file. The basic syntax of the command is:

diff file1 file2

Using checksums to verify successful data storage or transfer

Checksums provide a tool to make sure that a data file is fully conserved during storage or copying. The idea behind checksums is an algorithm that calculates a number or a string based on the content of the file. A checksum string is calculated and stored before the file is moved to a storage media or copied to a new location. Later on, when the data is retrieved from the storage or the copying process is finished, a new checksum is computed based on the retrieved or copied files. If the new checksum equals to the previously computed one, we can be pretty sure that the data is fully conserved.

One of the most common checksum algorithms is md5 that is often used to verify the correctness of data files. For example, many scientific data sets available on the internet are accompanied by a list of md5 sums. The md5 sum is always a text string 32 characters long. This string has the characteristics of a good checksum: it does not tell anything about the actual content of the source file and any modification to the original file produces a completely different checksum. Other frequently used checksum algorithms include SHA (Secure Hash Algorithm) that is often used in cryptography and CRC (Cyclic Redundancy Check) that is common in data transfer.

The example below shows how to use md5sum in the CSC environment. An md5 checksum for a file is calculated with the command:

md5sum file_name

For example:

$ md5sum poster1.pdf
cc494699398122a6b6d93a5a69bd2667 poster1.pdf

You can easily store the checksum in a file by redirecting the output of the command to a new file with > character.

md5sum poster1.pdf > poster1.pdf.md5

The command above stores the checksum and file name to a new file called poster1.pdf.md5.

Checking a set of files against an md5 sum list is done by using option -c.

md5sum -c checksum_list

For example, to check the validity of file poster1.pdf with the previously created checksum file poster1.pdf.md5, use command:

$ md5sum -c poster1.pdf.md5
poster1.pdf: OK

Encrypting files with GPG

Note

If you work with sensitive data at CSC, please see our sensitive data services guide.

File encryption can be used to increase the security of your data. In normal conditions (i.e., when not working with sensitive data), encrypting files that locate at CSC is not needed. The files can by default be accessed only by the user themselves. However, in principle, the system administrators of CSC are able to read all data at the servers of CSC. In some occasions the administrators may need to check file names and sizes, but the administrator policy of CSC strictly prohibits accessing the contents of user's data files. However, encryption may be reasonable if you for example need to copy the data outside CSC or if encryption is required by the owner of the data.

At CSC, you can use the GPG program to encrypt your files. GPG is frequently used for creating encryption key pairs to protect emails and other data transfer. However, in this chapter we demonstrate only how GPG can use used to encrypt individual files.

The basic syntax for encrypting a file with gpg command is:

gpg -c file_name

The command asks the user to define a password for the file. This password is not, and should not be, in any sense related to your CSC password. After confirming the password, the command makes an encrypted copy of the given file. By default, the encryption is done with CAST5 algorithm, but several other algorithms can be used too.

To open a GPG-encrypted file, give command:

gpg gpg_encrypted_file

GPG example

Say we have file a my_file.txt that we want to encrypt. This can be done with command:

gpg -c my_file.txt

When the command is launched, the following prompt appears:

Enter passphrase:

Now you can type in any password for the file. In this case we use following password: y8kIeg%a. Once the password is typed, the programs asks you to confirm the password:

Repeat passphrase:

When the encryption is finished we have two files: the original file and and its encrypted version that has an extension .gpg.

$ ls  -l
-rw-------+ 1 kkayttaj csc 1291176 Feb 11 15:57 my_file.txt
-rw-------+ 1 kkayttaj csc  313848 Feb 11 16:05 my_file.txt.gpg

Note that in this case the encrypted file is smaller than the original one. Now we can remove the original file.

rm my_file.txt

Later on, for example after copying the file to some other location, you can extract the data with command:

gpg my_file.txt.gpg

The program now asks for the password you used in encryption (in this case: y8kIeg%a). After this, you again have two files: the encrypted file my_file.txt.gpg and the original, readable file my_file.txt. Note that if you forget the password of your encrypted file, there is no one who can open the file!

Managing access permissions of files and directories

Hundreds of users use the computing and storage environments of CSC. To keep the files private and in order, each file and folder in the Linux environment of CSC is owned by a certain user account. In Linux systems each file has three user categories: owner, group and others. For each or these user categories, there are three access settings: reading, writing and execution permissions.

By default, only the owner of the file can read and modify (i.e. write) the files and directories they have created. Other users do not have any access permissions to the files. Normally, this setting is good as it keeps your data private. However, if you wish to share some data or execute self-written programs, the access permissions need to be modified.

Note that the project specific-disk areas in Puhti and Mahti supercomputers are an exception to this rule.̣ There by default also other project members, belonging to the same UNIX group, have full rights to files created by other users.

You can check the access permissions with the command ls -l. Let's take a look at the sample file listing that was previously used in the ls -la example. In this file listing the characters from second to the tenth column include the information on the access permissions.

The first three of these permission characters display the permissions of the owner, next three ones display the access permissions of the UNIX group members and the last three characters all the other users. Below is a sample output for ls -la command:

total 26914
drwx------+  3 kkayttaj csc       10 Dec 22 09:12 .
drwxr-xr-x  20 root     root       0 Dec 22 09:12 ..
drwx------+ 42 kkayttaj csc      472 Dec 22 09:07 ..
-rwxr-x---+  1 kkayttaj csc     1648 Dec 22 09:01 .cshrc
-rw-------+  1 kkayttaj csc       93 Dec 22 09:01 .my.cnf
-rw-------+  1 kkayttaj csc       48 Dec 22 09:05 Test.txt
-rw-------+  1 kkayttaj csc   878849 Jan 19  2009 input.table
drwxr-xr-x+  2 kkayttaj csc        2 Dec 22 09:11 project1
-rw-------+  1 kkayttaj csc 26432051 Dec 22 09:08 results.out
-rw-------+  1 kkayttaj csc       25 Mar 27  2009 sample.data
-rw-------+  1 kkayttaj csc       49 Mar 27  2009 test.txt

In the case of file Test.txt the setting is: rw-------. This means that the owner of the file (kkayttaj) has permission to read (r) and write (w) to the file. Other users have no permissions for this file. In the case of file .cshrc the definition is: rwxr-x---. In this case the owner has also execution permissions (x) to the file and also the other users that belong to group csc have permission to read (r) and execute (x) the file.

Managing access permissions using the command-line

In command-line usage, access permissions can be modified with command chmod. This command needs two arguments:

a string that defines what changes are to be done and
an argument that defines the target file or directory

As the first argument, you first define the user category: u (user i.e. owner), g (group) or, o (others). Then you define with plus or minus character if you are going to add (+) or remove (-) permissions. Finally, you define, what permissions are added or removed. For example, to allow all the group members to read file Test.txt you should give the command:

chmod g+r Test.txt

You can check the effect with ls -l command:

$ ls -l Test.txt
-rw-r-----+  1 kkayttaj csc       48 Dec 22 09:05 Test.txt

You can define several user categories and permissions at the same time. For example,

chmod go+rwx Test.txt

would add all access permissions for all users to file Test.txt. To remove the permissions, you should change the + character to -.

chmod go-rwx Test.txt

Note that by default, changing the permissions of a directory does not change the permissions of the files and subdirectories in the target directory. Thus, the command

chmod g+w project1

would not allow other group members to modify the files in directory project1. You can use option -R to do the same permission modification recursively i.e. to all files and subdirectories in the target directory:

chmod -R g+w project1

You can use command groups to check which groups you belong to.