PRIMARY CATEGORY → BASH
RESOURCES | ||
---|---|---|
David A. Wheeler | How to handle filenames correctly | See Here |
Stephane Chazelas | Handle filenames safely | See Here |
Stephane Chazelas | Never parse ls ’s output | See here |
Globs | Meaning | e.g |
---|---|---|
* | Match any str, of any lenght | foo |
foo* | Match any str beginning with foo | footer |
bar? | Match bar followed by one or cero chars | bart |
[kz]sh | Match sh beginning with k or z | ksh |
*
→ Matches cero or more chars?
→ Matches one specifically char[...]
→ Matches one char in a specified set
Globbing, Globs, Filename Expansion (All the same) happens after Word Splitting, which means that any file expanded through Globbing corresponds to One Word for Bash and not undergoes any word or field separation.
Above code will cause $_file
parameter to be expanded as ./filename
instead of filename
.
Globbing is the latest Type of Expasion that happens during Bash Parsing Process, aka Bash Parser
CAUTION
Note that this is incorrect 🔴
Previous Example may lead to a command interpreting a file, which name begins with a dash (-), like an option, given the following reasons:
- A file can contain any char in its name except a Null Byte
\0
and Slash/
- Rarely a command is not going to interpret a string beginning with a dash like an option (i.e.
command -opt -- arg
)The following Syntax
--
is often used on a wide variety of commands to indicate end of option processing.This can be added to previous code as an additional measure to ensure command to not process File as an option due to leading dash.
IMPORTANT
Be aware
--
syntax should never replace./
measure, cause not all commands have implemented--
to indicate end of options, such asecho
command
Be aware that, in previous example, if directory has no files, *
glob pattern will return the pattern instead ./*
, then rm
will receive as a non-option argument a non-existent file.
To avoid this issue, check file existence before executing any command which has that file as argument →
POSIX Compliance
IMPORTANT
Use
test
or[ ]
Shell Builtins instead of[[ ]]
non-standard BashNote that
[ -e string ]
checks if a string is an existent file in the system. If not, above code also checks if the string is a Soft Link (Symbolic Link)This is done since the string can be a broken Soft Link that it could be restored later
If only
[ -e string ]
check is performed and the string is a soft link, It will return false as that soft link is pointing to a non-existent fileThat is why
[ -L string ]
is used, to check if that string is a broken link and treat it
Non-Standard Shell Extension → Nullglob
IMPORTANT
Check shell extension first, then, do actions and restore initial value. It’s always advisable to keep or restore values to the previous ones
By default, globbing does not expand to hidden files, this behaviour can be changed by the following glob patterns →
Similar results can be achieved using
dotglob
Shell ExpansionBoth alternatives have similar performance. I’d go with the Hidden Files Glob Pattern for readability and to not restore dotglob initial status
Tip
On Arithmetic Operators or Expansions is not necessary to use following syntax →
Instead use
$(( _var ))
or(( _var ))
→Overview → Omit
$
char inside$(( ))
or(( ))
This Nullglob
Non-Standard Shell Expansion results way more efficient than the POSIX one. This is due to non requiring File Existence Check for each iteration (i.e. [ -e "$_file ]
)
Note that Globbing should only be used on for
loops. If used as non-option command argument, and expasion results on a too long Filename List, command may not handle correctly all arguments
./
→ Prevents file processing, whose name starts with-
, as a non-option cmd argument
--
→ Denotes the end of option argumentes. It should be used as a additional measure, not the only one
./.[!.]* ./..?*
ordotglob
→ Adds hidden files to glob expansion list
[ -e file ]
ornullglob
→ Avoids to process non existent files
But, with the above measures applied, the following case may arises →
Above command expects, at least, one file matching the previous pattern. If not, it’ll hang trying to read from the standard input (fd 0)
In cases where globbing does not expand to none pathname and nullglob
shell expansion is enabled, to prevent above case, just add /dev/null
as final non-option argument
Important
Be aware of
( )
to group inner commands on a subshell (Bash’s Child Process). This allows not to modify current shell’s environmentIt’s important to point out that any assignment or parameter modification inside
()
only affects child’s environment
Handle Correctly Pathname and Filename
To handle correctly either files with control chars ("\n", "\t"...
) or another processing-sensitive char, there are different approches to get to the same place ➡️
Globbing
→ Does not undergoes word splitting cause happens after it. So there’s no need to worry about pathnames which contains control chars or others
Note that Globbing is useful to get a list of unhidden files on a specific directory, but, it has some limitations to overcome like empty matches (nullglob
), symlinks, non-recursive search (globstar
) and hidden files (dotglob
)
To deal with above aspects, it’d be more feasible to make use of find
binary
Find
→ It has several features to perform advanced and recursive searchs on a specific path given certain criteria
Globbing
POSIX Compliant - Non-Hidden Files
POSIX Compliant - Including Hidden Files
Info
All the following code snippets will include hidden files on globbing expansion. Otherwise, remove
./.[!.]* ./..?*
from them
Non Standard Bash Extension For Loop
Non Standard Bash Extension Oneliner (Command Interface)
CAUTION
Be aware that errors may arise if the list of matches is too long and shell command cannot handle that argument number →
Therefore, in robust scripts, globbing should only be used on for loops, unless the number of arguments that the command receives as filenames is known →
Non Standard Bash Extension > v4.0 - Recursive Directory Search
CAUTION
Note that
globstar
Bash extension was added on BASH 4.0 (2006). This does not include MACOS devices, whose latest Bash version is 3.2.57, due to licensing issues.To prevent unintended errors, add some Bash version validation on
.bash
scripts →Check release date of Bash Features
If Bash Version validation is implemented, you should add, prior to the above one, a Shell validation to check if
.bash
script is running in a Bash or not →It could be in the same function or in a different one, but it must be done prior to any other type of validation
Be aware that
case
is used on prior shell checking instead of other for these reasons:
[[ ]]
→ Not POSIX-Compliant. Restricted to Korn Shell, Bash and Zsh
[ ]
akatest
→ POSIX-Compliant but it does not allow Pattern Matching using GlobbingHINT
Note that, in
case
statements, it’s not necessary to use doble quotes""
on Command Substitution or Parameter Expansion to prevent Word Splitting (according toIFS
value) or Filename Expansion (Globbing)
By default, this recursive shell search behaves this way →
- Omit hidden files
- Prune dot dirs (does not descend into them)
- Since Bash v.4.3, does not follow symlinks. This prevents infinite loops and duplicated entries
Due to latest point, take into account the use of globstar
on systems with < Bash v4.3
Therefore, It’d be feasible to perform a recursive search with globstar
, including hidden files and avoiding errors on empty matches, using dotglob
and nullglob
Bash extensions, respectively →
TIP
To avoid having to reset the bash extensions’ values above, all commands can be executed on a subhell
( )
instead of grouping them{ }
→Note that both assigments and parameter modifications will only be applied in the subshell’s environment, not in Shell parent one
However, find
offers a more reliable way to do this, being able to deploy several advanced filters that globbing could not
Remember that globstar
recursive search way is non-standard and It offers poor control over the recursion
Not to mention that, as long as the number of files handled increase, is way more feasible to make use of find
in terms of performance
Command Output
Command Mean [ms] Min [ms] Max [ms] Relative foo
173.6 ± 21.8 156.6 329.1 5.39 ± 0.69 find .
32.2 ± 1.0 30.3 35.6 1.00
In the above benchmark, find
shows a better performance over globstar - nullglob - dotglob
recursive search on more than 50k files
Note that performance breach between both will increase as long as the number of files increases, so it seems more feasible to make use of find
for better yield and robustness
Find
is nearly in every UNIX system while globstar
is Non-POSIX Compliant and > Bash v4.0 (Infinite Loops) - v4.3
FIND
It’s an external Binary, not a shell Builting
Take into account that any binary’s execution leads to the creation of a Shell’s child process through system calls like fork
or similar (vfork
, clone
)
This child process’s env is a clone of its parent’s env. Then, the binary is executed inside that child process through execve
syscall
That does not happen on globstar
way due to none binary is required, only shell functionalities (builtins, keywords, expansions…) are used during it
But since find
is only executed once or a few times on nearly any context, this does not imply a perceptible yield reduction
IMPORTANT
Subshells or Child Process can be generated due to specific tasks:
- Binary’s execution
- Background Jobs →
command &
- Simple or Group command isolation →
( command )
- Command Substitution →
$( command )
- Process Substitution →
<( command ) >( command )
- Shells’s
-c
option argument →{sh,bash,ksh} -c 'command'
- Pipes →
command | command
On
command | command
, every command executes on a different subshellAny assignment or parameter modification inside above subshells does not take affect on shell parent’s env
As mentioned earlier, as long as subshell generation is done correctly and not in abused way such as in loops, It will not negatively affect performance
Command Output
Command Mean [ms] Min [ms] Max [ms] Relative foo
2.6 ± 0.2 2.4 5.6 1.00 bar
781.7 ± 28.0 737.1 932.2 296.33 ± 26.24 As can be seen above, performance is reduced very significantly when subshells are created inside any loop context
It’s important to know in which situations shell functionality can be used instead of depend on external binaries. This will allow to not decrease script’s yield notoriously
Like the following one, users tend to use external binaries to implement functionalities that can be obtained through shell builtins →
Command Output
Command Mean [ms] Min [ms] Max [ms] Relative printf "%s\n" "${_pathname##*/}"
0.0 ± 0.0 0.0 2.0 1.00 basename "$_pathname"
0.0 ± 0.1 0.0 2.8 5.84 ± 165.51
Default behaviour →
- Does not omit hidden files. To omit them →
- Applies a recursive search. To limit it to current directory (Non-recursive) →
- Is passed a directory and all matches begins with that directory in the command’s output. Therefore, errors probably don’t arise when filename starts with a
-
, unlike globbing
NOTE
Just a random note to keep in mind when searching for files with
find
Note that
-name
option comes before-type
to avoid having to call stat() syscall for each file
Find
man page →The -name test comes before the -type test in order to avoid having to call stat(2) on every file
find -exec
option allows to directly run commands with any matched file. Although, if the pathnames are needed back into the shell, several alternatives arise →
Before proceed to show them, take into account the following stuff →
- Filenames can contain any char except Zero Byte (aka Null Byte
\0
) and slash/
Hence, since filenames can contain special chars like newlines \n
, reading files line-by-line will fail → read
Why read fails if filename contains newline
Command Output
One line is expected as output since only there’s only one file in current directory
But, because of read’s default behavior, It reads until embedded filenames’s newline
\n
, assigns that part of the filename asread
declared parameter’s value and process it inside the loopInstead of process the entire filename due that embedded newline
\n
Info
Note that
read
reads until a newline\n
char. This default behaviour can be changed through-d
option, which changes the delimiter
read
processes input string in the following way →
- Process input string until
\n
character- Word Splitting applies to the processed string according to
IFS
’s value- Resulting string is stored within
read
’s declared parameterIf
read
’s input string is divided into words due to Word Splitting, three cases may arise →
read -a
option → Each field/word resulting from splitting is stored into an array as one element
- Multiple parameters → If more than one parameter is declared as a
read
argument, each parameter receives as value one word of the input stringIf input string’s field number is greater than the number of parameters declared,
read
’s last argument receives the remainder of the stringNo Splitting or delimiter consolidation are performed on the remaining part
- One parameter and
IFS
with default value or unset → Note that, due to Word Splitting again, leading and trailing blanks and\t
in the input string are trimmed and the internal ones are consolidatedWhile if
IFS
value is empty, input string does not undergoe word splitting. Thus, the above leading and trailing chars are not removed and the inner ones are not consolidated→Prior behaviour does not happen if
IFS
contains non-whitespace chars in its value ($' \t\n'
)No initial or final trimming and no inner consolidation is performed on
IFS
chars →
Some alternatives such as the following will also fail due to above problem →
Incorrect Ways of Handling Filenames
Wrong
Note that if any pathname contains a space,
\n
or\t
, its name will be split into more than one word. Likewise, if pathname contains any globbing chars (*
,?
), the shell will try to expand it to any matched fileFurthermore,
$( )
expansion chop off any trailing newlinePrevious situation can be improved to handle correctly filenames with embedded spaces and tabs. It can be also manage globbing expansion →
INFO
There’re several standard ways to assign values to IFS parameter
- Non-POSIX Compliant. But It’s quite expanded →
- POSIX Compliant. Command substitution and Parameter Expansion →
- POSIX Compliant. Command Substitution →
Be aware that it cannot be done through the following way since command substitution trims trailing
\n
→To prevent that above expansion trims trailing newline, a character must be appended after
\n
After that, the character is trimmed using
${var%string}
parameter expansion syntaxNote that double quotes are not used on escalar assignments’s right hand such as the prior ones and in the following situations →
Above situations do not undergoe Word Splitting and Globbing. They’re like double quote contexts, therefore, It’s not necessary to use double quotes
However, remember that when in doubt, always quote parameters references !!
Be aware that above
for loop
will break up pathnames that contain newlines\n
plus all parameter creation and modification is not saved due to( )
subshellTo prevent subshell’s above problem,
local
ordeclare
shell builtins can be used to restrict to a local scope the shell parameters modification →INFO
To restore shell options to their previous values, It cannot be like this →
In the above function globbing is enabled once all stuff is done, but what if globbing was already disabled prior to
set -f
To prevent that misleading situation, store Shell Options in a parameter through command substitution
set +o
With all required actions done, just restore shell options from prior parameter through
eval
→Note that above way is Non-POSIX Compliant due to
local
, same applies withdeclare
andtypeset
One POSIX Compliant way to store
IFS
’s prior value →Above code is explained here
Take into account that if a filename contains an embedded newline, that filename will continue to be split into different parts due to Word Splitting
This happens because the
IFS
’s value is a newline\n
. There is no workaround to handle that newline splittingYou cannot want that a filename with an embedded newline does not split into several parts if you are relying on IFS to split on newlines
Wrong
That unquoted Command Substitution undergoes Word Splitting and Globbing
Therefore, any file whose name contains
$' \t\n'
will be split into several fields/words (i.e. a file named"John Doe.pdf
will be parsed as two files,John
andDoe.pdf
)Likewise, if a filename contains any globbing char like
*
, It will be expanded to the matched filenames in the current directoryAny trailing newlines in the output expansion will be trimmed due to Command Substitution behavior
Moreover, if
find
returns no filenames,cat
will hang up waiting for any inputLast situation can be handled correctly as follows →
Thus, the command will not hang up as It receives at least one argument
Wrong
As in the previous cases, above expansion undergoes split-glob as It is not quoted
Therefore, any filename with embedded
IFS
chars will be split into several fields/wordsThe same applies with globbing, any glob char such as
*
will expand to the files matchedFurthermore, any parameter assignment or modification is not reflected in Shell Parent’s environment due to the subshell
( )
Wrong
Above way handles correctly filenames with
\t
or blanks sinceread
’s default behavior is to process an input stream until a newline\n
The problem arises when filenames contains
\n
since, as mentioned earlier,read
splits that filename into several fields/wordsLeading and trailing whitespaces characters, such as blanks and tabs, are chopped off due to
IFS
’s default value$' \t\n'
Remember that, in above action, any consecutive sequence of that characters is consolidated into a single delimiter and is trimmed (leading, trailing or inner)
read -r
is not used, therefore any backslash\
followed by a specific character could be interpreted as an escape sequence rather than as a literalAny modification or parameter assigment within
while loop
is not reflected in the Shell Parent’s env due to the pipeline|
as It creates a subshell for each command
Wrong
This improves the above explained way to handle filenames, but It is still incorrect
No leading/trailing/inner whitespace chars are consolidated and trimmed as
IFS
’s value is an empty stringBackslash characters are treated as literals now since
read -r
option is usedHowever, this way fails dealing with files whose names contain an embedded newline, splitting them into serveral words/fields
Wrong
Xargs
reads from stdin to parse a file as argument until a blank or newline\n
Therefore, a filename that contains a blank or
\n
is split into several arguments that are passed to thexargs
’s command. That would lead to an unintended actionsAbove situation can be improved as follows →
Xargs -I
’s argument option modifies defaultxargs
behavior to read and process an argument until a newline (i.e. It excludes blanks as delimiters)However, It is still incorrect if a filename contains
\n
As mentioned in above sections, a filename can contain any character unless Null Byte \0
or backslash \
Since find
process matched files and prints them line-by-line (i.e. one file per line), It’s unfeasible to reprocess those ones through →
read
’s default behavior asread
process an input string until a newline char\n
- Command Substitution as It removes trailing newlines and has to be unquoted, therefore it undergoes Word Splitting and Globbing
Which can be improved modifying IFS
’s value to a \n
and disabling globbing through set -f
. Although, the same problem arises, It cannot handle correctly filenames with newlines
Therefore, process filenames line-by-line through those ways will fail
A good approach is to separate pathnames with a Null Byte \0
rather than with a newline \n
since filenames cannot contain \0
Above way is POSIX Compliant since 2023, although It has been going on for a long time
POSIX Compliant - Find -exec ’{}’ ;
INFO
It runs the command once for each file. It can be unwiedly if command is large
Be aware that braces
{}
are enclosed in quotes to prevent that the Shell treats them sintactically meaningfull (Brace expansion maybe)The same applies with
;
as it’s escaped (Command termination)
POSIX Compliant - Find -exec ’{}’ +
INFO
It runs much faster than the above one as long as command allows multiples filenames as arguments
POSIX Compliant - Find -print0 + Xargs -0
INFO
As mentioned earlier,
xargs
reads arguments from stdin delimited by blanks or newlines\n
This default behaviour can be modified by the
-I
option (newlines as args delimiters, not blanks), and the recommended one-0
(Null Bytes as delimiters)In above command,
-print0
option modifiesfind
’s output to use a Null Byte as file delimiter instead a newline\n
(i.e. one file per line)After that,
xargs -0
read from stdin (find
’s stdout fd1) each argument until a Null Byte rather than a blank or newline\n
Finally,
xargs
passes those arguments to the specific command according to certain system limits
Above commands are recommended if filename handling process is no longer than a few commands
If a more complex filename processing task is required, such as the execution of several commands, the following are the feasible ones →
POSIX Compliant - Find -print0 + IFS= Read -d ” (Pipelined)
INFO
Find
’s output format is modified due to-print0
option as it splits filenames matched in Null Bytes\0
rather than in newlines\n
read -d ''
modifiesread
’s default behavior to process the input string until a Null Byte instead of until a newline\n
Those actions improve the following one, seen earlier →
Note the above one fails and may create unintended results if a filename contains a newline
\n
, asread
processes input string until a newlineTherefore, Null Byte related options such as
find -print0
andread -d ''
prevent this behaviorAs said, since a filename cannot contain a Null Byte
\0
andread -d ''
processes an input string until a Null Byte, a filename cannot be splitted into several parts, It’s a unitFurthermore,
IFS
is set to an empty string to prevent leading and trailing whitespace chars trimming andread -r
treates backslash chars\
as literalsBe aware that
while loop
occurs in a subshell( )
due to the pipeline|
, so parameter assigment or modification may be lost/unset in Shell Parent’s environment
Non-POSIX Compliant - Find -print0 + IFS= Read -d ” (Process Substitution)
IMPORTANT
Non-standard way, as it uses Process Substitution
<( )
, which is not POSIX CompliantProcess Substitution is specific to BASH, ZSH and KSH 93
With prior code, parameters retains its value in Shell Parent’s env as no subshell is used
while loop
’s stdin (fd 0) is not empty. This can be changed as follows →Above command frees
read
’s stdin sinceread
process the input from fd 4Specifically, this occurs:
- Process Substitution →
find
is run within a subshell and its output is redirected to a pipeline/tempfile/dev/fd
- File Descriptor → Within the
while loop
context, that/dev/fd
tempfile is opened in read mode and assigned to the file descriptor 4Therefore, file descriptor 4 is created and refers to the resource opened in read mode
- Read’s Processing →
read
processes the input from file descriptor 4 instead of from file descriptor 0 (i.e. Its stdin)
Read
inherits the file descriptor from thewhile loop
just as a Child process inherits them from its Parent processNote that
while loop
is executed untilread
returns false (i.e.read
arrives to EOF)
POSIX Compliant - Find -print0 + IFS= Read -d ” (FIFOS)
INFO
Namedpipe or FIFO is used rather than Process Substitution, which makes it POSIX Compliant
Find
’s output (fd 1) is redirected to the FIFO opened. Thatfind
process is sent to the background to allowThis makes that while
find
process is writing to the namedpipe, thewhile loop
is reading from itThat is, the FIFO connects
find
’s stdout (fd 1) withread
’s stdin (fd 0)The
&
character is necessary to parallelise the execution of both processesAs a FIFO acts synchronously, if the process writing to the FIFO is not sent to the background, the script flow will stop until a process reads from that namedpipe
Therefore,
find
writes to the FIFO in the background, the FIFO is opened in read mode and the file descriptor 4 is assigned to it. Then,read
reads from that fd, all this in parallelNote that the file descriptor 4 refers to the open FIFO in read mode. This fd is only created in the
while loop
’s contextAs mentioned earlier, the creation of that file descriptor frees the
while loop
’s stdin and any process’s stdin inside the loop
CAUTION
It is important to create a new file descriptor, which refers to the FIFO opened in read mode, as the stdin of the
while loop
and of the processes inside that loop is freedThus, any process which reads from its stdin will not process the FIFO’s content
Cleaner but not POSIX Compliant
INFO
Above code limits the parameter assigment and modification scope to the function through
local
It also removes the previous FIFO created when the function returns any value due to
trap '' RETURN
However, those actions are not POSIX Compliant, neither
local
nortrap '' RETURN
Therefore, a Shell Check can be performed to avoid unintended actions if the script is executed via a POSIX Compliant shell such as dash or sh
Or if not, the above one can be modified as follows →