diff options
author | Luke Shumaker <LukeShu@sbcglobal.net> | 2014-01-09 23:08:31 -0500 |
---|---|---|
committer | Luke Shumaker <LukeShu@sbcglobal.net> | 2014-01-09 23:08:31 -0500 |
commit | 0e3db4278516f4963fb4b50426860fe17078fb89 (patch) | |
tree | 6165a93dc2410dcfd8ebd5761af5487d2ca06112 /public | |
parent | 238c43919239c6b501f7804745176900f3a21e67 (diff) |
bash-arrays: revise
I've actually had these changes sitting on my laptop for a while.
(before newyears)
Diffstat (limited to 'public')
-rw-r--r-- | public/bash-arrays.md | 268 |
1 files changed, 208 insertions, 60 deletions
diff --git a/public/bash-arrays.md b/public/bash-arrays.md index aee5ec0..b2c3d74 100644 --- a/public/bash-arrays.md +++ b/public/bash-arrays.md @@ -7,12 +7,72 @@ date: 2013-10-13 Way too many people don't understand Bash arrays. Many of them argue that if you need arrays, you shouldn't be using Bash. If we reject the notion that one should never use Bash for scripting, then thinking -you don't need Bash arrays is what I like to call "wrong". +you don't need Bash arrays is what I like to call "wrong". I don't +even mean real scripting; even these little stubs in `/usr/bin`: -The simple explanation of why everybody who programs in Bash needs to -understand arrays is this: command line arguments are exposed as an -array. Does your script take any arguments on the command line? -Great, you need to work with an array! + #!/bin/sh + java -jar /.../something.jar $* # WRONG! + +Command line arguments are exposed as an array, that little `$*` is +accessing it, and is doing the wrong thing (for the lazy, the correct +thing is `-- "$@"`). Arrays in Bash offer a safe way preserve field +separation. + +One of the main sources of bugs (and security holes) in shell scripts +is field separation. That's what arrays are about. + +What? Field separation? +------------------------ + +Field separation is just splitting a larger unit into a list of +"fields". The most common case is when Bash splits a "simple command" +(in the Bash manual's terminology) into a list of arguments. +Understanding how this works is an important prerequisite to +understanding arrays, and even why they are important. + +Dealing with lists is something that is very common in Bash scripts; +from dealing with lists of arguments, to lists of files; they pop up a +lot, and each time, you need to think about how the list is +separated. In the case of `$PATH`, the list is separated by colons. +In the case of `$CFLAGS`, the list is separated by whitespace. In the +case of actual arrays, it's easy, there's no special character to +worry about, just quote it, and you're good to go. + +Bash word splitting +------------------- + +When Bash reads a "simple command", it splits the whole thing into a +list of "words". "The first word specifies the command to be +executed, and is passed as argument zero. The remaining words are +passed as arguments to the invoked command." (to quote `bash(1)`) + +It is often hard for those unfamiliar with Bash to understand when +something is multiple words, and when it is a single word that just +contains a space or newline. To help gain an intuitive understanding, +I recommend using the following command to print a bullet list of +words, to see how Bash splits them up: + +<pre><code>printf ' -> %s\n' <var>words...</var><hr> -> word one + -> multiline +word + -> third word +</code></pre> + +In a simple command, in absence of quoting, Bash separates the "raw" +input into words by splitting on spaces and tabs. In other places, +such as when expanding a variable, it uses the same process, but +splits on the characters in the `$IFS` variable (which has the default +value of space/tab/newline). This process is, creatively enough, +called "word splitting". + +In most discussions of Bash arrays, one of the frequent criticisms is +all the footnotes and "gotchas" about when to quote things. That's +because they usually don't set the context of word splitting. +**Double quotes (`"`) inhibit Bash from doing word splitting.** +That's it, that's all they do. Arrays are already split into words; +without wrapping them in double quotes Bash re-word splits them, +which is almost *never* what you want; otherwise, you wouldn't be +working with an array. Normal array syntax ------------------- @@ -20,18 +80,18 @@ Normal array syntax <table> <caption> <h1>Setting an array</h1> - <p><var>tokens...</var> is expanded and split into array elements - the same way command line arguments are.</p> + <p><var>words...</var> is expanded and subject to word splitting + based on <code>$IFS</code>.</p> </caption> <tbody> <tr> - <td><code>array=(<var>tokens...</var>)</code></td> + <td><code>array=(<var>words...</var>)</code></td> <td>Set the contents of the entire array.</td> </tr><tr> - <td><code>array+=(<var>tokens...</var>)</code></td> - <td>Appends <var>tokens...</var> to the end of the array.</td> + <td><code>array+=(<var>words...</var>)</code></td> + <td>Appends <var>words...</var> to the end of the array.</td> </tr><tr> - <td><code>array[<var>n</var>]=<var>value</var></code></td> + <td><code>array[<var>n</var>]=<var>word</var></code></td> <td>Sets an individual entry in the array, the first entry is at <var>n</var>=0.</td> </tr> @@ -45,17 +105,24 @@ difference between `@` and `*`. <table> <caption> <h1>Getting an entire array</h1> - <p>There is almost <em>no</em> valid reason to not wrap these in - double quotes.</p> + <p>Unless these are wrapped in double quotes, they are subject to + word splitting, which defeats the purpose of arrays.</p> + <p>I guess it's worth mentioning that if you don't quote them, and + word splitting is applied, <code>@</code> and <code>*</code> + end up being equivalent.</p> + <p>With <code>*</code>, when joining the elements into a single + string, the elements are separated by the first character in + <code>$IFS</code>, which is, by default, a space.</p> </caption> <tbody> <tr> <td><code>"${array[@]}"</code></td> - <td>Returns every element of the array as a separate token.</td> + <td>Evaluates to every element of the array, as a separate + words.</td> </tr><tr> <td><code>"${array[*]}"</code></td> - <td>Returns every element of the array in a single - whitespace-separated string.</td> + <td>Evaluates to every element of the array, as a single + word.</td> </tr> </tbody> </table> @@ -64,21 +131,34 @@ It's really that simple—that covers most usages of arrays, and most of the mistakes made with them. To help you understand the difference between `@` and `*`, here is a -sample. +sample of each: -<pre><code>#!/bin/bash +<table> + <tbody> + <tr><th><code>@</code></th><th><code>*</code></th></tr> + <tr> + <td>Input:<pre><code>#!/bin/bash array=(foo bar baz) for item in "${array[@]}"; do echo " - <${item}>" -done<hr> - <foo> - - <bar> - - <baz></code></pre> - -<pre><code>#!/bin/bash +done</code></pre></td> + <td>Input:<pre><code>#!/bin/bash array=(foo bar baz) for item in "${array[*]}"; do echo " - <${item}>" -done<hr> - <foo bar baz></code></pre> +done</code></pre></td> + </tr> + <tr> + <td>Output:<pre><code> - <foo> + - <bar> + - <baz></code></pre></td> + <td>Output:<pre><code> - <foo bar baz><br><br><br></code></pre></td> + </tr> + </tbody> +</table> + +In most cases, `@` is what you want, but `*` comes up often enough +too. To get individual entries, the syntax is <code>${array[<var>n</var>]}</code>, where <var>n</var> starts at 0. @@ -86,39 +166,43 @@ To get individual entries, the syntax is <table> <caption> <h1>Getting a single entry from an array</h1> + <p>Also subject to word splitting if you don't wrap it in + quotes.</p> </caption> <tbody> <tr> <td><code>"${array[<var>n</var>]}"</code></td> - <td>Returns the <var>n</var>th entry of the array, where the - first entry is at <var>n</var>=0.</td> + <td>Evaluates to the <var>n</var><sup>th</sup> entry of the + array, where the first entry is at <var>n</var>=0.</td> </tr> </tbody> </table> -To get a subset of the array, there are a few options (like normal, -switch between `@` and `*` to switch between -getting it as separate items, and as a whitespace-separated string): +To get a subset of the array, there are a few options: <table> <caption> <h1>Getting subsets of an array</h1> <p>Substitute <code>*</code> for <code>@</code> to get the subset - as a whitespace-separated string instead of separate tokens, as - described above.</p> - <p>Again, there is almost no valid reason to not wrap each of - these in double quotes.</p> + as a <code>$IFS</code>-separated string instead of separate + words, as described above.</p> + <p>Again, if you don't wrap these in double quotes, they are + subject to word splitting, which defeats the purpose of + arrays.</p> </caption> <tbody> <tr> <td><code>"${array[@]:<var>start</var>}"</code></td> - <td>Returns from <var>n</var>=<var>start</var> to the end of the array.</td> + <td>Evaluates to the entries from <var>n</var>=<var>start</var> to the end + of the array.</td> </tr><tr> <td><code>"${array[@]:<var>start</var>:<var>count</var>}"</code></td> - <td>Returns <var>count</var> entries, starting at <var>n</var>=<var>start</var>.</td> + <td>Evaluates to <var>count</var> entries, starting at + <var>n</var>=<var>start</var>.</td> </tr><tr> <td><code>"${array[@]::<var>count</var>}"</code></td> - <td>Returns <var>count</var> entries from the beginning of the array.</td> + <td>Evaluates to <var>count</var> entries from the beginning of + the array.</td> </tr> </tbody> </table> @@ -128,8 +212,10 @@ Notice that `"${array[@]}"` is equivalent to `"${array[@]:0}"`. <table> <caption> <h1>Getting the length of an array</h1> - <p>The is the only situation where there is no difference - between <code>@</code> and <code>*</code>.</p> + <p>The is the only situation with arrays where quoting doesn't + make a difference.</p> + <p>True to my earlier statement, when unquoted, there is no + difference between <code>@</code> and <code>*</code>.</p> </caption> <tbody> <tr> @@ -139,7 +225,7 @@ Notice that `"${array[@]}"` is equivalent to `"${array[@]:0}"`. <code>${#array[*]}</code> </td> <td> - Returns the length of the array + Evaluates to the length of the array </td> </tr> </tbody> @@ -190,7 +276,7 @@ anyway. <tr><th colspan=2>Array length</th></tr> <tr><td><code>${#array[@]}</code></td><td><code>$#</code> + 1</td></tr> <tr><th colspan=2>Setting the array</th></tr> - <tr><td><code>array=("${array[0]}" <var>tokens...</var>)</code></td><td><code>set -- <var>tokens...</var></code></td></tr> + <tr><td><code>array=("${array[0]}" <var>words...</var>)</code></td><td><code>set -- <var>words...</var></code></td></tr> <tr><td><code>array=("${array[0]}" "${array[@]:2}")</code></td><td><code>shift</code></td></tr> <tr><td><code>array=("${array[0]}" "${array[@]:<var>n+1</var>}")</code></td><td><code>shift <var>n</var></code></td></tr> </tbody> @@ -241,21 +327,19 @@ flags. The `shift` command shifts each entry <var>n</var> spots to the left, using <var>n</var>=1 if no value is specified; and leaving argument 0 alone. ----- +But you mentioned "gotchas" about quoting! +------------------------------------------ -2013-12-06 update: When it's okay to not quote arrays ------------------------------------------------------ +But I explained that quoting simply inhibits word splitting, which you +pretty much never want when working with arrays. If, for some odd +reason, you do what word splitting, then that's when you don't quote. +Simple, easy to understand. -I mentioned that there is "almost no" valid reason to not wrap -`${array[@]}` in double-quotes. Seriously, you probably want to put -it in quotes. In an earlier version, I wrote "no", and in an even -earlier draft I wrote "no reason that I can think of". Well, I just -changed it to "almost no", because I thought of a reason: +I think possibly the only case where you do want word splitting with +an array is when you didn't want an array, but it's what you get +(arguments are, by necessity, an array). For example: -It is okay to not quote it when you are doing field-separator -manipulations. For example: - - # Usage: path_la PATH1 PATH2... + # Usage: path_ls PATH1 PATH2... # Description: # Takes any number of PATH-style values; that is, # colon-separated lists of directories, and prints a @@ -269,12 +353,76 @@ manipulations. For example: find -L "${dirs[@]}" -maxdepth 1 -type f -executable -printf '%f\n' 2>/dev/null | sort -u } -The explanation is that no field-separator evaluation is done when you -quote the array. This is almost always what you want—the array is -already field-separated; you don't want that to be re-evaluated, and -possibly changed. However, if you do have character-separated fields -that you want to get at, you do want field-separation to be -re-evaluated. +Logically, there shouldn't be multiple arguments, just a single +`$PATH` value; but, we can't enforce that, as the array can have any +size. So, we do the robust thing, and just act on the entire array, +not really caring about the fact that it is an array. Alas, there is +still a field-separation bug in the program, with the output. + +I still don't think I need arrays in my scripts +----------------------------------------------- + +Consider the common code: + + ARGS=' -f -q' + ... + command $ARGS # unquoted variables are a bad code-smell anyway + +Here, `$ARGS` is field-separated by `$IFS`, which we are assuming has +the default value. This is fine, as long as `$ARGS` is known to never +need an embedded space; which you do as long as it isn't based on +anything outside of the program. But wait until you want to do this: + + ARGS=' -f -q' + ... + if [[ -f "$filename" ]]; then + ARGS+=" -F $filename" + fi + ... + command $ARGS + +Now you're hosed if `$filename` contains a space! More than just +breaking, it could have unwanted side effects, such as when someone +figures out how to make `filename='foo --dangerous-flag'`. + +Compare that with the array version: + + ARGS=(-f -q) + ... + if [[ -f "$filename" ]]; then + ARGS+=(-F "$filename") + fi + ... + command "${ARGS[@]}" + +What about compatability? +------------------------- + +Except for the little stubs that call another program with `"$@"` at +the end, trying to write for multiple shells (including the ambiguous +`/bin/sh`) is not a task for mere mortals. If you do try that, your +best bet is probably sticking to POSIX. Arrays are not POSIX; except +for the arguments array, which is; though getting subset arrays from +`$@` and `$*` is not (tip: use `set --` to re-purpose the arguments array). + +Writing for various versions of Bash, though, is pretty do-able. +Everything here works all the way back in bash-2.0 (1996), with the +following exceptions: + + * The `+=` operator wasn't added until Bash 3.1. + * Accessing subset arrays of the arguments array is inconsistent if + <var>pos</var>=0 in <code>${@:<var>pos</var>:<var>len</var>}</code>. + + * In Bash 2.x and 3.x, it works as expected, except that argument + 0 is silently missing. For example `${@:0:3}` gives arguments 1 + and 2; where `${@:1:3}` gives arguments 1, 2, and 3. This means + that if <var>pos</var>=0, then only <var>len</var>-1 arguments + are given back. + * In Bash 4.0, argument 0 can be accessed, but if + <var>pos</var>=0, then it only gives back <var>len</var>-1 + arguments. So, `${@:0:3}` gives arguments 0 and 1. + * In Bash 4.1 and higher, it works in the way described in the + main part of this document. -But seriously, quote your arrays by default. If they need to be -unquoted, you should think long and hard before doing so. +Bash 1.x won't compile with modern GCC, so I couldn't verify how it +behaves. |