How to sort a data frame by multiple columns in R
Posted by: AJ Welch
To begin understanding how to properly sort data frames in R
, we of course must first generate a data frame to manipulate.
# run.R
# Generate data frame
dataframe <- data.frame(
x = c("apple", "orange", "banana", "strawberry"),
y = c("a", "d", "b", "c"),
z = c(4:1)
)
# Print data frame
dataframe
Note: The spacing isn’t necessary, but it improves legibility.
Executing our run.R
script outputs the list of vectors in our data frame as expected, in the order they were entered:
$ Rscript run.R
x y z
1 apple a 4
2 orange d 3
3 banana b 2
4 strawberry c 1
The order function
While perhaps not the easiest sorting method to type out in terms of syntax, the one that is most readily available to all installations of R
, due to being a part of the base
module, is the order function.
The order
function accepts a number of arguments, but at the simplest level the first argument must be a sequence of values or logical vectors.
For example, we can use order()
to simply sort a vector of five randomly ordered numbers with this script:
# Create unordered vector
vector = c(2, 5, 1, 3, 4)
# Print vector
vector
# Sort in ascending order
vector[order(vector)]
Executing the script, we see the initial output of the unordered vector, followed by the now ordered list afterward:
$ Rscript run.R
[1] 2 5 1 3 4
[1] 1 2 3 4 5
Sorting a data frame by vector name
With the order()
function in our tool belt, we’ll start sorting our data frame by passing in the vector names within the data frame.
For example, using our previously generated dataframe
object, we can sort by the vector z
by adding the following code to our script:
# Sort by vector name [z]
dataframe[
with(dataframe, order(z)),
]
What we’re effectively doing is calling our original dataframe
object, and passing in the new index order that we’d like to have. This index order is generated using the with() function, which effectively creates a new environment using the passed in data in the first argument along with an expression for evaluating that data in the second argument.
Thus, we’re reevaluating the dataframe
data using the order()
function, and we want to order based on the z
vector within that data frame. This returns a new index order for the data frame values, which is then finally evaluated within the [brackets] of dataframe[]
, outputting our new ordered result.
$ Rscript run.R
x y z
1 apple a 4
2 orange d 3
3 banana b 2
4 strawberry c 1
x y z
4 strawberry c 1
3 banana b 2
2 orange d 3
1 apple a 4
Consequently, we see our original unordered output, followed by a second output with the data sorted by column z
.
Sorting by column index
Similar to the above method, it’s also possible to sort based on the numeric index
of a column in the data frame, rather than the specific name.
Instead of using the with()
function, we can simply pass the order()
function to our dataframe
. We indicate that we want to sort by the column of index 1
by using the dataframe[,1]
syntax, which causes R
to return the levels (names) of that index 1
column. In other words, similar to when we passed in the z
vector name above, order
is sorting based on the vector values that are within column of index 1
:
dataframe[
order( dataframe[,1] ),
]
As expected, we get our normal output followed by the sorted output in the first column:
$ Rscript run.R
x y z
1 apple a 4
2 orange d 3
3 banana b 2
4 strawberry c 1
x y z
1 apple a 4
3 banana b 2
2 orange d 3
4 strawberry c 1
Sorting by multiple columns
In some cases, it may be desired to sort by multiple columns. Thankfully, doing so is very simple with the previously described methods.
To sort multiple columns using vector names, simply add additional arguments to the order()
function call as before:
# Sort by vector name [z] then [x]
dataframe[
with(dataframe, order(z, x)),
]
Similarly, to sort by multiple columns based on column index, add additional arguments to order()
with differing indices:
# Sort by column index [1] then [3]
dataframe[
order( dataframe[,1], dataframe[,3] ),
]