What’s the time complexity of Java Stream distinct().sorted()?

Every time when I get a coding interview, I always avoid using Java stream, because I can’t analyze the time complexity very well.

For example: in my daily work, I might write like this:

Arrays.stream(a).distinct().sorted().toArray();

to get the unique number and sort them.

but I’m curious about the time complexity will be..? is distinct().sorted will become an nested loop?

should I need to seperate them?

int[] arr = Arrays.stream(a).distinct().toArray();
Arrays.stream(arr).sorted().toArray();

so sometimes when I have an interview, I’ll use set to distinct then sort them…. but I really want to write a clean code…

If anyone can help! Thank you!

Answer

There is no definite statement possible, as you are developing against an interface and the specification does not mandate a particular sort algorithm.

For the generic Stream, we have to assume a comparison sort whose worst case of O(n log n) can not be avoided by any algorithm.

But your example uses an IntStream which, in principle, could use a counting sort or similar, having O(n). This doesn’t happen in practice with the reference implementation, as having a better worst case time complexity does not necessarily lead to a better performance in practice and the maximum number of elements is limited to the JVM’s maximum array size.

The time complexity of distinct() is O(n), as it just checks whether adding into a HashSet succeeds. Combining the O(n) with O(n log n) leads to an overall complexity of O(n log n). Maybe the interviewer got this combining of time complexities wrong.

But this is a good example demonstrating that time complexity is not the equal to performance. When you use sorted().distinct(), the distinct() operation will utilize the sorted nature of the incoming elements, which makes it unnecessary to build a HashSet behind the scenes¹. Since the reference implementation has no primitive value set, this eliminates a lot of boxing overhead. On the other hand, using distinct().sorted() could reduce the number of elements to sort, but it requires to have significantly less distinct elements than total stream elements to pay off.

Such performance differences are not covered by the time complexity, which still is the same for both approaches. But as said, for the streams of primitive types, different algorithms with a different time complexity would be possible.

But one thing, we can say for sure. When you split the operations into two stream operations, like requesting a result array from the first and passing it to Arrays.stream again, there is no chance that the underlying implementation utilizes knowledge about the previous operation in the next operation.

Note that the statements above assume a terminal operation like your example’s toArray that consumes all elements and requires the resulting encounter order to be maintained. With other, short-circuiting or unordered terminal operations, the overall time complexity could change, e.g. sorted().findFirst() could get optimized to the equivalent of min() or the sorting step could get eliminated for an unordered terminal operation, like sum(). That does not happen in the current reference implementation, but, as said, you’re programming against an interface.


¹ for primitive streams, this only works in Java 9+. As said, there is no primitive specialization for distinct(), it’s implemented like boxed().distinct().mapToInt(i -> i) and in Java 8, boxed() is implemented as mapToObj(Integer::valueOf) which loses the information about the sorted input, as explained in the last section of this answer.

Leave a Reply

Your email address will not be published. Required fields are marked *