Regardless of what type of application you are developing, optimizing its performance and ensuring it loads fast and offers smooth interactions is critical for the user experience and the success of the application. One way to do this is to inspect the activity of an application by using profiling tools to see what's happening under the hood as it runs during a time window. The Performance panel in DevTools is a great profiling tool to analyze and optimize performance of web applications. If your app is running in Chrome, it gives you a detailed visual overview of what the browser is doing as your application is being executed. Understanding this activity can help you identify patterns, bottlenecks, and performance hotspots you can act on to improve performance.
The following example walks you though using the Performance panel.
Setting up and recreating our profiling scenario
Recently, we set a goal to make the Performance panel more performant. In particular, we wanted it to load large volumes of performance data more quickly. This is the case, for example, when profiling long-running or complex processes or capturing high-granularity data. To achieve this, an understanding of how the application was performing and why it performed that way was first needed, which was achieved by using a profiling tool.
As you might know, DevTools itself is a web application. As such, it can be profiled using the Performance panel. To profile this panel itself, you can open DevTools, and then open another DevTools instance attached to it. At Google, this setup is known as DevTools-on-DevTools.
With the setup ready, the scenario to be profiled must be recreated and recorded. To avoid confusion, the original DevTools window will be referred to as the "first DevTools instance", and the window that's inspecting the first instance will be referred to as the "second DevTools instance".
On the second DevTools instance, the Performance panel—which will be called the perf panel from here on out—observes the first DevTools instance to recreate the scenario, which loads a profile.
On the second DevTools instance a live recording is started, while on the first instance, a profile is loaded from a file on disk. A large file is loaded in order to accurately profile performance of processing large inputs. When both instances finish loading, the performance profiling data—commonly called a trace—is seen in the second DevTools instance of the perf panel loading a profile.
The initial state: identifying opportunities for improvement
After loading has finished, the following on our second perf panel instance was observed in the next screenshot. Focus on the activity of the main thread, which is visible under the track labeled Main. It can be seen that there are five big groups of activity in the flame chart. These consist of the tasks where the loading is taking the most time. The total time of these tasks was approximately 10 seconds. In the following screenshot, the performance panel is used to focus on each of these these activity groups to see what can be found.
First activity group: unnecessary work
It became apparent that the first group of activity was legacy code that still ran, but wasn't really needed. Basically, everything under the green block labeled processThreadEvents
was wasted effort. That one was a quick win. Removing that function call saved around 1.5 seconds of time. Cool!
Second activity group
In the second activity group, the solution was not as simple as with the first. The buildProfileCalls
took about 0.5 seconds, and that task was not something that could be avoided.
Out of curiosity, we enabled the Memory option in the perf panel to investigate further, and saw that the buildProfileCalls
activity was also using a lot of memory. Here, you can see how the blue line graph suddenly jumps around the time buildProfileCalls
is run, which suggests a potential memory leak.
To follow up on this suspicion, we used the Memory panel (another panel in DevTools, different from the Memory drawer in the perf panel) to investigate. Within the Memory panel, the "Allocation sampling" profiling type was selected, which recorded the heap snapshot for the perf panel loading the CPU profile.
The following screenshot shows the heap snapshot that was gathered.
From this heap snapshot, it was observed that the Set
class was consuming a lot of memory. By checking the call points, it was found that we were unnecessarily assigning properties of type Set
to objects which were created in large volumes. This cost was adding up and a lot of memory was consumed, to the point that it was common for the application to crash on large inputs.
Sets are useful for storing unique items and provide operations that use the uniqueness of their content, like deduplicating datasets and providing more efficient lookups. However, those features weren't necessary since the data stored was guaranteed to be unique from the source. As such, sets weren't necessary in the first place. To improve memory allocation, the property type was changed from a Set
to a plain array. After applying this change, another heap snapshot was taken, and reduced memory allocation was observed. Despite not achieving considerable speed improvements with this change, the secondary benefit was that the application crashed less frequently.
Third activity group: weighing data structure trade-offs
The third section is peculiar: you can see in the flame chart that it consists of narrow but tall columns, which denote deep function calls, and deep recursions in this case. In total, this section lasted around 1.4 seconds. By looking at the bottom of this section, it was apparent that the width of these columns was determined by the duration of one function: appendEventAtLevel
, which suggested that it could be a bottleneck
Inside the implementation of the appendEventAtLevel
function, one thing stood out. For every single data entry in the input (which is known in code as the "event"), an item was added to a map that tracked the vertical position of the timeline entries. This was problematic, because the amount of items that were stored was very large. Maps are fast for key-based lookups, but this advantage doesn't come for free. As a map grows larger, adding data to it can, for example, become expensive due to rehashing. This cost becomes noticeable when large amounts of items are added to the map successively.
/**
* Adds an event to the flame chart data at a defined vertical level.
*/
function appendEventAtLevel (event, level) {
// ...
const index = data.length;
data.push(event);
this.indexForEventMap.set(event, index);
// ...
}
We experimented with another approach that didn't require us to add an item in a map for every entry in the flame chart. The improvement was significant, confirming that the bottleneck was indeed related to the overhead incurred by adding all the data to the map. The time the activity group took shrank from around 1.4 seconds to around 200 milliseconds.
Before:
After:
Fourth activity group: deferring non-critical work and cache data to prevent duplicate work
Zooming in on this window, it can be seen that there are two nearly identical blocks of function calls. By looking at the name of the functions called, you can infer that these blocks consist of code that's building trees (for example, with names like refreshTree
or buildChildren
). In fact, the related code is the one that creates the tree views in the bottom drawer of the panel. What’s interesting is that these tree views aren't shown right after loading. Instead, the user needs to select a tree view (the "Bottom-up", "Call Tree" and "Event Log" tabs in the drawer) for the trees to be shown. Furthermore, as you can tell from the screenshot, the tree building process was executed twice.
There are two problems we identified with this picture:
- A non-critical task was hindering the performance of the load time. Users don't always need its output. As such, the task is not critical for the profile loading.
- The result of these tasks wasn't cached. That's why the trees were calculated twice, despite the data not changing.
We started with deferring tree calculation to when the user manually opened the tree view. Only then is it worth paying the price of creating these trees. The total time of running this twice was around 3.4 seconds, so deferring it made a significant difference in the loading time. We are still looking into caching these types of tasks as well.
Fifth activity group: avoid complex call hierarchies when possible
Looking closely at this group, it was clear that a particular call chain was being invoked repeatedly. The same pattern appeared 6 times in different places in the flame chart, and the total duration of this window was about 2.4 seconds!
The related code being called multiple times is the part that processes the data to be rendered on the "minimap" (the overview of the timeline activity at the top of the panel). It wasn't clear why it was happening multiple times, but it certainly didn't have to happen 6 times! In fact, the output of the code should remain current if no other profile is loaded. In theory, the code should only run once.
Upon investigation, it was found that the related code was called as a consequence of multiple parts in the loading pipeline directly or indirectly calling the function that calculates the minimap. This is because the complexity of the program's call graph evolved over time, and more dependencies to this code were added unknowingly. There is no quick fix for this issue. The way to solve it depends on the architecture of the codebase in question. In our case, we had to reduce the call hierarchy complexity a bit and add a check to prevent the code's execution if the input data remained unchanged. After implementing this, we got this outlook of the timeline:
Note that the minimap rendering execution occurs twice, not once. This is because there are two minimaps being drawn for every profile: one for the overview on top of the panel, and another for the drop-down menu that selects the currently visible profile from the history (every item in this menu contains an overview of the profile it selects). Nonetheless, these two have the exact same content, so one should be able to reused for the other.
Since these minimaps are both images drawn on a canvas, it was a matter of using the drawImage
canvas utility, and subsequently running the code only once to save some extra time. As a result of this effort, the group's duration was reduced from 2.4 seconds to 140 milliseconds.
Conclusion
After having applied all these fixes (and a couple of other smaller ones here and there), the change of the profile loading timeline looked as follows:
Before:
After:
The load time after the improvements was 2 seconds, meaning that an improvement of about 80% was achieved with relatively low effort, since most of what was done consisted of quick fixes. Of course, properly identifying what to do initially was key, and the perf panel was the right tool for this.
It's also important to highlight that these numbers are particular to a profile being used as a subject of study. The profile was interesting to us because it was particularly large. Nonetheless, since the processing pipeline is the same for every profile, the significant improvement achieved applies to every profile loaded in the perf panel.
Takeaways
There are some lessons to take away from these results in terms of performance optimization of your application:
1. Make use of profiling tools to identify runtime performance patterns
Profiling tools are incredibly useful to understand what's happening in your application while it's running, especially to identify opportunities to improve performance. The Performance panel in Chrome DevTools is a great option for web applications since it's the native web profiling tool in the browser, and it's actively maintained to be up-to-date with the latest web platform features. Also, it's now significantly faster! 😉
Use samples that can be used as representative workloads and see what you can find!
2. Avoid complex call hierarchies
When possible, avoid making your call graph too complicated. With complex call hierarchies, it's easy to introduce performance regressions and hard to understand why your code is running the way it is, making it hard to land improvements.
3. Identify unnecessary work
It's common for aging codebases to contain code that's no longer needed. In our case, legacy and unnecessary code was taking a significant portion of the total loading time. Removing it was the lowest-hanging fruit.
4. Use data structures appropriately
Use data structures to optimize performance, but also understand the costs and trade-offs each type of data structure brings when deciding which ones to use. This isn't only the space complexity of the data structure itself, but also time complexity of the applicable operations.
5. Cache results to avoid duplicate work for complex or repetitive operations
If the operation is costly to execute, it makes sense to store its results for the next time it's needed. It also makes sense to do this if the operation is done many times—even if each individual time isn't particularly costly.
6. Defer non-critical work
If the output of a task isn't needed immediately and the task execution is extending the critical path, consider deferring it by lazily calling it when its output is actually needed.
7. Use efficient algorithms on large inputs
For large inputs, optimal time complexity algorithms become crucial. We didn't look into this category in this example, but their importance can hardly be overstated.
8. Bonus: benchmark your pipelines
To make sure your evolving code remains fast, it's wise to monitor the behavior and compare it against standards. This way, you proactively identify regressions and improve the overall reliability, setting you up for long-term success.