Firefox Memory Usage in the Quantum Era

This is a continuation of my Are They Slim Yet series. For background see my previous installment.

Firefox’s upcoming release 57 has a huge focus on performance. We’ve quantum-ed all the things but we haven’t really talked about memory usage, which is something that often falls by the wayside in the pursuit of performance. Luckily since we brought AWSY in tree it’s been pretty easy to track memory usage and regressions even on separate development branches. The Stylo team was a big user of this and it shows, we flipped the switch to enable Stylo by default around the 7th and you can see a fairly large regression, but by the 16th it was mostly gone:

Hopefully I’ve convinced you we’ve put a lot of work into performance, now let’s see how we’re doing memory-wise compared to other browsers.

The methodology for the test is the same as previous runs: I used the ATSY project to load 30 pages and measure memory usage of the various processes that each browser spawns during that time.

The results

Browser Memory Usage.
Memory usage of browsers across operating systems.

Edge has the highest memory usage on Windows, Chrome comes in with 1.4X the memory usage of Firefox 64-bit on Windows, about 2X Firefox on Linux. On macOS Safari is now by far the worst offender in memory usage, Chrome and Firefox are about even with Firefox memory usage having gone up a fair amount since the last time I measured.

Overall I’m pretty happy with where we’re at, but now that our big performance push is over I’d like to see us focus more on dropping memory usage so we can start pushing up the number of content processes. I’d also like to take a closer look into what’s going on on macOS as that’s been our biggest regression.

Browsers included are Edge 38 on Windows 10, Chrome Beta 62 on all platforms, Firefox Beta 57 on all platforms, and Safari Technology Preview 40 on macOS 10.12.6.

Note: I had to run the test for Safari manually again, they seem to have made some changes that cause all of the pages from my test to be loaded in the same content process.

MemShrink’s 6th Birthday

MemShrink, it’s still a thing

Although not as active, we still have a MemShrink group at Mozilla. We’ve transitioned from an all out assault on memory usage to mostly just attempting to keep memory usage sane. I wasn’t around when things started, but when I joined there were at least seven people actively attending our MemShrink triage meetings, now we’re down to two. Some members have moved on, others have transitioned through, but really it comes down to the fact that we did a pretty good job of getting memory under control and with limited resources there were more important tasks to look at.

Fear not, we haven’t abandoned the project. We’re just in a bit of a lull. With big pushes for multiple content processes and the Quantum project I think we’re going to see the need to ramp up MemShrink again. In the meantime rest assured we’re still chugging along, just at a slower pace.

Big Ticket Items – 2014

Three years ago Nicholas Nethercote wrote a blog post celebrating MemShrink’s 3rd birthday and put together a list of important work we saw coming up. Lets see how those projects went.

Better regression detection

AWSY has moved into our testing automation system and we are now have automated regression detection through perfherder. I think we can declare victory here.

Devtools

The devtools team added a memory tab. Dan Callahan and Nick Fitzgerald put together a nice writeup of the new memory tool. There’s more work that can be done, but most of the devtools team’s focus is on performance profiling these days. It sounds like it could become a priority again next year.

GC Arena Fragmentation

Jon Coppeard did some heroic work (64 patches!) and got compacting GC landed. Initial measurements showed an 8% reduction in JS memory usage which is quite impressive. You can read more details in a blog post by Jon about [compacting garbage collection in SpiderMonkey].(https://hacks.mozilla.org/2015/07/compacting-garbage-collection-in-spidermonkey/)

Tarako

We actually shipped the 128MB phone! It never took off in it’s target market and eventually the entire FirefoxOS project was shut down, but I’m still super impressed we achieved such a feat.

Windows OOM crashes

This is an ongoing problem. We still think the push to 64-bit Windows builds will be a huge win. We have a plan to upgrade users from 32-bit to 64-bit if their system can handle it and will make 64-bit the default in Firefox 55.

In the meantime the JS engine is now smarter about requesting memory on Windows and multi-process Firefox has shipped.

We had hopes that upgrading our memory allocator would help as well, but we’ve since abandoned that effort.

Big Ticket Items – 2017

That was a nice trip down memory lane, but now we need to look forward. Let’s take a look at some of what I see as our next big ticket items.

Reduce JS memory usage and increase sharing of data across processes

The JavaScript engine is probably our biggest target coming up for reducing memory usage, particularly with multiple content processes enabled. There’s some impressive work going on to have our core JavaScript modules share a single global. Initial testing has shown some pretty big wins for this.

In general we need think about ways to share more data across processes.

Improved devtools for memory analysis

The devtools team did a great job with their initial iteration of memory profiling, but it would be great to see a more refined UI and tie in information from our cycle collector on the C++ side.

Expanded testing

I’d like to get the ATSY project automated so that we can get consistent numbers on how we fare against other browsers. This has been a boon for JavaScript performance, I can see it being a good motivator for improving memory usage as well. An updated test corpus that uses modern web features would be a big improvement. Making it easier to track the memory impact of WebExtensions would also be great.

Conclusions

We ticked off 4 out of 5 of our big ticket items. 64-bit builds on Windows by default is just around the corner so lets just go ahead and count that as 5 out of 5. I see plenty of future challenges for the MemShrink group particularly once the dust settles from enabling multiple content processes and the various Quantum projects.

Let me know if I missed any big improvements, I’m sure there are plenty!

Are we slim yet is dead, all hail are we slim yet

Aside from some pangs of nostalgia, it is with great pleasure that I announce the retirement of areweslimyet.com, the areweslimyet github project, and its associated infrastructure (a sad computer in Mountain View under dvander’s desk and a possibly less sad computer running the website that’s owned by the former maintainer).

Wait, what?

Don’t worry! Are we slim yet, aka AWSY, lives on, it’s just moved in-tree and is run within Mozilla’s automated testing infrastructure.

For equivalent graphs check out:
Explicit
RSS
Miscellaneous

You can build your own graph from Perfherder. Just choose ‘+ Add test data’, ‘awsy’ for the framework and the tests and platforms you care about.

Wait, why?

I spent a few years maintaining and updating AWSY and some folks spent a fair amount of time before me. It was an ad hoc system that had bits and pieces bolted on over time. I brought it into the modern age from using the mozmill framework over to marionette, added support for e10s, and cleaned up some old slightly busted code. I tried to reuse packages developed by Mozilla to make things a bit easier (mozdownload and friends).

This was all pretty good, but things kept breaking. We weren’t in-tree, so breaking changes to marionette, mozdownload, etc would cause failures for us and it would take a while to figure out what happened. Sometimes the hard drive filled up. Sometimes the status file would get corrupted due to a poorly timed shutdown. It just had a lot of maintenance for a project with nobody dedicated to it.

The final straw was the retirement of archive.mozilla.org for what we call tinderbox builds, builds that are done more or less per push. This completely broke AWSY back in January and we decided it was just better to give in and go in-tree.

So is this a good thing?

It is a great thing. We’ve gone from 18,000 lines of code to 1,000 lines of code. That is not a typo. We now run on linux64, win32, and win64. Mac is coming soon. We turned on e10s. We have results on mozilla-inbound, autoland, try, mozilla-central, and mozilla-beta. We’re going to have automated crash analysis soon. We were able to use the project to give the greenlight for the e10s-multi project on memory usage.

Oh and guess what? Developers can run AWSY locally via mach. That’s right, try this out:

mach awsy-test --quick

Big thanks go out to Paul Yang and Bob Clary who pulled all this together — all I did was do a quick draft of an awsy-lite implementation — they did the heavy lifting getting it in tree, integrated with task cluster, and integrated with mach.

What’s next?

Now that we’re in-tree we can easily add new tests. Imagine getting data points for running the AWSY test with a specific add-on enabled to see if it regresses memory across revisions. And anyone can do this, no crazy local setup. Just mach awsy-test.

Are they slim yet, round 2

A year later let’s see how Firefox fares on Windows, Linux, and OSX with multiple content processes enabled.

Results

Graph comparing memory usage, chrome is still quite high

We can see that Firefox with four content processes fares better than Chrome on all platforms which is reassuring; Chrome is still about 2X worse on Windows and Linux. Our current plan is to only move up to four content processes, so this is great news.

Two content processes is still better than IE, with four we’re a bit worse. This is pretty impressive given last year we were in the same position with one content process.

Surprisingly on Mac Firefox is better than Safari with two content processes, compared with last year where we used 2X the memory with just one process, now we’re on par with four content processes.

I included Firefox with eight content processes to keep us honest. As you can see we actually do pretty well, but I don’t think it’s realistic to ship with that many nor do we currently plan to. We already have or are adding additional processes such as the plugin process for Flash and the GPU process. These need to be taken into consideration when choosing how many content processes to enable and pushing to eight doesn’t give us much breathing room. Making sure we have measurements now is important; it’s good to know where we can improve.

Overall I feel solid about these numbers, especially considering where we were just a year ago. This bodes well for the e10s-multi project.

Test setup

This is the same setup as last year. I load the first 30 pages of the tp5 page set (a snapshot of Alexa top 100 websites from a few years ago), each in its own tab, with 10 seconds in between loads and 60 seconds of settle time at the end.

Note: There was a minor change to the setup to give each page a unique domain. At least Safari and Chrome are roughly doing process per domain, so just using different ports on localhost was not enough. A simple solution was to modify my /etc/hosts file to add localhost-<1-30> aliases.

Methodology

Measuring multiprocess browser memory usage is tricky. I’ve settled with a somewhat simple formula of:

total_memory = sum_uss(content processes) + sum_rss(parent processes); 

Where a parent process is defined as anything that is not a content process (I’ll explain in a moment). Historically there was just one parent process that manages all other processes, this is still somewhat the case but each browser still has other executables they may run in addition to content processes. A content process has a slightly different definition per browser, but is generally “where the pages are loaded” — this is an oversimplification, but it’s good enough for now.

My definitions:

Browser Content Definition Example “parent”
Firefox firefox processes launched with the -contentproc command line. firefox without the -contentproc command line, plugin-process which is used for Flash, etc.
Chrome chrome processes launched with the --type command line. chrome without out the --type command line, nacl_helper, etc.
Safari WebContent processes. Safari, SafariServices, SafariHistory, Webkit.Networking, etc.
IE iexplore.exe process launched with the /prefetch command line. iexplore without the /prefetch command line.
Edge MicrosoftEdgeCP.exe processes. MicrosoftEdge.exe, etc.

For Firefox this is a reasonable and fair measurement, for other browsers we might be under counting memory by a bit. For example Edge has a parent executable, MicrosoftEdge.exe, and a different content executable, MicrosoftEdgeCP.exe, arguably we should measure the RSS of one the MicrosoftEdgeCP.exe processes, and USS for the rest, so we’re probably under counting. On the other hand we might end up over counting if the parent and content processes are sharing dynamic libraries. In future measurements I may tweak how we sum the memory, but for now I’d rather possibly under count rather then worry about being unfair to other browsers.

Raw numbers

OS Browser Total Memory
Ubuntu 16.04 LTS Chrome 54 (see note) 1,478 MB
Ubuntu 16.04 LTS Firefox 55 – 2 CP 765 MB
Ubuntu 16.04 LTS Firefox 55 – 4 CP 817 MB
Ubuntu 16.04 LTS Firefox 55 – 8 CP 990 MB
macOS 10.12.3 Chrome 59 1,365 MB
macOS 10.12.3 Firefox 55 – 2 CP 1,113 MB
macOS 10.12.3 Firefox 55 – 4 CP 1,215 MB
macOS 10.12.3 Firefox 55 – 8 CP 1,399 MB
macOS 10.12.3 Safari 10.2 (see note) 1,203 MB
Windows 10 Chrome 59 1,382 MB
Windows 10 Edge (see note) N/A
Windows 10 Firefox 55 – 2 CP 587 MB
Windows 10 Firefox 55 – 4 CP 839 MB
Windows 10 Firefox 55 – 8 CP 905 MB
Windows 10 IE 11 660 MB

Browser Version Notes

  • Chrome 54 — aka chrome-unstable — was used on Ubuntu 16.04 LTS as that’s the latest branded version available (rather than Chromium)
  • Firefox Nightly 55 – 2 CP is Firefox with 2 content processes and one parent process, the default configuration for Nightly.
  • Firefox Nightly 55 – 4 CP is Firefox with 4 content processes and one parent process, this is a longer term goal.
  • Firefox Nightly 55 – 8 CP is Firefox with 8 content processes and one parent process, this is aspirational, a good sanity check.
  • Safari Technology Preview 10.2 release 25 was used on macOS as that’s the latest branded version available (rather than Webkit nightly)
  • Edge was disqualified because it seemed to bypass the hosts file and wouldn’t load pages from unique domains. I can do measurements so I might revisit this, but it wouldn’t have been a fair comparison as-is.

Minimum alignment of allocation across platforms

In Firefox we use a custom allocator, mozjemalloc, based on a rather ancient version of jemalloc. The motivation for using a custom allocator is that it potentially gives us both performance and memory wins. I don’t know the full history, so I’ll let someone else write that up. What I do know is that we use it and it behaves a bit differently than system malloc implementations in a rather significant way: minimum alignment.

Why does this matter? Well it turns out C runtime implementations and/or compilers make some assumptions based on what the minimum allocation size and alignment is. For example in bug 1181142 we’re looking at a crash on Windows that happens in strcmp. The CRT decided to walk off the end of a page because it was comparing 4 bytes at a time.

Crossing the page boundary.
Crossing the page boundary.

Why was it doing that? Because the minimum allocation size is at least 4-bytes, so why not? If you head over to MSDN it’s spelled out somewhat clearly (although older versions of that page lack the specific byte sizes):

A fundamental alignment is an alignment that’s less than or equal to the largest alignment that’s supported by the implementation without an alignment specification. (In Visual C++, this is the alignment that’s required for a double, or 8 bytes. In code that targets 64-bit platforms, it’s 16 bytes.)

We’ve had similar issues on Linux (and maybe OS X), see bug 691003 for more historical details.

As it turns out we’re still not exactly in compliance in Linux which seems to stipulate 8-byte alignment on 32-bit and 16-byte alignment on 64-bit:

The address of a block returned by malloc or realloc in GNU systems is always a multiple of eight (or sixteen on 64-bit systems).

We haven’t seen a compelling reason to go up to a 8-byte alignment on 32-bit platforms (in the form of crashes) but perhaps that’s due to Linux being such a small percentage of our users.

And lets not forget about OS X, which as far as I can tell has always had a 16-byte alignment minimum. I can’t find where that’s spelled out in bytes, but go bang on malloc and you’ll always get a 16-byte aligned thing. My guess is this is a leftover from the PPC days and altivec. From the malloc man page for OS X:

The allocated memory is aligned such that it can be used for any data type, including AltiVec- and SSE-related types.

Again we haven’t seen crashes pointing to the lack of 16-byte alignment, again perhaps that’s because OS X is also a small percentage of our users. On the other hand maybe this is just an optimization but not an outright requirement.

So what happens when we do the right thing? Odds are less crashes which is good. Maybe more memory usage (you ask for a 1-byte thing on 64-bit Windows you’re going to get a 16-byte thing back), although early testing hasn’t shown a huge impact. Perf-wise there might be a win, with guaranteed minimum sizes we can compare things a bit quicker (4, 8, 16 bytes at a time).

Are they slim yet?

In my previous post I focused on how Firefox compares against itself with multiple content processes. In this post I’d like to take a look at how Firefox compares to other browsers.

For this task I automated as much as I could, the code is available as the atsy project on github. My goal here is to allow others to repeat my work, point out flaws, push fixes, etc. I’d love for this to be a standardized test for comparing browsers on a fixed set of pages.

As with my previous measurements, I’m going with:

total_memory = RSS(parent) + sum(USS(children))

An aside on the state of WebDriver and my hacky workarounds

When various WebDriver implementations get fixed we can make a cleaner test available. I had a dream of automating the tests across browsers using the WebDriver framework, alas, trying to do anything with tabs and WebDriver across browsers and platforms is a fruitless endeavor. Chrome’s actually the only one I could get somewhat working with WebDriver.

Luckily Chrome and Firefox are completely automated. I had to do some trickery to get Chrome working, filed a bug, doesn’t sound like they’re interested in fixing it. I also had to do some trickery to get Firefox to work (I ended up using our marionette framework directly instead), there are some bugs, not much traction there either.

IE and Safari are semi-automated, in that I launch a browser for you, you click a button, and then hit enter when it’s done. Safari’s WebDriver extension is completely broken, nobody seems to care. IE’s WebDriver completely failed at tabs (among other things), I’m not sure where to a file a bug for that.

Edge is mostly manual, its WebDriver implementation doesn’t support what I need (yet), but it’s new so I’ll give it a pass. Also you can’t just launch the browser with a file path, so there’s that. Also note I was stuck running it in a VM from modern.ie which was pretty old (they don’t have a newer one). I’d prefer not to do that, but I couldn’t upgrade my Windows 7 machine to 10 because Microsoft, Linux, bootloaders and sadness.

I didn’t test Opera, sorry. It uses blink so hopefully the Chrome coverage is good enough.

The big picture

Browser memory compared

The numbers

OS Browser Version RSS + USS
OSX 10.10.5 Chrome Canary 50.0.2627.0 1,354 MiB
OSX 10.10.5 Firefox Nightly (e10s) 46.0a1 20160122030244 1,065 MiB
OSX 10.10.5 Safari 9.0.3 (10601.4.4) 451 MiB
Ubuntu 14.04 Google Chrome Unstable 49.0.2618.8 dev (64-bit) 944 MiB
Ubuntu 14.04 Firefox Nightly (e10s) 46.0a1 20160122030244 (64-bit) 525 MiB
Windows 7 Chrome Canary 50.0.2631.0 canary (64-bit) 1,132 MiB
Windows 7 Firefox Nightly (e10s) 47.0a1 20160126030244 (64-bit) 512 MiB
Windows 7 IE 11.0.9600.18163 523 MiB
Windows 10 Edge 20.10240.16384.0 795 MiB

So yeah, Chrome’s using about 2X the memory of Firefox on Windows and Linux. Lets just read that again. That gives us a bit of breathing room.

It needs to be noted that Chrome is essentially doing 1 process per page in this test. In theory it’s configurable and I would have tried limiting its process count, but as far as I can tell they’ve let that feature decay and it no longer works. I should also note that Chrome has it’s own version of memshrink, Project TRIM, so memory usage is an area they’re actively working on.

Safari does creepily well. We could attribute this to close OS integration, but I would guess I’ve missed some processes. If you take it at face value, Safari is using 1/3 the memory of Chrome, 1/2 the memory of Firefox. Even if I’m miscounting, I’d guess they still outperform both browsers.

IE was actually on par with Firefox which I found impressive. Edge is using about 50% more memory than IE, but I wouldn’t read too much into that as I’m comparing running IE on Windows 7 to Edge on an outdated Windows 10 VM.

Memory Usage of Firefox with e10s Enabled

Quick background

With the e10s project full steam ahead, likely to be enabled for many users in mid-2016, it seemed like a good time to measure the memory overhead of switching Firefox from a single-process architecture to a multi-process architecture. The concern here is simple: the more processes we have, the more memory we use. Starting Q4-2015 I began setting up a test to measure the memory usage of Firefox with a variable amount of content processes.

Methodology

For the test I used a slightly modified version of the AWSY framework that I maintain for areweslimyet.com. This test runs through a sample pageset, the same one used in Talos perf testing, in an attempt to simulate a long-lived session.

The steps:

  1. Open Firefox configured to use N content processes.
  2. Measure memory usage.
  3. Open 100 urls in 30 tabs, cycling through tabs once 30 are opened. Wait 10 seconds per tab.
  4. Measure memory usage.
  5. Close all tabs.
  6. Measure memory usage.

For this test I performed two iterations of this, reporting the startup memory usage from the first and the end of test memory usage (TabsOpen, TabsClosed) for the second.

Note: Just summing the total memory usage of each Firefox process is not a useful metric as it will include memory shared between the main process and the content processes. For a more realistic baseline I chose to use a combination of RSS and USS (aka unique set size, private working bytes):

total_memory = RSS(parent_process) + sum(USS(content_processes))

For example if we had:

Process RSS USS
parent 100 50
content_1 90 30
content_2 95 40

total_memory = 100 + 30 + 40

Results

Note on memory checkpoints:

  • Settled: 30 seconds have passed since previous checkpoint.
  • ForceGC: We manually invoked garbage collection.
  • We list the memory usage for each checkpoint using 0, 1, 2, 4, 8 content processes.

Linux, 64-bit

0 1 2 4 8
Start 190 MiB 232 MiB 223 MiB 223 MiB 229 MiB
StartSettled 173 MiB 219 MiB 216 MiB 219 MiB 213 MiB
TabsOpen 457 MiB 544 MiB 586 MiB 714 MiB 871 MiB
TabsOpenSettled 448 MiB 542 MiB 582 MiB 696 MiB 872 MiB
TabsOpenForceGC 415 MiB 510 MiB 560 MiB 670 MiB 820 MiB
TabsClosed 386 MiB 507 MiB 401 MiB 381 MiB 381 MiB
TabsClosedSettled 264 MiB 359 MiB 325 MiB 308 MiB 303 MiB
TabsClosedForceGC 242 MiB 322 MiB 304 MiB 285 MiB 281 MiB

Windows 7, 64-bit

32-bit Firefox

0 1 2 4 8
Start 172 MiB 212 MiB 207 MiB 204 MiB 213 MiB
StartSettled 194 MiB 236 MiB 234 MiB 232 MiB 234 MiB
TabsOpen 461 MiB 537 MiB 631 MiB 800 MiB 1,099 MiB
TabsOpenSettled 463 MiB 535 MiB 635 MiB 808 MiB 1,108 MiB
TabsOpenForceGC 447 MiB 514 MiB 593 MiB 737 MiB 990 MiB
TabsClosed 429 MiB 512 MiB 435 MiB 333 MiB 347 MiB
TabsClosedSettled 356 MiB 427 MiB 379 MiB 302 MiB 306 MiB
TabsClosedForceGC 342 MiB 392 MiB 360 MiB 297 MiB 295 MiB

64-bit Firefox

0 1 2 4 8
Start 245 MiB 276 MiB 275 MiB 279 MiB 295 MiB
StartSettled 236 MiB 290 MiB 287 MiB 288 MiB 289 MiB
TabsOpen 618 MiB 699 MiB 805 MiB 1061 MiB 1334 MiB
TabsOpenSettled 625 MiB 690 MiB 795 MiB 1058 MiB 1338 MiB
TabsOpenForceGC 600 MiB 661 MiB 740 MiB 936 MiB 1184 MiB
TabsClosed 568 MiB 663 MiB 543 MiB 481 MiB 435 MiB
TabsClosedSettled 451 MiB 517 MiB 454 MiB 426 MiB 377 MiB
TabsClosedForceGC 432 MiB 480 MiB 429 MiB 412 MiB 374 MiB

OSX, 64-bit

0 1 2 4 8
Start 319 MiB 350 MiB 342 MiB 336 MiB 336 MiB
StartSettled 311 MiB 393 MiB 383 MiB 384 MiB 382 MiB
TabsOpen 889 MiB 1,038 MiB 1,243 MiB 1,397 MiB 1,694 MiB
TabsOpenSettled 876 MiB 977 MiB 1,105 MiB 1,252 MiB 1,632 MiB
TabsOpenForceGC 795 MiB 966 MiB 1,096 MiB 1,235 MiB 1,540 MiB
TabsClosed 794 MiB 996 MiB 977 MiB 889 MiB 883 MiB
TabsClosedSettled 738 MiB 925 MiB 876 MiB 823 MiB 832 MiB
TabsClosedForceGC 621 MiB 800 MiB 799 MiB 755 MiB 747 MiB

Conclusions

Simply put: the more content processes we use, the more memory we use. On the plus side it’s not a 1:1 factor, with 8 content processes we see roughly a doubling of memory usage on the TabsOpenSettled measurment. It’s a bit worse on Windows, a bit better on OSX, but it’s not 8 times worse.

Overall we see a 10-20% increase in memory usage for the 1 content process case (which is what we plan on shipping initially). This seems like a fair tradeoff for potential security and performance benefits, but as we try to grow the number of content processes we’ll need to take another look at where that memory is being used.

For the next steps I’d like to take a look at how our memory usage compares to other browsers. Expect a follow up post on that shortly.