RFC: Electrolysis manifesto: Chrome never blocks on content

Discussion:

Chris Jones

2010-04-23 07:22:43 UTC

tl;dr
o The biggest win for e10s on mobile is UI perceived responsiveness
o We lose that if chrome can block on content
o There's perhaps been some confusion about reentrant vs.
non-reentrant synchronous messaging
- non-reentrant synchronous messaging is perfectly OK content-->chrome
- reentrant is not, because it potentially blocks chrome on content
o Therefore, abolish chrome blocking on content, including reentrant
synchronous messages from content-->chrome
o (Jetpacks and debuggers are not being discussed)

(Spawned off from jduell's post in m.d.t.dom. I think we should discuss
the "what" before the "how".)

Let's start from first principles: we want e10s for mobile Firefox
mainly because by running chrome and content code in separate processes,
the chrome UI can stay responsive even if content isn't. e10s achieves
this by giving chrome and content separate event loops. Chrome can
process input events in its event loop without being delayed by content.
We might even be able to put a rough upper bound on how long chrome
takes before responding to arbitrary input; in fact, establishing and
minimizing such an upper bound should be an explicit goal IMHO.

We lose this improved responsiveness if the processing of events in the
chrome event loop can be delayed by web content. One way this can
happen if chrome makes a synchronous request to content, i.e. blocks on
content. AFAIK, no one advocates this.

There's a second way that has recently arisen. To discuss it, some
background is required. Our IPC mechanisms support three messaging
models between A and B

o 'async' --- A "fires and forgets" a message to B. Neither A blocks
on B nor B blocks on A.

o 'sync' --- A sends a non-reentrant, synchronous message to B. This
means A blocks on B until B replies to the message. However, while
replying to the message, B is not allowed to "re-enter" A, that is, send
a synchronous message to A. From A's perspective, sending the message
and receiving the reply is synchronous. But from B's perspective,
replying to the message is asynchronous.

o 'rpc' --- A sends a reentrant, synchronous message to B. This means
that while B replies to A, it is allowed to "re-enter" A with a
synchronous message. (And A can further re-enter B, and so forth.)
(This is sort of how function calls work, hence the name 'rpc', "remote
procedure call".)

So with that, the second way chrome can block on content is if content
is allowed to send an 'rpc' message to chrome. This would mean that,
while replying to content, chrome code could send back an 'rpc' message,
re-entering content and blocking on it. Once chrome re-enters content
in this way, it's at the mercy of content code to not invoke web content
directly or compute values parameterized by web content. If it did,
then chrome could block arbitrarily on content, and responsiveness loses.

So I think we have two paths going forward

(1) Forbid chrome from ever blocking on content, including forbidding
'rpc' messages from content to chrome. (And to be reiterate, 'sync'
messages from content to chrome would be perfectly acceptable.)

(2) Allow content-->chrome 'rpc', but audit all content code that
might reply to chrome-->content re-entry to ensure that it never
computes something the duration of which is controllable by web content.

It seems to me that (1) is the path more likely to end in victory. That
said, it won't be easy, and things will break.

A compelling counterargument to (1) could take the form of, "If we can't
'rpc' from content to chrome, there's no possible way to implement
killer feature X."

And to be clear, this manifesto *does not* cover Jetpacks or debuggers:
they are separate problems. I don't know how current add-ons fit in
(debuggers excluded), but I also don't know that that's a problem we
need to solve fully before releasing fennec+electrolysis.

Cheers,
Chris

Mike Shaver

2010-04-23 12:13:34 UTC

Permalink

(1) Forbid chrome from ever blocking on content, including forbidding 'rpc'
messages from content to chrome. (And to be reiterate, 'sync' messages from
content to chrome would be perfectly acceptable.)
(2) Allow content-->chrome 'rpc', but audit all content code that might
reply to chrome-->content re-entry to ensure that it never computes
something the duration of which is controllable by web content.

With (2), if the content process is busy with something like a reflow,
it might block unacceptably before processing the "cheap" rpc reply
path. That's pretty much the source of the responsiveness issues we
want to address with e10s for Fennec, after all.

I think (1) is the right rule here.

Mike

Benjamin Smedberg

2010-04-23 13:41:07 UTC

Permalink

Post by Mike Shaver

Why?

This is about content->chrome RPC. We have *lots* of codepaths where content
needs to block on chrome (alert()). The rule is that chrome must never block
on content. But I do think we're going to need to use RPC messages in the
other direction rather promiscuously, and I'm sure we're going to need RPC
messages content->jetpack and jetpack->chrome, which amounts to pretty much
the same thing.

--BDS

Mike Shaver

2010-04-23 14:14:18 UTC

Permalink

On Fri, Apr 23, 2010 at 6:41 AM, Benjamin Smedberg

Post by Benjamin Smedberg

Post by Mike Shaver

(2) Allow content-->chrome 'rpc', but audit all content code that might
reply to chrome-->content re-entry to ensure that it never computes
something the duration of which is controllable by web content.

Why?
This is about content->chrome RPC. We have *lots* of codepaths where content
needs to block on chrome (alert()). The rule is that chrome must never block
on content.

What does "reply to chrome->content re-entry" mean, then, and why do
we care if it computes something for which the duration is
controllable by web content?

Mike

Benjamin Smedberg

2010-04-23 14:23:10 UTC

Permalink

Post by Mike Shaver
What does "reply to chrome->content re-entry" mean, then, and why do
we care if it computes something for which the duration is
controllable by web content?

I think cjones is incorrect about chrome->content re-entery with RPC
messages, because we don't allow RPC messages in that direction. However,
the content->chrome message still has to be RPC (instead of sync) for one
reason: the return value from such a function may contain an actor handle
which was just created (during the call). We have to asynchronously deliver
that constructor message before we deliver the RPC reply, in order for the
content process to be aware of it.

--BDS

Mike Shaver

2010-04-23 14:27:41 UTC

Permalink

On Fri, Apr 23, 2010 at 7:23 AM, Benjamin Smedberg

Post by Benjamin Smedberg

Post by Mike Shaver
What does "reply to chrome->content re-entry" mean, then, and why do
we care if it computes something for which the duration is
controllable by web content?

In that case, yeah, it's fine.

I presume that we can't end up with a content->chrome rpc blocking the
content process, and then the chrome process blocking because a queue
is too full for it to post an async message to the same content
process, right?

Mike

Benjamin Smedberg

2010-04-23 14:38:35 UTC

Permalink

Post by Mike Shaver
I presume that we can't end up with a content->chrome rpc blocking the
content process, and then the chrome process blocking because a queue
is too full for it to post an async message to the same content
process, right?

I believe that's correct, yes: we cache those messages in a std::vector or
something, and should only be bounded by memory, not by any pipe-full problems.

--BDS

Chris Jones

2010-04-23 19:16:47 UTC

Permalink

Post by Benjamin Smedberg

Post by Mike Shaver
What does "reply to chrome->content re-entry" mean, then, and why do
we care if it computes something for which the duration is
controllable by web content?

I think cjones is incorrect about chrome->content re-entery with RPC
messages, because we don't allow RPC messages in that direction.

Incorrect how? All I'm saying is we shouldn't be doing this, and we
have new protocols that do, but see below.

Post by Benjamin Smedberg
However, the content->chrome message still has to be RPC (instead of
sync) for one reason: the return value from such a function may contain
an actor handle which was just created (during the call). We have to
asynchronously deliver that constructor message before we deliver the
RPC reply, in order for the content process to be aware of it.

If ctors are the only reason we're using RPC, I say it's a semantic bug
that we should fix. There are several ways we can special-case actor
ctors. I'll file a bug.

Cheers,
Chris

Boris Zbarsky

2010-04-23 19:40:59 UTC

Permalink

Post by Chris Jones
If ctors are the only reason we're using RPC, I say it's a semantic bug
that we should fix. There are several ways we can special-case actor
ctors. I'll file a bug.

The reason I used rpc most recently was to support window.open.

For window.open, what needs to happen is that content code asks chrome
for a new window. Chrome calls back into the content process and has it
create a window, then hands that window to the original caller.

In terms of the actual messages on the wire, content sends a "give me a
window" message, chrome JS creates a <xul:browser>, whatever stuff
happens to create a new content process rendering area happens, then
chrome sends back the TabParent (which becomes a TabChild on the content
side).

I suppose we could change this by having the messages involved be async
but content spinning a (nested) event loop manually to prevent returning
to the open() caller... Would that work? Seems kinda fragile, but
maybe so's rpc.

-Boris

Chris Jones

2010-04-23 19:47:28 UTC

Permalink

Post by Boris Zbarsky

Post by Chris Jones
If ctors are the only reason we're using RPC, I say it's a semantic bug
that we should fix. There are several ways we can special-case actor
ctors. I'll file a bug.

The reason I used rpc most recently was to support window.open.
For window.open, what needs to happen is that content code asks chrome
for a new window. Chrome calls back into the content process and has it
create a window, then hands that window to the original caller.
In terms of the actual messages on the wire, content sends a "give me a
window" message, chrome JS creates a <xul:browser>, whatever stuff
happens to create a new content process rendering area happens, then
chrome sends back the TabParent (which becomes a TabChild on the content
side).

OK. Modulo the bug with 'rpc' and ctors not playing well together, I
think window.open (is this createWindow() in PIFrameEmbedding?) can be
'sync'. I'm working on fixing this bug right now.

Post by Boris Zbarsky
I suppose we could change this by having the messages involved be async
but content spinning a (nested) event loop manually to prevent returning
to the open() caller... Would that work? Seems kinda fragile, but
maybe so's rpc.

No thanks :). I think 'sync'++ is the right choice here.

Cheers,
Chris

Boris Zbarsky

2010-04-23 20:42:02 UTC

Permalink

OK. Modulo the bug with 'rpc' and ctors not playing well together, I
think window.open (is this createWindow() in PIFrameEmbedding?)

Yes.

can be 'sync'. I'm working on fixing this bug right now.

But sync means the content process can't be reentered, no? In
particular, within the createWindow call we currently run
nsFrameLoader::ReallyStartLoadingInternal, which seems to send several
messages to the child process. Or are those all async except the ctor
and therefore ok within a sync call?

No thanks :). I think 'sync'++ is the right choice here.

If we can make it work, I agree.

-Boris

Chris Jones

2010-04-23 20:52:49 UTC

Permalink

Post by Boris Zbarsky
But sync means the content process can't be reentered, no? In
particular, within the createWindow call we currently run
nsFrameLoader::ReallyStartLoadingInternal, which seems to send several
messages to the child process. Or are those all async except the ctor
and therefore ok within a sync call?

I don't know if they're async; if they are, they're fine to send, but
won't be processed until the 'sync' message is finished. Do the
notifications being sent contain information that the content process
can't deduce for itself?

Cheers,
Chris

Boris Zbarsky

2010-04-23 21:04:21 UTC

Permalink

Post by Chris Jones

I don't know if they're async; if they are, they're fine to send, but
won't be processed until the 'sync' message is finished.

I think that should be fine; would need to audit the messages to be
sure. In particular, the race between the various loads being kicked
off here could need resolving; that should be doable, though.

Post by Chris Jones
Do the notifications being sent contain information that the content process
can't deduce for itself?

Generally speaking, yes. Not sure that information is relevant in this
case.

-Boris

Chris Jones

2010-04-23 20:46:32 UTC

Permalink

Post by Chris Jones

Post by Benjamin Smedberg
However, the content->chrome message still has to be RPC (instead of
sync) for one reason: the return value from such a function may
contain an actor handle which was just created (during the call). We
have to asynchronously deliver that constructor message before we
deliver the RPC reply, in order for the content process to be aware of
it.

If ctors are the only reason we're using RPC, I say it's a semantic bug
that we should fix. There are several ways we can special-case actor
ctors. I'll file a bug.

I see us using this in |PIFrameEmbedding::createWindow()|, but why can't
that simply use the PIFrameEmbedding ctor? Did I miss other usages?

Cheers,
Chris

Chris Jones

2010-04-23 19:19:26 UTC

Permalink

Post by Benjamin Smedberg
I'm sure we're
going to need RPC messages content->jetpack and jetpack->chrome, which
amounts to pretty much the same thing.

I don't have an opinion on chrome<-->jetpack RPC yet (content should be
fine), but are you referring to "RPC-for-ctor-ordering" or
"RPC-for-full-reentry"?

Cheers,
Chris