Supervisor.start_children blocks
Knee deep into working with OTP I came across the following scenario: I had
a Supervisor
that can dynamically spawn workers and each worker initializes their own state
with an expensive function. To speed up the whole process I thought I would just
wrap it into a Task.start_link
block to start them in parallel. Turns out,
and this makes sense if you think about it, you can only add one worker at a time
to a Supervisor
.
Consider following code:
defmodule MyApp do
use Application
def start(_type, _args) do
import Supervisor.Spec, warn: false
children = [
worker(MyApp.Worker, [])
]
opts = [strategy: :simple_one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
def run do
Enum.map(1..10, fn i ->
Task.start_link(fn ->
Supervisor.start_child(MyApp.Supervisor, [i])
end)
end)
end
end
with the following worker:
defmodule MyApp.Worker do
def start_link(i) do
IO.inspect i
Agent.start_link(fn -> expensive(i) end)
end
def expensive(i) do
:timer.sleep(1000)
i
end
end
If you run this function in IEX
you get something like this:
iex(1)> MyApp.run
1
[ok: #PID<0.90.0>, ok: #PID<0.91.0>, ok: #PID<0.92.0>, ok: #PID<0.93.0>,
ok: #PID<0.94.0>, ok: #PID<0.95.0>, ok: #PID<0.96.0>, ok: #PID<0.97.0>,
ok: #PID<0.98.0>, ok: #PID<0.99.0>]
iex(2)> 2
3
4
5
...
Where the integers start showing up 1 second apart. (Note: The returned PID
s
are from the Task
). What is happening is that Supervisor.start_child(MyApp.Supervisor, [i])
blocks until start_link
is done returning a {:ok, pid}
, before it can allow
another child process to be registered.
The solution to this issue is to use a GenServer
and to set the state in a async
manner using handle_info/2
with init/1
. This is the changed code:
defmodule MyApp.Worker do
use GenServer
def start_link(i) do
IO.inspect i
GenServer.start_link(__MODULE__, i)
end
def init(args) do
send self, :set_init_state
{:ok, args}
end
def handle_info(:set_init_state, i) do
:timer.sleep(3000)
{:noreply, i * i}
end
end
and running IEX
again:
iex(1)> MyApp.run
1
2
3
4
5
6
7
8
9
10
[ok: #PID<0.90.0>, ok: #PID<0.91.0>, ok: #PID<0.92.0>, ok: #PID<0.93.0>,
ok: #PID<0.94.0>, ok: #PID<0.95.0>, ok: #PID<0.96.0>, ok: #PID<0.97.0>,
ok: #PID<0.98.0>, ok: #PID<0.99.0>]
We can see that the workers all started immediately. But did they in fact change the state? Sure did!
What is this useful for? Let say you want to start a connection to an external service or want to start more workers as part of that worker in the supervision tree. Either way you don’t want the top supervisor to wait until all the workers are initialized in sequence, especially when you can do it in parallel.