an overview of paracelsus
sylph is ostensibly a system for biological analysis, but its core is a system for automated geometric proofs about dynamical systems. i'm going to try to demystify that a bit in this post, and link it to the literature i'm drawing all this from. i'm still prototyping this, so it's not quite solid enough to just "write out the math", but thankfully the preceding work has been very good at describing the perspectives. at least, that's my excuse for dodging pedagogy in this post.
we call that backend proof system and numerical engine "paracelsus". it was developed from some past work i did in stable integration and targeting schemes for low-energy lunar transfers. about a year ago, vivien encouraged me to look at the applied math aspects of systems biology, and i was very amused to discover how much overlap there was, right down to the relationship between phase space structures and model optimization.
this post briefly explains the direction i've taken, what i haven't done, and how it compares to other modes of analysis. as this work solidifies, i'll be making more posts here about specific methods in use, what parts didn't work out, and how this system integrates in common biology workflows.
a brief description of mathematics in systems biology
systems and synthetic biologists use a standardized form of ordinary differential equations to model the functional relationships between components of a biological system. this is a slightly modified form of the chemical reaction networks commonly used in chemical stoichiometry. like in chemistry, these are time-invariant dynamical systems subject to the usual thermodynamic conservation laws. these are formalized in the sbml standard, which uses symbolic mathematical expressions to represent biological systems.
these ode models are conceptually simple, but are analytically intractable. biological systems have high dimensionality, and have very strange properties with respect to stability. the most common approach to analysis is to first reduce the dimensionality of a system using the "separation of time-scales" principle, using a steady-state assumption to linearize terms with negligible influence at the timescale under investigation. this produces a system with relatively low dimensionality, and lots of rate terms.
these systems are then discretized in time and integrated numerically to simulate the function of the biological system. a major problem with integrating these systems is that they are almost always "stiff", which means they accumulate numerical error quickly unless the timestep used is very small. in casual investigation, this can cause results to be far off the mark and impossible to reproduce. under optimization in experiment design or synthetic biology, this poses major barriers to computational efficiency and soundness of results. the common methods used to address stiff equations are applicable here, but significantly complicate the analysis process.
- ^ sbml core specifcation (pdf)
- ^ challenges in the calibration of large-scale ordinary differential equation models
numerical analysis of systems biology models
i've approached this problem a little differently in paracelsus. the primary form of analysis in paracelsus is identification of lie point and contact symmetries of differential equations.
when naively applied to the common reaction net models in systems biology, this exploits the underlying conservation laws and equilibrium mechanics of chemical kinetics to relax the "stiffness" in the model. in sylph's default mode of operation, paracelsus examines an input model, identifies a group generator that can be shown to reduce error terms, and finally constructs a specialized integrator from the group. i do this because you can't just construct a general stable integrator for reaction networks (the same way you can construct e.g. a symplectic integrator for n-body mechanics), but you can construct an integrator that is stable for an individual reaction network.
algorithmic search for an applicable group generator is a shockingly simple process. the ode is converted into a determining equation (a big pde), i construct the group from the terms in that equation (see the links below), and i take an extra step to see if inserting thermodynamic laws is valid for terms that are known to be individual reactions. conceptually, it's confusing as hell, but mechanically, all i'm doing is graph transformations and automated proofs against simple axioms originally provided by lie.
this process gives us access to much more information than just the solution to the equations. by examining the content of the determining equations, it's easy to identify dynamical properties of a model associated with specific terms (reactions, chemicals, gene interactions, etc). i don't do anything with this yet, but there is a lot of potential hiding in this intermediate information.
i am greatly indebted to the modern "cult of noether" for producing excellent pedagogical materials on this subject, and for publishing implementations for symbolic mathematics languages. as my tiny engineer brain still can't figure out what noether and cartan were trying to say, the progress i've made here was only possible because the subject was made so approachable. the work of peter olver, evelyn hubert, and elizabeth mansfield have been the most significant in my efforts so far.
- ^ applications of lie groups to differential equations
- ^ moving coframes i (pdf)
- ^ moving coframes ii (pdf)
- ^ (various papers on differential invariants)
- ^ a practical guide to invariant calculus
just getting solutions from the models isn't the primary roadblock to systems and synthetic biology. any old ode solver can give you that pretty quick. what they can't do is help you navigate permutations of a model holding some desired feature as an invariant.
the proof system in paracelsus is designed primarily for optimization over systems biology models, to assist experiment design, done by hand or with the assistance of reinforcement learning. when you ask paracelsus to do a parameter or permutation search, i use the same invariant search described above to construct the solution space of your specific goal. then i just drop gradient descent on it.
the benefit of this process is we can guide optimization based on a goal, either for a set of constraints on parameters or (preferably) a desired continuous function. we do this by reshaping the model to necessarily meet that goal, which does not require any sort of training or input dataset. then we look for a set of parameters or permutations that produce near-zero error. i'm not a control theorist, but i've heard this is similar to model construction in nonlinear control systems.
this process is currently extremely ad-hoc, as optimization methods tend to be. a proper explanation will have to wait for my methods to solidify, and for alice to implement latex typesetting on this blog.
what we haven't addressed
reducing stiffness and providing dynamical invariants is the problem that really obviously stands out in systems biology, but there's a lot more complexity lurking beyond it. here i'm going to describe the problems i haven't touched yet, but seem like major issues in sylph's future.
biological systems, when examined under the lens of equilibrium mechanics, are extremely prone to developing "bistable" systems. for non-biologists, that's just a pitchfork bifurcation. these systems also commonly present global bifurcations and cusps, but most systems biologists don't talk about those. i expect, as the design of genetic circuits advances past simple logic, efficiently mapping phase spaces of these systems will become critically important for experiment design and gene optimization. thankfully the literature on celestial mechanics provides a wonderful set of tools for this task. thinking more about this problem and adapting those methods is my clear next step.
in practical analysis, biologists make use of a few discrete approximations i haven't properly accounted for yet. constraint-based reconstruction for metabolic networks and discrete circuit simulation for genetic design are the ones that stand out to me.
metabolic analysis is often performed using integer programming over a set of sort-of-linear constraints gathered from experimental data. cobrapy is the most commonly used tool for this purpose. i haven't touched this area because i still haven't figured out what constraint-based metabolic modeling is in a mathematical sense, and how it fits into an experiment design process. please feel free to email me if you have any comments on this one.
synthetic biologists like to use discrete circuit analysis when designing genetic modifications. there have been efforts to use verilog, linear logic, and there's even a standardized circuit language with a myriad of discrete interpreters. from a mathematics perspective, this is strange, since gene expression is definitely an analog system with lots of important dynamical properties that will bite you if you ignore them. from an engineering perspective, this is normal, since analyzing complex analog circuits really sucks. incorporating this standard paradigm for genetic design into paracelsus could help automate a lot of the design tasks involved in synthetic biology. however, i'm not really sure how to bridge the gap between optimization of an ode and optimization of a corresponding logical system. if you've used any of these tools, feel free to email me about this one also.
the software side of paracelsus
paracelsus is an automated proof system and is implemented as sort of a domain-specific proof assistant. this is a very grandiose description of messy and naive code, but it's a startup you know. i use agda to manage the symbolic transformations, validation of results, and managing the protocol of operation with sylph. agda's too slow to even add integers on time, so i drop down to haskell to do the heavy lifting, for search and numerical evaluation. haskell's also too slow for numerical integration and optimization, so i use accelerate to compile the validated system to opencl/cuda.
the downside of the architecture is that my prototyping cycle is all formalization of things i've only ever implemented in mathematica, with a proof assistant i learned a few months ago. of course, the upside is that agda and haskell actually tell me when i'm wrong, so i've been able to develop this much faster than i could with traditional methods. accelerate's also a lot faster than my first pass at handwritten cuda would be, having a well-defined semantics for gpu compute is a goddamn miracle.
this is my attempt to demonstrate through practice something i think of as "mathematical computing", where you implement a complex system by just writing out a denotational semantics and proofs for it, then giving it supplemental operational semantics to make up the practical computing gap. this sort of development is becoming fairly common in domains that have a long history of formal verification: hardware definition languages written in coq, avionics and fire control derived entirely from optimal control definitions, information-theoretic spacecraft telemetry protocols, etc. i think it's equally applicable to every engineering and scientific field that relies heavily on simulation to design experiments and validate experimental results. of course, that's nearly all of them. i think systems/synthetic biology stands to benefit immensely from this approach as the scope of analysis expands.
the details of what exactly is proven in agda, how i did it, and my plans for further formalization are better left for another post. the short answer is that pretty much nothing besides algebraic transformations and error bounds have proofs, and a significant part of the implementation is only defined in haskell.
some previous mathematical work in this space
obviously i'm not the first applied mathematician to touch this field. i'm not even the first person to take the type theory axe to it. much like computational chemistry, systems biology has received a few waves of mathematical development since it entered common use. there is even a journal dedicated to the subject. i want to highlight some previous work in this space that took routes very different than constructing differential invariants, since these other methods shine some light on what paracelsus doesn't do.
many studies of dynamical stability and phase space structures in biology have been focused on approximation by logical systems, and algebraic decomposition of those approximate systems. this permits analysis of some topological properties of the dynamical system, like the connectedness and boundaries of the parameter space, and the precise locations of bifurcation transitions. one approach i see cited often on biorxiv is this paper, which examines the joint stability of two protein signaling pathways in embryogenesis. their methods and results are extremely interesting, but they lose me when they introduce a domain-specific method of real quantifier elimination.
parameter optimization for synthetic genetic circuits has been tried with pretty much every method you'd expect from computer scientists. one that caught my attention was this paper, which claims to use simulated annealing, but actually implements something a bit more novel. they decompose the parameter space by isolating the individual rate terms in a genetic model, then estimate an initial state by an ad-hoc linear heuristic, then step the model with continuous refitting of those parameters. this has the curious perk of completely avoiding error accumulation associated with stiffness, though it's not clear how they actually managed that.
a few people have approached biological system analysis using non-equilibrium mechanics, focusing instead on the conditional expectation properties of a full stochastic interpretation. recent work in this vein (with many citations) can be found here. in a dynamical framing, these methods approximate the phase space by fitting polynomial functions of stochastic variables to the conditional moments of the ode model. if you know a lot measure theory, this allows you to examine the relationships between each dimension of the dynamical system using only algebraic equations, and to identify phase space structures that are not as clear as bifurcations. i don't know very much measure theory, so i have no idea how sound that is in practice.
- ^ journal of mathematical biology
- ^ geometry and topology of parameter space: investigating measures of robustness in regulatory networks
- ^ a framework for scalable parameter estimation of gene circuit models using structural information
- ^ algebraic expressions of conditional expectations in gene regulatory networks
we'll be doing development and progress updates on this blog on-and-off over the next few months. my next post will cover more aspects of using sylph/paracelsus for enzyme and protein optimization, and i'll explain my take on optimization in less hazy words. we've got some customers who are already interested in this application, so i'm diving right in.