
Bryan He, Ludvig Bergenstråhle, Linnea Stenbeck, Abubakar Abid, Alma Andersson, Åke Borg, Jonas Maaskola, Joakim Lundeberg, and James Zou
Nature Biomedical Engineering, 2020.
Spatial transcriptomics allows for the measurement of RNA abundance at a high spatial resolution, making it possible to systematically link the morphology of cellular neighbourhoods and spatially localized gene expression. Here, we report the development of a deep learning algorithm for the prediction of local gene expression from haematoxylinandeosinstained histopathology images using a new dataset of 30,612 spatially resolved gene expression data matched to histopathology images from 23 patients with breast cancer. We identified over 100 genes, including known breast cancer biomarkers of intratumoral heterogeneity and the colocalization of tumour growth and immune activation, the expression of which can be predicted from the histopathology images at a resolution of 100 µm. We also show that the algorithm generalizes well to The Cancer Genome Atlas and to other breast cancer gene expression datasets without the need for retraining. Predicting the spatially resolved transcriptome of a tissue directly from tissue images may enable imagebased screening for molecular biomarkers with spatial variation.

David Ouyang, Bryan He, Amirata Ghorbani, Neal Yuan, Joseph Ebinger, Curt P. Langlotz, Paul A. Heidenreich, Robert A. Harrington, David H. Liang, Euan A. Ashley, and James Y. Zou
Nature, 2020.
Accurate assessment of cardiac function is crucial for diagnosing cardiovascular disease, screening for cardiotoxicity and deciding clinical management in patients with critical illness. However human assessment of cardiac function focuses on a limited sampling of cardiac cycles and has significant interobserver variability despite years of training. To overcome this challenge, we present the first beattobeat deep learning algorithm that surpasses human expert performance in the critical tasks of segmenting the left ventricle, estimating ejection fraction, and assessing cardiomyopathy. Trained on echocardiogram videos, our model accurately segments the left ventricle with a Dice Similarity Coefficient of 0.92, predicts ejection fraction with mean absolute error of 4.1%, and reliably classifies heart failure with reduced ejection fraction (AUC of 0.97). Prospective evaluation with repeated human measurements confirms that our model has less variance than experts. By leveraging information across multiple cardiac cycles, our model can identify subtle changes in ejection fraction, is more reproducible than human evaluation, and lays the foundation for precise diagnosis of cardiovascular disease. As a new resource to promote further innovation, we also make publicly available one of the largest medical video dataset of over 10,000 annotated echocardiograms.

Paroma Varma, Bryan He, Payal Bajaj, Nishith Khandwala, Imon Banerjee, Daniel Rubin, Christopher Ré
Advances in Neural Information Processing Systems (NeurIPS), 2017.
Obtaining enough labeled data to robustly train complex discriminative models is a major bottleneck in the machine learning pipeline. A popular solution is combining multiple sources of weak supervision using generative models. The structure of these models affects the quality of the training labels, but is difficult to learn without any ground truth labels. We instead rely on weak supervision sources having some structure by virtue of being encoded programmatically. We present Coral, a paradigm that infers generative model structure by statically analyzing the code for these heuristics, thus significantly reducing the amount of data required to learn structure. We prove that Coral's sample complexity scales quasilinearly with the number of heuristics and number of relations identified, improving over the standard sample complexity, which is exponential in n for learning nth degree relations. Empirically, Coral matches or outperforms traditional structure learning approaches by up to 3.81 F1 points. Using Coral to model dependencies instead of assuming independence results in better performance than a fully supervised model by 3.07 accuracy points when heuristics are used to label radiology data without ground truth labels.

Stephen H. Bach, Bryan He, Alexander Ratner, Christopher Ré
Proceedings of the 34th International Conference on Machine Learning, 2017.
Curating labeled training data has become the primary bottleneck in machine learning. Recent frameworks address this bottleneck with generative models to synthesize labels at scale from weak supervision sources. The generative model’s dependency structure directly affects the quality of the estimated labels, but selecting a structure automatically without any labeled data is a distinct challenge. We propose a structure estimation method that maximizes the l_{1}regularized marginal pseudolikelihood of the observed data. Our analysis shows that the amount of unlabeled data required to identify the true structure scales sublinearly in the number of possible dependencies for a broad class of models. Simulations show that our method is 100x faster than a maximum likelihood approach and selects 1/4 as many extraneous dependencies. We also show that our method provides an average of 1.5 F1 points of improvement over existing, userdeveloped information extraction applications on realworld data such as PubMed journal abstracts.

Bryan He, Christopher De Sa, Ioannis Mitliagkas, Christopher Ré
Advances in Neural Information Processing Systems (NeurIPS), 2016.
Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.

Bryan He, Yisong Yue
Advances in Neural Information Processing Systems (NeurIPS), 2015.
Interactive submodular set cover is an interactive variant of submodular set cover over a hypothesis class of submodular functions, where the goal is to satisfy all sufficiently plausible submodular functions to a target threshold using as few (costweighted) actions as possible. It models settings where there is uncertainty regarding which submodular function to optimize. In this paper, we propose a new extension, which we call smooth interactive submodular set cover, that allows the target threshold to vary depending on the plausibility of each hypothesis. We present the first algorithm for this more general setting with theoretical guarantees on optimality. We further show how to extend our approach to deal with realvalued functions, which yields new theoretical results for realvalued submodular set cover for both the interactive and noninteractive settings.

Bryan He
Frontiers in Algorithmics and Algorithmic Aspects in Information and Management, 2012.
A floorplan is a rectangle subdivided into smaller rectangular blocks by horizontal and vertical line segments. Two floorplans are considered equivalent if and only if there is a bijection between the blocks in the two floorplans such that the corresponding blocks have the same horizontal and vertical boundaries. Mosaic floorplans use the same objects as floorplans but use an alternative definition of equivalence. Two mosaic floorplans are considered equivalent if and only if they can be converted into equivalent floorplans by sliding the line segments that divide the blocks. The QuarterState Sequence method of representing mosaic floorplans uses 4n bits, where n is the number of blocks. This paper introduces a method of representing an nblock mosaic floorplan with a (3n − 3)bit binary string. It has been proven that the shortest possible binary string representation of a mosaic floorplan has a length of (3n − o(n)) bits. Therefore, the representation presented in this paper is asymptotically optimal. Baxter permutations are a set of permutations defined by prohibited subsequences. There exists a bijection between mosaic floorplans and Baxter permutations. As a result, the methods introduced in this paper also create an optimal binary string representation of Baxter permutations.