Extracting predictive models from nonlinear systems is a central task in scientific machine learning. One key problem is the reconciliation between modern data-driven approaches and first principles. Despite rapid advances in machine learning techniques, embedding domain knowledge into data-driven models remains a challenge. In this work, we present a universal learning framework for extracting predictive models from nonlinear systems based on observations. Our framework can readily incorporate first principle knowledge because it naturally models nonlinear systems as continuous-time systems. This both improves the extracted models' extrapolation power and reduces the amount of data needed for training. In addition, our framework has the advantages of robustness to observational noise and applicability to irregularly sampled data. We demonstrate the effectiveness of our scheme by learning predictive models for a wide variety of systems including a stiff Van der Pol oscillator, the Lorenz system, and the Kuramoto-Sivashinsky equation. For the Lorenz system, different types of domain knowledge are incorporated to demonstrate the strength of knowledge embedding in data-driven system identification.