This page collects useful resources regarding Experiment-Oriented Computing (EOC), a concept introduced in my ESEC/FSE 2018 paper The Case for Experiment-Oriented Computing (local free download). Computational experimentation technology can be found in many forms, sometimes explicit and dedicated, but more often intertwined with other concerns. In almost all cases I’m aware of, however, there is no proper understanding of the wide scope that such technology can have. Nevertheless, it is useful to map the technology that does exist, since it can help us track, motivate and imagine the progress of proper EOC tools and systems. I’ve also created a related infographic for dissemination to practitioners.
A/B Test Libraries, Frameworks and Services
In Software Engineering, experimentation is often confused with mere A/B testing. This is actually a very popular technique, so it would be futile to try to curate all such tools here. Rather, I will focus on those that for some reason are particularly interesting or representative.
- Facebook’s PlanOut: a framework for large-scale A/B testing, used at Facebook.
- Optimizely: A popular service that calls itself “the world’s leading experimentation platform.” It allows non-programmers to take existing web pages and modify them in order to determine the effects of such modifications. This is achieved by dynamically instrumenting the page before delivering to customers. Other forms of experimentation are also possible, some relating to personalization of pages with respect to users, locations and perhaps other factors.
- Unbounce: A popular service to design and deploy landing pages (and other artifacts, it seems). Critically, allows for easy A/B testing, provided that the user spends some time designing the versions to be tested.
- Effective Experiments: A platform for managing experiments conducted through other tools. Seems useful for teams with long-term commitment to experimentation.
Experiment Tracking and Versioning
- Version Control System for Data Science Projects: Uses git to track experiments. Includes a number of additional abstractions, such as metrics and Machine Learning pipelines, in order to allow easier assessment and reproduction of results.
- Sacred: Python library to define, run and track experiments. “Sacred is a tool to configure, organize, log and reproduce computational experiments. It is designed to introduce only minimal overhead, while encouraging modularity and configurability of experiments.”
- Sumatra: Primarely a command-line tool for “managing and tracking projects based on numerical simulation and/or analysis, with the aim of supporting reproducible research. It can be thought of as an automated electronic lab notebook for computational projects.” Also provides a Python library for deeper integrations and customizations.
- Experimenter: Uses git versioning in order to track both experimental setup and results.
Reccrd:Used to be a Python library and a related online service to store experimentation results. The link, however, no longer points to the right place.
- MLflow: Describes itself as “an open source platform for the machine learning lifecycle”. Allows the tracking of experiments, the organization of projects (in particular, to permit easier reproducibility) and — less importantly from an experimentation point of view — the deployment of models to various tools.
- ModelDB: Allows the tracking of Machine Learning experiments results by close integration with selected libraries, currently Spark’s MLlib and scikit-learn.
- Comet: Cloud-based programmatic experiment tracking, hyperparameters optimization, source and results comparison, git versioning and documentation. Commercial, but offers free access to open source projects.
- Weights and Biases: Cloud-based programmatic experiment tracking, rather similar to MLflow’s tracking component. Interestingly, has some ways to log artifacts to make their later inspection easier (e.g., 3D objects for visualization). Free for small projects, commercial or not.
Experimentation with Users
- Amazon’s Mechanical Turk: One of the most well-known platforms to recruit users to complete arbitrary tasks online. Obviously useful for experimenting with users. For example, Toomin et al. have studied user preferences by performing experiments using Mechanical Turk.
- Clickworker: an alternative to Mechanical Turk, with various pre-defined use cases.
- Delve: a tool from Sidewalk Labs to automatically experiment with multiple possibilities of urban designs, in order to optimize metrics of interest to the experimenter. Looks quite amazing. Surprisingly, I was told by an authoritative source that architects actually like the idea – the main resistance comes from real estate developers.
- SIERRA (reSearch pIpEline for Reproducibility, Reusability, and Automation): a research tool capable of generating experimental setups from declarative, instead of imperative, specifications. They claim that instead of writing “I need to perform these steps to run the experiment, process the data and generate the graphs I want”, one can write “OK SIERRA: Here is the environment and simulator/robot platform I want to use, the deliverables I want to generate, and the data I want to appear on them for my research query–GO!”. [Related paper]
Automated Machine Learning, to achieve its objectives, often experiment with various software designs, models and parameters. Some notable tools and libraries:
- Azure AutoML
- AWS SageMaker Autopilot
- Google Cloud AutoML
- H2O AutoML
- Lale: a very interesting AutoML library by IBM. What I liked most about it is the fact that we can define a structure to be explored, and then the library handles the necessary experiments to determine the best composition. See also the paper Lale: Consistent Automated Machine Learning.
Causal Inference Libraries
Causal inference aims to extract causal knowledge from historical data, when actual experimentation is not possible. So, though it is not really about experimentation per se, it might offer ways in which to motivate and improve experiments, as well as to treat experimental data.
DoWhy, by Microsoft, in Python.
Causal ML, by Uber, in Python.
EconML, by Microsoft, in Python.
causalToolbox, in R.
CausalNex, by QuantumBlack, in Python.
CausalImpact, by Google, in R. There’s a Python implementation by Dafiti.
- What-If Tool: Part of TensorBoard, allows users to interact with model features and exemples in order to quickly assess their effect in learning. This is manual tool, designed to allow users to manipulate and understand models in real-time.
- Hydra: a CLI application framework from Facebook that seems to help in automating experimentation workflows by making configuration variation easy, including the management of corresponding outputs. Take a look at their tutorial.
Real-World Applications, Laboratories and Companies
- Experimentation at Uber
- Related presentation: A/B testing at Uber: How we built a BYOM (bring your own metrics) platform
- Uber’s Michelangelo Machine Learning management platform.
- Experimentation to optimize Pinterest’s recommendation system: Diversification in recommender systems: Using topical variety to increase user satisfaction
- Microsoft’s ExP Experimentation Platform. There’s a lot of original papers, talks and other contents here, particularly regarding A/B testing.
- Eureqa is arguably an precursor to present (2020) AutoML tools, though it uses an (apparently) unpopular method, namely, symbolic regression. It was acquired by DataRobot in 2017 and is now part of their commercial offering.
- Designing Adaptive Experiments to Study Working Memory. A practical example of using Probabilistic Programming (through the Pyro library) to choose optimal experimental parameters while performing the experiments. More concretely, given how subjects have responded so far, what should be the length of the next sequence of digits to measure memory capacity. See the paper by Foster et al. as well.
The nature of experimentation itself:
- Radder, H. (2009).The philosophy of scientific experimentation: a review. Automated Experimentation, 1(1), 2.
- Kohavi, R., & Longbotham, R. (2017). Online controlled experiments and a/b testing. In Encyclopedia of machine learning and data mining (pp. 922-929). Springer US.
Computational Scientific Discovery:
- Gil, Y., Greaves, M., Hendler, J., & Hirsh, H. (2014). Amplify scientific discovery with artificial intelligence. Science, 346(6206), 171-172. [alternative download]
- The DISK project and related papers:
- Bakshy, E., Eckles, D., & Bernstein, M. S. (2014, April). Designing and deploying online field experiments. In Proceedings of the 23rd international conference on World wide web (pp. 283-292).
- Comment: Concerns the PlanOut system developed by Facebook.
- Tosch, E., Bakshy, E., Berger, E. D., Jensen, D. D., & Moss, J. E. B. (2019). PlanAlyzer: assessing threats to the validity of online experiments. Proceedings of the ACM on Programming Languages, 3(OOPSLA), 1-30. There’s a CACM republication here too.
- Comment: The works above concerning PlanOut treat the specification of experiments as something that differs from regular programs (and even from unusual programs such as those from Probabilistic Programming Languages), thereby elevating experimentation to a first-class computational entity. In this manner, they define new problems relevant to this scope, as well as their corresponding solutions. For this ontological reason alone, I find these to be important contributions.
- Foster, A., Jankowiak, M., Bingham, E., Horsfall, P., Teh, Y.W., Rainforth, T. and Goodman, N., (2019). Variational Bayesian Optimal Experimental Design. Advances in Neural Information Processing Systems 2019.
John Harwell, London Lowmanstone, and Maria Gini (2022). SIERRA: A Modular Framework for Research Automation. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS ’22). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 1905–1907.
- Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B. E., Bussonnier, M., Frederic, J., … & Ivanov, P. (2016, May). Jupyter Notebooks-a publishing format for reproducible computational workflows. In ELPUB (pp. 87-90).
- Shen, H. (2014). Interactive notebooks: Sharing the code. Nature News, 515(7525), 151.
- Miao, H., Li, A., Davis, L. S., & Deshpande, A. (2017, April). Modelhub: Deep learning lifecycle management. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE)(pp. 1393-1394). IEEE.
- Gharib Gharibi et al. (2019). ModelKB: Towards Automated Management of the Modeling Lifecycle in Deep Learning. In Proceedings of the 2019 IEEE/ACM 7th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE).
- IEEE Computer Society (2018). Bulletin of the Technical Committee on Data Engineering: Special Issue on Machine Learning Life-cycle Management.
- Bakshy, E., Eckles, D., & Bernstein, M. S. (2014, April). Designing and deploying online field experiments. In Proceedings of the 23rd international conference on World wide web (pp. 283-292). ACM.
- Silva, M., Hines, M. R., Gallo, D., Liu, Q., Ryu, K. D., & Da Silva, D. (2013, March). Cloudbench: Experiment automation for cloud environments. In Cloud Engineering (IC2E), 2013 IEEE International Conference on (pp. 302-311). IEEE.
- Sparkes, A., Aubrey, W., Byrne, E., Clare, A., Khan, M. N., Liakata, M., … & Young, M. (2010). Towards Robot Scientists for autonomous scientific discovery. Automated Experimentation, 2(1), 1.
- Hunter, D., & Evans, N. (2016). Facebook emotional contagion experiment controversy. Research Ethics, 12(1), 2–3.
In Human-Computer Interaction (HCI), there are tools that help users to experiment with the design of various types of artifacts. These range from very simple approaches (e.g., quick previews) to highly sophisticated ones, based on optimization or learning techniques. Beyond their specific design applications, such tools are, by definition (the ‘Human’ part of HCI), very close to users, and therefore can be rich sources of inspiration for more general experimentation interfaces.
- Carter, S., & Nielsen, M. (2017). Using artificial intelligence to augment human intelligence. Distill, 2(12), e9.
- Seidel, S., Berente, N., Lindberg, A., Lyytinen, K., & Nickerson, J. V. (2018). Autonomous tools and design: a triple-loop approach to human-machine learning. Communications of the ACM, 62(1), 50-57.
- Yao, L., Chu, Z., Li, S., Li, Y., Gao, J., & Zhang, A. (2020). A Survey on Causal Inference. arXiv preprint arXiv:2002.02770.
- A concise presentation of the potential outcomes framework, including both traditional and modern methods.
Computational Scientific Discovery.
- Langley, P., Simon, H. A., Bradshaw, G. L., & Zytkow, J. M. (1987). Scientific discovery: Computational explorations of the creative processes. MIT press.
Active Learning. Burr (2012)’s description of the field explains the importance for experimentation very well: “Traditional ‘passive’ learning systems induce a hypothesis to explain whatever training data happens to be available (e.g., a collection of labeled instances). By contrast, the hallmark of an active learning system is that it eagerly develops and tests new hypotheses as part of a continuing,interactive learning process. Another way to think about it is that active learners develop a ‘line ofinquiry,’ much in the way a scientist would design a series of experiments to help him or her drawconclusions as efficiently as possible.“
- Burr Settles (2012). Active Learning. Morgan & Claypool Publishers.
Software analytics (and related experimental concerns). Although, in principle, software analytics can be entirely passive (and therefore not experimental), in reality software provides an ideal medium for supporting experimentation (i.e., because arbitrary interaction can be implemented). Hence, it is worth understanding the area.
- Bird, C., Menzies, T., & Zimmermann, T. (Eds.). (2015). The Art and Science of Analyzing Software Data. Elsevier.
- Menzies, T., Williams, L., & Zimmermann, T. (2016). Perspectives on Data Science for Software Engineering. Morgan Kaufmann.
Philosophy of Science. Unsurprisingly, I find the discipline to be quite insightful.
- Hanson, N. R. (1977). Patterns of discovery an inquiry into the conceptual foundations of science. CUP Archive.
- Pearl, J., & Mackenzie, D. (2018). The book of why: the new science of cause and effect. Basic Books.
- A readable exposition of the causal diagrams school of thought.
- A Crash Course in Causality: Inferring Causal Effects from Observational Data.
- Causal Diagrams: Draw Your Assumptions Before Your Conclusions.
- 2014: Everything You Need To Know about Facebook’s Controversial Emotion Experiment
- 2016: Why These Tech Companies Keep Running Thousands Of Failed Experiments
- 2017: The Surprising Power of Online Experiments
- Sarah Zhang (2019).The 500-Year-Long Science Experiment. In: The Atlantic.
- This is actually not at all automated, but comments on a unique challenge: how do you support experiments that last centuries? Here, scientists are looking for ways to do this manually, such as rewriting the experiment’s instructions every 25 years to keep them up to date for future generations. What would an EOC technology to support such long-term experiments look like? Perhaps just like these scientists are rewriting instructions manually, an experimentation software could rewrite itself over time?