G. Gousios | TU Delft Repository

Frankenstein

Fast and lightweight call graph generation for software builds

Journal article (2024) - M. Keshani (author), G. Gousios (author), G. Gousios (author), S. Proksch (author)

Call Graphs are a rich data source and form the foundation for advanced static analyses that can, for example, detect security vulnerabilities or dead code. This information is invaluable when it is immediately available, such as in the output of a build system. Call Graph genera ...

Call Graphs are a rich data source and form the foundation for advanced static analyses that can, for example, detect security vulnerabilities or dead code. This information is invaluable when it is immediately available, such as in the output of a build system. Call Graph generation is a whole-program analysis: not just the application, but also all its dependencies are processed together. Recent work has shown that even advanced static analyses can use summarization techniques to substantially improve runtime; however, existing analyses focus on soundness, and as such remain very expensive. When executed in the build system, which typically has limited resources, even powerful servers suffer from slow build times, rendering these analyses impractical in today’s fast-paced development. In this paper, we aim to strike a balance between improving static analyses while remaining practical for use cases that require quick results in low-resource environments. We propose a summarization-based implementation of a Class-Hierarchy Analysis algorithm for call graph generation of Java programs. Our approach leverages the fact that dependency sets often do not change between builds: we can generate call graphs for these dependencies, cache their generation for subsequent builds, and using a novel stitching algorithm, Frankenstein, merge all partial results into a complete call graph for the whole program. Our evaluation results show that this lightweight approach can substantially outperform existing frameworks. In terms of speed improvements, Frankenstein surpasses the baselines by up to 38%, requiring an average of just 388 Megabytes of memory. This makes the proposed approach practical for build systems with limited memory resources. Despite these optimizations, our generated call graphs maintain a near-identical set of edges when compared to the baselines, achieving an F ₁ score of up to 0.98. This summarization-based approach for call graph generation paves the way for using extended static analyses in build processes.

@en

Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling

Conference paper (2023) - E. Kula (author), Eric Greuter (author), A. van Deursen (author), G. Gousios (author)

Modern agile software projects are subject to constant change, making it essential to re-asses overall delay risk throughout the project life cycle. Existing effort estimation models are static and not able to incorporate changes occurring during project execution. In this paper, ...

Pull Request Decisions Explained

An Empirical Overview

Journal article (2023) - Xunhui Zhang (author), Yue Yu (author), G. Gousios (author), A. Rastogi (author), A. Rastogi (author)

Context: The pull-based development model is widely used in open source projects, leading to the emergence of trends in distributed software development. One aspect that has garnered significant attention concerning pull request decisions is the identification of explanatory fact ...

Präzi

From package-based to call-based dependency networks

Journal article (2022) - J.I. Hejderup (author), M.M. Beller (author), M.M. Beller (author), K. Triantafyllou (author), K. Triantafyllou (author), G. Gousios (author)

Modern programming languages such as Java, JavaScript, and Rust encourage software reuse by hosting diverse and fast-growing repositories of highly interdependent packages (i.e., reusable libraries) for their users. The standard way to study the interdependence between software p ...

CodeFill

Multi-token Code Completion by Jointly learning from Structure and Naming Sequences

Conference paper (2022) - M. Izadi (author), Roberta Gismondi (author), G. Gousios (author), G. Gousios (author)

Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant draw-backs: grammar-based autocompletion is restricted in dynamically-typed language environ ...

Factors Affecting On-Time Delivery in Large-Scale Agile Software Development

Journal article (2022) - E. Kula (author), E. Kula (author), Eric Greuter (author), A. van Deursen (author), G. Gousios (author)

Late delivery of software projects and cost overruns have been common problems in the software industry for decades. Both problems are manifestations of deficiencies in effort estimation during project planning. With software projects being complex socio-technical systems, a larg ...

ConE: A Concurrent Edit Detection Tool for Large Scale Software Development

Journal article (2022) - C.S. Maddila (author), Nachiappan Nagappan (author), Christian Bird (author), G. Gousios (author), G. Gousios (author), A. van Deursen (author)

Modern, complex software systems are being continuously extended and adjusted. The developers responsible for this may come from different teams or organizations, and may be distributed over the world. This may make it difficult to keep track of what other developers are doing, w ...

Can we trust tests to automate dependency updates?

A case study of Java Projects

Journal article (2022) - J.I. Hejderup (author), G. Gousios (author)

Developers are increasingly using services such as Dependabot to automate dependency updates. However, recent research has shown that developers perceive such services as unreliable, as they heavily rely on test coverage to detect conflicts in updates. To understand the prevalenc ...

Type4Py

Practical Deep Similarity Learning-Based Type Inference for Python

Conference paper (2022) - S.A.M. Mir (author), Evaldas Latoskinas (author), S. Proksch (author), G. Gousios (author), G. Gousios (author)

Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility and productivity. Lack of static typing can cause run-time exceptions and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotation ...

Nudge

Accelerating Overdue Pull Requests toward Completion

Journal article (2022) - C.S. Maddila (author), Sai Surya Upadrasta Upadrasta (author), Chetan Bansal (author), Nachiappan Nagappan (author), G. Gousios (author), A. van Deursen (author)

Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we desig ...

Impact of Software Engineering Research in Practice: A Patent and Author Survey Analysis

Journal article (2022) - Zoe Kotti (author), G. Gousios (author), D. Spinellis (author), D. Spinellis (author)

Existing work on the practical impact of software engineering (SE) research examines industrial relevance rather than adoption of study results, hence the question of how results have been practically applied remains open. To answer this and investigate the outcomes of impactful ...

Fine-Grained Network Analysis for Modern Software Ecosystems

Journal article (2021) - Paolo Boldi (author), G. Gousios (author)

Modern software development is increasingly dependent on components, libraries, and frameworks coming from third-party vendors or open-source suppliers and made available through a number of platforms (or forges). This way of writing software puts an emphasis on reuse and on comp ...

Topic recommendation for software repositories using multi-label classification algorithms

Journal article (2021) - M. Izadi (author), Abbas Heydarnoori (author), G. Gousios (author)

Many platforms exploit collaborative tagging to provide their users with faster and more accurate results while searching or navigating. Tags can communicate different concepts such as the main features, technologies, functionality, and the goal of a software repository. Recently ...

Learning Off-By-One Mistakes

An Empirical Study

Conference paper (2021) - Hendrig Sellik (author), Onno van Paridon (author), G. Gousios (author), Maurício Aniche (author)

Mistakes in binary conditions are a source of error in many software systems. They happen when developers use, e.g., < or > instead of <= or >=. These boundary mistakes are hard to find and impose manual, labor-intensive work for software developers. While previous re ...

ManyTypes4Py

A benchmark python dataset for machine learning-based type inference

Conference paper (2021) - S.A.M. Mir (author), Evaldas Latoskinas (author), G. Gousios (author)

In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5, 382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of ...

Modeling Team Dynamics for the Characterization and Prediction of Delays in User Stories

Conference paper (2021) - E. Kula (author), A. van Deursen (author), G. Gousios (author)

In agile software development, proper team structures and effort estimates are crucial to ensure the on-time delivery of software projects. Delivery performance can vary due to the influence of changes in teams, resulting in team dynamics that remain largely unexplored. In this p ...

Selecting third-party libraries: The practitioners' perspective

Conference paper (2020) - E. Larios Vargas (author), E. Larios Vargas (author), Maurício Aniche (author), Christoph Treude (author), Magiel Bruntink (author), G. Gousios (author)

The selection of third-party libraries is an essential element of virtually any software development project. However, deciding which libraries to choose is a challenging practical problem. Selecting the wrong library can severely impact a software project in terms of cost, time, ...

OffSide

Learning to Identify Mistakes in Boundary Conditions

Conference paper (2020) - Jón Arnar Briem (author), Jordi Smit (author), Hendrig Sellik (author), Pavel Rapoport (author), G. Gousios (author), Maurício Aniche (author)

Mistakes in boundary conditions are the cause of many bugs in software. These mistakes happen when, e.g., developers make use of '<' or '>' in cases where they should have used '<=' or '>='. Mistakes in boundary conditions are often hard to find and manually detecting ...

Dependency Solving Is Still Hard, but We Are Getting Better at It

Conference paper (2020) - Pietro Abate (author), Roberto Di Cosmo (author), Roberto Di Cosmo (author), G. Gousios (author), Stefano Zacchiroli (author), Stefano Zacchiroli (author)

Dependency solving is a hard (NP-complete) problem in all non-trivial component models due to either mutually incompatible versions of the same packages or explicitly declared package conflicts. As such, software upgrade planning needs to rely on highly specialized dependency sol ...

Questions for Data Scientists in Software Engineering: A Replication

Conference paper (2020) - H.K.M. Huijgens (author), A. Rastogi (author), Ernst Mulders (author), G. Gousios (author), A. van Deursen (author)

In 2014, a Microsoft study investigated the sort of questions that data science applied to software engineering should answer. This resulted in 145 questions that developers considered relevant for data scientists to answer, thus providing a research agenda to the community. Fast ...