`dupree`

Jul 5, 2019
code-analysis rstats
24 min read

dupree

dupree: All files are the same, but some are samier than others

… and on that dark and stormy night, the fabled real programmers came down from Mount Partition. It is said they smell your code before it is even written.

The sources of duplication are everywhere

We’ve all copied & pasted stuff. “Screw it, it worked then and it’ll work now!”. “Could you add a graph like you did the other week into this report? You’ve got til yesterday”.

And to-be-fair, abstracting code is a bit more complicated than just bloating out more code. Hell, if you write functions, you might have to test them…

I write a lot of analysis code. Sometimes I duplicate things in different projects, sometimes I do it in the same project, sometimes I do it in the same file. Most duplication doesn’t matter - I’ve written library(dplyr) in enough scripts to know trivial duplication.

But if you have long bits of code that have substantial duplication between them they can quickly come back to bite you.

I couldn’t find an R package to identify code duplication in/between scripts, packages and projects so I wrote one: dupree.

Code duplication programs can be really complicated. This one isn’t. It converts the code blocks in your files into sentences and finds stretches of common ‘text’ within those sentences. There isn’t an abstract syntax tree in sight.

Plus it’s slow.

Triumphantly slow.

If you’re working on a big package this is not the tool for you.

Given that my day job involves aligning lots of little 70-character sentences against one massive 3-billion-character sentence, I was rather proud of how slowly dupree aligns things against things.

Here’s some code

# block 1
library(dplyr)

# block 2
data(iris)

# block 3
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# block 4
lm(Sepal.Length ~ Petal.Length, data = filter(iris, Species == "setosa"))

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = filter(iris, 
##     Species == "setosa"))
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##       4.2132        0.5423

# block 5
lm(Sepal.Length ~ Petal.Length, data = filter(iris, Species == "versicolor"))

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = filter(iris, 
##     Species == "versicolor"))
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##       2.4075        0.8283

# block 6
lm(Sepal.Length ~ Petal.Length, qr = FALSE, data = filter(iris, Species == "virginica"))

## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = filter(iris, 
##     Species == "virginica"), qr = FALSE)
## 
## Coefficients:
##  (Intercept)  Petal.Length  
##       1.0597        0.9957

Now I’ve saved that code in a temporary file. Let’s have a look for any duplication within it.

if (! "dupree" %in% installed.packages()){
  devtools::install_github(repo = "russHyde/dupree", dependencies = FALSE)
}
  
library(dupree)

dupree::dupree(f, min_block_size = 10) %>%
  dplyr::select(-file_a, -file_b)

## # A tibble: 2 x 5
##   block_a block_b line_a line_b score
##     <int>   <int>  <int>  <int> <dbl>
## 1       4       5     12     15 0.9  
## 2       4       6     12     18 0.818

What was that?

dupree split the file into 6 different code blocks (those # block X comments aren’t required for dupree to run, they’re just illustrative)

It ignored all the comments
It disregarded the first three blocks (library(dplyr), data(iris), summary(iris)) because they were small (that’s what the min_block_size argument was for)
Then it compared the final 3 code blocks against each other

When dupree is ran, it compares each pair of (non-trivial) code blocks and computes a score that indicates how similar those blocks are. At present it reports at least one similarity score for each of those blocks (so block 4, 5 and 6 are each present in at least one row of the results), but it doesn’t report the results of comparing every pair of code blocks (block-5 versus block-6 was not reported because for each of these blocks the score for the comparison with block-4 is at least as large).

The score is calculated in a pretty complicated way - but, the higher the score, the more similar two blocks are. Each of the code blocks is tokenized by converting the contained symbols (lm, formula, Sepal.Length, Petal.Length, …; though commas and other trivial symbols are disregarded) to a unique integer - the lm in block-4 is converted to the same integer as the lm in block-5, but the integer for lm is different from that for formula. Tokenizing constructs a vector of integers for each code block and it is these integer-vectors that are compared to determine the similarity score between the code blocks. Our go-to black box for similarity calculation is stringdist::seq_sim(..., method = "lcs") - check it out!

dupree works on single files, sets of files and there are a couple of helpers for working with a directory tree, or with an R package.

Why would you use it? Because it may help you reduce the amount of code you have to write and because it may help you simplify the code you have already written.

In a previous post I was talking about lintr. Lets clone it and see if there is much duplication within the code (as of July 2019).

lintr_path <- file.path(tempdir(), "lintr")
lintr_repo <- git2r::clone("https://github.com/jimhester/lintr", lintr_path)

## cloning into '/tmp/RtmpoJljDU/lintr'...
## Receiving objects:   1% (32/3199),   17 kb
## Receiving objects:  11% (352/3199),   80 kb
## Receiving objects:  21% (672/3199),  169 kb
## Receiving objects:  31% (992/3199),  264 kb
## Receiving objects:  41% (1312/3199),  337 kb
## Receiving objects:  51% (1632/3199),  392 kb
## Receiving objects:  61% (1952/3199),  425 kb
## Receiving objects:  71% (2272/3199),  497 kb
## Receiving objects:  81% (2592/3199),  537 kb
## Receiving objects:  91% (2912/3199),  577 kb
## Receiving objects: 100% (3199/3199),  603 kb, done.

lintr_dups <- dupree::dupree_package(
  # use a higher min_block_size than we used above
  lintr_path, min_block_size = 40
)

Similarity scores across the reported code-block pairs are as follows:

plot(lintr_dups$score, ylab = "Similarity score")

So there’s a few code-blocks in there with a similarity score > 0.8. Here’s some example files:

lintr_dups %>%
  dplyr::filter(score > 0.8 & file_a != file_b) %>%
  dplyr::mutate_at(c("file_a", "file_b"), basename) %>%
  head()

## # A tibble: 3 x 7
##   file_a             file_b             block_a block_b line_a line_b score
##   <chr>              <chr>                <int>   <int>  <int>  <int> <dbl>
## 1 equals_na_lintr.R  no_tab_linter.R          1       1      3      3 0.886
## 2 function_left_par… spaces_left_paren…       1       1      4      4 0.884
## 3 undesirable_funct… undesirable_opera…       1       2      6     16 0.815

A lot of the linter-functions have exactly the same structure. Maybe there’s an easier way to write linter-functions… [next time]

(Ooo, I have a network image of the similarities in lintr in a recent talk I gave)

`dupree`

`dupree`

Russ Hyde

2019-07-05