Russ Hyde
: All files are the same, but some are samier than others
… and on that dark and stormy night, the fabled real programmers came down from Mount Partition. It is said they smell your code before it is even written.
The sources of duplication are everywhere
We’ve all copied & pasted stuff. “Screw it, it worked then and it’ll work now!”. “Could you add a graph like you did the other week into this report? You’ve got til yesterday”.
And to-be-fair, abstracting code is a bit more complicated than just bloating out more code. Hell, if you write functions, you might have to test them…
I write a lot of analysis code. Sometimes I duplicate things in different projects, sometimes I do it in the same project, sometimes I do it in the same file. Most duplication doesn’t matter - I’ve written library(dplyr)
in enough scripts to know trivial duplication.
But if you have long bits of code that have substantial duplication between them they can quickly come back to bite you.
I couldn’t find an R package to identify code duplication in/between scripts, packages and projects so I wrote one: dupree
Code duplication programs can be really complicated. This one isn’t. It converts the code blocks in your files into sentences and finds stretches of common ‘text’ within those sentences. There isn’t an abstract syntax tree in sight.
Plus it’s slow.
Triumphantly slow.
If you’re working on a big package this is not the tool for you.
Given that my day job involves aligning lots of little 70-character sentences against one massive 3-billion-character sentence, I was rather proud of how slowly dupree
aligns things against things.
Here’s some code
# block 1
# block 2
# block 3
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
# block 4
lm(Sepal.Length ~ Petal.Length, data = filter(iris, Species == "setosa"))
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = filter(iris,
## Species == "setosa"))
## Coefficients:
## (Intercept) Petal.Length
## 4.2132 0.5423
# block 5
lm(Sepal.Length ~ Petal.Length, data = filter(iris, Species == "versicolor"))
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = filter(iris,
## Species == "versicolor"))
## Coefficients:
## (Intercept) Petal.Length
## 2.4075 0.8283
# block 6
lm(Sepal.Length ~ Petal.Length, qr = FALSE, data = filter(iris, Species == "virginica"))
## Call:
## lm(formula = Sepal.Length ~ Petal.Length, data = filter(iris,
## Species == "virginica"), qr = FALSE)
## Coefficients:
## (Intercept) Petal.Length
## 1.0597 0.9957
Now I’ve saved that code in a temporary file. Let’s have a look for any duplication within it.
if (! "dupree" %in% installed.packages()){
devtools::install_github(repo = "russHyde/dupree", dependencies = FALSE)
dupree::dupree(f, min_block_size = 10) %>%
dplyr::select(-file_a, -file_b)
## # A tibble: 2 x 5
## block_a block_b line_a line_b score
## <int> <int> <int> <int> <dbl>
## 1 4 5 12 15 0.9
## 2 4 6 12 18 0.818
What was that?
split the file into 6 different code blocks (those # block X
comments aren’t required for dupree to run, they’re just illustrative)
It ignored all the comments
It disregarded the first three blocks (
) because they were small (that’s what themin_block_size
argument was for)Then it compared the final 3 code blocks against each other
When dupree
is ran, it compares each pair of (non-trivial) code blocks and computes a score that indicates how similar those blocks are. At present it reports at least one similarity score for each of those blocks (so block 4, 5 and 6 are each present in at least one row of the results), but it doesn’t report the results of comparing every pair of code blocks (block-5 versus block-6 was not reported because for each of these blocks the score for the comparison with block-4 is at least as large).
The score is calculated in a pretty complicated way - but, the higher the score, the more similar two blocks are. Each of the code blocks is tokenized by converting the contained symbols (lm
, formula
, Sepal.Length
, Petal.Length
, …; though commas and other trivial symbols are disregarded) to a unique integer - the lm
in block-4 is converted to the same integer as the lm
in block-5, but the integer for lm
is different from that for formula
. Tokenizing constructs a vector of integers for each code block and it is these integer-vectors that are compared to determine the similarity score between the code blocks. Our go-to black box for similarity calculation is stringdist::seq_sim(..., method = "lcs")
- check it out!
works on single files, sets of files and there are a couple of helpers for working with a directory tree, or with an R package.
Why would you use it? Because it may help you reduce the amount of code you have to write and because it may help you simplify the code you have already written.
In a previous post I was talking about lintr
. Lets clone it and see if there is much duplication within the code (as of July 2019).
lintr_path <- file.path(tempdir(), "lintr")
lintr_repo <- git2r::clone("https://github.com/jimhester/lintr", lintr_path)
## cloning into '/tmp/RtmpoJljDU/lintr'...
## Receiving objects: 1% (32/3199), 17 kb
## Receiving objects: 11% (352/3199), 80 kb
## Receiving objects: 21% (672/3199), 169 kb
## Receiving objects: 31% (992/3199), 264 kb
## Receiving objects: 41% (1312/3199), 337 kb
## Receiving objects: 51% (1632/3199), 392 kb
## Receiving objects: 61% (1952/3199), 425 kb
## Receiving objects: 71% (2272/3199), 497 kb
## Receiving objects: 81% (2592/3199), 537 kb
## Receiving objects: 91% (2912/3199), 577 kb
## Receiving objects: 100% (3199/3199), 603 kb, done.
lintr_dups <- dupree::dupree_package(
# use a higher min_block_size than we used above
lintr_path, min_block_size = 40
Similarity scores across the reported code-block pairs are as follows:
plot(lintr_dups$score, ylab = "Similarity score")
So there’s a few code-blocks in there with a similarity score > 0.8. Here’s some example files:
lintr_dups %>%
dplyr::filter(score > 0.8 & file_a != file_b) %>%
dplyr::mutate_at(c("file_a", "file_b"), basename) %>%
## # A tibble: 3 x 7
## file_a file_b block_a block_b line_a line_b score
## <chr> <chr> <int> <int> <int> <int> <dbl>
## 1 equals_na_lintr.R no_tab_linter.R 1 1 3 3 0.886
## 2 function_left_par… spaces_left_paren… 1 1 4 4 0.884
## 3 undesirable_funct… undesirable_opera… 1 2 6 16 0.815
A lot of the linter-functions have exactly the same structure. Maybe there’s an easier way to write linter-functions… [next time]
(Ooo, I have a network image of the similarities in lintr in a recent talk I gave)