About my data-blog
Sep 23, 2018
This is my research blog about data science. Anything interesting from computing, data analysis, maths or biology may make it into here. I am in the process of migrating some content from blogger - most of my early posts were about R, bioinformatics or the bash / conda / linux backdrop to my daily work.
The site was made from Rmarkdown posts using the R package blogdown and the static site engine Hugo.
Shiny App: StackOverflow WordCloud
Mar 10, 2021
Who answers what on Stack Overflow?
Here’s a little Shiny App I made that takes a Stack-Overflow User ID, finds out what subjects (tags) they answer about on Stack Overflow, and creates a wordcloud of their most frequent tags.
The source code can be found here.
It uses
the stackr package for accessing the Stack Exchange API; and the CRAN package wordcloud for drawing the image. The default image is my own wordcloud, and given the dominance of ‘r’, ‘python’, ‘snakemake’ and ‘conda’ reflects my data analysis / bioinformatics life.
Shiny Resources
Mar 5, 2021
On the “R-for-data-science” community slack channel, we are currently running a book-club to study “Mastering Shiny” by Hadley Wickham (on Tuesdays at 5PM GMT).
Here are some resources that you might find useful if learning “shiny”.
General Cheat sheets:
[https://github.com/rstudio/cheatsheets/raw/master/shiny.pdf] Books:
Engineering Production-Grade Shiny Apps Outstanding User Interfaces with Shiny Competitions & Examples:
Shiny Gallery Winners of the 2020 Annual Shiny Contest Awesome Lists:
`renv` in conda
Feb 23, 2021
Get your reproducibility here!
Reproducibility is nice. It’s nice to be able to rerun a project from a couple of years ago and get the same results, reports and figures. It’s even nicer if you can hand your project over to someone else, and they can generate the same results.
But it takes a lot of work.
What could change in the intervening couple-of-years, or on transfer from one computer to another, that would alter the results generated in a project?
Probabilistic modelling resources
Sep 22, 2019
Probabilistic Programming
I’ve been doing a bit of probabilistic modelling in STAN recently and have used JAGS for a long time. Probabilistic modelling (as embodied in probabilistic graphical models - PGMs - and Bayesian statistics and implemented in probabilistic programming languages and libraries like STAN) is a way to model some phenomenon that incorporates various sources of randomness, and the dependence between components of that model. The models used tend to be more sparse and more informative than would be generated by a neural-network based model for the same phenomenon (IMO).
New-Package Checklist
Aug 23, 2019
Things to do when starting a new R package.
Before setup Name selected using available::suggest() and available::available("prospective_pkg", browse = FALSE)
New repository on github (without readme / gitignore / license)
Ensure all packages that your new package will rely on are available in your current R environment
Initial setup New project in rstudio
New Project –> New Directory –> R Package
Define package name to match the github repo
`dupree`
Jul 5, 2019
dupree /*! jQuery v1.11.3 | (c) 2005, 2015 jQuery Foundation, Inc. | jquery.org/license */ !function(a,b){"object"==typeof module&&"object"==typeof module.exports?module.exports=a.document?b(a,!0):function(a){if(!a.document)throw new Error("jQuery requires a window with a document");return b(a)}:b(a)}("undefined"!=typeof window?window:this,function(a,b){var c=[],d=c.slice,e=c.concat,f=c.push,g=c.indexOf,h={},i=h.toString,j=h.hasOwnProperty,k={},l="1.11.3",m=function(a,b){return new m.fn.init(a,b)},n=/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g,o=/^-ms-/,p=/-([\da-z])/gi,q=function(a,b){return b.toUpperCase()};m.fn=m.prototype={jquery:l,constructor:m,selector:"",length:0,toArray:function(){return d.call(this)},get:function(a){return null!=a?0a?this[a+this.length]:this[a]:d.call(this)},pushStack:function(a){var b=m.merge(this.constructor(),a);return b.prevObject=this,b.context=this.context,b},each:function(a,b){return m.each(this,a,b)},map:function(a){return this.pushStack(m.map(this,function(b,c){return a.call(b,c,b)}))},slice:function(){return this.pushStack(d.apply(this,arguments))},first:function(){return this.eq(0)},last:function(){return this.eq(-1)},eq:function(a){var b=this.length,c=+a+(0a?b:0);return this.pushStack(c=0&&bc?[this[c]]:[])},end:function(){return this.prevObject||this.constructor(null)},push:f,sort:c.sort,splice:c.splice},m.extend=m.fn.extend=function(){var a,b,c,d,e,f,g=arguments[0]||{},h=1,i=arguments.length,j=!1;for("boolean"==typeof g&&(j=g,g=arguments[h]||{},h++),"object"==typeof g||m.isFunction(g)||(g={}),h===i&&(g=this,h--);ih;h++)if(null!=(e=arguments[h]))for(d in e)a=g[d],c=e[d],g!==c&&(j&&c&&(m.isPlainObject(c)||(b=m.isArray(c)))?(b?(b=!1,f=a&&m.isArray(a)?a:[]):f=a&&m.isPlainObject(a)?a:{},g[d]=m.extend(j,f,c)):void 0!==c&&(g[d]=c));return g},m.extend({expando:"jQuery"+(l+Math.random()).replace(/\D/g,""),isReady:!0,error:function(a){throw new Error(a)},noop:function(){},isFunction:function(a){return"function"===m.type(a)},isArray:Array.isArray||function(a){return"array"===m.type(a)},isWindow:function(a){return null!=a&&a==a.window},isNumeric:function(a){return!m.isArray(a)&&a-parseFloat(a)+1=0},isEmptyObject:function(a){var b;for(b in a)return!1;return!0},isPlainObject:function(a){var b;if(!a||"object"!==m.type(a)||a.nodeType||m.isWindow(a))return!1;try{if(a.constructor&&!j.call(a,"constructor")&&!j.call(a.constructor.prototype,"isPrototypeOf"))return!1}catch(c){return!1}if(k.ownLast)for(b in a)return j.call(a,b);for(b in a);return void 0===b||j.call(a,b)},type:function(a){return null==a?a+"":"object"==typeof a||"function"==typeof a?h[i.call(a)]||"object":typeof a},globalEval:function(b){b&&m.trim(b)&&(a.execScript||function(b){a.eval.call(a,b)})(b)},camelCase:function(a){return a.
R-function Pokemon
Jul 4, 2019
R-Function Pokemon and the Informal Formats of Formals
When writing R, I tend to use snake_case for object names. The bioconductor project tends to use camelCase (limma::makeContrasts, biomaRt::useMart) and a lot of base functions use dotted.case.
There are functions in R that use a few different formats for the function and argument names. For example,
scan has both dotted and camelCase parameters (na.strings, allowEscapes),
sapply has ALLUPPERCASE, DOTTED.UPPER.CASE and alllowercase parameters (FUN, USE.
`lintr`
Jul 1, 2019
Lint is the fluff on your clothes. Aside from all that fluff, you look fine.
lintr (Author: Jim Hester) compares the code in your files / packages against a style-guide. This helps ensure your source code looks pretty consistent across your package(s).
Why is that useful?
It might not be. I couldn’t find many objective studies of code readability amongst the thousands of opinion pieces that are online, so I can’t tell you whether consistent styling is all that valuable.
Code Analysis in R
Jun 27, 2019
You’ve been analysing data all day, now let’s analyse your code …
Within it’s programming toolkit, R has some really cool things for analysing code and for identifying / fixing issues in your code.
What kinds of code-level stuff (eg, software-design/architectural properties, code smells) might you want to be aware of when developing packages or writing analysis scripts? And what tools are available to do this?
Dependencies
between packages (pkgnet)
Data Analysis Project Architecture
Dec 6, 2018
Data analysis projects need a slightly different structure from general programming projects 1. Their organisation should ensure they are
reproducible
the project should be rerunnable and all results / reports generated should be automatically produced
running it on a different computer or in a different location should not affect the results
version-controlled
an explicit version of the source code, and any dependencies (eg, external github repos, CRAN packages, command line tools) should be used consistent
Matrix::Matrix
Sep 30, 2018
Originally posted 2018-05-04 to Blogger.
See here and here for some background on the Matrix package.
Matrix provides sparse-matrix objects to R. So if you’re making matrices that are mostly zero use Matrix not matrix.
I recently used Matrix while trying to work out the overlap sizes between a few hundred different sets of genes. The genesets were represented as a list of vectors of gene-ids; each vector being a single geneset.
UpSetR charts
Sep 30, 2018
Venn diagrams blow.
Multiset Venn diagrams both blow and suck:
Don’t make them; and
Don’t make me interpret them; and
Don’t try and put them in your presentations because you’ll get lost.
UpSetR provides a way to do the multi-set comparison thing without looking horrific.
We sample a few sets from the letters b-z:
library("UpSetR")
set.seed(1)
bucket_names <- paste0("set", 1:6)
buckets <- Map(
function(x){
bucket_size <- sample(1:25, 1)
bucket <- sample(letters[-1], bucket_size, replace = FALSE)
},
bucket_names
)
lapply(buckets, sort)
## $set1
## [1] "f" "k" "n" "o" "t" "v" "x"
## ## $set2
## [1] "c" "d" "f" "h" "i" "j" "k" "m" "n" "o" "q" "r" "s" "w" "y" "z"
## ## $set3
## [1] "b" "e" "i" "k" "l" "m" "p" "v" "x" "y"
## ## $set4
## [1] "b" "c" "d" "f" "g" "i" "k" "l" "n" "o" "p" "q" "s" "t" "u" "v" "w"
## [18] "x" "y" "z"
## ## $set5
## [1] "c" "f" "h" "j" "k" "n" "q" "r" "s" "t" "v" "w" "y"
## ## $set6
## [1] "b" "c" "d" "e" "f" "g" "i" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t"
## [18] "u" "w" "x" "z"
The function upset takes a data-frame as input.
tidyr::nest
Sep 30, 2018
Originally posted 2018-05-02 to Blogger.
See here and here.
In dplyr, I often want to group_by some columns, apply a function to the subtables defined by grouping, and then dissolve away the grouping. The function applied to the subtables may return:
a single row for each group - in which case you’d use dplyr::summarise() or summarize();
a row for each row in the subtable - in which case there may be an appropriate window function;
Working Directories and RMarkdown
Sep 10, 2018
Discussed elsewhere, I organise my bioinformatics projects like this:
./jobs/
- <jobname>/
- conf/
- data/
- doc/
- notebook.Rmd
- some_other_script.Rmd
- .here
- lib/
- results/
- requirements.txt
- scripts/
- subjobs/
- <nested_structure>
- Snakefile
Here, the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled / knitted to .pdfs by the packages rmarkdown and knitr.