About my data-blog Sep 23, 2018 This is my research blog about data science. Anything interesting from computing, data analysis, maths or biology may make it into here. I am in the process of migrating some content from blogger - most of my early posts were about R, bioinformatics or the bash / conda / linux backdrop to my daily work. The site was made from Rmarkdown posts using the R package blogdown and the static site engine Hugo.
Probabilistic modelling resources Sep 22, 2019 Probabilistic Programming I’ve been doing a bit of probabilistic modelling in STAN recently and have used JAGS for a long time. Probabilistic modelling (as embodied in probabilistic graphical models - PGMs - and Bayesian statistics and implemented in probabilistic programming languages and libraries like STAN) is a way to model some phenomenon that incorporates various sources of randomness, and the dependence between components of that model. The models used tend to be more sparse and more informative than would be generated by a neural-network based model for the same phenomenon (IMO).
New-Package Checklist Aug 23, 2019 Things to do when starting a new R package. Before setup Name selected using available::suggest() and available::available("prospective_pkg", browse = FALSE) New repository on github (without readme / gitignore / license) Ensure all packages that your new package will rely on are available in your current R environment Initial setup New project in rstudio New Project –> New Directory –> R Package Define package name to match the github repo
`dupree` Jul 5, 2019 dupree /*! jQuery v1.11.3 | (c) 2005, 2015 jQuery Foundation, Inc. | jquery.org/license */ !function(a,b){"object"==typeof module&&"object"==typeof module.exports?module.exports=a.document?b(a,!0):function(a){if(!a.document)throw new Error("jQuery requires a window with a document");return b(a)}:b(a)}("undefined"!=typeof window?window:this,function(a,b){var c=[],d=c.slice,e=c.concat,f=c.push,g=c.indexOf,h={},i=h.toString,j=h.hasOwnProperty,k={},l="1.11.3",m=function(a,b){return new m.fn.init(a,b)},n=/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g,o=/^-ms-/,p=/-([\da-z])/gi,q=function(a,b){return b.toUpperCase()};m.fn=m.prototype={jquery:l,constructor:m,selector:"",length:0,toArray:function(){return d.call(this)},get:function(a){return null!=a?0a?this[a+this.length]:this[a]:d.call(this)},pushStack:function(a){var b=m.merge(this.constructor(),a);return b.prevObject=this,b.context=this.context,b},each:function(a,b){return m.each(this,a,b)},map:function(a){return this.pushStack(m.map(this,function(b,c){return a.call(b,c,b)}))},slice:function(){return this.pushStack(d.apply(this,arguments))},first:function(){return this.eq(0)},last:function(){return this.eq(-1)},eq:function(a){var b=this.length,c=+a+(0a?b:0);return this.pushStack(c=0&&bc?[this[c]]:[])},end:function(){return this.prevObject||this.constructor(null)},push:f,sort:c.sort,splice:c.splice},m.extend=m.fn.extend=function(){var a,b,c,d,e,f,g=arguments[0]||{},h=1,i=arguments.length,j=!1;for("boolean"==typeof g&&(j=g,g=arguments[h]||{},h++),"object"==typeof g||m.isFunction(g)||(g={}),h===i&&(g=this,h--);ih;h++)if(null!=(e=arguments[h]))for(d in e)a=g[d],c=e[d],g!==c&&(j&&c&&(m.isPlainObject(c)||(b=m.isArray(c)))?(b?(b=!1,f=a&&m.isArray(a)?a:[]):f=a&&m.isPlainObject(a)?a:{},g[d]=m.extend(j,f,c)):void 0!==c&&(g[d]=c));return g},m.extend({expando:"jQuery"+(l+Math.random()).replace(/\D/g,""),isReady:!0,error:function(a){throw new Error(a)},noop:function(){},isFunction:function(a){return"function"===m.type(a)},isArray:Array.isArray||function(a){return"array"===m.type(a)},isWindow:function(a){return null!=a&&a==a.window},isNumeric:function(a){return!m.isArray(a)&&a-parseFloat(a)+1=0},isEmptyObject:function(a){var b;for(b in a)return!1;return!0},isPlainObject:function(a){var b;if(!a||"object"!==m.type(a)||a.nodeType||m.isWindow(a))return!1;try{if(a.constructor&&!j.call(a,"constructor")&&!j.call(a.constructor.prototype,"isPrototypeOf"))return!1}catch(c){return!1}if(k.ownLast)for(b in a)return j.call(a,b);for(b in a);return void 0===b||j.call(a,b)},type:function(a){return null==a?a+"":"object"==typeof a||"function"==typeof a?h[i.call(a)]||"object":typeof a},globalEval:function(b){b&&m.trim(b)&&(a.execScript||function(b){a.eval.call(a,b)})(b)},camelCase:function(a){return a.
R-function Pokemon Jul 4, 2019 R-Function Pokemon and the Informal Formats of Formals When writing R, I tend to use snake_case for object names. The bioconductor project tends to use camelCase (limma::makeContrasts, biomaRt::useMart) and a lot of base functions use dotted.case. There are functions in R that use a few different formats for the function and argument names. For example, scan has both dotted and camelCase parameters (na.strings, allowEscapes), sapply has ALLUPPERCASE, DOTTED.UPPER.CASE and alllowercase parameters (FUN, USE.
`lintr` Jul 1, 2019 Lint is the fluff on your clothes. Aside from all that fluff, you look fine. lintr (Author: Jim Hester) compares the code in your files / packages against a style-guide. This helps ensure your source code looks pretty consistent across your package(s). Why is that useful? It might not be. I couldn’t find many objective studies of code readability amongst the thousands of opinion pieces that are online, so I can’t tell you whether consistent styling is all that valuable.
Code Analysis in R Jun 27, 2019 You’ve been analysing data all day, now let’s analyse your code … Within it’s programming toolkit, R has some really cool things for analysing code and for identifying / fixing issues in your code. What kinds of code-level stuff (eg, software-design/architectural properties, code smells) might you want to be aware of when developing packages or writing analysis scripts? And what tools are available to do this? Dependencies between packages (pkgnet)
Data Analysis Project Architecture Dec 6, 2018 Data analysis projects need a slightly different structure from general programming projects 1. Their organisation should ensure they are reproducible the project should be rerunnable and all results / reports generated should be automatically produced running it on a different computer or in a different location should not affect the results version-controlled an explicit version of the source code, and any dependencies (eg, external github repos, CRAN packages, command line tools) should be used consistent
Matrix::Matrix Sep 30, 2018 Originally posted 2018-05-04 to Blogger. See here and here for some background on the Matrix package. Matrix provides sparse-matrix objects to R. So if you’re making matrices that are mostly zero use Matrix not matrix. I recently used Matrix while trying to work out the overlap sizes between a few hundred different sets of genes. The genesets were represented as a list of vectors of gene-ids; each vector being a single geneset.
UpSetR charts Sep 30, 2018 Venn diagrams blow. Multiset Venn diagrams both blow and suck: Don’t make them; and Don’t make me interpret them; and Don’t try and put them in your presentations because you’ll get lost. UpSetR provides a way to do the multi-set comparison thing without looking horrific. We sample a few sets from the letters b-z: library("UpSetR") set.seed(1) bucket_names <- paste0("set", 1:6) buckets <- Map( function(x){ bucket_size <- sample(1:25, 1) bucket <- sample(letters[-1], bucket_size, replace = FALSE) }, bucket_names ) lapply(buckets, sort) ## $set1 ## [1] "f" "k" "n" "o" "t" "v" "x" ## ## $set2 ## [1] "c" "d" "f" "h" "i" "j" "k" "m" "n" "o" "q" "r" "s" "w" "y" "z" ## ## $set3 ## [1] "b" "e" "i" "k" "l" "m" "p" "v" "x" "y" ## ## $set4 ## [1] "b" "c" "d" "f" "g" "i" "k" "l" "n" "o" "p" "q" "s" "t" "u" "v" "w" ## [18] "x" "y" "z" ## ## $set5 ## [1] "c" "f" "h" "j" "k" "n" "q" "r" "s" "t" "v" "w" "y" ## ## $set6 ## [1] "b" "c" "d" "e" "f" "g" "i" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" ## [18] "u" "w" "x" "z" The function upset takes a data-frame as input.
tidyr::nest Sep 30, 2018 Originally posted 2018-05-02 to Blogger. See here and here. In dplyr, I often want to group_by some columns, apply a function to the subtables defined by grouping, and then dissolve away the grouping. The function applied to the subtables may return: a single row for each group - in which case you’d use dplyr::summarise() or summarize(); a row for each row in the subtable - in which case there may be an appropriate window function;
Working Directories and RMarkdown Sep 10, 2018 Discussed elsewhere, I organise my bioinformatics projects like this: ./jobs/ - <jobname>/ - conf/ - data/ - doc/ - notebook.Rmd - some_other_script.Rmd - .here - lib/ - results/ - requirements.txt - scripts/ - subjobs/ - <nested_structure> - Snakefile Here, the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled / knitted to .pdfs by the packages rmarkdown and knitr.