Title: | Utilities for Preparation of Data Analysis |
---|---|
Description: | Miscellaneous small utilities are provided to mitigate issues with messy, inconsistent or high dimensional data and help for preprocessing and preparing analyses. |
Authors: | Josef Frank |
Maintainer: | Josef Frank <[email protected]> |
License: | GPL-3 |
Version: | 1.0.3 |
Built: | 2024-11-24 03:23:32 UTC |
Source: | https://github.com/cran/preputils |
Proportional data are commonly modelled using a glm approach with logit link function. When performing the logit transformation in advance separately, simple OLS methods can be applied.
beta2m(b)
beta2m(b)
b |
vector or matrix holding the original data |
Data are transformed according to M=log2(b⁄1-b) . The input data are assumed to have the range 0<b<1. Data outside this range will lead to missing values. Corner cases (data of b=0 or b=1) can be handled by use of fixlimits().
A named vector/matrix with same dimensions as b and log2 transformed values
a <- sapply(c(0.01,0.05,0.5,0.8,0.9),function(x) rbinom(30,100,x)/100) matplot(a,pch=20) matplot(beta2m(a),pch=20) matplot(a,beta2m(a),pch=20)
a <- sapply(c(0.01,0.05,0.5,0.8,0.9),function(x) rbinom(30,100,x)/100) matplot(a,pch=20) matplot(beta2m(a),pch=20) matplot(a,beta2m(a),pch=20)
Inspired by the respectively named sql function. In a list of vectors with equal length the function for each of the observations selects the first value that is not NA, in the order provided
coalesce(...)
coalesce(...)
... |
vectors holding the values, sepearated by "," (commonly columns of data.frame) |
Data are transformed according to
A named vector holding the supplemented values
a1 = c(1,NA,NA,NA) a2 = c(2,2,NA,NA) a3 = c(NA,3,3,NA) cbind(a1,a2,a3,suppl=coalesce(a1,a2,a3))
a1 = c(1,NA,NA,NA) a2 = c(2,2,NA,NA) a3 = c(NA,3,3,NA) cbind(a1,a2,a3,suppl=coalesce(a1,a2,a3))
In properly normalized data bases, 1:1 mapping should be complete and unique. In real world data however ID mappings or data base key candidates are repeated over and over across observations, partly with missing data in case of merged data set. fillmap supplements NAs in mapping variables as far as possible
fillmap(x, y, what = "xy", rmdup=FALSE, rmmiss=FALSE, printori=FALSE)
fillmap(x, y, what = "xy", rmdup=FALSE, rmmiss=FALSE, printori=FALSE)
x , y
|
vectors of equal length, holding the mapping values, sepearated by "," |
what |
data to be returned, either 1st ("x") or 2nd argument ("y") or a data,table, cointaining both ("xy") |
rmdup |
remove duplicates from mapping (TRUE) or return all rows in original order (FALSE) |
rmmiss |
remove rows, where not mapping could be found (TRUE) or return all rows (FALSE) |
printori |
print original variables side by side |
incons assumes a 1:1 mapping between provided variables, as is commonly the case for example in ID translation steps.
For all cases where a proper unambiguous 1:1 matching exists. The missings values are filled in
Vector or data.table with original mapping data, where NAs are filled in whith supplemented data where possible
library(data.table) pheno1 <- data.frame(id1=c(1,2,3,4),id2=c(11,22,NA,NA),phenodat=c(NA,NA,NA,"d")) pheno2 <- data.frame(id1=c(NA,NA,NA),id2=c(11,22,33),phenodat=c("a","b","c")) pheno3 <- data.frame(id1=c(4,3),id2=c(44,33),phenodat=c(NA,NA)) phenoges <- rbind(rbind(pheno1,pheno2),pheno3) with(phenoges,fillmap(id1,phenodat)) with(phenoges,fillmap(id1,phenodat,rmdup=TRUE)) with(phenoges,fillmap(id1,phenodat,rmmiss=TRUE)) with(phenoges,fillmap(id1,phenodat,rmdup=TRUE,rmmiss=TRUE)) with(phenoges,fillmap(id2,phenodat)) with(phenoges,fillmap(id2,phenodat,rmdup=TRUE)) with(phenoges,fillmap(id2,phenodat,rmmiss=TRUE)) with(phenoges,fillmap(id2,phenodat,rmdup=TRUE,rmmiss=TRUE)) phenosupp <- with(phenoges,fillmap(id1,id2)) names(phenosupp) <- c("id1","id2") phenosupp$phenodat <- fillmap(phenosupp$id1,phenoges$phenodat,what="y") unique(phenosupp)
library(data.table) pheno1 <- data.frame(id1=c(1,2,3,4),id2=c(11,22,NA,NA),phenodat=c(NA,NA,NA,"d")) pheno2 <- data.frame(id1=c(NA,NA,NA),id2=c(11,22,33),phenodat=c("a","b","c")) pheno3 <- data.frame(id1=c(4,3),id2=c(44,33),phenodat=c(NA,NA)) phenoges <- rbind(rbind(pheno1,pheno2),pheno3) with(phenoges,fillmap(id1,phenodat)) with(phenoges,fillmap(id1,phenodat,rmdup=TRUE)) with(phenoges,fillmap(id1,phenodat,rmmiss=TRUE)) with(phenoges,fillmap(id1,phenodat,rmdup=TRUE,rmmiss=TRUE)) with(phenoges,fillmap(id2,phenodat)) with(phenoges,fillmap(id2,phenodat,rmdup=TRUE)) with(phenoges,fillmap(id2,phenodat,rmmiss=TRUE)) with(phenoges,fillmap(id2,phenodat,rmdup=TRUE,rmmiss=TRUE)) phenosupp <- with(phenoges,fillmap(id1,id2)) names(phenosupp) <- c("id1","id2") phenosupp$phenodat <- fillmap(phenosupp$id1,phenoges$phenodat,what="y") unique(phenosupp)
Noise removal in data set by means of using principal component analysis. Optionally calculate distances (reconstruction error and Mahalanobis distance
filterpca(x,npc=NULL,pcs=NULL,scale.=F, method=c("k","t"),resulttype=c("p","d","b"),lambda=NULL)
filterpca(x,npc=NULL,pcs=NULL,scale.=F, method=c("k","t"),resulttype=c("p","d","b"),lambda=NULL)
x |
data set |
npc |
Number of leading principal components to be used for reconstruction of data set after filtering (positive integer) or number of last components to be skipped (negative integer). |
pcs |
Vector of integers providing column numbers of components to be included for reconstruction (positive numbers) or components to be skipped (negative numbers). In case of mixed signs negative numbers are ignored. |
scale. |
should values be scaled to unit variance before PCA? |
method |
One of either "k" or "t", with following meaning: "k": No further filtering except from ignoring some components when projecting back into original space; "t": Additionally threshold data by setting all value with absolute value below lambda to 0 |
resulttype |
Type of resulting value, either matrix of projected values (p), distances (d) or a list containing both (b) |
lambda |
cutoff to be used for thresholding data. Lambda = NULL instructs to use a predefined value of 5% of the mean deviation |
The function performs PCA on provided data set. Noise is removed by reconstructing original values either on only a subset of extracted PCs, thresholding PC-scores (setting all values with absolute value below provided cutoff to 0) or a combination of both.
Depending on requested resulttype:
p |
Matrix with original observations projected back onto original attribute space after filtering |
d |
Data frame with Mahalanobis distance of observations calculated only on subset of requested PCs and with reconstruction error |
b |
List containing both values mentioned above |
a = iris[-5] b0 = filterpca(a,npc=4,res="b") b1 = filterpca(a,npc=3,res="b") b2 = filterpca(a,npc=2,res="b") pairs(b0,pch=20,col=iris$Species) pairs(b1,pch=20,col=iris$Species) pairs(b2,pch=20,col=iris$Species)
a = iris[-5] b0 = filterpca(a,npc=4,res="b") b1 = filterpca(a,npc=3,res="b") b2 = filterpca(a,npc=2,res="b") pairs(b0,pch=20,col=iris$Species) pairs(b1,pch=20,col=iris$Species) pairs(b2,pch=20,col=iris$Species)
Change extreme values in proprtional data prior to logit transformation
fixlimits(x)
fixlimits(x)
x |
name of vector to adjust |
The function assumes a data range of 0<=x<=1. Data outside this range are regarded as measurement errors and recoded to NA. In order to avoid generating missings during logit transformation values >=1 and <=0 respectively are shifted to lie within the range (0,1) excluding the borders themselves by recoding them to the mean of the respective border and and the most extreme nearest neighbour.
vector of same length as x with adjusted values
fixlimits(0:5/5)
fixlimits(0:5/5)
In properly normalized data bases, no inconsistencies should be present. In real world data however ID mappings or data base key candidates are repeated over and over across observations, especially in mult centric studies with basic research data. incons tries to detect and flag these mapping discrepanices
incons(x, y, printproblems=FALSE)
incons(x, y, printproblems=FALSE)
x , y
|
vectors of equal length, holding the mapping values, sepearated by "," |
printproblems |
Should a table of found problems be printed in addition to the returned flag? |
incons assumes a 1:1 mapping between provided variables, as is commonly the case for example in ID translation steps
A named vector indicating whether ambiguous mapping does occur (TRUE) or mapping is clean (FALSE)
id1 = c(1,2,2,3,4) id2 = c("a","b","c","d","d") ambiguous <- incons(id1,id2,print=TRUE) data.frame(id1,id2,ambiguous)
id1 = c(1,2,2,3,4) id2 = c("a","b","c","d","d") ambiguous <- incons(id1,id2,print=TRUE) data.frame(id1,id2,ambiguous)
Despite conducting analysis of proportional data in M space, for publication figures the estimated values are commonly shown in the original space (range between 0 and 1). This function provides backscaling of the M values to original space by inverting the logit transformation done by beta2m()
m2beta(M)
m2beta(M)
M |
vector or matrix holding the original data |
Data are transformed according to
A named vector/matrix with same dimensions as M and transformed values
b = 1:99 / 100 M = beta2m(b) plot(b,m2beta(M)) print(all.equal(b, m2beta(M)))
b = 1:99 / 100 M = beta2m(b) plot(b,m2beta(M)) print(all.equal(b, m2beta(M)))
Changes range of numeric variables to have min=0 and max=1
normalize(x)
normalize(x)
x |
name of object to normalize |
The function changes the range of the named numeric vector to finally have min(x)=0 and max(x)=1.
vector of same length as x with normalized values
normalize(1:5)
normalize(1:5)
Conduct PCA on variables with biggest variance in high dimensional data matrix
pcv(x, cols=5, sites=5000)
pcv(x, cols=5, sites=5000)
x |
name of data matrix |
cols |
number of principal components to extract |
sites |
number of attributes to consider |
pcv assumes data in a numeric matrix and variable major format, i.e. every line corresponds to to a variable, while the columns correspond to the individual observations. This is commonly the case for data in high throughput experiments where the number of data points per individuals is high (> 10,000), while the size of batches is comparably small (dozens to hundreds). Variables with missing values are disregarded for the selection.
Use t() to transpose individual major data sets beforehand.
pcv selects the attributes with the highest variance up to the numbers provided, but takes considerations to limit these to the actual size of the present data set.
This is often used as first step in high throughput measurements to detect global effects of known batch variables.
matrix with rows corresponding to observations and columns to extracted components. Values denote the scores on the extracted components for the respective observations.
pcs <- pcv(t(iris[1:4]),cols=2) cor(pcs,iris[-5])
pcs <- pcv(t(iris[1:4]),cols=2) cor(pcs,iris[-5])
Remove known categorical batch effects from high dimensional data sets
rmbat(x,batches)
rmbat(x,batches)
x |
name of object to be processed. This is a matrix in atribute major format (rows correspond to variables, columns to observations) |
batches |
Vector with batch identifiers for each of the columns in x |
For each variable the mean values of all batches are shifted to the grand mean of the total sample. On case of several independent bacth effects being present in th data set, thes can either be combined in one batch variable, or the batches can be removed one at a time by chaining the processing and caling the cuntiong with each of the batch variables in turn
matrix with same dimensions as x and batch effects removed
This function is intended for use with methods that do not inherently allow inclusion of covariates in the analysis itself, e.g. pca or heatmap. If methods are used that allow inclusion of batches in analysis like linear models, that is preferred, as the method above can otherwise greatly reduce power if batches are correlated with the effect variable
# create data set n_obs = 8 n_var = 10 predictor <- rep(0:1,n_obs*0.5) pure_effect <- outer(rnorm(n_var),predictor) error <- matrix(rnorm(n_var*n_obs),n_var,n_obs) batch1 <- rep(1:2,each=n_obs*0.5) batch2 <- rep(c(1,2,1,2),each=n_obs*0.25) batch_effect1 <- outer(rnorm(n_var)*2,scale(batch1))[,,1] batch_effect2 <- outer(rnorm(n_var)*4,scale(batch2))[,,1] batch_effect <- batch_effect1 + batch_effect2 data_measured <- pure_effect + batch_effect + error zero = outer(rep(0,n_var),rep(0,n_obs)) b1 <- rmbat(batch_effect1,batch1) b2 <- rmbat(batch_effect2,batch2) b12a <- rmbat(batch_effect1,paste(batch1,batch2)) b12b <- batch_effect all.equal(b1,zero) all.equal(b2,zero) all.equal(b12a,zero) all.equal(b12b,zero)
# create data set n_obs = 8 n_var = 10 predictor <- rep(0:1,n_obs*0.5) pure_effect <- outer(rnorm(n_var),predictor) error <- matrix(rnorm(n_var*n_obs),n_var,n_obs) batch1 <- rep(1:2,each=n_obs*0.5) batch2 <- rep(c(1,2,1,2),each=n_obs*0.25) batch_effect1 <- outer(rnorm(n_var)*2,scale(batch1))[,,1] batch_effect2 <- outer(rnorm(n_var)*4,scale(batch2))[,,1] batch_effect <- batch_effect1 + batch_effect2 data_measured <- pure_effect + batch_effect + error zero = outer(rep(0,n_var),rep(0,n_obs)) b1 <- rmbat(batch_effect1,batch1) b2 <- rmbat(batch_effect2,batch2) b12a <- rmbat(batch_effect1,paste(batch1,batch2)) b12b <- batch_effect all.equal(b1,zero) all.equal(b2,zero) all.equal(b12a,zero) all.equal(b12b,zero)
Calculate variance inflation factors (VIF) for all numeric variables contained in data set
vifx(x)
vifx(x)
x |
name of data frame for which the VIFs should be computed |
The function reads in the object named in x, builds a linear model for each of the contained variables in the data set, regressing the selected variable on all other numeric variables contained in the data set.
The multiple R-squared is computed and transformed to VIF using following formula:
A named vector with names given by the contained numeric variables and values by the computed respective VIFs
vifx(iris)
vifx(iris)
Several presets are provided for creating text data files. Functions are based on write.table, with some predefined extras to save time when writing data sets to clear text files
write.tab(data, filename, sep="\t", quote=FALSE, row.names=FALSE, ...)
write.tab(data, filename, sep="\t", quote=FALSE, row.names=FALSE, ...)
data |
name of object to be saved |
filename |
File name |
sep |
Column separator, see details |
quote |
Should text data be quoted? Default FALSE |
row.names |
Whether rownames whould be included in output, default=FALSE |
... |
Further arguments passed on to write.table() |
Both of the named functions just use the filename as 2nd positional argument and call write.table(). Difference between both is that write.tab has predefined the column separator as "\t", while write.space uses the write.table default " ".
## Not run: write.tab(iris,"~/iris_tab.txt") write.space(iris,"~/iris_space.txt") ## End(Not run)
## Not run: write.tab(iris,"~/iris_tab.txt") write.space(iris,"~/iris_space.txt") ## End(Not run)