Multi-sample analysis with ASCAT

Edith Ross, Kerstin Haase

2024-05-22

Vignette Info

This vignette demonstrates how to use ASCAT to analyse multiple phylogenetically related samples. For the general usage of ASCAT including parameters that are not specific to multi-sample analysis please refer to the ASCAT webpage and the example pipeline.

Using ASCAT for multi-sample analyses

We start by loading the ASCAT package.

library(ASCAT)

Next we load the data.

ascat.bcMulti <- ascat.loadData(
  Tumor_LogR_file = system.file("extdata", "tumour.logR.txt", package="ASCAT"),
  Tumor_BAF_file = system.file("extdata", "tumour.BAF.txt", package="ASCAT"),
  Germline_LogR_file = system.file("extdata", "singlenormal.logR.txt", package="ASCAT"),
  Germline_BAF_file = system.file("extdata", "singlenormal.BAF.txt", package="ASCAT"))
## [1] Reading Tumor LogR data...
## [1] Reading Tumor BAF data...
## [1] Reading Germline LogR data...
## [1] Reading Germline BAF data...
## [1] Registering SNP locations...
## [1] Splitting genome in distinct chunks...

Both Tumor_LogR_file and Tumor_BAF_file are expected to contain a column for each of the samples to analyse.

head(ascat.bcMulti$Tumor_LogR)                          
##            S1       S2
## SNP1  0.03615 -1.03950
## SNP2  0.14998 -0.79433
## SNP3 -0.00891 -0.76137
## SNP4  0.40188 -0.67521
## SNP5  0.14902 -0.72980
## SNP6  0.24118 -1.11302
head(ascat.bcMulti$Tumor_BAF)       
##           S1      S2
## SNP1 0.51596 0.99262
## SNP2 0.67903 0.00255
## SNP3 1.00000 1.00000
## SNP4 0.00000 0.00000
## SNP5 1.00000 1.00000
## SNP6 0.45572 0.04925

The next step is to run the segmentation. When analysing phylogenetically related samples, it is expected that some of the copy number segment boundaries are shared between samples. In this case a joint segmentation of all samples is recommended. The synthetic data set used in this example was also simulated with partly shared segment boundaries. The ground truth copy number plots of the two samples we are going to analyse are shown in the following plots.

The multi-sample segmentation algorithm can be run using the function ascat.asmultipcf.

ascat.bcMulti <- ascat.asmultipcf(ascat.bcMulti,penalty = 5,out.dir=NA)
## [1] "Segmentlength 5"

Finally ASCAT can be run on the segmented data set.

ascat.outputMulti = ascat.runAscat(ascat.bcMulti)

Comparison with single sample segmentation

Finally, we compare our result to that of standard single sample segmentation using ascat.aspcf.

ascat.bc = ascat.loadData(system.file("extdata", "tumour.logR.txt", package="ASCAT"),
                          system.file("extdata", "tumour.BAF.txt", package="ASCAT"),
                          system.file("extdata", "normal.logR.txt", package="ASCAT"),
                          system.file("extdata", "normal.BAF.txt", package="ASCAT"))
## [1] Reading Tumor LogR data...
## [1] Reading Tumor BAF data...
## [1] Reading Germline LogR data...
## [1] Reading Germline BAF data...
## [1] Registering SNP locations...
## [1] Splitting genome in distinct chunks...
ascat.bc = ascat.aspcf(ascat.bc,penalty = 25,out.dir=NA)
## [1] Sample S1 (1/2)
## [1] Sample S2 (2/2)

Note that in the single-sample case the same segmentation sensitivity is achieved with a higher penalty parameter compared to the multi-sample case. This means, when switching from single- to multi-sample segmentation, the penalty parameter needs to be lowered to maintain a similar sensitivity.

We plot the segment boundaries inferred for each of the two samples by multi- and single-sample segmentation.

plot.segments(v1=cumsum(rle(ascat.bc$Tumor_LogR_segmented[,1])$lengths),
              v2=cumsum(rle(ascat.bc$Tumor_LogR_segmented[,2])$lengths),
              main="Single-sample segmentation")

plot.segments(v1=cumsum(rle(ascat.bcMulti$Tumor_LogR_segmented[,1])$lengths),
              v2=cumsum(rle(ascat.bcMulti$Tumor_LogR_segmented[,2])$lengths),
              main="Multi-sample segmentation")

In case of single-sample segmentation the inferred positions of most of the shared segment boundaries vary slightly between the two samples, whereas the multi-sample segmentation infers a common breakpoint when there is no significant evidence that the boundaries differ between samples.

##Comparison with another multi-sample copy number method

In order to compare asmultipcf segmentation to other copy number inference methods, we ran ASCAT on two samples from two patients from a metastatic prostate cancer study (Gundem et al). For the same study, copy number profiles are available from HATCHet (github repository; publication).

Loading all neccessary packages.

library(ggplot2)
library(plyr)

Because HATCHet can model a mixture of copy number states but ASCAT only detects the major clone, we need to define the major clone in the HATCHet data.

getMajorClone<-function(row){
  maj<-names(which.max(row[c(8,10,12)]))
  maj.cn<-row[(match(maj,names(row))-1)]
  return(maj.cn)
}

Next we define a function to load HATCHet data, define the major clone for each segment, i.e. row, and split copy number annotation from “a|b” format into two columns.

readHatchetFile<-function(path){
  hatchet.raw<-read.table(path, header=F, sep="\t", stringsAsFactors = F)
  
  ##find major clone
  major.cn<-apply(hatchet.raw, 1, function(z) getMajorClone(z))
  hatchet.raw$major.cn<-major.cn
  
  hatchet.short<-hatchet.raw[c(1:4,13)]
  names(hatchet.short)<-c("chr","start","end","sample","hatchet.cn")
  
  major.cn<-unlist(lapply(strsplit(hatchet.short$hatchet.cn, "\\|"), function(z) z[1]))
  minor.cn<-unlist(lapply(strsplit(hatchet.short$hatchet.cn, "\\|"), function(z) z[2]))
  
  hatchet.short$major.cn<-major.cn
  hatchet.short$minor.cn<-minor.cn
  
  return(hatchet.short[-5])
}

In order to provide a quantitative measure of how much ASCAT using asmultipcf and HATCHet agree with regard to the sample segmentation, we analysed what fraction of asmultipcf breakpoints has a corresponding HATCHet breakpoint. First, we centred the ASCAT breakpoints in the middle of two segments.

returnBreakpoints<-function(df.chr){
  bp.df<-NULL
  df.sorted<-df.chr[order(df.chr$start),]
  for(i in 1:(nrow(df.sorted)-1)){
    first<-df.sorted[i,]
    second<-df.sorted[(i+1),]
    ##verify real cn bp
    if(first$major.cn==second$major.cn & first$minor.cn==second$minor.cn){
      print("Consecutive rows have identical CN!")
    }
    else{
      bp.df<-rbind(bp.df, data.frame(bp.location=(first$end+((second$start-first$end)/2))))
    }
  }
  return(bp.df)
}

HATCHet output is provided for 50kb bins and not specifically segmented. Hence, in order to compare consecutive segments with the same major clonal copy number, we are merging neighbouring segments with matching allele specific copy number values. Then we calculate the distance of the closest HATCHet breakpoint for every asmultipcf breakpoint.

findMatchingBPs<-function(ascat.bps, hatchet.segs){
  ##limit hatchet df to true CN segs
  hatchet.true<-NULL
  hatchet.sorted<-hatchet.segs[order(hatchet.segs$start),]
  
  rles<-rle(paste0(hatchet.sorted$major.cn,"_",hatchet.sorted$minor.cn))$lengths
  idx<-1
  for(k in rles){
    hatchet.true<-rbind(hatchet.true, data.frame(start=hatchet.sorted[idx, "start"], end=hatchet.sorted[(idx+k-1), "end"]))
    idx<-(idx+k)
  }
  
  ascat.flag<-NULL
  for(i in 1:nrow(ascat.bps)){
    ascat.bp<-ascat.bps[i,"bp.location"]
    min.dist<-min(c(abs(ascat.bp-hatchet.true$end),abs(ascat.bp-hatchet.true$start)))
    
    ascat.flag<-rbind(ascat.flag, data.frame(bkp=ascat.bp, min.dist=min.dist))
  }
  return(ascat.flag)
}

This is the main function visualising the copy number of asmultipcf and HATCHet for all four analysed samples and calculates the fraction of ASCAT breakpoints with a HATCHet breakpoint fewer than 50kb bases away (size of HATCHet bins).

compareCN<-function(cn.ascat, cn.hatchet){
  ##split into samples
  shared.samples<-intersect(cn.ascat$sample, cn.hatchet$sample)
  
  for(s in shared.samples){
    ascat.sub<-cn.ascat[cn.ascat$sample==s,]
    hatchet.sub<-cn.hatchet[cn.hatchet$sample==s,]
    ##move values slightly for plotting
    ascat.plot<- ascat.sub
    hatchet.plot<- hatchet.sub
    ascat.plot$major.cn<-(as.numeric(ascat.plot$major.cn)+0.1)
    hatchet.plot$major.cn<-(as.numeric(hatchet.plot$major.cn)-0.1)
    
    ##visualise
    joint.df<-rbind(ascat.plot,hatchet.plot)
    joint.df$chr<-factor(joint.df$chr, levels=c(1:22,"X","Y"))
    p1<-ggplot(joint.df, aes(x=start, y=(as.numeric(major.cn)+as.numeric(minor.cn)), xend=end, yend=(as.numeric(major.cn)+as.numeric(minor.cn)), col=as.factor(method)))+
      geom_segment()+
      theme(axis.text.x=element_text(angle=45,hjust=1), panel.background = element_blank(), panel.grid.major = element_line(color="grey80"))+
      ggtitle(paste0("Sample ",s))+
      scale_color_manual(values=c("cyan4","coral3"), name="")+
      scale_x_continuous(name="")+
      scale_y_continuous(name="Copy number", limits=(c(0,15)))+
      facet_wrap(.~chr, scales = "free_x", nrow = 3)
    print(p1)
    
    
    ##give quality in measure: what fraction of ASCAT bp has hatchet bp in < threshold vincinity?
    bps<-ddply(ascat.sub, .(chr), function(z) returnBreakpoints(z))
    bp.list<-split(bps, bps$chr)
    
    bkp.list<-NULL
    for(c in names(bp.list)){
      if(c!="X"){
        bkp.list<-rbind(bkp.list,findMatchingBPs(bp.list[[c]], hatchet.sub[hatchet.sub$chr==c,]))
      }
    }
    ##use bin size of HATCHet as threshold
    threshold=50000
    bkp.list$match<-ifelse(bkp.list$min.dist<threshold,T,F)
    print(s)
    print(prop.table(table(bkp.list$match)))
  }
}

Now we can carry out the comparison. Load the HATCHet segmentation for patient A32.

path<-"https://raw.githubusercontent.com/raphael-group/hatchet-paper/master/cancer/prostate/A32/best.seg.ucn"
hatchet.seg<-readHatchetFile(path)
hatchet.seg$method<-"HATCHet"
hatchet.seg$sample<-gsub("-","",hatchet.seg$sample)

Now we can load the asmultipcf results for patient A32.

ascat.seg<-read.table(system.file("extdata", "A32.fast.segments.txt", package="ASCAT"), header=T, sep="\t", stringsAsFactors = F)
ascat.seg$tmp<-unlist(lapply(strsplit(ascat.seg$sample, "-"), function(z) z[1]))
ascat.seg<-ascat.seg[-1]
names(ascat.seg)<-c("chr","start","end","major.cn","minor.cn","sample")
ascat.seg$method<-"ASCAT"

Finally, we can create the visual and segmentation comparison between the copy number of asmultipcf and HATCHet.

compareCN(ascat.seg, hatchet.seg)

## [1] "A32C"
## 
##     FALSE      TRUE 
## 0.3591331 0.6408669

## [1] "A32A"
## 
##     FALSE      TRUE 
## 0.4606061 0.5393939

So for both samples of patient A32, between 54% and 64% of asmultipcf breakpoints have a corresponding HATCHet breakpoint.

We can now repeat the same analysis for patient A17.

path<-"https://raw.githubusercontent.com/raphael-group/hatchet-paper/master/cancer/prostate/A17/best.seg.ucn"
hatchet.seg<-readHatchetFile(path)
hatchet.seg$method<-"HATCHet"

ascat.seg<-read.table(system.file("extdata", "A17.fast.segments.txt", package="ASCAT"), header=T, sep="\t", stringsAsFactors = F)
ascat.seg$tmp<-unlist(lapply(strsplit(ascat.seg$sample, "-"), function(z) z[1]))
ascat.seg<-ascat.seg[-1]
names(ascat.seg)<-c("chr","start","end","major.cn","minor.cn","sample")
ascat.seg$method<-"ASCAT"
compareCN(ascat.seg, hatchet.seg)

## [1] "A17F"
## 
##     FALSE      TRUE 
## 0.2039801 0.7960199

## [1] "A17A"
## 
##     FALSE      TRUE 
## 0.1955017 0.8044983

For both samples of patient A17, around 80% of asmultipcf breakpoints have a corresponding HATCHet breakpoint.