Skip to content
Snippets Groups Projects
Unverified Commit b5cd0aa6 authored by hansenju's avatar hansenju Committed by GitHub
Browse files

Update Intro_to_SOULS.Rmd

BOULS/SOULS
parent 3b142d5a
Branches main
No related tags found
No related merge requests found
---
title: "Introduction to SOULS"
title: "Introduction to BOULS"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to SOULS}
%\VignetteIndexEntry{Introduction to BOULS}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
......@@ -15,44 +15,44 @@ knitr::opts_chunk$set(
```
```{r setup}
library(SOULS)
library(BOULS)
```
This is the introduction to the LCMS data processing approach SOULS (Segmentation of untargeted LCMS spectra). It is based on the xcms package [1-3] and particularly suitable for the development of large machine learning models over time. In contrast to the xcms workflow, it allows separate processing, as a unique peak list independent of the processing batch is achieved by summing up the signal intensities within defined segments in the LCMS spectra.
This is the introduction to the LCMS data processing approach BOULS (Bucketing of untargeted LCMS spectra). It is based on the xcms package [1-3] and particularly suitable for the development of large machine learning models over time. In contrast to the xcms workflow, it allows separate processing, as a unique peak list independent of the processing batch is achieved by summing up the signal intensities within defined segments in the LCMS spectra.
## Data
To demonstrate the functionality of this approach, example data is provided at this link: https://www.fdr.uni-hamburg.de/record/13535
Please download the and unzip the file. The phenodata folder contains a csv file is provided with information about the samples. It can be customized (e.g. name, variety, geogr. origin, instrument, harvesting year etc). The other folders provide mzML files. Vendor specific files have been converted to mzML format using MSConvert [4]. The example files have been shortened to one minute to enable processing on a normal laptop. In practice, the files and data sets are much larger, so we recommend using a workstation for real data sets when using the SOULS approach. A sample is defined as a reference sample for the retention time alignment for the development of large machine learning models over time. This could be e.g. the sample with the most detected peaks. This reference sample is added to each new data set to be processed, making the retention time alignment independent of the processing batch. The "Samples" folder contains the samples to be processed in the respective batch.
Please download the and unzip the file. The phenodata folder contains a csv file with information about the samples. It can be customized (e.g. name, variety, geogr. origin, instrument, harvesting year etc). The other folders provide mzML files. Vendor specific files have been converted to mzML format using MSConvert [4]. The example files have been shortened to one minute to enable processing on a normal laptop. In practice, the files and data sets are much larger, so we recommend using a workstation for real data sets when using the BOULS approach. A sample is defined as a reference sample for the retention time alignment for the development of large machine learning models over time. This could be e.g. the sample with the most detected peaks. This reference sample is added to each new data set to be processed, making the retention time alignment independent of the processing batch. The "Samples" folder contains the samples to be processed in the respective batch.
```{r}
# Please insert your paths to the 'Sample', the 'Reference' and the 'phenodata' folders.
path_mzMLs <- '/home/hansen/example_data/Samples'
path_Ref <- '/home/hansen/example_data/Reference'
path_csv <- '/home/hansen/example_data/phenodata'
path_mzMLs <- '/example_data/Samples'
path_Ref <- '/example_data/Reference'
path_csv <- '/example_data/phenodata'
```
## Data processing using SOULS
This package provides two functions for the processing of LCMS data. To facilitate a first try-out, the first function `process_souls()` includes all steps from data import to R to the segmentation of the spectra. The settings for the xcms functions are predefined. To adjust the xcms settings, the second function `souls()` can be included in the xcms workflow, replacing the correspondence step (https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html).
## Data processing using BOULS
This package provides two functions for the processing of LCMS data. To facilitate a first try-out, the first function `process_bouls()` includes all steps from data import to R to the bucketing of the spectra. The settings for the xcms functions are predefined. To adjust the xcms settings, the second function `bouls()` can be included in the xcms workflow, replacing the correspondence step (https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html).
### General processing using the SOULS approach
Here, the example data is processed using the predefined xcms settings. Two CPUs are used (num_workers parameter) for the processing. A retention time range of 300 s to 360 s and a mass range of 250 Da to 750 Da is processed. The size of the segments is 10 s in retention time dimension and 5 Da in mass dimension. The result is a matrix with summed intensities for the respective segments. The segments are named (rownames). The first value corresponds to the beginning of the segment in retention time dimension and the second value to the beginning of the segment in mass dimension. For example, the segment 300-310 s and 250-255 Da would have the name "300.250".
Here, the example data is processed using the predefined xcms settings. Two CPUs are used (num_workers parameter) for the processing. A retention time range of 300 s to 360 s and a mass range of 250 Da to 750 Da is processed. The size of the buckets is 10 s in retention time dimension and 5 Da in mass dimension. The result is a matrix with summed intensities for the respective buckets. The buckets are named (rownames). The first value corresponds to the beginning of the bucket in retention time dimension and the second value to the beginning of the bucket in mass dimension. For example, the bucket 300-310 s and 250-255 Da would have the name "300.250".
```{r , warning=FALSE}
seg.result <- process_souls(path_mzMLs = path_mzMLs,
bin.result <- process_bouls(path_mzMLs = path_mzMLs,
path_Ref = path_Ref,
path_csv = path_csv,
num_workers = 2,
RT_range = c(300, 360),
size_RT_segs = 10,
size_RT_bins = 10,
mz_range = c(250, 750),
size_mz_segs = 5)
head(seg.result)
size_mz_bins = 5)
head(bin.result)
```
### Individual processing using xcms and SOULS approach
For individual settings in peak picking and retention time alignment, the `souls()` function can be used in the xcms workflow instead of the [correspondence step](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html#25_Correspondence). The XCMSnExp object resulting from the retention time alignment should be used as object. This is an example, how you can process your LCMS data using xcms and SOULS.
### Individual processing using xcms and BOULS approach
For individual settings in peak picking and retention time alignment, the `bouls()` function can be used in the xcms workflow instead of the [correspondence step](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html#25_Correspondence). The XCMSnExp object resulting from the retention time alignment should be used as object. This is an example, how you can process your LCMS data using xcms and BOULS.
```{r , warning=FALSE}
## load mzMLs
......@@ -99,13 +99,13 @@ For individual settings in peak picking and retention time alignment, the `souls
)
)
seg.result <- souls(object = xdata_adj,
bin.result <- souls(object = xdata_adj,
num_workers = 2,
RT_range = c(300, 360),
size_RT_segs = 10,
size_RT_bins = 10,
mz_range = c(250, 750),
size_mz_segs = 5)
head(seg.result)
size_mz_bins = 5)
head(bin.result)
```
[1] Smith, C.A., Want, E.J., O'Maille, G., Abagyan,R., Siuzdak, G. (2006). “XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching and identification.” Analytical Chemistry, 78, 779–787.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment