bioCancer Package
bioCancer is a platform-independent interface for dynamic interaction with cancer genomics data. The web is implemented in the R language and based on the Shiny package. It runs on any modern Web browser and requires no programming skills, increasing the accessibility to the huge, complex and heterogeneous cancer genomic data. The data are provided from cBioPortal that contains data from 105 cancer genomics studies. The studies are updated monthly, based on the last TCGA production runs. User can access easily to studies, search in clinical data or by genetic profiles. All data are displayed in table which user can filter, combine, download, visualize and get statistics on it. For more global exploring, zoomable circular layout are available to merge and view around twenty matrices in the same plot. The circular layout makes easy and rapid to identify pertinent multi-assays changes in genes through multiple cancers or studies. The web page implements multiple methods, to classify genes by study or by disease, to cluster studies by biological process or other ontology annotation. From gene list user can predicts functional interaction network. Nodes and edges can be colored and formatted by omics cancer data. User is free to choose which dimension will be included in network and can set some thresholds to view only significant biological scenario. The web accepts multiple format of input data that can be included by user to compare/analysis with/without cancer studies. All investigation done by user can be saved in session and can be reloaded later or shared with colleagues. The main R plotting features are available and easy to use. User needs only to chose the type of plot and select variables to be viewed. All generated plot is downloadable with a high resolution. bioCancer has dynamic sidebar dashboard that changes and displays functionalities depending on user request. It reduces excessive clicking or false queries. It can be launched in local machine with any system with R installed. All navigating panel are well assisted and documented by examples. bioCancer is free and open to all users and there is no login requirement.
Pipeline Overview
How to run bioCancer
library(bioCancer)
bioCancer()
Portal Panel
Display available Cancer Studies in Table
Studies Panel
This panel displays in table all available cancer studies hosted and maintained by Memorial Sloan Kettering Cancer Center (MSKCC). It provides access to data by The Cancer Genome Atlas as well as many carefully curated published data sets.
Every row lists one study by Identity
, name
and description
.
Browse the data
By default only 10 rows of are shown at one time. You can change this
setting through the Show ... entries
dropdown. Press the
Next
and Previous
buttons at the bottom-right
of the screen to navigate through the data.
Sort
Click on a column header in the table to sort the data by the values of that variable. Clicking again will toggle between sorting in ascending and descending order. To sort on multiple columns at once press shift and then click on the 2nd, 3rd, etc. column to sort by.
Filters in Table
The search is possible for numerical or categorical variables. It is
possible to match string
or to use mathematical
operator
to filter data. For more detail see help page in
Processing > View panel. #### Global Search the Filter
box on the left (click the check-box first). #### Column filter Every
column has its filetr at the column header.
Download table as csv file
User can download table as csv file. Use the download icon in the top-right of the page.
Show Clinical Data in Table
Clinical panel displays informations related to patients as
AGE
, GENDER
and other variables depending on
study and type of cancer. Some variables are shared between studies and
others are specific. Each row corresponds to one patient.
Show Profiles Data in Table
Profiles panel displays informations related to gene list. User needs
to specify a Study
, a Case
, and a
Genetic Profile
to get the right profile.
It is more practice to select that have all data
(case_all
) and change only the profile.
There are in general but not always, 6 types of genetic profiles: *
Copy Number Alteration (CNA). *
mRNA expression (mRNA) *
Mutations (Mut) *
Methylation (Met): There are
two probes HM_27
and HM_450
* microRNA
expression (miRNA)
* Reverse Phase Protein Array (RPPA)
It is possible to find other kind of data related to one of listed
types. For example the log
or z_score
of mRNA expression.
Load Gene List
User can upload gene list examples or upload own gene list.
When user selects examples
and clic on
Load examples
button, the gene list examples is loaded in
DropDown Gene List.
When User selects clipboard
, it is possible to copy own
gene list from text file (gene symbol by line) and clic on
Paste Gene List
button. The gene List will be named
Genes
in DropDown Gene List.
Load Profiles to Datasets
It is interesting to get any statistics analysis or transformation
with genetic profiles. Any table from Profiles
panel can be
loaded to Processing
panel by checking
Load Profiles to Datasets
and press the button. The data
frame will be named ProfData
. # Processing Panel
Manage data and state: Load data into bioCancer, Save data to disk, Remove a dataset from memory, or Save/Load the full state of the app
Datasets
When you start bioCancer a dataset (epiGenomics
) with
information on how it was formatted is shown in Processing
panel.
It is good practice to add a description of the data and variables to each file you use. For the files that are part of bioCancer you will see a brief overview of the variables etc. below the table of the first 10 rows of the data. If you would like to add a description for your own data check the ‘Add/edit data description’ check-box. A window will open below the data table where you can add text in markdown format. The descriptions of the data included with bioCancer should serve as a good starting point.
If you would like to rename a dataset loaded in bioCancer check the
Rename data
box, enter a new name for the data, and click
the Rename
button
Load data
The best way to load and save data for use in bioCancer (and R) is to
use the R-data format (rda). These are binary files that can be stored
compactly and read into R quickly. Choose rda
from the
Load data of type
dropdown and click
Choose Files
to locate the file(s) you want to load. If the
rda
file is available online choose rda (url)
from the dropdown, paste the url into the text input, and press
Load
.
You can get data from a spreadsheet (e.g., Excel or Google sheets)
into bioCancer in two ways. First, you can save data from the
spreadsheet in csv format and then, in bioCancer, choose
csv
from the Load data of type
dropdown. Most
likely you will have a header row in the csv file with variable names.
If the data are not comma separated you can choose semicolon or tab
separated. To load a csv file click ‘Choose files’ and locate the file
on your computer. If the csv
data is available online
choose csv (url)
from the dropdown, paste the url into the
text input shown, and press Load
.
Note: For Windows users with data that contain multibyte characters please make sure your data are in ANSI format so bioCancer can load the characters correctly.
Alternatively, you can select and copy the data in the spreadsheet
using CTRL-C (or CMD-C on mac), go to bioCancer, choose
clipboard
from the dropdown, and click the
Paste data
button. This is a short-cut that can be
convenient for smaller datasets that are cleanly formatted. If you see a
message in bioCancer that the data were not transferred cleanly try
saving the data in csv format and loading it into bioCancer as described
above.
To access all data files bundled with bioCancer choose
examples
from the Load data of type
dropdown
and click Load examples
. These files are used to illustrate
the various analysis tools accessible in bioCancer. For example, the
catalog sales data is used as an example in the help file for regression
(i.e., Regression > Linear (OLS)).
Save data
As mentioned above, the most convenient way to get data in and out of
bioCancer is to use the R-data format (rda). Choose rda
from the Save data
dropdown and click the
Save data
button to save selected dataset to file.
It is good practice to add a description of the data and variables to each file you use. For the files that are part of bioCancer you will see a brief overview of the variables etc. below the table of the first 10 rows of the data. If you would like to add a description for your own data check the ‘Add/edit data description’ check-box. A window will open below that data table where you can add text in markdown format. The descriptions of the data included with bioCancer should serve as a good starting point. When you save the data as an rda file the description you created (or edited) will automatically be added to the file.
Getting data from bioCancer into a spreadsheet can be achieved in two
ways. First, you can save data in csv format and load the file into the
spreadsheet (i.e., choose csv
from the
Save data
dropdown and click the Save data
button). Alternatively, you can copy the data from bioCancer into the
clipboard by choosing clipboard
from the dropdown and
clicking the Copy data
button, open the spreadsheet, and
paste the data from bioCancer using CTRL-V (or CMD-V on mac).
Save and load state
You can save and load the state of the bioCancer app just as you
would a data file. The state file (extension rda) will contain (1) the
data loaded in bioCancer, (2) settings for the analyses you were working
on, (3) and any reports or code from the R-menu. Save the state-file to
your hard-disk and when you are ready to continue simply load it by
selecting the state radio button and clicking the
Choose file
button.
The best way to save your analyses is to save the state of the app to
a file by clicking on the icon
in the navbar and then on Save state
. Similar functionality
is available in Data > Manage
tab.
This is convenient if you want to save your work to be completed at
another time, perhaps on another computer, or to review any assignments
you completed using bioCancer. You can also share the file with others
that would like to replicate your analyses. As an example, download and
then load the state_file RadiantState.rda
.
Go to Data > View
, Data > Visualize
to
see some of the settings loaded from the statefile. There is also a
report in R > Report
created using the Radiant
interface. The html file
RadiantState.html
contains the output.
A related feature in bioCancer is that state is maintained if you
accidentally navigate to another page, close (and reopen) the browser,
and/or hit refresh. Use Reset
in the
menu in the navigation
bar to return to a clean/new state.
Loading and saving state also works with Rstudio. If you start
bioCancer from Rstudio and use
> Stop
to stop the app, lists called r_data
and
r_state
will be put into Rstudio’s global workspace. If you
start bioCancer again using bioCancer()
it will use these
lists to restore state. This can be convenient if you want to make
changes to a data file in Rstudio and load it back into bioCancer. Also,
if you load a state file directly into Rstudio it will be used when you
start bioCancer to recreate a previous state.
Remove data from memory
If data are loaded that you no longer need access to in the current
session check the Remove data from memory
box. Then select
the data to remove and click the Remove data
button. One
datafile will always remain open.
Using commands to load and save data
The loadr
command can be used to load data from a file
directly into a bioCancer session and add it to the
Datasets
dropdown. The saver
command can be
used to exact data from bioCancer and save it to disk. Data can be
loaded or saved as rda
or rds
format depending
on the file extension chosen. These commands can be used both inside or
without the bioCancer browser interface. See ?loadr
and
?saver
for details.
Show data in table form
Datasets
Choose one of the datasets from the Datasets
dropdown.
Files are loaded into bioCancer through the Manage tab.
Select columns
By default all columns in the data are shown. Click on any variable to focus on it alone. To select several variables use the SHIFT and ARROW keys on your keyboard. On a mac the CMD key can also be used to select multiple variables. The same effect is achieved on windows using the CTRL key. To select all variable use CTRL-A (or CMD-A on mac).
Browse the data
By default only 10 rows of are shown at one time. You can change this
setting through the Show ... entries
dropdown. Press the
Next
and Previous
buttons at the bottom-right
of the screen to navigate through the data.
Sort
Click on a column header in the table to sort the data by the values of that variable. Clicking again will toggle between sorting in ascending and descending order. To sort on multiple columns at once press shift and then click on the 2nd, 3rd, etc. column to sort by.
Filter
There are several ways to select a subset of the data to view. The
Filter
box on the left (click the check-box first) can be
used with >
and <
signs and you can also
combine subset commands. For example, x > 3 & y == 2
would show only those rows for which the variable x
has
values larger than 3 and for which y
has
values equal to 2. Note that in R, and most other programming languages,
=
is used to assign a value and ==
to
evaluate if the value of a variable is equal to some other value. In
contrast !=
is used to determine if a variable is
unequal to some value. You can also use expressions that have
an or condition. For example, to select rows where
mutation frequency
is smaller than 20 and larger than 10
use FreqMut > 10 & FreqMut < 20
.
&
is the symbol for and. The table
below gives an overview of common operators.
You can also use string matching to select rows. For example, type
grepl("lu", Diseases)
to select rows with lung
Cancers. This search is case sensitive by default. For case insensitive
search you would use
grepl("TCGA", name, ignore.case = TRUE)
. Type your
statement in the Filter
box and press return to see the
result on screen or an error below the box if the expression is
invalid.
It is important to note that these filters are persistent. A
filter entered in one of the Data-tabs will also be applied to other
tabs and to any analysis conducted through the bioCancer menus. To
deactivate a filter uncheck the Filter
check-box. To remove
a filter simply erase it.
Operator | Description | Example |
---|---|---|
<
|
less than |
price < 5000
|
<=
|
less than or equal to |
carat <= 2
|
>
|
greater than |
price > 1000
|
>=
|
greater than or equal to |
carat >= 2
|
==
|
exactly equal to |
cut == 'Fair'
|
!=
|
not equal to |
cut != 'Fair'
|
|
|
x OR y |
price > 10000 | cut == 'Premium'
|
&
|
x AND y |
carat < 2 & cut == 'Fair'
|
%in%
|
x is one of y |
cut %in% c('Fair', 'Good')
|
Column filters and Search
For variables that have a limited number of different values (i.e., a
factor) you can select the levels to keep from the column filter below
the variable name. For example, to filter on rows with
CNA = -1
click in the box below the CNA
column
header and select -1
from the dropdown menu shown. You can
also type a string into these column filters followed by return. Note
that matching is case-insensitive. In fact, typing 1
would
produce the same result because the search will match any part of a
string. Similarly, you can type a string to select observations for
character variables (e.g., street names).
For numeric variables the column filter boxes have some special
features that make them almost as powerful as the Filter
box. For numerical and integer variables you can use ...
to
indicate a range. For example, to select mRNA
values
between 200 and 500 type 200 ... 500
and press return. The
range is inclusive of the values typed. Furthermore, if we want to
filter on FreqMut
20 ...
will show only
Studies with mutation frequancy larger than or equal to 20. Numeric
variables also have a slider that you can use to define the range of
values to keep.
If you want to get really fancy you can use the search box
on the top right to search across all columns in the data using
regular expressions. For example, to find all rows that
have an entry in any column ending with the number 72 type
72$
(i.e., the $
sign is used to indicate the
end of an entry). For all rows with entries that start with 60 use
^60
(i.e., the ^
is used to indicate the first
character in an entry). Regular expressions are incredibly powerful for
search but this is a big topic area. To learn more about
regular expressions see this
tutorial.
It is important to note that column sorting, column filters, and
search are not persistent. To store these settings for
use in other parts of bioCancer press the Store
button. You
can store the data and settings under a different dataset name by
changing the value in the text input to the left of the
Store
button. This feature can also be used to select a
subset of variables to keep. Just select the ones you want to keep and
press the Store
button. For more control over the variables
you want to keep or remove and to specify their order in the dataset use
the Data > Transform
tab.
Visualize data
Filter
Use the Filter
box to select (or omit) specific sets of
rows from the data. See the helpfile for Data > View for details.
Plot-type
Select the plot type you want. Choose histograms or density for one
or more single variable plots. For example, with the
epiGenomics
data loaded select Histogram
and
all (X) variables (use CTRL-a or CMD-a). This will create histograms for
all variables in your dataset. Scatter plots are used to visualize the
relationship between two variables. Select one or more variables to plot
on the Y-axis and one or more variables to plot on the X-axis. Line
plots are similar to scatter plots but they connect-the-dots and are
particularly useful for time-series data. Bar plots are used to show the
relationship between a categorical variable (X-axis) and the average
value of a numeric variable (Y-axis). Box-plots are also used when you
have a numeric Y-variable and a categorical X-variable. They are more
informative than bar charts but also require a bit more effort to
evaluate.
Box plots
The upper and lower “hinges” of the box correspond to the first and third quartiles (the 25th and 75th percentiles) in the data. The middle hinge is the median value of the data. The upper whisker extends from the upper hinge (i.e., the top of the box) to the highest value in the data that is within 1.5 x IQR of the upper hinge. IQR is the inter-quartile range, or distance, between the first and third quartiles. The lower whisker extends from the lower hinge to the lowest value in the data within 1.5 x IQR of the lower hinge. Data beyond the end of the whiskers could be outliers and are plotted as points (as suggested by Tukey).
In sum: 1. The upper whisker extends from Q3 to min(max(data), Q3 + 1.5 x IQR) 2. The lower whisker extends from Q1 to max(min(data), Q1 - 1.5 x IQR)
You may have to read the two bullets above a few times before it sinks in. The plot below should help to explain the structure of the box plot.
Sub-plots and heat-maps
Facet row
and Facet column
can be used to
split the data into different groups and create separate plots for each
group.
If you select a scatter or line plot a Color
drop-down
will be shown. Selecting a Color
variable will create a
type of heat-map where the colors are linked to the values of the
Color
variable. Selecting a categorical variable from the
Color
dropdown for a line plot will split the data into
groups and will show a line of a different color for each group.
Line, loess, and jitter
To add a linear or non-linear regression line to a scatter plot check the Line and/or Loess boxes. If your data take on a limited number of values checking Jitter can be useful to get a better feel for where most of the data points are located. Jitter-ing simply adds a small random value to each data point so they do not overlap completely in the plot(s).
Axis scale
The relationship between variables depicted in a scatter plot may be
non-linear. There are numerous transformations we might apply to the
data so this relationship becomes (approximately) linear (see Data >
Transform) and easier to estimate. Perhaps the most common data
transformation applied to business data is the (natural) log. To see if
a log-linear or log-log transformation may be appropriate for your data
check the Log X
and/or Log Y
boxes.
By default the scale of the y-axis is the same across sub-plots when
using Facet row
. To allow the y-axis to be specific to each
sub-plot click the Scale-y
check-box.
Flip axes
To switch the variable on the X- and Y-axis check the
Flip
box.
Plot height and width
To make plots bigger or smaller adjust the values in the height and width boxes on the bottom left.
Customizing plots in R > Report
To customize a plot first generate the visualize command by clicking
the report (book) icon on the bottom left of your screen. The example
below illustrates how to customize a command in the
R > Report
tab. Notice that custom
is set
to TRUE
.
visualize(dataset = "diamonds", yvar = "price", xvar = "carat", type = "scatter", custom = TRUE) +
ggtitle("A scatterplot") + xlab("price in $")
See the ggplot2 documentation page for available options https://docs.ggplot2.org.
Create pivot tables to explore your data
If you have used pivot-tables in Excel the functionality provided in the Pivot tab should be familiar to you. Similar to the Explore tab, you can generate summary statistics for variables in your data. You can also easily generate frequency tables. Perhaps the most powerful feature in Pivot is that you can describe the data by one or more other variables.
For example, with the epiGenomics
data select
Genes
, Diseases
and CNA
from the
Categorical variables drop-down. You can drag-and-drop the selected
variables to change their order. The categories for the first variable
will be the column headers. After selecting these three variables a
frequency table of data with different Diseases and Genes. Choose
Row
, Column
, or Total
from the
Normalize drop-down to normalize the frequencies by row, column, or
overall total. If a normalize option is selected it can be convenient to
check the Percentage
box to express the numbers as
percentages. Choose Color bar
or Heat map
from
the Conditional formatting drop-down to emphasize the highest frequency
counts.
It is also possible to summarize numerical variables. Select
FreqMut
from the Numerical variables drop-down. This will
create the table shown below. Just as in the View tab you can sort the
table by clicking on the column headers. You can also use sliders (e.g.,
click in the input box below I1
) to limit the view to
values in a specified range. To view only information for
CNA
with 0
or -1
levels click in
the input box below the CNA
header.
You can also create a bar chart based on the generated table (see image above). To download the table to csv format or the plot to a png format click the download icon on the right.
Filter
Use the Filter
box to select (or omit) specific sets of
rows from the data. See the help file for Data > View for
details.
Summarize and explore your data
Generate summary statistics for one or more variables in your data. The most powerful feature in Explore is that you can easy describe the data by one or more other variables. Where the Pivot tab works best for frequency tables and to summarize a single numerical variable, the Explore tab allows you to summarize multiple variables at the same time using various statistics.
For example, if we select Genes
from the
xmRNA
dataset we can see the number of observations (n),
the mean, the median, etc. etc.
The created summary table can be stored in bioCancer by clicking the
Store
button. This can be useful if you want to create
plots using the summarized data. To download the table to csv
format click the download icon on the top-right.
You can select options from Column variable
dropdown to
switch between different column headers. Select either the
functions
(e.g., mean, median, etc), the variables (e.g.,
Genes), or the levels of the (first) Group by
variable
(e.g., Studies).
Filter
Use the Filter
box to select (or omit) specific sets of
rows from the data. See the helpfile for Data > View for details.
Transform command log
All transformations applied in the Data > Transform tab
can be logged. If, for example, you apply a log
transformation to numeric variables the following code is generated and
put in the Transform command log window at the bottom of your
screen when you click the Store
button.
## transform variable r_data[["epiGenomics"]] <- mutate_each(r_data[["epiGenomics"]], funs(log), ext = "_log", mRNA, Met450)
This is an important feature if you need to recreate your results at some point in the future or you want to re-run a report with new, but similar, data. Even more important is that there is a record of the steps taken to generate all results.
To add commands contained in the command log window to a report in R > Report click the icon.
Filter
Filter functionality must be turned off when transforming variables.
If a filter is active the transform functions will show a warning
message. Either remove the filter statement or un-check the
Filter
check-box. Alternatively, navigate to the Data >
View tab and click the Store
button to store the filtered
data and select the newly create dataset. Then return to the Transform
tab to make the desired variable changes.
Type
When you select Type
from the
Transformation type
drop-down another drop-down menu is
shown that will allow you to change the type (or class) of one or more
variables. For example, you can change a variable of type integer to a
variable of type factor. Click the Store
button to change
variable(s) in the data set. A description of the transformations
included in bioCancer is provided below.
- As factor: convert a variable to type factor (i.e., a categorical variable)
- As number: convert a variable to type numeric
- As integer: convert a variable to type integer
- As character: convert a variable to type character (i.e., strings)
- As date (mdy): convert a variable to a date if the dates are ordered as month-day-year
- As date (dmy): convert a variable to a date if the dates are ordered as day-month-year
- As date (ymd): convert a variable to a date if the dates are ordered as year-month-day
- As date/time (mdy_hms): convert a variable to a date if the dates are ordered as month-day-year-hour-minute-second
- As date/time (mdy_hm): convert a variable to a date if the dates are ordered as month-day-year-hour-minute
- As date/time (dmy_hms): See mdy_hms
- As date/time (dmy_hm): See mdy_hm
- As date/time (ymd_hms): See mdy_hms
- As date/time (ymd_hm): See mdy_hm
Transform
When you select Transform
from the
Transformation type
drop-down another drop-down menu is
shown that will allow you to apply common transformations to one or more
variables in the data. For example, to take the (natural) log of a
variable select the variable(s) you want to transform and choose
Log
from the Apply function
drop-down. A new
variable is created with the extension specified in the ’Variable name
extensiontext input (e.g,.
_log). Make sure to press
returnafter changing the extension. Click the
Store`
button to add the variable(s) to the data set. A description of the
transformation functions included in bioCancer is provided below.
- Log: create a natural log-transformed version of the selected variable (i.e., log(x) or ln(x))
- Square: multiply a variable by itself (i.e., x^2 or square(x))
- Square-root: take the square-root of a variable (i.e., x^.5)
- Absolute: Absolute value of a variable (i.e., abs(x))
- Center: create a new variable with a mean of zero (i.e., x - mean(x))
- Standardize: create a new variable with a mean of zero and standard deviation of one (i.e., (x - mean(x))/sd(x))
- Invert: 1/x
- Median split: create a new factor with two levels (Above and Below) that splits the variable values at the median
- Deciles: create a new factor with 10 levels (deciles) that splits the variable values at the 10th, 20th, …, 90th percentiles.
Create
Choose Create
from the Transformation type
drop-down. This is the most flexible command to create new or
transformed variables. However, it also requires some basic knowledge of
R-syntax. A new variable can be any function of other variables in the
(active) dataset. Some examples are given below. In each example the
name to the left of the =
sign is the name of the new
variable. To the right of the =
sign you can include other
variable names and basic R-functions. After you have typed the command
press return
to create the new variable and press
Store
to add it to the dataset.
Create a new variable z that is the difference between variables x and y
z = x - y
Create a new variable z that is a transformation of variable x but with mean equal to zero (note that this transformation is also available in the
Transform
drop-down asCenter
):z = x - mean(x)
Create a new
logical
variable z that takes on the value TRUE when x > y and FALSE otherwisez = x > y
Create a new
logical
z that takes on the value TRUE when x is equal to y and FALSE otherwisez = x == y
Create a variable z that is equal to x lagged by 3 periods
z = log(x,3)
Create a categorical variable with two levels
z = ifelse(x < y, ‘smaller’, ‘bigger’)
Create a categorical variable with three levels. An alternative approach would be to use the
Recode
function described belowz = ifelse(x < 60, ‘< 60’, ifelse(x > 65, ‘> 65’, ‘60-65’))
Convert an outlier to a missing value. For example, if we want to remove the maximum value from a variable called
xmRNA
that is equal to 400 we could use anifelse
statement and enter the command below in theCreate
box. Pressreturn
andStore
to add the newxmRNA_rc
variable. Note that if we had enteredxmRNA
on the left-hand side of the=
sign the original variable would have been overwritten
xmRNA_rc = ifelse(xmRNA > 400, NA, sales)
Similarly, if a respondent with ID 3 provided information in the wrong scale on a survey (e.g., income in $1s rather than in $1000s) we could use an
ifelse
statement and enter the command below in theCreate
box. As before, pressreturn
andStore
to add the newsales_rc
variableincome_rc = ifelse(ID == 3, income/1000, income)
If multiple respondents made the same scaling mistake (e.g., those with ID 1, 3, and 15) we again use
Create
and enter:income_rc = ifelse(ID %in% c(1, 3, 15), income/1000, income)
If you have a date in a format not available through the
Type
menu you can use theparse_date_time
function. For a date formated as “2-1-14” you would specify the command below (note that this format will also be parsed correctly by themdy
function in theType
menu)date = parse_date_time(x, “%m%d%y”)
Determine the time difference between two dates/times in seconds
time_diff = as_duration(time2 - time1)
Extract the month from a date variable
month = month(date)
Other attributes that can be extracted from a date or date-time variable are
minute
,hour
,day
,week
,quarter
,year
,wday
(for weekday). Forwday
andmonth
it can be convenient to addlabel = TRUE
to the call. For example, to extract the weekday from a date variable and use a label rather than a numberweekday = wday(date, label = TRUE)
Calculating the distance between two locations using lat-long information
trip_distance = as_distance(lat1, long1, lat2, long2)
Note: For examples 6, 7, and 14 above you may need to change the new
variable to type factor
before using it for further
analysis (see Type
above)
Recode
To use the recode feature select the variable you want to change and
choose Recode
from the Transformation type
drop-down. Provide one or more recode commands, separated by a
;
, and press return to see the newly created variable. Note
that you can specify the names for the recoded variable in the
Recoded variable name
input box (press return to submit
changes). Finally, click Store
to add the new variable to
the data. Some examples are given below.
Values below 20 are set to ‘Low’ and all others to ‘High’
lo:20 = ‘Low’; else = ‘High’
Values above 20 are set to ‘High’ and all others to ‘Low’
20:hi = ‘High’; else = ‘Low’
Values 1 through 12 are set to ‘A’, 13:24 to ‘B’, and the remainder to ‘C’
1:12 = ‘A’; 13:24 = ‘B’; else = ‘C’
Collapse age categories for a cross-tab analysis. In the example below ‘<25’ and ‘25-34’ are recoded to ‘<35’, ‘35-44’ and ‘35-44’ are recoded to ‘35-54’, and ‘55-64’ and ‘>64’ are recoded to ‘>54’
‘<25’ = ‘<35’; ‘25-34’ = ‘<35’; ‘35-44’ = ‘35-54’; ‘45-54’ = ‘35-54’; ‘55-64’ = ‘>54’; ‘>64’ = ‘>54’
To exclude a particular value (e.g., an outlier in the data) for subsequent analyses we can recode it to a missing value. For example, if we want to remove the maximum value from a variable called
FreqMut
that is equal to 102 we would (1) select the variableFreqMut
in theSelect variable(s)
box and enter the command below in theRecode
box. Pressreturn
andStore
to add the recoded variable to the data102 = NA
To recode specific numeric values (e.g., carat) to a new value (1) select the variable
carat
in theSelect variable(s)
box and enter the command below in theRecode
box to set the value for carat to 2 in all rows where carat is currently larger than or equal to 2. Pressreturn
andStore
to add the recoded variable to the data2:hi = 2
Note: Never use a =
symbol in a label
when using the recode function (e.g., 50:hi = ‘>= 50’) as this will
cause an error.
Rename
Choose Rename
from the Transformation type
drop-down, select one or more variables, and enter new names for them in
the rename box shown. Separate each name by a ,
. Press
return to see the variables with their new names on screen and press
Store
to alter the variable names in the original data.
Replace
Choose Replace
from the Transformation type
drop-down if you want to replace existing variables in the data with new
ones created using, for example, Create, Transform, Clipboard, etc..
Select one or more variables to overwrite and the same number of
replacement variables. Press Store
to alter the data.
Clipboard
It is possible to manipulate your data in a spreadsheet (e.g., Excel
or Google sheets) and copy-and-paste the data back into bioCancer. If
you don’t have the original data in a spreadsheet already use the
clipboard feature in Data > Manage so you can paste it into
the spreadsheet or click the download icon on the top right of your
screen in the Data > View tab. Apply your transformations in
the spreadsheet program and then copy the new variable(s), with a header
label, to the clipboard (i.e., CTRL-C on windows and CMD-C on mac).
Select Clipboard
from the Transformation type
drop-down and paste the new data into the
Paste from spreadsheet
box. It is key that new variable(s)
have the same number of observations as the data in bioCancer. To add
the new variables to the data click Store
.
Note: Using the clipboard feature for data transformation is discouraged because it is not reproducible.
Normalize
Choose Normalize
from the
Transformation type
drop-down to standardize one or more
variables. For example, in the epiGenomics data we may want to express
mRNA of a Genes per-FreqMut. Select FreqMut
as the
normalizing variable and mRNA
in the
Select variable(s)
box. You will see summary statistics for
the new variable (e.g., mRNA_FreqMut
) in the main panel.
Store changes by clicking the Store
button.
Reorder or remove columns
Choose Reorder/Remove columns
from the
Transformation type
drop-down. Drag-and-drop variables to
reorder them in the data. To remove a variable click the \(\times\) next to the label. Press
Store
to commit the changes.
Reorder or remove levels
If a (single) variable of type factor
is selected in
Select variable(s)
, choose
Reorder/Remove levels
from the
Transformation type
drop-down to reorder and/or remove
levels. Drag-and-drop levels to reorder them or click the \(\times\) to remove them. Press
Store
to commit the changes. To temporarily exclude levels
from the data use the Filter
box (see the help file linked
in the Data > View
tab).
Remove missing values
Choose Remove missing
from the
Transformation type
drop-down to eliminate rows with one or
more missing values. If all variables are selected a row with a missing
values in any column will be removed. If one or more
variables are selected only those rows will be removed with missing
values for the selected variables. Press Store
to change
the data. If missing values were present you will see the number of
observations in the data summary change (i.e., the value of n
changes).
Remove duplicates
It is common to have one or more variables in a dataset that
should have only unique values (i.e., no duplicates).
Customers id’s, for example, should be unique unless the dataset
contains multiple orders for the same customer. In that case the
combination of customer id and order id should be
unique. To remove duplicate select one or more variables to determine
uniqueness. Choose Remove duplicates
from the
Transformation type
drop-down and check how the summary
statistics change. Press Store
to change the data. If there
are duplicate rows you will see the number of observations in the data
summary change (i.e., the value of n and n_distinct
will change).
Show duplicates
If there are duplicates in the data use Show duplicates
to get a better sense for the data points that have the same value in
multiple rows. If you want to explore duplicates using the View
tab make sure to Store
them in a different dataset (i.e.,
make sure not to overwrite the data you are working
on). If you choose to show duplicates based on all columns in the data
only one of the duplicate rows will be shown. These rows are
exactly the same so showing 2 or 3 isn’t helpful. If,
however, we look for duplicates based on a subset of the available
variables bioCancer will generate a dataset with all
rows that are deemed similar.
Combine two datasets
There are six join (or merge) options available in bioCancer from the dplyr package developed by Hadley Wickham and Romain Francois on GitHub.
The examples below are adapted from Cheatsheet for dplyr join
functions by Jenny Bryan and focus on three small datasets,
superheroes
, publishers
, and
avengers
, to illustrate the different join types
and other ways to combine datasets in R and bioCancer. The data is also
available in csv format through the links below:
name | alignment | gender | publisher |
---|---|---|---|
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Hellboy | good | male | Dark Horse Comics |
publisher | yr_founded |
---|---|
DC | 1934 |
Marvel | 1939 |
Image | 1992 |
In the screen-shot of the Data > Combine tab below we see the two
datasets. The tables share the variable publisher which is
automatically selected for the join. Different join options are
available from the Combine type
dropdown. You can also
specify a name for the combined dataset in the Data name
text input box.
Inner join (superheroes, publishers)
If x = superheroes and y = publishers: > An inner join returns all rows from x with matching values in y, and all columns from both x and y. If there are multiple matches between x and y, all match combinations are returned.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
In the table above we lose Hellboy because, although this
hero does appear in superheroes
, the publisher (Dark
Horse Comics) does not appear in publishers
. The join
result has all variables from superheroes
, plus
yr_founded, from publishers
. We can visualize an
inner join with the venn-diagram below:
The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "inner_join")
# R
inner_join(superheroes, publishers, by = "publisher")
Left join (superheroes, publishers)
A left join returns all rows from x, and all columns from x and y. If there are multiple matches between x and y, all match combinations are returned.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Hellboy | good | male | Dark Horse Comics | NA |
The join result contains superheroes
with variable
yr_founded
from publishers
. Hellboy,
whose publisher does not appear in publishers
, has an
NA
for yr_founded. We can visualize a left join
with the venn-diagram below:
The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "left_join")
# R
left_join(superheroes, publishers, by = "publisher")
Right join (superheroes, publishers)
A right join returns all rows from y, and all columns from y and x. If there are multiple matches between y and x, all match combinations are returned.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
NA | NA | NA | Image | 1992 |
The join result contains all rows and columns from
publishers
and all variables from superheroes
.
We lose Hellboy, whose publisher does not appear in
publishers
. Image is retained in the table but has
NA
values for the variables name,
alignment, and gender from superheroes
.
Notice that a join can change both the row and variable order so you
should not rely on these in your analysis. We can visualize a right join
with the venn-diagram below:
The bioCancer commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "right_join")
# R
right_join(superheroes, publishers, by = "publisher")
Full join (superheroes, publishers)
A full join combines two datasets, keeping rows and columns that appear in either.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Hellboy | good | male | Dark Horse Comics | NA |
NA | NA | NA | Image | 1992 |
In this table we keep Hellboy (even though Dark Horse
Comics is not in publishers
) and Image (even
though the publisher is not listed in superheroes
) and get
variables from both datasets. Observations without a match are assigned
the value NA for variables from the other dataset. We can
visualize a full join with the venn-diagram below:
The bioCancer commands are:
Semi join (superheroes, publishers)
A semi join keeps only columns from x. Whereas an inner join will return one row of x for each matching row of y, a semi join will never duplicate rows of x.
name | alignment | gender | publisher |
---|---|---|---|
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
We get a similar table as with inner_join
but it
contains only the variables in superheroes
. The bioCancer
commands are:
# bioCancer
combinedata("superheroes", "publishers", by = "publisher", type = "semi_join")
# R
semi_join(superheroes, publishers, by = "publisher")
Anti join (superheroes, publishers)
An anti join returns all rows from x without matching values in y, keeping only columns from x
name | alignment | gender | publisher |
---|---|---|---|
Hellboy | good | male | Dark Horse Comics |
We now get only Hellboy, the only superhero
not in publishers
and we do not get the variable
yr_founded either. We can visualize an anti join with the
venn-diagram below:
Dataset order
Note that the order of the datasets selected may matter for a join. If we setup the Data > Combine tab as below the results are as follows:
Inner join (publishers, superheroes)
publisher | yr_founded | name | alignment | gender |
---|---|---|---|---|
DC | 1934 | Batman | good | male |
DC | 1934 | Joker | bad | male |
DC | 1934 | Catwoman | bad | female |
Marvel | 1939 | Magneto | bad | male |
Marvel | 1939 | Storm | good | female |
Marvel | 1939 | Mystique | bad | female |
Every publisher that has a match in superheroes
appears
multiple times, once for each match. Apart from variable and row order,
this is the same result we had for the inner join shown above.
Left and Right join (publishers, superheroes)
Apart from row and variable order, a left join of
publishers
and superheroes
is equivalent to a
right join of superheroes
and publishers
.
Similarly, a right join of publishers
and
superheroes
is equivalent to a left join of
superheroes
and publishers
.
Full join (publishers, superheroes)
As you might expect, apart from row and variable order, a full join
of publishers
and superheroes
is equivalent to
a full join of superheroes
and publishers
.
Semi join (publishers, superheroes)
publisher | yr_founded |
---|---|
Marvel | 1939 |
DC | 1934 |
With semi join the effect of switching the dataset order is more
clear. Even though there are multiple matches for each publisher only
one is shown. Contrast this with an inner join where “If there are
multiple matches between x and y, all match combinations are returned.”
We see that publisher Image is lost in the table because it is
not in superheroes
.
Anti join (publishers, superheroes)
publisher | yr_founded |
---|---|
Image | 1992 |
Only publisher Image is retained because both
Marvel and DC are in superheroes
. We keep
only variables in publishers
.
Additional tools to combine datasets (avengers, superheroes)
When two datasets have the same columns (or rows) there are
additional ways in which we can combine them into a new dataset. We have
already used the superheroes
dataset and will now try to
combine it with the avengers
data. These two datasets have
the same number of rows and columns and the columns have the same
names.
In the screen-shot of the Data > Combine tab below we see the two
datasets. There is no need to select variables to combine the datasets
here. Any variables in Select variables
are ignored in the
commands below. Again, you can specify a name for the combined dataset
in the Data name
text input box.
Bind rows
name | alignment | gender | publisher |
---|---|---|---|
Thor | good | male | Marvel |
Iron Man | good | male | Marvel |
Hulk | good | male | Marvel |
Hawkeye | good | male | Marvel |
Black Widow | good | female | Marvel |
Captain America | good | male | Marvel |
Magneto | bad | male | Marvel |
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Hellboy | good | male | Dark Horse Comics |
If the avengers
dataset were meant to extend the list of
superheroes we could just stack the two datasets, one below the other.
The new datasets has 14 rows and 4 columns. Due to a coding error in the
avengers
dataset (i.e.., Magneto is not
an Avenger) there is a duplicate row in the new combined
dataset. Something we probably don’t want.
The bioCancer commands are:
# bioCancer
combinedata("avengers", "superheroes", type = "bind_rows")
# R
bind_rows(avengers, superheroes)
Bind columns
name | alignment | gender | publisher | name | alignment | gender | publisher |
---|---|---|---|---|---|---|---|
Thor | good | male | Marvel | Magneto | bad | male | Marvel |
Iron Man | good | male | Marvel | Storm | good | female | Marvel |
Hulk | good | male | Marvel | Mystique | bad | female | Marvel |
Hawkeye | good | male | Marvel | Batman | good | male | DC |
Black Widow | good | female | Marvel | Joker | bad | male | DC |
Captain America | good | male | Marvel | Catwoman | bad | female | DC |
Magneto | bad | male | Marvel | Hellboy | good | male | Dark Horse Comics |
If the dataset had different columns for the same superheroes we could combine the two datasets, side by side. In bioCancer you will see an error message if you try to bind these columns because they have the same name. Something that we should always avoid. The method can be useful if we know the order of the row ids of two dataset are the same but the columns are all different.
Intersect
A good way to check if two datasets with the same columns have
duplicate rows is to choose intersect
from the
Combine type
dropdown. There is indeed one row that is
identical in the avengers
and superheroes
data
(i.e., Magneto).
The biCancer commands are the same as shown above, except you will
need to replace bind_rows
by intersect
.
Union
Thor | good | male | Marvel | Magneto | bad | male | Marvel |
Iron Man | good | male | Marvel | Storm | good | female | Marvel |
Hulk | good | male | Marvel | Mystique | bad | female | Marvel |
Hawkeye | good | male | Marvel | Batman | good | male | DC |
Black Widow | good | female | Marvel | Joker | bad | male | DC |
Captain America | good | male | Marvel | Catwoman | bad | female | DC |
Magneto | bad | male | Marvel | Hellboy | good | male | Dark Horse Comics |
A union
of avengers
and
superheroes
will combine the datasets but will omit
duplicate rows (i.e., it will keep only one copy of the row for
Magneto). Likely what we want here.
The bioCancer commands are the same as shown above, except you will
need to replace bind_rows
by union
.
Setdiff
name | alignment | gender | publisher |
---|---|---|---|
Thor | good | male | Marvel |
Iron Man | good | male | Marvel |
Hulk | good | male | Marvel |
Hawkeye | good | male | Marvel |
Black Widow | good | female | Marvel |
Captain America | good | male | Marvel |
Magneto | bad | male | Marvel |
Finally, a setdiff
will keep rows from
avengers
that are not in superheroes
. If we
reverse the inputs (i.e., choose superheroes
from the
Datasets
dropdown and superheroes
from the
Combine with
dropdown) we will end up with all rows from
superheroes
that are not in avengers
. In both
cases the entry for Magneto will be omitted.
The bioCancer commands are the same as shown above, except you will
need to replace bind_rows
by setdiff
.
For additional discussion see https://dplyr.tidyverse.org/articles/two-table.html
Enrichment Panel
Show multi-Omics Data in Circular Layout
The world Circomics
comes from the association between
Circos
and Omics
.
Circos is a package for visualizing data and information with circular layouts. User can visualize multiple matrices of Omics data at the same time and makes easy the exploring of relationships between dimensions using coloring sectors.
This function uses CoffeeWheel package developped by Dr. ARman Aksoy.
Studies in Wheel
User needs to: * Choice in which Studies is interested. * Visualize
the availability of dimensions by checking Availability
. +
The output is a table with Yes/No availability. * Load Omics data for
selected Studies by checking Load
. The output is a list of
loaded dimensions for selected Studies.
When Profiles Data are loaded, the button
Load Profiles in Datasets
appears. It uploads all Profiles
Data to Processing
panel for more exploring or
analysis.
Legend
checkbox displays the meaning of the color
palette.
Load Profiles in Datasets
For every dimension, the tables are merged by study and saved as:
xCNA
, xMetHM27
, xMetHM450
,
xmiRNA
, xmRNA
, xMut
,
xRPPA
in Datasets (Processing panel).
Genes / Diseases / Pathways Classification and clustering
Classification
The classifier uses geNetClassifier
methods [1] to
classify genes by disease based only on gene expression (mRNA). The
approach is implemented in an R package, named geNetClassifier,
available as an open access tool in Bioconductor. All proccess are
resumed into 5 steps: * Select Studies * get sample size by
processing
> Samples
* Set the sample size
and the posterior probability * Select one Case
and one
Genetic Profile
for every study. Respect the order
of studies. it is recommanded to use _v2_mrna
for
all genetic profiles. * Run classifier by processing
>
Classifier
The ranking is built by ordering the genes decreasingly by their pos-
terior probability for each study (class). Each gene is assigned to a
class in which has the best ranking. As a result of this process, even
if a gene is found associated to several classes during the expression
analysis, each gene can only be on the ranking of one class [1]. The
resulting output is a table (Table 1) that associates genes to study and
displays PostProb
and gene expression sign
exprsUpDw
. The exprsMeanDiff
value is the
expression difference between the mean for each gene in the given class
and the mean in the closest class.
Table1: Ranking Genes by Study
Plot Clusters
Gene Diseases Association
GeneList/Diseases
predicts Wich disease are involving
your GeneList. It uses annotations from DisGeNET [2] and Methods from clusterProfiler
package [3].
The GeneList/Diseases
association uses gene list as
input. The assess of this prediction is based on two parameters: * The
number of genes that are involving in the disease (x-axis) * The P-value
of this association (color). In the following example, there are two
annotation related to Breast cancer which involve more than 130 genes
and has small P-Value.
Figure 1: Genes / Diseases Association
The Diseases Onthology
uses genes/Study groups computed
by Classifier
(Table 1). The dotplot position indicates
wish Diseases are annotated for genes/study [4]. The dot size indicates
the ratio of genes involved in the disease for the same genes groups
(lihc_tcga has 2/3 genes involved for the 4 disease). The color
indicates the P-Value.
Figure 2: Diseases Onthology
The same process is possible with Gene Onthology (GO) and KEGG.
Figure 3: GO Pathway Enrishment
Figure 4: KEGG Pathway Enrishment
Function Interaction Network Enrichment
Edges Attributes
Function Interactions (FIs) Type
Arrowhead | Reaction | Arrowhead | Reaction |
---|---|---|---|
-> | activate, express, regulate | -| | inhibit |
diamond -<> | complexe | curve | catalyze, reaction |
point -o | phosphorylate | – | binding, input, compound |
-< | dissociation | …. | predicted, indirect,ubiquitinated |
Use Linkers
Picks up as few as possible of linkers that can connect input genes together. For example, if the algorithm finds one gene can link all input genes together, it will not try other genes (not from gene list) that may be used as a linker.
The linker gene hes box format.
Layouts
dot
The dot engine flows the directed graph in the direction of rank (i.e., downstream nodes of the same rank are aligned). By default, the direction is from top to bottom ##### twopi The twopi engine provides radial layouts. Nodes are placed on concentric circles depending their distance from a given root node.
neato
The neato engine provides spring model layouts. This is a suitable engine if the graph is not too large (less than 100 nodes) and you don’t know anything else about it. The neato engine attempts to minimize a global energy function, which is equivalent to statistical multi-dimensional scaling.
circo
The circo engine provide circular layouts. This is suitable for certain diagrams of multiple cyclic structures, such as certain telecommunications networks.
Nodes Attributes
From ReactomeFI
The size of node is related to the number of inetractions. If node has multiple interaction, it will has bigger size than node with few interaction. Otherwise, i will be easier to locate important gene in the network.
From Classifier
mRNA
Attribute node color using exprsMeanDiff
values from
Classifier
panel.
Studies
Link study to associated genes from Classifier
table.
From Profiles Data
User needs to * Select studies (From Which Studies
) *
Load profiles data (Load
). * Select Profiles Data * Set
threshold from Sliders
Legend
Interpretation
References
[1] Aibar S, Fontanillo C, Droste C, Roson-Burgo B, Campos-Laborie F, Hernandez-Rivas J and De Las Rivas J (2015). “Analyse multiple disease subtypes and build associated gene networks using genome-wide expression profiles.” BMC Genomics, 16(Suppl 5:S3). https://dx.doi.org/10.1186/1471-2164-16-S5-S3.
[2] Piñero, J., Queralt-Rosinach, N., Bravo, A., Deu-Pons, J., Bauer-Mehren, A., Baron, M., Ferran Sanz, and Furlong, L. I. (2015). DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database: The Journal of Biological Databases and Curation, 2015, bav028. https://doi.org/10.1093/database/bav028
[3] Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287. https://dx.doi.org/10.1089/omi.2011.0118.
[4] Yu G, Wang L, Yan G and He Q (2015). “DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis.” Bioinformatics, 31(4), pp. 608-609. https://dx.doi.org/10.1093/bioinformatics/btu684, https://bioinformatics.oxfordjournals.org/content/31/4/608.