Search form

University of Pretoria, Bioinformatics and Computational Biology Unit

Contact details
Traineeship proposition

Abstract 2018-2019: Analysis of germ line variants with specific effect prediction of noncoding variants in cancer-related genes of black female South African breast cancer patients

The most described and important cancer in woman is breast cancer, occupying a staggering second place of cancers with high incidences worldwide. The disease can be found more in women than in men, covering a 100 to 1 ratio (1,2). Historically breast cancer was always rated the second most common cancer in North America in comparison with lung cancer that has been rated as number one (3,4,5). According to Society AC. et al., breast cancer outnumbers lung cancer to be the most prominent cancer (6). Meanwhile in South-Africa studies reported that the lifetime breast cancer risk for women is 1 in 28, with 0,7% of all deaths caused by breast cancer. A total of 166 blood samples were previously collected from black South African females with breast carcinoma. The patients visited the "Oncology Clinic" at Steve Biko Hospital. Consent for all samples were given by patients following ethics approval for the study. DNA was obtained from the peripheral blood samples by the procedure illustrated by Johns and Paulus-Thomas et al. (1989). All samples were tested for the presence of BCRA mutations, however all the samples tested negative. Samples were subsequently analysed for germ line variants in selected cancer-related genes of black female South African breast cancer patients. The first step after performing the quality analysis was trimming the samples. The FastX_toolkit was used to trim to 5 and 95 nucleotides on the 5' and 3' ends of the 100bp paired-end reads respectively. Next, the samples were mapped against the reference hg19 genome. For this step BWA-MEM was used. Samtools was used next, to view, sort and index the aligned reads. Then, Qualimap was used to calculate how many reads aligned to each gene of interest. In the following step duplicate reads were marked using Picard Mark Duplicates. The GATK Toolkit was used for the base quality score recalibration and outputs a recalibrated BAM or CRAM file. After recalibrating and applying the base quality scores, the next step was to process to variant calling. To complete this step the GATK HaplotypeCaller was used in gVCF mode. The actual variant calling consisted of more than one step that can be found in the electronic notebook. After variant calling using the HaplotypeCaller, variants were filtered using a specified cut-off value. This step consisted of two tools, first the type of variants was selected with the GATK tool SelectVariants, where the options were between SNP's or indels. After selecting the variants, the GATK VariantFiltration tool was used to filter variant calls based on INFO and/or FORMAT annotations. Also, before selection variants with occurrence of ≥ 1% in the ExAcAfr (the polymorphisms) were removed. In the last step, the Ensembl Variant Effect Predictor (VEP) was used. VEP is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provided access to an extensive collection of genomic annotation. For this study the result of 2 functional effect were considered being CADD or GWAVA with variants being selected if both methods predicted a variant to be deleterious. Moreover, VEP provided result for multiple transcripts per gene as determined by mapping of REFSEQ identifiers to Ensembl transcripts via UCSC.


Lunnon Road
Hillcrest, 0001 Pretoria
South Africa


Traineeship supervisor
Fourie Joubert

Testimony Mubashir Prempeh (2018-2019)

“Totally worth it”

Initially, thoughts of not enjoying South-Africa (SA) crossed my mind. Whenever I talked about my work placement in SA, my family members and friends would talk about the crime rates, inequality and many more negative things. I decided to obtain more information about SA so, I registered for the 2018 information moment of “Word Wereldburger “, organized by the province of WestFlandres. At the event, students talked about experiences during their internships in SA. They insisted that Pretoria was a safe place to go to and that I should not have any worries because I would take pleasure in my stay. Another important factor which played a role in choosing this destination comes from the fact that the University of Pretoria is one of the best ranked universities in SA. Moreover, the Centre for African Gene Technologies at the Dept. of Bioinformatics is one of the first and best information technology research facilities in Africa which also collaborates with other universities and colleges (e.g. Hokkaido institute, University of Uppsala and Howest University of Applied Science). I was fortunate enough to be working on germ line variant analysis in introns to determine if the mutated genes had increased susceptibility to breast cancer. Every step in the project had his/her own difficulties but was very interesting and fun to accomplish. Supervisors, colleagues and student were all ready to assist and help you. As intern I never had the feeling that I had to perform the project on your own or had nobody to go to when needed. During the project, I was introduced to new software tools, packages and programmes which were/could be very useful and helpful for any data scientist in the future. My accommodation (Tuksdorp) was a five-minute walk from work and all the stores and shops were all in a 1km radius. My nine weeks here were very interesting, fun and challenging. I enjoyed my short stay, but it may come to no surprise that I wanted to stay longer. I would recommend SA to everybody that is looking to perform good research in a challenging yet strong scientific environment.

Link to blog:

Via Map