I've started to use again the R languages for data visualizations. Discovering the lessR package, which simplifies considerably the verbose syntax of the R language by encapsulating the functionality behind simple functions, I wondered whether it can be installed in Microsoft Fabric and used from notebooks. Besides the available documentation, for learning I used also David W Berging's book on R visualizations [3].
Into a new notebook, I used one cell for each installation or package retrieval (see [1], [2]):
#installing packages #install.packages("tidyverse") #is in Microsoft Fabric preinstalled install.packages("lessR") #retrieve packages from library library("tidyverse") library("lessR")
I attempted to read the data from a http location via the lessR Read function and it worked
d <- Read("http://lessRstats.com/data/employee.xlsx")
head(d)
However, attempting to use any of the lessR functions used for visualization displayed only the text output and not the visualizations. No matter what I did - suppressing the text, suppressing the generation of PDF files, the result was the same. It seems to be a problem with the output device, though I'm not sure how to solve this yet.
# supressing the text style(quiet=TRUE) # reenabling the text style(quiet=FALSE) # supressing PDF generation pdf(NULL)
# retrieving current device used
options()$device
I was able to run the ggplot2 scripts from [3] though only when the lessR was also installed (each script should be run in its own cell, otherwise only the last plot is shown):
# bard charts ggplot(d) + geom_bar(aes(Dept)) # histogram ggplot(d, aes(Salary)) + geom_histogram(binwidth=10000) # integrated violin/box/scatterplot ggplot(d, aes(x="", y=Salary)) + geom_violin(fill="gray90", bw=9500, alpha=.3) + geom_boxplot(fill="gray75", outlier.color="black", width=0.25) + geom_jitter(shape=16, position=position_jitter(0.05)) + theme(axis.title.y=element_blank()) + coord_flip() # enhanced scatterplot ggplot(d, aes(Years, Salary)) + geom_point() + geom_smooth(method=lm, color="black") + stat_ellipse(type="norm") + geom_vline(aes(xintercept=mean(Years, na.rm=TRUE)), color="gray70") + geom_hline(aes(yintercept=mean(Salary), na.rm=TRUE), color="gray70")
Similar results could be obtained by using the following lessR syntax in RStudio:
# bard charts BarChart(Dept) # histogram Histogram(Salary) # integrated violin/box/scatterplot Plot(Salary) # enhanced scatterplot Plot(Years, Salary, enhance=TRUE)
Trying to see whether I can access the data from a lakehouse via SparkR, I've downloaded the file from the support website [3], loaded the data into an available lakehouse (e.g. UAT), respectiveley loaded the data to a new table:
-- creating the table CREATE TABLE [dbo].[employee]( [Name] [varchar](8000) NULL, [Years] [int] NULL, [Gender] [varchar](8000) NULL, [Dept] [varchar](8000) NULL, [Salary] [float] NULL, [JobSat] [varchar](8000) NULL, [Plan] [int] NULL, [Pre] [int] NULL, [Post] [int] NULL ) ON [PRIMARY] GO -- checking the data SELECT * FROM [dbo].[employee]
I was able to access the content of the imported file via the following script:
#access the file from lakehouse
#csv_file <- "https://onelake.dfs.fabric.microsoft.com/<file_system>/<account_name>/Files/OpenSource/employee.csv"
#csv_file <- "abfss://<file_system>.dfs.fabric.microsoft.com/<account_name>/Files/OpenSource/employee.csv"
csv_file <- "Files/OpenSource/employee.csv"
df <- read.df(csv_file, source= "csv", header = "true", inferSchema = "true") display(df)
Initially, I wasn't able to access the table directly, though in the end I was able to retrieve the data (without and with the catalog's name):
# creating a data frame via SparkSQL dfEmp <- sql("SELECT * FROM Employee") head(dfEmp)
Comments:
1) Once the sessions timeout, it seems that one needs to rerun the scripts, which proves to be time-consuming as the installation takes about 5 minutes.
2) Being able to use lessR directly in Microsoft Fabric could be a real win given its simple syntax. I run most of the tests from the book [3] plus some of the recommended scripts and the results are satisfactory.
3) The connection via the ABFS path to the lakehouse works as well, but not via URL.
References:
[1] Microsoft Learn - Microsoft Fabric (2023) R library management(link)
[2] lessR (2024) Data (link)
[3] David W Gerbing (2020) R Visualizations: Derive Meaning from Data
[4] CRAN=R (2024) Package lassR (link)