Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

07 April 2025

🏭🗒️Microsoft Fabric: User Data Functions (UDFs) in Python [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 7-Apr-2025

[Microsoft Fabric] User Data Functions (UDFs)
  • {def} a platform that allows customers to host and run custom logic on Fabric from different types of items and data sources [1]
  • empower data developers to write custom logic and embed it into their Fabric ecosystem [1]
    • the logic can include internal algorithms and libraries [1]
    • {benefit} allows to handle complex transformations, optimizing performance, and integrating diverse data sources beyond the capabilities of low-code tools [5]
      • via public libraries from PyPI
      • via native Fabric integrations to 
        • connect to Fabric data sources
          • e.g. warehouse, lakehouse or SQL Database
        • invoke functions from other Fabric items
          • e.g. notebooks, Power BI reports, data pipelines
      • via edit functions directly in the Fabric portal [1]
        • via in-browser tooling, VS Code extension
      • supports the Python 3.11 runtime [1]
    • {goal} reusability
      • by creating libraries of standardized functionality [1]
        • can be used in many solutions across the organization [1]
    • {goal} customization
      • the applications are tailored to customers’ needs [1]
    • {goal} encapsulation
      • functions can perform various tasks to build sophisticated workflows [1]
    • {goal} external connectivity
      • data functions can be invoked from external client applications using a REST endpoint [1]
      • allows integrations with external systems [1]
  • {feature} programming model
    • SDK that provides the necessary functionality to author and publish runnable functions in Fabric [4]
    • {benefit} allows to seamlessly integrate with other items in the Fabric ecosystem [4]
      • e.g. Fabric data sources
    • {concept} user data functions item 
      • contains one or many functions that can be invoked from the Fabric portal using the provided REST endpoint [4]
        • from another Fabric item, or from an external application 
      • each function is a method in Python script that allows passing parameters and returning an output to the invoker [4]
    • {component} fabric.functions library 
      • provides the code needed to create user data functions in Python [4]
      • imported in the template by default [4]
    • {method} fn.UserDataFunctions() 
      • provides the execution context [4]
      • added at the beginning of the code file in all new user data functions items, before any function definitions [4]
    • {concept} functions
      • every function is identified with a @udf.function() decorator
        • can be invoked individually from the portal or an external invoker [4]
        • functions without the decorator can be invoked directly [4]
  • {feature} invocation logs
    • function invocations are logged
    • {benefit} allows to check the status or perform debugging  [3]
    • {limitation} can take few minutes to appear [3]
      • {recommendation} refresh the page after a few minutes [3]
    • {limitation} daily ingestion limit: 250 MB [3]
      • reset the next day 
        • a new log is made available [3]
    • {limitation} logs are sampled while preserving a statistically correct analysis of application data [3]
      • the sampling is done by User data functions to reduce the volume of logs ingested [3]
      • if some logs are partially missing, it might be because of sampling [3]
    • {limitation} supported log types: information, error, warning, trace [3]
  • {feature} Manage connections
    • {limitation} only supports connecting to Fabric-native data sources [2]
  • {limitation} only available in a subset of Fabric regions [2]
  • {limitation} editable by the owner only [2]
    • can only be modified and published by the user who is the owner of the User Data Functions Fabric item [2]
  • {limitation} adds further keywords to the list of reserved keywords
    • namely: req, context, reqInvocationId
    • can't be used as parameter names or function names [2]

References:
[1] Microsoft Learn (2025) Microsoft Fabric: What is Fabric User data functions (Preview)? [link]
[2] Microsoft Learn (2025) Microsoft Fabric: Service details and limitations of Fabric User Data Functions [link]
[3] Microsoft Learn (2025) Microsoft Fabric: User data functions invocation logs (Preview) [link]
[4] Microsoft Learn (2025) Microsoft Fabric: Fabric User data functions programming model overview (Preview) [link]
[5] Microsoft Fabric Updates Blog (2025) Announcing Fabric User Data Functions (Preview) [link]
[6] Microsoft Learn (2025) Microsoft Fabric: Use the Functions activity to run Fabric user data functions and Azure Functions [link

Resources:
[R1] Microsoft Fabric Updates Blog (2025) Utilize User Data Functions in Data pipelines with the Functions activity (Preview) [link

Acronyms:
PyPI - Python Package Index
REST - Representational State Transfer
SDK - Software Development Kit
UDF - User Data Function
VS - Visual Studio

27 February 2021

🐍Python: PySpark and GraphFrames (Test Drive)

Besides the challenges met during configuring the PySpark & GraphFrames environment, also running my first example in Spyder IDE proved to be a bit more challenging than expected. Starting from an example provided by the DataBricks documentation on GraphFrames, I had to add 3 more lines to establish the connection of the Spark cluster, respectively to deactivate the context (only one SparkContext can be active per Java VM).

The following code displays the vertices and edges, respectively the in and out degrees for a basic graph. 

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

Output:
id nameage
a Alice34
b Bob36
cCharlie30
d David29
e Esther32
f Fanny36
g Gabby60
srcdstrelationship
a b friend
b c follow
c b follow
f c follow
e f follow
e d friend
d a friend
a e friend
idinDegree
f1
e1
d1
c2
b2
a1
idoutDegree
f1
e2
d1
c1
b1
a2

Notes:
Without the last line, running a second time the code will halt with the following error: 
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local) created by __init__ at D:\Work\Python\untitled0.py:4

Loading the same data from a csv file involves a small overhead as the schema needs to be defined explicitly. The same output from above should be provided by the following code:

from graphframes import *
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import * 

#establishing a connection to the Spark cluster (code added)
sc = SparkContext('local').getOrCreate()
spark = SparkSession(sc)

nodes = [
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
]
edges = [
    StructField("src", StringType(), True),
    StructField("dst", StringType(), True),
    StructField("relationship", StringType(), True)
    ]

v = spark.read.csv(r"D:\data\nodes.csv", header=True, schema=StructType(nodes))

e = spark.read.csv(r"D:\data\edges.csv", header=True, schema=StructType(edges))

# Create a GraphFrame
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

g.inDegrees.show()
g.outDegrees.show()

#stopping the active context (code added)
sc.stop()

The 'nodes.csv' file has the following content:
id,name,age
"a","Alice",34
"b","Bob",36
"c","Charlie",30
"d","David",29
"e","Esther",32
"f","Fanny",36
"g","Gabby",60

The 'edges.csv' file has the following content:
src,dst,relationship
"a","b","friend"
"b","c","follow"
"c","b","follow"
"f","c","follow"
"e","f","follow"
"e","d","friend"
"d","a","friend"
"a","e","friend"

Note:
There should be no spaces between values (e.g. "a", "b"), otherwise the results might deviate from expectations. 

Now, one can go and test further operations on the graph thus created:

#filtering edges 
gl = g.edges.filter("relationship = 'follow'").sort("src")
gl.show()
print("number edges: ", gl.count())

#filtering vertices
#gl = g.vertices.filter("age >= 30 and age<40").sort("id")
#gl.show()
#print("number vertices: ", gl.count())

# relationships involving edges and vertices
#motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
#motifs.show()

Happy coding!

🐍Python: Installing PySpark and GraphFrames on a Windows 10 Machine

One of the To-Dos for this week was to set up the environment so I can start learning PySpark and GraphFrames based on the examples from Needham & Hodler’s free book on Graph Algorithms. Therefore, I downloaded and installed the Java SDK 8 from the Oracle website (requires an Oracle account) and the latest stable version of Python (Python 3.9.2), downloaded and unzipped the Apache Spark package locally on a Windows 10 machine, respectively the Winutils tool as described here.

The setup requires several environment variables that need to be created, respectively the Path variable needs to be extended with further values (delimited by ";"). In the end I added the following values:

VariableValue
HADOOP_HOMED:\Programs\spark-3.0.2-bin-hadoop2.7
SPARK_HOMED:\Programs\spark-3.0.2-bin-hadoop2.7
JAVA_HOMED:\Programs\Java\jdk1.8.0_281
PYTHONPATHD:\Programs\Python\Python39\
PYTHONPATH;%SPARK_HOME%\python
PYTHONPATH%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip
PATH%HADOOP_HOME%\bin
PATH%SPARK_HOME%\bin
PATH%PYTHONPATH%
PATH%PYTHONPATH%\DLLs
PATH%PYTHONPATH%\Lib
PATH%JAVA_HOME%\bin

I tried then running the first example from Chapter 3 using the Spyder IDE, though the environment didn’t seem to recognize the 'graphframes' library. As long it's not already available, the graphframes .jar file (e.g. graphframes-0.8.1-spark3.0-s_2.12.jar) corresponding to the installed Spark version must be downloaded and copied in the Spark folder where the other .jar files are available (e.g. .\spark-3.0.2-bin-hadoop2.7\jars). With this change I could finally run my example, though it took me several tries to get this right. 

During Python's installation I had to change the value for the LongPathsEnabled setting from 0 to 1 via regedit to allow path lengths longer than 260 characters, as mentioned in the documentation. The setting is available via the following path:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem

In the process I also tried installing ‘pyspark’ and ‘graphframes’ via the Anaconda tool with the following commands:

pip3 install --user pyspark
pip3 install --user graphframes

From Anaconda’s point of view the installation was correct, fact which pointed me to the missing 'graphframe' library.

It took me 4-5 hours of troubleshooting and searching until I got my environment setup. I still have two more warnings to solve, though I will look into this later:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped

Notes:
Spaces in the folder's names might creates issues. Therefore, I used 'Programs' instead of 'Program Files' as main folder. 
There seem to be some confusion what environment variables are needed and how they need to be configured.
Unfortunately, the troubleshooting involved in setting up an environment and getting a simple example to work seems to be a recurring story over the years. Same situation was with the programming languages from 15-20 years ago. 

17 February 2021

📊🐍Python: Plotting Data with the Radar Chart

Today's task was to display a set of data using the radar chart available with the matplotlib.pyplot library. For this I considered the iris dataset available with the sklearn learning library. The dataset is stored as an array, therefore for further manipulation was converted into a data frame. As the radar chart allows comparing only a small set of numerical values, I considered displaying only the mean values for each type of iris (setosas versicolor, virginica). 

Unfortunately, the radar chart doesn't seem to complete the polygons based on the available dataset, therefore as workaround I had to duplicate the first column within the result data frame. (It seems that the Ploty library does a better job at displaying radar charts, see example).

Radar Chart

Here's the code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets  
import pandas as pd

#preparing the data frame 
iris = datasets.load_iris()

ds = pd.DataFrame(data = iris.data, columns = iris.feature_names)
dr = ds.assign(target = iris.target) #iris type

group_by_iris = dr.groupby('target').mean()
group_by_iris[''] = group_by_iris[iris.feature_names[0]] #duplicating the first column

# creating the graph
fig = plt.subplots()

angle = np.linspace(start=0, stop=2 * np.pi, num=len(group_by_iris.columns))

plt.figure(figsize=(5, 5))
plt.subplot(polar=True)

values = group_by_iris[:1].values[0]
plt.plot(angle, values, label='Iris-setosa', color='r')
plt.fill(angle, values, 'r', alpha=0.2)

values = group_by_iris[1:2].values[0]
plt.plot(angle, values, label='Iris-versicolor', color='g')
plt.fill(angle, values, 'g', alpha=0.2)

values = group_by_iris[2:3].values[0]
plt.plot(angle, values, label='Iris-virginica', color='b')
plt.fill(angle, values, 'b', alpha=0.2)

#labels
plt.title('Iris comparison', size=15)
labels = plt.thetagrids(np.degrees(angle), labels=group_by_iris.columns.values)
plt.legend()

plt.show()

Happy coding!

16 February 2021

📊🐍Python: Drawing Concentric Circles with matplotlib.pyplot

Today I tried using for the first time the matplotlib library for drawing a few concentric circles, though it proved to be a bit more challenging than expected, as the circles were distorted given the scale differences between x and y axis. Because of this the circles displayed via the Circle class (in blue) seem to be displayed as ellipses. To show the difference I used  trigonometric functions to draw the circles (in green) by applying a 5/7.5 multiplication factor for the x axis:

And here's the code:

import numpy as np
import math as m
import matplotlib.pyplot as plt

axis_dimensions = [-100,100, -100,100] #dimensions axis
dx=10       #distance between ticks on x axis 
dy=10       #distance between ticks on y axis
sfx = 5/7.5 #scale factor for x axis
r= 50       #radius

#drawing the grid
plt.axis(axis_dimensions)
plt.axis('on')
plt.grid(True, color='gray')
plt.xticks(np.arange(axis_dimensions[0], axis_dimensions[1], dx))
plt.yticks(np.arange(axis_dimensions[2], axis_dimensions[3], dy))

#adding labels
plt.title('Circles')
plt.xlabel('x axis')
plt.ylabel('y axis')

#drawing the geometric figures
for i in range(0,51,10):
    for angle in np.arange(m.radians(0),m.radians(360),m.radians(2)):
        #drawing circles via trigonometric functions
        x = (r+i)*m.cos(angle)*sfx
        y = (r+i)*m.sin(angle)
        plt.scatter(x,y,s=2,color ='g')
        
    #drawing with circles
    circle = plt.Circle((0,0),r+i,color='b', fill=False)
    plt.gca().add_patch(circle)

plt.show()

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.