Pages

Friday, May 3, 2019

Presenting At Dataworks Summit In DC, "Introducing Apache MlFlow" , Wednesday, May 22


Introducing Apache MlFlow: An Open Source Platform for the Machine Learning Lifecycle for On-Prem or in the Cloud


Wednesday, May 22
11:50 AM - 12:30 PM
Marquis Salon 7
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. Apach MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, Apache MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

https://dataworkssummit.com/washington-dc-2019/session/introducing-apache-mlflow-an-open-source-platform-for-the-machine-learning-lifecycle-for-on-prem-or-in-the-cloud/

Wednesday, March 16, 2016

I am presenting at NJ Data Science - Meetup/User Group in Princeton - Thursday, March 17, 2016

I am presenting at  NJ Data Science - Apache Spark

meetup group in Princeton 

Thursday, March 17, 2016




How to build Recommendation Engine using Apache Spark, Apache Zeppelin on Hortonworks HDP Platform


Hope to see you there!


Tuesday, February 2, 2016

Please vote for my Hadoop Summit Talk

Please vote for my Hadoop Summit Talk

Use of Apache SOLR , Apache Spark and OCR for Text Mining and Search capability for business process improvement and Advanced Analytics

Showcase how to use OCR - Optical Character Recognition technology along with Apache SOLR Search and Apache Spark to utilize text mining capabilities. A very common scenario is to be able to index and search text in image files that were scanned in, for example patient charts, legal documents, etc. In this session we will demonstrate how to use OCR technology to convert scanned documents (jpg, gif, tiff,etc.) to text documents. The converted result text data than can be stored in a HIVE, HBase, SOLR and than can be used further for Data Analysis and Exploration. We will demonstrate how to Apache Spark to text mine the data.




Tuesday, December 1, 2015

SparkSql Federated

Spark SQL comes with a nice feature called: “JDBC to other Databases”, but, it practice, it’s JDBC federation feature.

Example below using sandbox 2.3.2 and spark 1.5.1 TP (https://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/):

1- Run Spark SQL Thrift Server with mysql jdbc driver:

[root@sandbox incubator-zeppelin]# /root/dev/spark-1.5.1-bin-hadoop2.6/sbin//start-thriftserver.sh --hiveconf hive.server2.thrift.port=10010 --jars "/usr/share/java/mysql-connector-java.jar"

2- Open beeline and connect to Spark SQL Thrift Server:

beeline -u "jdbc:hive2://localhost:10010/default" -n admin

3- Create a jdbc federated table pointing to existing mysql database, using beeline:

CREATE TABLE mysql_federated_sample
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "com.mysql.jdbc.Driver",
  url "jdbc:mysql://localhost/hive?user=hive&password=hive",
  dbtable "TBLS"
);
describe mysql_federated_sample;
select * from mysql_federated_sample;
select count(1) from mysql_federated_sample;

Code below using spark-shell, scala code and data frames.

1- Open spark-shell with mysql jdbc driver

/root/dev/spark-1.5.1-bin-hadoop2.6/bin/spark-shell  --jars "/usr/share/java/mysql-connector-java.jar"

2- Create a data frame pointing to mysql table

val jdbcDF = sqlContext.read.format("jdbc").options( 
  Map(
  "driver" -> "com.mysql.jdbc.Driver",
  "url" -> "jdbc:mysql://localhost/hive?user=hive&password=hive",
  "dbtable" -> "TBLS"
  )
).load()

jdbcDF.show

Written with StackEdit.

Wednesday, November 25, 2015

Hidden Gem in HDP sandbox. SSH Web Server port 4200

If you go to port 4200 on your HDP sandbox http://sandbox.hortonworks.com:4200/

Wednesday, November 11, 2015

Got access to zeppelin hub beta

The sync function to the hub and enabling collaboration is going to make Spark development so much better now

 

--NO_CROSS_POST

Tuesday, November 10, 2015

IntelliJ and Maven HDP public Repos to index

If you are trying to add the HDP repo's to InteliJ Maven repo list, so you  would get all the versions in autocomplete not just local repo's and you are getting an error when you try to run the update. Like screenshot below:


This is a known bug. Thanks Shane Kumpf to pointing out a fix to get InteliJ Maven Repo working with autocomplete



This is a bug in Intellij 14.1 (and many earlier versions).
See IDEA-102693 which includes a zip with the fixed maven plugin jars. Replace your intellij jars with those from the zip file.
If that doesn't work, take a look at your idea.log (sudo find / -name idea.log to locate it) for any exceptions and research those and/or post your stack trace here.

--NO_CROSS_POST