使用 Azure、Hadoop 和 Mahout 构建一个推荐系统

2013-7-17 11:48| 发布者: 红黑魂| 查看: 6402| 评论: 0|来自: 开源中国

摘要: 　　今天想帮助别人吗?　　根据用户之前的回答历史，我们可以把Stack Exchange的新问题推荐给一个能够回答的用户，这与亚马逊通过你之前的购买记录给你提供推荐很相似。不知道Stack Exchange是做什么的？ - 它运行了 ...

Step 3 - 从Dump(User,Question)中提取我们需要的数据

为了提取数据，我们将借助Hadoop来分配。首先要写一个简单的Mapper。就像前面提到过的，我们需要弄

清楚所有PostTypeId=2的文章中的{OwnerUserId,ParentId}。这是因为我们要之后要为推荐工作输入的

是{user,item}。基于此，首先要把Posts.XML加载到HDFS。你可以使用hadoop fs命令把本地文件复制

到指定的输入路径。

现在，是时候开始写一个用户映射来提取数据了。我们将使用Hadoop On Azure .NET SDK来写Mapduce任务。

不是我们在配置部分指明输入目录和输出目录。启动Visual Studio，创建一个C#控制台程序。如果你记得我之前

写的文章，你会知道hadoop fs是用来访问HDFS文件系统，当然如果你知道一些基本的*nix命令如 Is,cat等等

会更好。

注意：: （之前的文章）忽略HDInsight前面部分，你可以理解更多关于Map Reduce模型和Hadoop on Azure。

你需要通过Nuget包管理器来安装Hadop SDK for .NET上的Hadoop Map Reduce包。

`1`	`install-package Microsoft.Hadoop.MapReduce`

有下面的代码，我们可以

创建一个映射
创建一个任务
提交任务到集群

具体如下：

001using System;
002using System.Collections.Generic;
003using System.Globalization;
004using System.Linq;
005using System.Text;
006using System.Xml.Linq;
007using Microsoft.Hadoop.MapReduce;
008 
009namespace StackExtractor
010{
011 
012    //Our Mapper that takes a line of XML input and spits out the {OwnerUserId,ParentId,Score}
013    //i.e, {User,Question,Weightage}
014    public class UserQuestionsMapper : MapperBase
015    {
016        public override void Map(string inputLine, MapperContext context)
017        {
018            try
019            {
020                var obj = XElement.Parse(inputLine);
021                var postType = obj.Attribute("PostTypeId");
022                if (postType != null && postType.Value == "2")
023                {
024                    var owner = obj.Attribute("OwnerUserId");
025                    var parent = obj.Attribute("ParentId");
026            
027                    // Write output data. Ignore records will null values if any
028                    if (owner != null && parent != null )
029                    {
030                        context.EmitLine(string.Format("{0},{1}", owner.Value, parent.Value));
031                    }
032                }
033            }
034            catch
035            {
036                //Ignore this line if we can't parse
037            }
038        }
039    }
040 
041 
042    //Our Extraction Job using our Mapper
043    public class UserQuestionsExtractionJob : HadoopJob
044    {
045        public override HadoopJobConfiguration Configure(ExecutorContext context)
046        {
047            var config = new HadoopJobConfiguration();
048            config.DeleteOutputFolder = true;
049            config.InputPath = "/input/Cooking";
050            config.OutputFolder = "/output/Cooking";
051            return config;
052        }
053 
054        
055    }
056 
057    //Driver that submits this to the cluster in the cloud
058    //And will wait for the result. This will push your executables to the Azure storage
059    //and will execute the command line in the head node (HDFS for Hadoop on Azure uses Azure storage)
060    public class Driver
061    {
062        public static void Main()
063        {
064            try
065            {
066                var azureCluster = new Uri("https://{yoururl}.azurehdinsight.net:563");
067                const string clusterUserName = "admin";
068                const string clusterPassword = "{yourpassword}";
069 
070                // This is the name of the account under which Hadoop will execute jobs.
071                // Normally this is just "Hadoop".
072                const string hadoopUserName = "Hadoop";
073 
074                // Azure Storage Information.
075                const string azureStorageAccount = "{yourstorage}.blob.core.windows.net";
076                const string azureStorageKey =
077                    "{yourstoragekey}";
078                const string azureStorageContainer = "{yourcontainer}";
079                const bool createContinerIfNotExist = true;
080                Console.WriteLine("Connecting : {0} ", DateTime.Now);
081 
082                var hadoop = Hadoop.Connect(azureCluster,
083                                            clusterUserName,
084                                            hadoopUserName,
085                                            clusterPassword,
086                                            azureStorageAccount,
087                                            azureStorageKey,
088                                            azureStorageContainer,
089                                            createContinerIfNotExist);
090 
091                Console.WriteLine("Starting: {0} ", DateTime.Now);
092                var result = hadoop.MapReduceJob.ExecuteJob();
093                var info = result.Info;
094 
095                Console.WriteLine("Done: {0} ", DateTime.Now);
096                Console.WriteLine("\nInfo From Server\n----------------------");
097                Console.WriteLine("StandardError: " + info.StandardError);
098                Console.WriteLine("\n----------------------");
099                Console.WriteLine("StandardOut: " + info.StandardOut);
100                Console.WriteLine("\n----------------------");
101                Console.WriteLine("ExitCode: " + info.ExitCode);
102            }
103            catch(Exception ex)
104            {
105                Console.WriteLine("Error: {0} ", ex.StackTrace.ToString(CultureInfo.InvariantCulture));
106            }
107            Console.WriteLine("Press Any Key To Exit..");
108            Console.ReadLine();
109        }
110    }
111 
112 
113}