è¾å ¥å¯è½ä»¥å¤ä¸ªæ件çå½¢å¼åå¨å¨ HDFS ä¸ï¼æ¯ä¸ª File é½å å«äºå¾å¤åï¼ç§°ä¸º Blockãå½ Spark 读åè¿äºæ件ä½ä¸ºè¾å ¥æ¶ï¼ä¼æ ¹æ®å ·ä½æ°æ®æ ¼å¼å¯¹åºç InputFormat è¿è¡è§£æï¼ä¸è¬æ¯å°è¥å¹²ä¸ª Block å并æä¸ä¸ªè¾å ¥åçï¼ç§°ä¸º InputSplitï¼æ³¨æ InputSplit ä¸è½è·¨è¶æ件ãéåå°ä¸ºè¿äºè¾å ¥åççæå ·ä½ç TaskãInputSplit ä¸ Taskæ¯ä¸ä¸å¯¹åºçå ³ç³»ãéåè¿äºå ·ä½ç Task æ¯ä¸ªé½ä¼è¢«åé å°é群ä¸çæ个èç¹çæ个 Executor å»æ§è¡ã
æ¯ä¸ªèç¹å¯ä»¥èµ·ä¸ä¸ªæå¤ä¸ª Executorã
æ¯ä¸ª Executor ç±è¥å¹² core ç»æï¼æ¯ä¸ª Executor çæ¯ä¸ª core ä¸æ¬¡åªè½æ§è¡ä¸ä¸ª Task ã
æ¯ä¸ª Task æ§è¡çç»æå°±æ¯çæäºç®æ RDD çä¸ä¸ª partitonã
注æ: è¿éç core æ¯èæç core èä¸æ¯æºå¨çç©ç CPU æ ¸ï¼å¯ä»¥çè§£ä¸ºå°±æ¯ Executor çä¸ä¸ªå·¥ä½çº¿ç¨ã
è Task 被æ§è¡ç并å度 = Executor æ°ç® * æ¯ä¸ª Executor æ ¸æ°ã
è³äº partition çæ°ç®ï¼
对äºæ°æ®è¯»å ¥é¶æ®µï¼ä¾å¦ sc.textFileï¼è¾å ¥æ件被åå为å¤å° InputSplit å°±ä¼éè¦å¤å°åå§ Taskã
å¨ Map é¶æ®µ partition æ°ç®ä¿æä¸åã
å¨ Reduce é¶æ®µï¼RDD çèåä¼è§¦å shuffle æä½ï¼èååç RDD ç partition æ°ç®è·å
·ä½æä½æå
³ï¼ä¾å¦ repartition æä½ä¼èåææå®ååºæ°ï¼è¿æä¸äºç®åæ¯å¯é
ç½®çã