如何使用Pandas库将一个值与Python中的多个值进行比较?

问题描述:

目前,我有以下情况。如何使用Pandas库将一个值与Python中的多个值进行比较?

Excel Data Frame =   SQL Data Frame = 
________       ________ _______ ___________ _________ 
|sector|       |sector| | hour| | value_cs| value_ps| 
--------       -------- ------- ----------- --------- 
AXYZ        AXYZ  0  78.90  87.10 
BYYT        RACH  0  87.12  13.90 
IOPL        IOPL  0  93.10  13.87 
XFTR        AXYZ  1  27.90  12.87 
MANU        IOPL  1  23.09  90.09 
            FRES  2  34.09  12.34 
            YYYT  2  12.43  32.98 
            REWT  3  98.09  99.99 

我有一个Excel文件和一组SQL结果和我想的部门列的每个值从对所有的值Excel文件从SQL结果的部门列进行比对,结果,如果这两列的值之间存在匹配,则将SQL结果中的列小时,value_csvalue_ps添加到新数据框中。 注意: SQL结果的数据与Excel文件的数据大小不一样。

期望的结果

New data frame 1 for value cs 
    ________ ____ ___ ___ ___ ___ ___ ___  ____      
    |sector| |0| |1| |2| |3| |4| |5| |6| .... |23| 
    -------- ---- --- ---- --- --- --- ----  ----        
    AXYZ 78.90 27.90 78.89 54.90 98.23 85.0 45.90  68.23 
    BYYT 18.94 67.10 65.69 76.32 76.56 56.03 56.23  87.65 
    IOPL 93.10 23.09 34.29 97.34 34.34 14.54 34.91  23.21 
    ...  ... 

New data frame 2 for value ps 
    ________ ____ ___ ___ ___ ___ ___ ___  ____      
    |sector| |0| |1| |2| |3| |4| |5| |6| .... |23| 
    -------- ---- --- ---- --- --- --- ----  ----        
    AXYZ 87.10 12.87 49.89 84.90 76.23 15.01 12.90  68.23 
    BYYT 28.43 27.11 54.69 57.12 19.56 45.12 45.23  47.15 
    IOPL 13.87 90.09 24.19 47.34 18.34 21.54 67.11  13.61 
    ...  ... 

我遵循的方法是在SQL结果转换成数据帧,以及从Excel文件中的数据,但我不知道如何在不执行比较一个for循环,但只能使用Pandas(for循环会花费太多时间来执行计算)。

import pandas as pd 
import pypyodbc 
from datetime import date 

def get_and_compare(): 

    start_date = date.today() 

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date) 

    # Connect to the database 
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx", 
             pwd="xxx", Trusted_Connection="No") 

    # Get the sql result into dataframe 
    data_frame_sql = pd.read_sql(retrieve_values,db_connection) 


    #declare new data frames 
    new_df_one = pd.DataFrame(columns=['sector', 'value cs', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4', 
            'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11', 
            'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18', 
            'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23']) 

    new_df_two = pd.DataFrame(columns=['sector', 'value ps', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4', 
            'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11', 
            'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18', 
            'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23']) 


    # Read the Excel file 
    current_wb = pd.ExcelFile \ 
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx") 

    # Get the specific sheet to compare 
    working_values = current_wb.parse("Main values") 

    #Get the column from Excel 
    sector_from_excel = working_values['sector'] 

    #Comparison to perform 
    #.... unknown part 

所有的建议,意见将不胜感激,以帮助我完成这部分的代码。

试试这个:

def get_and_compare(): 

    start_date = date.today() 

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date) 

    # Connect to the database 
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx", 
             pwd="xxx", Trusted_Connection="No") 

    # Get the sql result into dataframe 
    data_frame_sql = pd.read_sql(retrieve_values,db_connection) 


    # Read the Excel file 
    current_wb = pd.ExcelFile \ 
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx") 

    # Get the specific sheet to compare 
    working_values = current_wb.parse("Main values") 

    #Get the column from Excel 
    sector_from_excel = working_values['sector'] 

    # perform inner join between DataFrames 
    # note: this requires that "sector" is a column (and not an index) 
    # in both DataFrames, and that it is also named as "sector" in each 
    merged_df = data_frame_sql.merge(sector_from_excel, how="inner", on="sector") 

    # use "pivot" to reshape data from wide to long 
    # first with value_cs 
    cs_value_df = merged_df.pivot(index="sector", columns="hour", values="value_cs") 

    # and then with value_ps 
    ps_value_df = merged_df.pivot(index="sector", columns="hour", values="value_ps") 

    # I'd suggest returning both DataFrames in a single object; 
    # in this case I'm using a dict 
    return {"value cs": cs_value_df, "value ps": ps_value_df} 

对于它的价值,我建议拆分此功能分为多个功能,也许是一个生成的SQL查询,一个读取Excel文件,和一个执行熊猫运营。将这么多的动作写入一个函数并不是一个好习惯 - 如果有必要这么做,调试将非常繁琐。